Forgetting Strategies ===================== Three strategies for shrinking the posterior precision matrix before a recursive Bayesian update. Each addresses a limitation of the previous one. The update and sampling steps are unchanged; see :doc:`normal` for those details and :doc:`/howto/decay` for practical tuning guidance. Symbols ------- .. list-table:: :header-rows: 1 :widths: 15 85 * - Symbol - Meaning * - :math:`\boldsymbol{\Lambda}` - Posterior precision matrix before forgetting * - :math:`\bar{\boldsymbol{\Lambda}}` - Precision matrix after forgetting (input to the update step) * - :math:`\gamma` - Forgetting factor, in :math:`(0, 1]` * - :math:`\alpha` - Prior precision scalar * - :math:`\beta` - Noise precision * - :math:`\mathbf{X}` - Design matrix (batch of observations) * - :math:`\bar{\mathbf{X}}` - Filtered design matrix (rank-:math:`q` approximation) * - :math:`q` - Rank of the filtered batch (:math:`q \le \min(n_{\text{batch}}, p)`) * - :math:`p` - Number of features * - :math:`\varepsilon` - Eigenvalue threshold for batch filtering The update loop --------------- The forgetting rule maps :math:`\boldsymbol{\Lambda} \to \bar{\boldsymbol{\Lambda}}`. The subsequent update is standard: .. math:: \boldsymbol{\Lambda}_n &= \bar{\boldsymbol{\Lambda}} + \beta\,\mathbf{X}^\top \mathbf{W} \mathbf{X} \\ \boldsymbol{\eta}_n &= \bar{\boldsymbol{\Lambda}}\,\boldsymbol{\mu}_{\text{old}} + \beta\,\mathbf{X}^\top \mathbf{W}\mathbf{y} \\ \boldsymbol{\mu}_n &= \boldsymbol{\Lambda}_n^{-1}\,\boldsymbol{\eta}_n The three strategies below differ only in the forgetting rule. Exponential forgetting ---------------------- .. math:: \bar{\boldsymbol{\Lambda}} = \gamma\,\boldsymbol{\Lambda} Scalar multiply on the precision. Preserves sparsity, no parameters beyond :math:`\gamma`, reduces to no forgetting when :math:`\gamma = 1`. Equivalent to the predict step of a Kalman filter for a random-walk model with process noise :math:`\mathbf{Q} = (1 - \gamma)\,\boldsymbol{\Lambda}^{-1}`. **Covariance windup.** Each step scales every eigenvalue by :math:`\gamma`. The update replenishes precision only in directions excited by the current batch. Unexcited directions lose :math:`1 - \gamma` per step and never recover; their eigenvalues approach zero and the Cholesky fails. The prior :math:`\alpha\mathbf{I}` decays along with everything else. Stabilized forgetting (Kulhavy--Zarrop) ---------------------------------------- .. math:: \bar{\boldsymbol{\Lambda}} = \gamma\,\boldsymbol{\Lambda} + (1 - \gamma)\,\alpha\,\mathbf{I} After scaling by :math:`\gamma`, a fraction of the prior is added back [1]_. No eigenvalue of :math:`\bar{\boldsymbol{\Lambda}}` can fall below :math:`(1 - \gamma)\,\alpha`. **Convergence.** The prior scalar :math:`s_t` on the diagonal evolves as :math:`s_{t+1} = \gamma\,s_t + (1 - \gamma)\,\alpha`, which has fixed point :math:`s^* = \alpha`. See :doc:`empirical-bayes` for the interaction with EB hyperparameter tuning. **Isotropy.** Every direction loses the same fraction of precision and gets the same floor, regardless of excitation. Directional forgetting (SIFt) ------------------------------ .. math:: \bar{\boldsymbol{\Lambda}} = \boldsymbol{\Lambda} - (1 - \gamma)\, \boldsymbol{\Lambda}\bar{\mathbf{X}}^\top \bigl(\bar{\mathbf{X}}\,\boldsymbol{\Lambda}\, \bar{\mathbf{X}}^\top\bigr)^{-1} \bar{\mathbf{X}}\,\boldsymbol{\Lambda} The correction term projects the precision onto the subspace excited by the batch and removes :math:`1 - \gamma` of it. Directions orthogonal to the batch are unchanged; fully excited directions receive the same :math:`\gamma` scaling as exponential forgetting [2]_. Before applying the formula, the batch is filtered: eigenvalues of the Gram matrix below :math:`\varepsilon` are discarded, leaving a rank-:math:`q` approximation :math:`\bar{\mathbf{X}}` that preserves :math:`\mathbf{X}^\top\!\mathbf{X}` and :math:`\mathbf{X}^\top\!\mathbf{y}` in the surviving subspace [3]_. Precision retention ~~~~~~~~~~~~~~~~~~~ :math:`\bar{\boldsymbol{\Lambda}} \succeq \gamma\,\boldsymbol{\Lambda}`. Directional forgetting retains at least as much precision as exponential forgetting. Strictly more when :math:`q < p` (Proposition 5 in [3]_). Eigenvalue floor ~~~~~~~~~~~~~~~~ After arbitrarily many forget–update cycles: .. math:: \lambda_{\min}(\boldsymbol{\Lambda}_k) \ge \min\!\Bigl( \frac{\varepsilon}{1 - \gamma},\; \lambda_{\min}(\boldsymbol{\Lambda}_0) \Bigr) No prior injection is needed (Theorem 3 in [3]_). Compare with the stabilized floor :math:`(1 - \gamma)\,\alpha`. Computational cost ~~~~~~~~~~~~~~~~~~ **Dense.** The inner Gram :math:`\bar{\mathbf{X}}\,\boldsymbol{\Lambda}\,\bar{\mathbf{X}}^\top` is :math:`q \times q`. Cholesky is :math:`O(q^3)`. The dominant cost is the rank-:math:`q` symmetric update via ``dsyrk`` at :math:`O(p^2 q)`, the same asymptotic cost as the standard RLS update. **Sparse.** If the batch activates :math:`k_0` columns of a sparse design matrix, the linear algebra operates on a :math:`k_0 \times k_0` dense submatrix extracted from the sparse precision. The sparse path uses pivoted Cholesky (``dpstrf``) rather than eigendecomposition, which handles rank deficiency and is faster for the small Gram matrices that arise in practice. Choosing a strategy ------------------- .. list-table:: :header-rows: 1 :widths: 20 25 25 30 * - Strategy - Strengths - Limitations - Best for * - Exponential - Simplest; no extra parameters; preserves sparsity - Covariance windup under non-uniform excitation - Uniformly excited features; prototyping * - Stabilized - Prior floor prevents collapse; integrates with Empirical Bayes - Isotropic; decays rare features at the same rate as common ones - Environments where EB tunes :math:`\alpha`; safety net against collapse * - Directional (SIFt) - Preserves precision in unexcited directions; natural eigenvalue floor - Slightly more compute per step - Sparse / high-dimensional features with heterogeneous excitation Stabilized and directional forgetting address orthogonal problems (prior collapse vs. isotropic decay) and can be combined. Adaptation from SIFt-RLS ------------------------- SIFt-RLS [3]_ is formulated for classical recursive least squares. Adapting it to the precision-parameterized Bayesian estimators in this library required several changes. This section documents each and why correctness is preserved. Precision-only updates ~~~~~~~~~~~~~~~~~~~~~~ Algorithm 1 of [3]_ maintains both the information matrix :math:`R_k` and the covariance :math:`P_k \triangleq R_k^{-1}`, updating the covariance via the matrix inversion lemma (equation 31 in [3]_). Only the precision is kept here: the SIFt step (equation 27) and the information update (equation 29) are both defined on :math:`R_k`, and the posterior mean is recovered by Cholesky solve. The covariance is never formed. :math:`\boldsymbol{\Lambda}` is positive definite at every step (Corollary 1 in [3]_), so the solve is well-conditioned. Gram eigendecomposition instead of SVD ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [3]_ filters the regressor via compact SVD, thresholding singular values below :math:`\sqrt{\varepsilon}` (Section 4.1). The Gram matrix :math:`\mathbf{G} = \mathbf{X}\mathbf{X}^\top` is eigendecomposed instead. Its eigenvalues are the squared singular values of :math:`\mathbf{X}`, so thresholding at :math:`\varepsilon` is equivalent to thresholding singular values at :math:`\sqrt{\varepsilon}`. The eigenvectors of :math:`\mathbf{G}` are the left singular vectors :math:`\mathbf{U}_k`, the only factor needed to form :math:`\bar{\mathbf{X}} = \mathbf{U}_q^\top \mathbf{X}` (:math:`\mathbf{V}_k` need not be computed; footnote 3 in [3]_). Pivoted Cholesky for sparse batch filtering ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When :math:`\mathbf{X}` is sparse with :math:`k_0` active (nonzero) columns, the dense :math:`k_0 \times k_0` active-column Gram :math:`\mathbf{G}_a = \mathbf{X}_a^\top \mathbf{X}_a` is factored via LAPACK pivoted Cholesky (``dpstrf``) with tolerance :math:`\varepsilon`: .. math:: \mathbf{P}^\top \mathbf{G}_a \mathbf{P} = \mathbf{L}_{:r}\,\mathbf{L}_{:r}^\top where :math:`r` is the numerical rank and :math:`\mathbf{P}` is a permutation. Setting :math:`\bar{\mathbf{X}}_a = \mathbf{L}_{:r}[\mathbf{P}^{-1}]^\top` gives :math:`\bar{\mathbf{X}}_a^\top \bar{\mathbf{X}}_a = \mathbf{G}_a`, preserving sufficient statistics in the surviving subspace. The nonzero eigenvalues of :math:`\mathbf{G}_a` are identical to those of the full :math:`p \times p` Gram, so the rank determination is equivalent. Pivoted Cholesky is :math:`O(k_0^2 r)` versus :math:`O(k_0^3)` for eigendecomposition. Symmetry preservation ~~~~~~~~~~~~~~~~~~~~~ Computing the SIFt correction via ``solve(H, w.T)`` produces an asymmetric result due to rounding, and the error accumulates over many steps. Algorithm 1 of [3]_ corrects this with :math:`R_k \leftarrow \tfrac{1}{2}(R_k + R_k^\top)` (line 12). Instead, :math:`\mathbf{H} = \bar{\mathbf{X}}\,\boldsymbol{\Lambda}\,\bar{\mathbf{X}}^\top` is Cholesky-factored as :math:`\mathbf{L}\mathbf{L}^\top` and the correction is computed as :math:`\mathbf{V}^\top\!\mathbf{V}` where :math:`\mathbf{V} = \mathbf{L}^{-1}\mathbf{w}^\top` and :math:`\mathbf{w} = \boldsymbol{\Lambda}\bar{\mathbf{X}}^\top`. :math:`\mathbf{V}^\top\!\mathbf{V}` is a Gram matrix, hence symmetric by construction: .. math:: \mathbf{V}^\top\!\mathbf{V} = \mathbf{w}\,\mathbf{L}^{-\top}\mathbf{L}^{-1}\mathbf{w}^\top = \mathbf{w}\,\mathbf{H}^{-1}\mathbf{w}^\top = \boldsymbol{\Lambda}\bar{\mathbf{X}}^\top (\bar{\mathbf{X}}\,\boldsymbol{\Lambda}\, \bar{\mathbf{X}}^\top)^{-1} \bar{\mathbf{X}}\,\boldsymbol{\Lambda} The rank-:math:`q` update is applied via ``dsyrk``, which writes only one triangle, so no post-hoc symmetrization is needed. Sparse precision downdate ~~~~~~~~~~~~~~~~~~~~~~~~~ [3]_ does not discuss sparse precision matrices. When :math:`\bar{\mathbf{X}}` has nonzero entries only in :math:`k_0` columns, :math:`\boldsymbol{\Lambda}\bar{\mathbf{X}}^\top` is nonzero only in the rows of :math:`\boldsymbol{\Lambda}` that interact with those columns. Call this row set :math:`\mathcal{N}`. The correction :math:`\mathbf{V}^\top\!\mathbf{V}` is a :math:`|\mathcal{N}| \times |\mathcal{N}|` dense block embedded in the :math:`p \times p` sparse matrix; all linear algebra is done on that block. Entries outside :math:`\mathcal{N}` are structurally zero, so no approximation is introduced. References ---------- .. [1] Kulhavy, R. & Zarrop, M. B. (1993). "On a general concept of forgetting." *International Journal of Control*, 58(4), 905--924. .. [2] Cao, L. & Schwartz, H. M. (2000). "A directional forgetting algorithm based on the decomposition of the information matrix." *Automatica*, 36(11), 1725--1731. .. [3] Lai, B. & Bernstein, D. S. (2024). "SIFt-RLS: Subspace of Information Forgetting Recursive Least Squares." *arXiv:2404.10844*.