Forgetting Strategies#
Three strategies for shrinking the posterior precision matrix before a recursive Bayesian update. Each addresses a limitation of the previous one. The update and sampling steps are unchanged; see NormalRegressor for those details and Choosing and Tuning a Decay Rate for practical tuning guidance.
Symbols#
Symbol |
Meaning |
|---|---|
\(\boldsymbol{\Lambda}\) |
Posterior precision matrix before forgetting |
\(\bar{\boldsymbol{\Lambda}}\) |
Precision matrix after forgetting (input to the update step) |
\(\gamma\) |
Forgetting factor, in \((0, 1]\) |
\(\alpha\) |
Prior precision scalar |
\(\beta\) |
Noise precision |
\(\mathbf{X}\) |
Design matrix (batch of observations) |
\(\bar{\mathbf{X}}\) |
Filtered design matrix (rank-\(q\) approximation) |
\(q\) |
Rank of the filtered batch (\(q \le \min(n_{\text{batch}}, p)\)) |
\(p\) |
Number of features |
\(\varepsilon\) |
Eigenvalue threshold for batch filtering |
The update loop#
The forgetting rule maps \(\boldsymbol{\Lambda} \to \bar{\boldsymbol{\Lambda}}\). The subsequent update is standard:
The three strategies below differ only in the forgetting rule.
Exponential forgetting#
Scalar multiply on the precision. Preserves sparsity, no parameters beyond \(\gamma\), reduces to no forgetting when \(\gamma = 1\).
Equivalent to the predict step of a Kalman filter for a random-walk model with process noise \(\mathbf{Q} = (1 - \gamma)\,\boldsymbol{\Lambda}^{-1}\).
Covariance windup. Each step scales every eigenvalue by \(\gamma\). The update replenishes precision only in directions excited by the current batch. Unexcited directions lose \(1 - \gamma\) per step and never recover; their eigenvalues approach zero and the Cholesky fails. The prior \(\alpha\mathbf{I}\) decays along with everything else.
Stabilized forgetting (Kulhavy–Zarrop)#
After scaling by \(\gamma\), a fraction of the prior is added back [1]. No eigenvalue of \(\bar{\boldsymbol{\Lambda}}\) can fall below \((1 - \gamma)\,\alpha\).
Convergence. The prior scalar \(s_t\) on the diagonal evolves as \(s_{t+1} = \gamma\,s_t + (1 - \gamma)\,\alpha\), which has fixed point \(s^* = \alpha\). See EmpiricalBayesNormalRegressor for the interaction with EB hyperparameter tuning.
Isotropy. Every direction loses the same fraction of precision and gets the same floor, regardless of excitation.
Directional forgetting (SIFt)#
The correction term projects the precision onto the subspace excited by the batch and removes \(1 - \gamma\) of it. Directions orthogonal to the batch are unchanged; fully excited directions receive the same \(\gamma\) scaling as exponential forgetting [2].
Before applying the formula, the batch is filtered: eigenvalues of the Gram matrix below \(\varepsilon\) are discarded, leaving a rank-\(q\) approximation \(\bar{\mathbf{X}}\) that preserves \(\mathbf{X}^\top\!\mathbf{X}\) and \(\mathbf{X}^\top\!\mathbf{y}\) in the surviving subspace [3].
Precision retention#
\(\bar{\boldsymbol{\Lambda}} \succeq \gamma\,\boldsymbol{\Lambda}\).
Directional forgetting retains at least as much precision as exponential forgetting. Strictly more when \(q < p\) (Proposition 5 in [3]).
Eigenvalue floor#
After arbitrarily many forget–update cycles:
No prior injection is needed (Theorem 3 in [3]). Compare with the stabilized floor \((1 - \gamma)\,\alpha\).
Computational cost#
Dense.
The inner Gram
\(\bar{\mathbf{X}}\,\boldsymbol{\Lambda}\,\bar{\mathbf{X}}^\top\)
is \(q \times q\). Cholesky is \(O(q^3)\). The dominant
cost is the rank-\(q\) symmetric update via dsyrk at
\(O(p^2 q)\), the same asymptotic cost as the standard RLS
update.
Sparse.
If the batch activates \(k_0\) columns of a sparse design matrix,
the linear algebra operates on a \(k_0 \times k_0\) dense
submatrix extracted from the sparse precision. The sparse path uses
pivoted Cholesky (dpstrf) rather than eigendecomposition, which
handles rank deficiency and is faster for the small Gram matrices
that arise in practice.
Choosing a strategy#
Strategy |
Strengths |
Limitations |
Best for |
|---|---|---|---|
Exponential |
Simplest; no extra parameters; preserves sparsity |
Covariance windup under non-uniform excitation |
Uniformly excited features; prototyping |
Stabilized |
Prior floor prevents collapse; integrates with Empirical Bayes |
Isotropic; decays rare features at the same rate as common ones |
Environments where EB tunes \(\alpha\); safety net against collapse |
Directional (SIFt) |
Preserves precision in unexcited directions; natural eigenvalue floor |
Slightly more compute per step |
Sparse / high-dimensional features with heterogeneous excitation |
Stabilized and directional forgetting address orthogonal problems (prior collapse vs. isotropic decay) and can be combined.
Adaptation from SIFt-RLS#
SIFt-RLS [3] is formulated for classical recursive least squares. Adapting it to the precision-parameterized Bayesian estimators in this library required several changes. This section documents each and why correctness is preserved.
Precision-only updates#
Algorithm 1 of [3] maintains both the information matrix \(R_k\) and the covariance \(P_k \triangleq R_k^{-1}\), updating the covariance via the matrix inversion lemma (equation 31 in [3]). Only the precision is kept here: the SIFt step (equation 27) and the information update (equation 29) are both defined on \(R_k\), and the posterior mean is recovered by Cholesky solve. The covariance is never formed. \(\boldsymbol{\Lambda}\) is positive definite at every step (Corollary 1 in [3]), so the solve is well-conditioned.
Gram eigendecomposition instead of SVD#
[3] filters the regressor via compact SVD, thresholding singular values below \(\sqrt{\varepsilon}\) (Section 4.1). The Gram matrix \(\mathbf{G} = \mathbf{X}\mathbf{X}^\top\) is eigendecomposed instead. Its eigenvalues are the squared singular values of \(\mathbf{X}\), so thresholding at \(\varepsilon\) is equivalent to thresholding singular values at \(\sqrt{\varepsilon}\). The eigenvectors of \(\mathbf{G}\) are the left singular vectors \(\mathbf{U}_k\), the only factor needed to form \(\bar{\mathbf{X}} = \mathbf{U}_q^\top \mathbf{X}\) (\(\mathbf{V}_k\) need not be computed; footnote 3 in [3]).
Pivoted Cholesky for sparse batch filtering#
When \(\mathbf{X}\) is sparse with \(k_0\) active (nonzero)
columns, the dense \(k_0 \times k_0\) active-column Gram
\(\mathbf{G}_a = \mathbf{X}_a^\top \mathbf{X}_a\) is factored
via LAPACK pivoted Cholesky (dpstrf) with tolerance
\(\varepsilon\):
where \(r\) is the numerical rank and \(\mathbf{P}\) is a permutation. Setting \(\bar{\mathbf{X}}_a = \mathbf{L}_{:r}[\mathbf{P}^{-1}]^\top\) gives \(\bar{\mathbf{X}}_a^\top \bar{\mathbf{X}}_a = \mathbf{G}_a\), preserving sufficient statistics in the surviving subspace. The nonzero eigenvalues of \(\mathbf{G}_a\) are identical to those of the full \(p \times p\) Gram, so the rank determination is equivalent. Pivoted Cholesky is \(O(k_0^2 r)\) versus \(O(k_0^3)\) for eigendecomposition.
Symmetry preservation#
Computing the SIFt correction via solve(H, w.T) produces an
asymmetric result due to rounding, and the error accumulates over
many steps. Algorithm 1 of [3] corrects this with
\(R_k \leftarrow \tfrac{1}{2}(R_k + R_k^\top)\) (line 12).
Instead, \(\mathbf{H} = \bar{\mathbf{X}}\,\boldsymbol{\Lambda}\,\bar{\mathbf{X}}^\top\) is Cholesky-factored as \(\mathbf{L}\mathbf{L}^\top\) and the correction is computed as \(\mathbf{V}^\top\!\mathbf{V}\) where \(\mathbf{V} = \mathbf{L}^{-1}\mathbf{w}^\top\) and \(\mathbf{w} = \boldsymbol{\Lambda}\bar{\mathbf{X}}^\top\). \(\mathbf{V}^\top\!\mathbf{V}\) is a Gram matrix, hence symmetric by construction:
The rank-\(q\) update is applied via dsyrk, which writes only
one triangle, so no post-hoc symmetrization is needed.
Sparse precision downdate#
[3] does not discuss sparse precision matrices. When \(\bar{\mathbf{X}}\) has nonzero entries only in \(k_0\) columns, \(\boldsymbol{\Lambda}\bar{\mathbf{X}}^\top\) is nonzero only in the rows of \(\boldsymbol{\Lambda}\) that interact with those columns. Call this row set \(\mathcal{N}\). The correction \(\mathbf{V}^\top\!\mathbf{V}\) is a \(|\mathcal{N}| \times |\mathcal{N}|\) dense block embedded in the \(p \times p\) sparse matrix; all linear algebra is done on that block. Entries outside \(\mathcal{N}\) are structurally zero, so no approximation is introduced.