NormalRegressor#

Bayesian linear regression with known noise variance (known \(\beta\)). Gaussian prior on weights, exact conjugate updates in the precision parameterization.

See also: NormalInverseGammaRegressor (unknown variance), EmpiricalBayesNormalRegressor (automatic hyperparameter tuning).

Symbols#

Symbol	Meaning
\(p\)	Number of features (dimensionality of the weight vector)
\(n\)	Number of observations in a batch
\(\mathbf{w}\)	Weight vector (posterior mean stored as `coef_`)
\(\alpha\)	Prior precision scalar
\(\beta\)	Noise precision (inverse noise variance, \(1/\sigma^2\))
\(\boldsymbol{\Lambda}\)	Posterior precision matrix (stored as `cov_inv_`)
\(\boldsymbol{\mu}\)	Posterior mean of the weight vector (stored as `coef_`)
\(\gamma\)	Learning rate / decay factor, in \((0, 1]\)
\(\mathbf{W}\)	Diagonal matrix of effective sample weights
\(w_i\)	User-supplied sample weight for observation \(i\)

Generative model#

Prior:

\[\mathbf{w} \sim \mathcal{N}(\mathbf{0},\; \alpha^{-1} \mathbf{I})\]

The prior precision matrix is \(\boldsymbol{\Lambda}_0 = \alpha \mathbf{I}\) and the prior mean is \(\boldsymbol{\mu}_0 = \mathbf{0}\).

Likelihood:

\[y_i \mid \mathbf{x}_i, \mathbf{w} \sim \mathcal{N}(\mathbf{x}_i^\top \mathbf{w},\; \beta^{-1})\]

Here \(\beta\) is the noise precision (inverse variance), not the variance itself.

References: Bishop (2006) Section 3.3 [1], Murphy (2012) Chapter 7 [2].

Update equations#

The posterior is Gaussian: \(\mathbf{w} \mid \mathcal{D} \sim \mathcal{N}(\boldsymbol{\mu}_n, \boldsymbol{\Lambda}_n^{-1})\).

Precision update:

\[\boldsymbol{\Lambda}_n = \gamma^n \,\boldsymbol{\Lambda}_{\text{old}} + \beta\, \mathbf{X}^\top \mathbf{W} \mathbf{X}\]

Mean update (via the information vector):

\[\begin{split}\boldsymbol{\eta}_n &= \gamma^n \,\boldsymbol{\Lambda}_{\text{old}}\,\boldsymbol{\mu}_{\text{old}} + \beta\, \mathbf{X}^\top \mathbf{W} \mathbf{y} \\ \boldsymbol{\mu}_n &= \boldsymbol{\Lambda}_n^{-1}\, \boldsymbol{\eta}_n\end{split}\]

The implementation stores \(\boldsymbol{\Lambda}\) and computes \(\boldsymbol{\eta}\) as an intermediate; the mean is recovered by a single Cholesky solve.

When \(\gamma = 1\) (the default), these reduce to the standard conjugate update.

Effective weights within a batch#

When \(\gamma < 1\) and a batch of \(n\) observations is processed in a single partial_fit call, each observation receives an effective weight that depends on its position in the batch:

\[w_{\text{eff}, i} = w_i \cdot \gamma^{\,n - 1 - i}, \qquad i = 0, 1, \ldots, n{-}1\]

Earlier observations in the batch are decayed more than later ones. This ensures that processing a batch of \(n\) observations in one call gives the same posterior as processing them one at a time with decay between each. When \(\gamma = 1\), all effective weights equal the user-supplied weights (or 1 if none are provided).

Sampling#

The sample method draws weight vectors from the posterior and projects them through the design matrix:

\[\mathbf{w}_s \sim \mathcal{N}(\boldsymbol{\mu}_n,\; \boldsymbol{\Lambda}_n^{-1}), \qquad \hat{\mathbf{y}} = \mathbf{X}\,\mathbf{w}_s\]

Sampling from the precision parameterization uses \(\mathbf{w}_s = \boldsymbol{\mu}_n + \mathbf{L}^{-\top}\mathbf{z}\) where \(\mathbf{L}\mathbf{L}^\top = \boldsymbol{\Lambda}_n\) and \(\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\).

Note

sample produces draws of the expected reward \(\mathbf{X}\mathbf{w}_s\), not noisy reward realizations. Observation noise \(\varepsilon \sim \mathcal{N}(0, \beta^{-1})\) is not added. Thompson sampling needs samples of the expected value under parameter uncertainty, not noisy outcomes.

Standalone decay#

The decay method scales the precision matrix without observing new data:

\[\boldsymbol{\Lambda} \leftarrow \gamma^n \,\boldsymbol{\Lambda}\]

where \(n\) is the number of rows in the X argument. The posterior mean is unchanged because \((\gamma^n \boldsymbol{\Lambda})^{-1}(\gamma^n \boldsymbol{\Lambda}\,\boldsymbol{\mu}) = \boldsymbol{\mu}\).

This is equivalent to a random-walk state-space model where the transition shrinks the effective precision by \(\gamma\) per time step. This is exponential forgetting, the simplest of three strategies. See Forgetting Strategies for stabilized and directional alternatives, and Choosing and Tuning a Decay Rate for practical guidance.

Hyperparameter semantics#

Parameter	Controls	Practical guidance
`alpha`	Prior precision. Sets \(\boldsymbol{\Lambda}_0 = \alpha \mathbf{I}\). Higher values give stronger regularization toward zero.	Start with 1.0. Range: 0.01 to 100.
`beta`	Noise precision \(1/\sigma^2\). Scales the data contribution to the precision matrix. Higher values mean observations are trusted more (lower noise).	Set to \(1/\sigma^2\) if the noise scale is known. Otherwise consider `NormalInverseGammaRegressor`.
`learning_rate`	Decay factor \(\gamma\). Applied per observation during `partial_fit` and per row during `decay`.	1.0 (default) for stationary environments. See Choosing and Tuning a Decay Rate for tuning advice.