NormalRegressor#

Bayesian linear regression with known noise variance (known \(\beta\)). Gaussian prior on weights, exact conjugate updates in the precision parameterization.

See also: NormalInverseGammaRegressor (unknown variance), EmpiricalBayesNormalRegressor (automatic hyperparameter tuning).

Symbols#

Symbol	Meaning
\(p\)	Number of features (dimensionality of the weight vector)
\(n\)	Number of observations in a batch
\(\mathbf{w}\)	Weight vector (posterior mean stored as `coef_`)
\(\alpha\)	Prior precision scalar
\(\beta\)	Noise precision (inverse noise variance, \(1/\sigma^2\))
\(\boldsymbol{\Lambda}\)	Posterior precision matrix (stored as `cov_inv_`)
\(\boldsymbol{\mu}\)	Posterior mean of the weight vector (stored as `coef_`)
\(\gamma\)	Learning rate / decay factor, in \((0, 1]\)
\(\mathbf{W}\)	Diagonal matrix of effective sample weights
\(w_i\)	User-supplied sample weight for observation \(i\)

Generative model#

Prior:

\[\mathbf{w} \sim \mathcal{N}(\mathbf{0},\; \alpha^{-1} \mathbf{I})\]

The prior precision matrix is \(\boldsymbol{\Lambda}_0 = \alpha \mathbf{I}\) and the prior mean is \(\boldsymbol{\mu}_0 = \mathbf{0}\).

Likelihood:

\[y_i \mid \mathbf{x}_i, \mathbf{w} \sim \mathcal{N}(\mathbf{x}_i^\top \mathbf{w},\; \beta^{-1})\]

Here \(\beta\) is the noise precision (inverse variance), not the variance itself.

References: Bishop (2006) Section 3.3 [1], Murphy (2012) Chapter 7 [2].

Update equations#

The posterior is Gaussian: \(\mathbf{w} \mid \mathcal{D} \sim \mathcal{N}(\boldsymbol{\mu}_n, \boldsymbol{\Lambda}_n^{-1})\).

Precision update:

\[\boldsymbol{\Lambda}_n = \gamma^n \,\boldsymbol{\Lambda}_{\text{old}} + \beta\, \mathbf{X}^\top \mathbf{W} \mathbf{X}\]

Mean update (via the information vector):

\[\begin{split}\boldsymbol{\eta}_n &= \gamma^n \,\boldsymbol{\Lambda}_{\text{old}}\,\boldsymbol{\mu}_{\text{old}} + \beta\, \mathbf{X}^\top \mathbf{W} \mathbf{y} \\ \boldsymbol{\mu}_n &= \boldsymbol{\Lambda}_n^{-1}\, \boldsymbol{\eta}_n\end{split}\]

The implementation stores \(\boldsymbol{\Lambda}\) and computes \(\boldsymbol{\eta}\) as an intermediate; the mean is recovered by a single Cholesky solve.

When \(\gamma = 1\) (the default), these reduce to the standard conjugate update.

Effective weights within a batch#

When \(\gamma < 1\) and a batch of \(n\) observations is processed in a single partial_fit call, each observation receives an effective weight that depends on its position in the batch:

\[w_{\text{eff}, i} = w_i \cdot \gamma^{\,n - 1 - i}, \qquad i = 0, 1, \ldots, n{-}1\]

Earlier observations in the batch are decayed more than later ones. This ensures that processing a batch of \(n\) observations in one call gives the same posterior as processing them one at a time with decay between each. When \(\gamma = 1\), all effective weights equal the user-supplied weights (or 1 if none are provided).

Sampling#

The sample method draws weight vectors from the posterior and projects them through the design matrix:

\[\mathbf{w}_s \sim \mathcal{N}(\boldsymbol{\mu}_n,\; \boldsymbol{\Lambda}_n^{-1}), \qquad \hat{\mathbf{y}} = \mathbf{X}\,\mathbf{w}_s\]

Sampling from the precision parameterization uses \(\mathbf{w}_s = \boldsymbol{\mu}_n + \mathbf{L}^{-\top}\mathbf{z}\) where \(\mathbf{L}\mathbf{L}^\top = \boldsymbol{\Lambda}_n\) and \(\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\).

Note

sample produces draws of the expected reward \(\mathbf{X}\mathbf{w}_s\), not noisy reward realizations. Observation noise \(\varepsilon \sim \mathcal{N}(0, \beta^{-1})\) is not added. Thompson sampling needs samples of the expected value under parameter uncertainty, not noisy outcomes.

Standalone decay#

The decay method scales the precision matrix without observing new data:

\[\boldsymbol{\Lambda} \leftarrow \gamma^n \,\boldsymbol{\Lambda}\]

where \(n\) is the number of rows in the X argument. The posterior mean is unchanged because \((\gamma^n \boldsymbol{\Lambda})^{-1}(\gamma^n \boldsymbol{\Lambda}\,\boldsymbol{\mu}) = \boldsymbol{\mu}\).

This is equivalent to a random-walk state-space model where the transition shrinks the effective precision by \(\gamma\) per time step. See Choosing and Tuning a Decay Rate for practical guidance.

Hyperparameter semantics#

Parameter	Controls	Practical guidance
`alpha`	Prior precision. Sets \(\boldsymbol{\Lambda}_0 = \alpha \mathbf{I}\). Higher values give stronger regularization toward zero.	Start with 1.0. Range: 0.01 to 100.
`beta`	Noise precision \(1/\sigma^2\). Scales the data contribution to the precision matrix. Higher values mean observations are trusted more (lower noise).	Set to \(1/\sigma^2\) if the noise scale is known. Otherwise consider `NormalInverseGammaRegressor`.
`learning_rate`	Decay factor \(\gamma\). Applied per observation during `partial_fit` and per row during `decay`.	1.0 (default) for stationary environments. See Choosing and Tuning a Decay Rate for tuning advice.