NormalRegressor#
Bayesian linear regression with known noise variance (known \(\beta\)). Gaussian prior on weights, exact conjugate updates in the precision parameterization.
See also: NormalInverseGammaRegressor (unknown variance), EmpiricalBayesNormalRegressor (automatic hyperparameter tuning).
Symbols#
Symbol |
Meaning |
|---|---|
\(p\) |
Number of features (dimensionality of the weight vector) |
\(n\) |
Number of observations in a batch |
\(\mathbf{w}\) |
Weight vector (posterior mean stored as |
\(\alpha\) |
Prior precision scalar |
\(\beta\) |
Noise precision (inverse noise variance, \(1/\sigma^2\)) |
\(\boldsymbol{\Lambda}\) |
Posterior precision matrix (stored as |
\(\boldsymbol{\mu}\) |
Posterior mean of the weight vector (stored as |
\(\gamma\) |
Learning rate / decay factor, in \((0, 1]\) |
\(\mathbf{W}\) |
Diagonal matrix of effective sample weights |
\(w_i\) |
User-supplied sample weight for observation \(i\) |
Generative model#
Prior:
The prior precision matrix is \(\boldsymbol{\Lambda}_0 = \alpha \mathbf{I}\) and the prior mean is \(\boldsymbol{\mu}_0 = \mathbf{0}\).
Likelihood:
Here \(\beta\) is the noise precision (inverse variance), not the variance itself.
References: Bishop (2006) Section 3.3 [1], Murphy (2012) Chapter 7 [2].
Update equations#
The posterior is Gaussian: \(\mathbf{w} \mid \mathcal{D} \sim \mathcal{N}(\boldsymbol{\mu}_n, \boldsymbol{\Lambda}_n^{-1})\).
Precision update:
Mean update (via the information vector):
The implementation stores \(\boldsymbol{\Lambda}\) and computes \(\boldsymbol{\eta}\) as an intermediate; the mean is recovered by a single Cholesky solve.
When \(\gamma = 1\) (the default), these reduce to the standard conjugate update.
Effective weights within a batch#
When \(\gamma < 1\) and a batch of \(n\) observations is
processed in a single partial_fit call, each observation receives an
effective weight that depends on its position in the batch:
Earlier observations in the batch are decayed more than later ones. This ensures that processing a batch of \(n\) observations in one call gives the same posterior as processing them one at a time with decay between each. When \(\gamma = 1\), all effective weights equal the user-supplied weights (or 1 if none are provided).
Sampling#
The sample method draws weight vectors from the posterior and
projects them through the design matrix:
Sampling from the precision parameterization uses \(\mathbf{w}_s = \boldsymbol{\mu}_n + \mathbf{L}^{-\top}\mathbf{z}\) where \(\mathbf{L}\mathbf{L}^\top = \boldsymbol{\Lambda}_n\) and \(\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\).
Note
sample produces draws of the expected reward
\(\mathbf{X}\mathbf{w}_s\), not noisy reward realizations.
Observation noise \(\varepsilon \sim \mathcal{N}(0, \beta^{-1})\)
is not added. Thompson sampling needs samples of the expected value
under parameter uncertainty, not noisy outcomes.
Standalone decay#
The decay method scales the precision matrix without observing new
data:
where \(n\) is the number of rows in the X argument. The
posterior mean is unchanged because
\((\gamma^n \boldsymbol{\Lambda})^{-1}(\gamma^n \boldsymbol{\Lambda}\,\boldsymbol{\mu})
= \boldsymbol{\mu}\).
This is equivalent to a random-walk state-space model where the transition shrinks the effective precision by \(\gamma\) per time step. See Choosing and Tuning a Decay Rate for practical guidance.
Hyperparameter semantics#
Parameter |
Controls |
Practical guidance |
|---|---|---|
|
Prior precision. Sets \(\boldsymbol{\Lambda}_0 = \alpha \mathbf{I}\). Higher values give stronger regularization toward zero. |
Start with 1.0. Range: 0.01 to 100. |
|
Noise precision \(1/\sigma^2\). Scales the data contribution to the precision matrix. Higher values mean observations are trusted more (lower noise). |
Set to \(1/\sigma^2\) if the noise scale is known.
Otherwise consider
|
|
Decay factor \(\gamma\). Applied per observation during
|
1.0 (default) for stationary environments. See Choosing and Tuning a Decay Rate for tuning advice. |