NormalInverseGammaRegressor#
Bayesian linear regression with unknown noise variance. Places a joint Normal-Inverse-Gamma prior on the weights and noise variance. The marginal posterior over weights is a multivariate t, giving heavier-tailed uncertainty when data is scarce.
See also: NormalRegressor (known variance), EmpiricalBayesNormalRegressor (automatic hyperparameter tuning).
Symbols#
Symbol |
Meaning |
|---|---|
\(p\) |
Number of features |
\(n\) |
Number of observations in a batch |
\(\mathbf{w}\) |
Weight vector (posterior mean stored as |
\(\sigma^2\) |
Noise variance (random, integrated out) |
\(\boldsymbol{\mu}_0\) |
Prior mean of the weights |
\(\boldsymbol{\Lambda}_0\) |
Prior precision matrix of the weights (conditioned on \(\sigma^2\)) |
\(a_0, b_0\) |
Prior shape and rate of the Inverse-Gamma on \(\sigma^2\) |
\(\boldsymbol{\Lambda}_n\) |
Posterior precision matrix (stored as |
\(\boldsymbol{\mu}_n\) |
Posterior mean (stored as |
\(a_n, b_n\) |
Posterior shape and rate (stored as |
\(\gamma\) |
Decay factor, in \((0, 1]\) |
\(w_i\) |
Effective sample weight for observation \(i\) |
Generative model#
Joint prior:
Note the covariance of \(\mathbf{w}\) scales with \(\sigma^2\).
Likelihood:
Reference: Murphy (2012) Chapter 7 [1].
Update equations#
Posterior update:
where \(\mathbf{W} = \mathrm{diag}(w_1, \ldots, w_n)\) contains effective sample weights (incorporating within-batch decay, as in NormalRegressor). When \(\gamma = 1\) and all weights are 1, these reduce to the standard NIG conjugate update.
Note that \(\beta\) does not appear in the precision update (contrast with NormalRegressor). The noise precision \(1/\sigma^2\) is unknown and integrated out; the precision matrix here is the conditional precision given \(\sigma^2\), not the marginal precision.
Marginal posterior#
Integrating out \(\sigma^2\), the marginal posterior of the weights is a multivariate t:
with \(2a_n\) degrees of freedom, location \(\boldsymbol{\mu}_n\), and shape covariance \((b_n / a_n)\,\boldsymbol{\Lambda}_n^{-1}\).
With little data, \(a_n\) is small and the tails are heavy.
Sampling#
The sample method draws from the multivariate t:
Draw \(u \sim \chi^2(2a_n) / (2a_n)\)
Draw \(\mathbf{z} \sim \mathcal{N}(\mathbf{0},\, (b_n/a_n)\,\boldsymbol{\Lambda}_n^{-1})\)
Return \(\boldsymbol{\mu}_n + \mathbf{z} / \sqrt{u}\)
Predictions are \(\hat{\mathbf{y}} = \mathbf{X}\,\mathbf{w}_s\) as in NormalRegressor. No observation noise is added.
Decay#
Standalone decay scales all three posterior statistics:
The mean is unchanged. The t distribution widens: fewer effective degrees of freedom and higher scale.
See Choosing and Tuning a Decay Rate for practical guidance.
Hyperparameter semantics#
Parameter |
Controls |
Practical guidance |
|---|---|---|
|
Prior mean of the weights. |
0.0 (default). A scalar is broadcast to all features. |
|
Prior precision of the weights (conditioned on \(\sigma^2\)). Scalar gives \(\lambda \mathbf{I}\), vector gives \(\mathrm{diag}(\boldsymbol{\lambda})\). |
1.0 (default). Higher values regularize more strongly. |
|
Prior shape of the IG on \(\sigma^2\). Controls how informative the prior variance estimate is. |
0.1 (default). Small values give a diffuse prior on the noise level. |
|
Prior rate of the IG. The prior mean of \(\sigma^2\) is \(b/(a-1)\) for \(a > 1\). |
0.1 (default). |
|
Decay factor \(\gamma\). See Choosing and Tuning a Decay Rate. |
1.0 (default) for stationary environments. |