NormalInverseGammaRegressor#

Bayesian linear regression with unknown noise variance. Places a joint Normal-Inverse-Gamma prior on the weights and noise variance. The marginal posterior over weights is a multivariate t, giving heavier-tailed uncertainty when data is scarce.

See also: NormalRegressor (known variance), EmpiricalBayesNormalRegressor (automatic hyperparameter tuning).

Symbols#

Symbol	Meaning
\(p\)	Number of features
\(n\)	Number of observations in a batch
\(\mathbf{w}\)	Weight vector (posterior mean stored as `coef_`)
\(\sigma^2\)	Noise variance (random, integrated out)
\(\boldsymbol{\mu}_0\)	Prior mean of the weights
\(\boldsymbol{\Lambda}_0\)	Prior precision matrix of the weights (conditioned on \(\sigma^2\))
\(a_0, b_0\)	Prior shape and rate of the Inverse-Gamma on \(\sigma^2\)
\(\boldsymbol{\Lambda}_n\)	Posterior precision matrix (stored as `cov_inv_`)
\(\boldsymbol{\mu}_n\)	Posterior mean (stored as `coef_`)
\(a_n, b_n\)	Posterior shape and rate (stored as `a_`, `b_`)
\(\gamma\)	Decay factor, in \((0, 1]\)
\(w_i\)	Effective sample weight for observation \(i\)

Generative model#

Joint prior:

\[\begin{split}\mathbf{w} \mid \sigma^2 &\sim \mathcal{N}(\boldsymbol{\mu}_0,\; \sigma^2 \boldsymbol{\Lambda}_0^{-1}) \\ \sigma^2 &\sim \mathrm{IG}(a_0,\, b_0)\end{split}\]

Note the covariance of \(\mathbf{w}\) scales with \(\sigma^2\).

Likelihood:

\[y_i \mid \mathbf{x}_i, \mathbf{w}, \sigma^2 \sim \mathcal{N}(\mathbf{x}_i^\top \mathbf{w},\; \sigma^2)\]

Reference: Murphy (2012) Chapter 7 [1].

Update equations#

Posterior update:

\[\boldsymbol{\Lambda}_n = \gamma^n\,\boldsymbol{\Lambda}_{\text{old}} + \mathbf{X}^\top \mathbf{W} \mathbf{X}\]

\[\boldsymbol{\mu}_n = \boldsymbol{\Lambda}_n^{-1}\!\left( \gamma^n\,\boldsymbol{\Lambda}_{\text{old}}\, \boldsymbol{\mu}_{\text{old}} + \mathbf{X}^\top \mathbf{W} \mathbf{y} \right)\]

\[a_n = \gamma^n\, a_{\text{old}} + \tfrac{1}{2} \sum_i w_i\]

\[b_n = \gamma^n\, b_{\text{old}} + \tfrac{1}{2}\!\left( \mathbf{y}^\top (\mathbf{W} \odot \mathbf{y}) + \gamma^n\,\boldsymbol{\mu}_{\text{old}}^\top \boldsymbol{\Lambda}_{\text{old}}\, \boldsymbol{\mu}_{\text{old}} - \boldsymbol{\mu}_n^\top \boldsymbol{\Lambda}_n\,\boldsymbol{\mu}_n \right)\]

where \(\mathbf{W} = \mathrm{diag}(w_1, \ldots, w_n)\) contains effective sample weights (incorporating within-batch decay, as in NormalRegressor). When \(\gamma = 1\) and all weights are 1, these reduce to the standard NIG conjugate update.

Note that \(\beta\) does not appear in the precision update (contrast with NormalRegressor). The noise precision \(1/\sigma^2\) is unknown and integrated out; the precision matrix here is the conditional precision given \(\sigma^2\), not the marginal precision.

Marginal posterior#

Integrating out \(\sigma^2\), the marginal posterior of the weights is a multivariate t:

\[\mathbf{w} \mid \mathcal{D} \sim t_{2a_n}\!\left( \boldsymbol{\mu}_n,\; \frac{b_n}{a_n}\,\boldsymbol{\Lambda}_n^{-1} \right)\]

with \(2a_n\) degrees of freedom, location \(\boldsymbol{\mu}_n\), and shape covariance \((b_n / a_n)\,\boldsymbol{\Lambda}_n^{-1}\).

With little data, \(a_n\) is small and the tails are heavy.

Sampling#

The sample method draws from the multivariate t:

Draw \(u \sim \chi^2(2a_n) / (2a_n)\)
Draw \(\mathbf{z} \sim \mathcal{N}(\mathbf{0},\, (b_n/a_n)\,\boldsymbol{\Lambda}_n^{-1})\)
Return \(\boldsymbol{\mu}_n + \mathbf{z} / \sqrt{u}\)

Predictions are \(\hat{\mathbf{y}} = \mathbf{X}\,\mathbf{w}_s\) as in NormalRegressor. No observation noise is added.

Decay#

Standalone decay scales all three posterior statistics:

\[\begin{split}\boldsymbol{\Lambda} &\leftarrow \gamma^n\,\boldsymbol{\Lambda} \\ a &\leftarrow \gamma^n\, a \\ b &\leftarrow \gamma^n\, b\end{split}\]

The mean is unchanged. The t distribution widens: fewer effective degrees of freedom and higher scale.

See Choosing and Tuning a Decay Rate for practical guidance.

Hyperparameter semantics#

Parameter	Controls	Practical guidance
`mu`	Prior mean of the weights.	0.0 (default). A scalar is broadcast to all features.
`lam`	Prior precision of the weights (conditioned on \(\sigma^2\)). Scalar gives \(\lambda \mathbf{I}\), vector gives \(\mathrm{diag}(\boldsymbol{\lambda})\).	1.0 (default). Higher values regularize more strongly.
`a`	Prior shape of the IG on \(\sigma^2\). Controls how informative the prior variance estimate is.	0.1 (default). Small values give a diffuse prior on the noise level.
`b`	Prior rate of the IG. The prior mean of \(\sigma^2\) is \(b/(a-1)\) for \(a > 1\).	0.1 (default).
`learning_rate`	Decay factor \(\gamma\). See Choosing and Tuning a Decay Rate.	1.0 (default) for stationary environments.

NormalInverseGammaRegressor#

Symbols#

Generative model#

Update equations#

Marginal posterior#

Sampling#

Decay#

Hyperparameter semantics#

References#

This Page