NormalInverseGammaRegressor#

Bayesian linear regression with unknown noise variance. Places a joint Normal-Inverse-Gamma prior on the weights and noise variance. The marginal posterior over weights is a multivariate t, giving heavier-tailed uncertainty when data is scarce.

See also: NormalRegressor (known variance), EmpiricalBayesNormalRegressor (automatic hyperparameter tuning).

Symbols#

Symbol

Meaning

\(p\)

Number of features

\(n\)

Number of observations in a batch

\(\mathbf{w}\)

Weight vector (posterior mean stored as coef_)

\(\sigma^2\)

Noise variance (random, integrated out)

\(\boldsymbol{\mu}_0\)

Prior mean of the weights

\(\boldsymbol{\Lambda}_0\)

Prior precision matrix of the weights (conditioned on \(\sigma^2\))

\(a_0, b_0\)

Prior shape and rate of the Inverse-Gamma on \(\sigma^2\)

\(\boldsymbol{\Lambda}_n\)

Posterior precision matrix (stored as cov_inv_)

\(\boldsymbol{\mu}_n\)

Posterior mean (stored as coef_)

\(a_n, b_n\)

Posterior shape and rate (stored as a_, b_)

\(\gamma\)

Decay factor, in \((0, 1]\)

\(w_i\)

Effective sample weight for observation \(i\)

Generative model#

Joint prior:

\[\begin{split}\mathbf{w} \mid \sigma^2 &\sim \mathcal{N}(\boldsymbol{\mu}_0,\; \sigma^2 \boldsymbol{\Lambda}_0^{-1}) \\ \sigma^2 &\sim \mathrm{IG}(a_0,\, b_0)\end{split}\]

Note the covariance of \(\mathbf{w}\) scales with \(\sigma^2\).

Likelihood:

\[y_i \mid \mathbf{x}_i, \mathbf{w}, \sigma^2 \sim \mathcal{N}(\mathbf{x}_i^\top \mathbf{w},\; \sigma^2)\]

Reference: Murphy (2012) Chapter 7 [1].

Update equations#

Posterior update:

\[\boldsymbol{\Lambda}_n = \gamma^n\,\boldsymbol{\Lambda}_{\text{old}} + \mathbf{X}^\top \mathbf{W} \mathbf{X}\]
\[\boldsymbol{\mu}_n = \boldsymbol{\Lambda}_n^{-1}\!\left( \gamma^n\,\boldsymbol{\Lambda}_{\text{old}}\, \boldsymbol{\mu}_{\text{old}} + \mathbf{X}^\top \mathbf{W} \mathbf{y} \right)\]
\[a_n = \gamma^n\, a_{\text{old}} + \tfrac{1}{2} \sum_i w_i\]
\[b_n = \gamma^n\, b_{\text{old}} + \tfrac{1}{2}\!\left( \mathbf{y}^\top (\mathbf{W} \odot \mathbf{y}) + \gamma^n\,\boldsymbol{\mu}_{\text{old}}^\top \boldsymbol{\Lambda}_{\text{old}}\, \boldsymbol{\mu}_{\text{old}} - \boldsymbol{\mu}_n^\top \boldsymbol{\Lambda}_n\,\boldsymbol{\mu}_n \right)\]

where \(\mathbf{W} = \mathrm{diag}(w_1, \ldots, w_n)\) contains effective sample weights (incorporating within-batch decay, as in NormalRegressor). When \(\gamma = 1\) and all weights are 1, these reduce to the standard NIG conjugate update.

Note that \(\beta\) does not appear in the precision update (contrast with NormalRegressor). The noise precision \(1/\sigma^2\) is unknown and integrated out; the precision matrix here is the conditional precision given \(\sigma^2\), not the marginal precision.

Marginal posterior#

Integrating out \(\sigma^2\), the marginal posterior of the weights is a multivariate t:

\[\mathbf{w} \mid \mathcal{D} \sim t_{2a_n}\!\left( \boldsymbol{\mu}_n,\; \frac{b_n}{a_n}\,\boldsymbol{\Lambda}_n^{-1} \right)\]

with \(2a_n\) degrees of freedom, location \(\boldsymbol{\mu}_n\), and shape covariance \((b_n / a_n)\,\boldsymbol{\Lambda}_n^{-1}\).

With little data, \(a_n\) is small and the tails are heavy.

Sampling#

The sample method draws from the multivariate t:

  1. Draw \(u \sim \chi^2(2a_n) / (2a_n)\)

  2. Draw \(\mathbf{z} \sim \mathcal{N}(\mathbf{0},\, (b_n/a_n)\,\boldsymbol{\Lambda}_n^{-1})\)

  3. Return \(\boldsymbol{\mu}_n + \mathbf{z} / \sqrt{u}\)

Predictions are \(\hat{\mathbf{y}} = \mathbf{X}\,\mathbf{w}_s\) as in NormalRegressor. No observation noise is added.

Decay#

Standalone decay scales all three posterior statistics:

\[\begin{split}\boldsymbol{\Lambda} &\leftarrow \gamma^n\,\boldsymbol{\Lambda} \\ a &\leftarrow \gamma^n\, a \\ b &\leftarrow \gamma^n\, b\end{split}\]

The mean is unchanged. The t distribution widens: fewer effective degrees of freedom and higher scale.

See Choosing and Tuning a Decay Rate for practical guidance.

Hyperparameter semantics#

Parameter

Controls

Practical guidance

mu

Prior mean of the weights.

0.0 (default). A scalar is broadcast to all features.

lam

Prior precision of the weights (conditioned on \(\sigma^2\)). Scalar gives \(\lambda \mathbf{I}\), vector gives \(\mathrm{diag}(\boldsymbol{\lambda})\).

1.0 (default). Higher values regularize more strongly.

a

Prior shape of the IG on \(\sigma^2\). Controls how informative the prior variance estimate is.

0.1 (default). Small values give a diffuse prior on the noise level.

b

Prior rate of the IG. The prior mean of \(\sigma^2\) is \(b/(a-1)\) for \(a > 1\).

0.1 (default).

learning_rate

Decay factor \(\gamma\). See Choosing and Tuning a Decay Rate.

1.0 (default) for stationary environments.

References#