NormalInverseGammaRegressor =========================== Bayesian linear regression with unknown noise variance. Places a joint Normal-Inverse-Gamma prior on the weights and noise variance. The marginal posterior over weights is a multivariate t, giving heavier-tailed uncertainty when data is scarce. See also: :doc:`normal` (known variance), :doc:`empirical-bayes` (automatic hyperparameter tuning). Symbols ------- .. list-table:: :header-rows: 1 :widths: 15 85 * - Symbol - Meaning * - :math:`p` - Number of features * - :math:`n` - Number of observations in a batch * - :math:`\mathbf{w}` - Weight vector (posterior mean stored as ``coef_``) * - :math:`\sigma^2` - Noise variance (random, integrated out) * - :math:`\boldsymbol{\mu}_0` - Prior mean of the weights * - :math:`\boldsymbol{\Lambda}_0` - Prior precision matrix of the weights (conditioned on :math:`\sigma^2`) * - :math:`a_0, b_0` - Prior shape and rate of the Inverse-Gamma on :math:`\sigma^2` * - :math:`\boldsymbol{\Lambda}_n` - Posterior precision matrix (stored as ``cov_inv_``) * - :math:`\boldsymbol{\mu}_n` - Posterior mean (stored as ``coef_``) * - :math:`a_n, b_n` - Posterior shape and rate (stored as ``a_``, ``b_``) * - :math:`\gamma` - Decay factor, in :math:`(0, 1]` * - :math:`w_i` - Effective sample weight for observation :math:`i` Generative model ---------------- **Joint prior:** .. math:: \mathbf{w} \mid \sigma^2 &\sim \mathcal{N}(\boldsymbol{\mu}_0,\; \sigma^2 \boldsymbol{\Lambda}_0^{-1}) \\ \sigma^2 &\sim \mathrm{IG}(a_0,\, b_0) Note the covariance of :math:`\mathbf{w}` scales with :math:`\sigma^2`. **Likelihood:** .. math:: y_i \mid \mathbf{x}_i, \mathbf{w}, \sigma^2 \sim \mathcal{N}(\mathbf{x}_i^\top \mathbf{w},\; \sigma^2) **Reference:** Murphy (2012) Chapter 7 [1]_. Update equations ---------------- Posterior update: .. math:: \boldsymbol{\Lambda}_n = \gamma^n\,\boldsymbol{\Lambda}_{\text{old}} + \mathbf{X}^\top \mathbf{W} \mathbf{X} .. math:: \boldsymbol{\mu}_n = \boldsymbol{\Lambda}_n^{-1}\!\left( \gamma^n\,\boldsymbol{\Lambda}_{\text{old}}\, \boldsymbol{\mu}_{\text{old}} + \mathbf{X}^\top \mathbf{W} \mathbf{y} \right) .. math:: a_n = \gamma^n\, a_{\text{old}} + \tfrac{1}{2} \sum_i w_i .. math:: b_n = \gamma^n\, b_{\text{old}} + \tfrac{1}{2}\!\left( \mathbf{y}^\top (\mathbf{W} \odot \mathbf{y}) + \gamma^n\,\boldsymbol{\mu}_{\text{old}}^\top \boldsymbol{\Lambda}_{\text{old}}\, \boldsymbol{\mu}_{\text{old}} - \boldsymbol{\mu}_n^\top \boldsymbol{\Lambda}_n\,\boldsymbol{\mu}_n \right) where :math:`\mathbf{W} = \mathrm{diag}(w_1, \ldots, w_n)` contains effective sample weights (incorporating within-batch decay, as in :doc:`normal`). When :math:`\gamma = 1` and all weights are 1, these reduce to the standard NIG conjugate update. Note that :math:`\beta` does not appear in the precision update (contrast with :doc:`normal`). The noise precision :math:`1/\sigma^2` is unknown and integrated out; the precision matrix here is the *conditional* precision given :math:`\sigma^2`, not the marginal precision. Marginal posterior ------------------ Integrating out :math:`\sigma^2`, the marginal posterior of the weights is a multivariate t: .. math:: \mathbf{w} \mid \mathcal{D} \sim t_{2a_n}\!\left( \boldsymbol{\mu}_n,\; \frac{b_n}{a_n}\,\boldsymbol{\Lambda}_n^{-1} \right) with :math:`2a_n` degrees of freedom, location :math:`\boldsymbol{\mu}_n`, and shape covariance :math:`(b_n / a_n)\,\boldsymbol{\Lambda}_n^{-1}`. With little data, :math:`a_n` is small and the tails are heavy. Sampling -------- The ``sample`` method draws from the multivariate t: 1. Draw :math:`u \sim \chi^2(2a_n) / (2a_n)` 2. Draw :math:`\mathbf{z} \sim \mathcal{N}(\mathbf{0},\, (b_n/a_n)\,\boldsymbol{\Lambda}_n^{-1})` 3. Return :math:`\boldsymbol{\mu}_n + \mathbf{z} / \sqrt{u}` Predictions are :math:`\hat{\mathbf{y}} = \mathbf{X}\,\mathbf{w}_s` as in :doc:`normal`. No observation noise is added. Decay ----- Standalone ``decay`` scales all three posterior statistics: .. math:: \boldsymbol{\Lambda} &\leftarrow \gamma^n\,\boldsymbol{\Lambda} \\ a &\leftarrow \gamma^n\, a \\ b &\leftarrow \gamma^n\, b The mean is unchanged. The t distribution widens: fewer effective degrees of freedom and higher scale. See :doc:`/howto/decay` for practical guidance. Hyperparameter semantics ------------------------- .. list-table:: :header-rows: 1 :widths: 20 50 30 * - Parameter - Controls - Practical guidance * - ``mu`` - Prior mean of the weights. - 0.0 (default). A scalar is broadcast to all features. * - ``lam`` - Prior precision of the weights (conditioned on :math:`\sigma^2`). Scalar gives :math:`\lambda \mathbf{I}`, vector gives :math:`\mathrm{diag}(\boldsymbol{\lambda})`. - 1.0 (default). Higher values regularize more strongly. * - ``a`` - Prior shape of the IG on :math:`\sigma^2`. Controls how informative the prior variance estimate is. - 0.1 (default). Small values give a diffuse prior on the noise level. * - ``b`` - Prior rate of the IG. The prior mean of :math:`\sigma^2` is :math:`b/(a-1)` for :math:`a > 1`. - 0.1 (default). * - ``learning_rate`` - Decay factor :math:`\gamma`. See :doc:`/howto/decay`. - 1.0 (default) for stationary environments. References ---------- .. [1] Murphy, K. P. (2012). *Machine Learning: A Probabilistic Perspective*, Chapter 7. MIT Press.