Bayesian GLM#
Bayesian generalized linear model for non-Gaussian likelihoods (binary and count data). The posterior is not conjugate; it is approximated as a Gaussian at the MAP via Laplace approximation / IRLS.
See also: NormalRegressor (exact conjugate, Gaussian likelihood).
Symbols#
Symbol |
Meaning |
|---|---|
\(p\) |
Number of features |
\(n\) |
Number of observations in a batch |
\(\mathbf{w}\) |
Weight vector (MAP estimate stored as |
\(\alpha\) |
Prior precision scalar |
\(\boldsymbol{\Lambda}\) |
Posterior precision matrix (stored as |
\(\boldsymbol{\eta}\) |
Linear predictor \(\mathbf{X}\mathbf{w}\) |
\(g^{-1}\) |
Inverse link function (sigmoid or exp) |
\(\boldsymbol{\mu}\) |
Mean response \(g^{-1}(\boldsymbol{\eta})\) |
\(\mathbf{W}\) |
Diagonal IRLS weight matrix |
\(\mathbf{z}\) |
Working response (pseudo-targets) |
\(\gamma\) |
Decay factor, in \((0, 1]\) |
Prior#
Likelihood#
Logit link (Bernoulli):
Log link (Poisson):
Neither likelihood is conjugate to the Gaussian prior.
Reference: Murphy (2012) Chapter 8 [1].
Laplace approximation via IRLS#
The posterior is approximated as:
where \(\hat{\mathbf{w}}\) is the MAP estimate and \(\boldsymbol{\Lambda}\) is the Hessian of the negative log-posterior evaluated at the MAP.
The MAP is found by iteratively reweighted least squares (IRLS). Each iteration:
Compute the linear predictor and link derivatives:
\[\begin{split}\boldsymbol{\eta} &= \mathbf{X}\mathbf{w} \\ \boldsymbol{\mu} &= g^{-1}(\boldsymbol{\eta}) \\ \frac{d\boldsymbol{\mu}}{d\boldsymbol{\eta}} &= \begin{cases} \boldsymbol{\mu}(1 - \boldsymbol{\mu}) & \text{logit} \\ \boldsymbol{\mu} & \text{log} \end{cases}\end{split}\]Form IRLS weights and working response:
\[\begin{split}\mathbf{W} &= \mathrm{diag}\!\left( \frac{d\boldsymbol{\mu}}{d\boldsymbol{\eta}} \right) \\ \mathbf{z} &= \boldsymbol{\eta} + \frac{\mathbf{y} - \boldsymbol{\mu}} {d\boldsymbol{\mu}/d\boldsymbol{\eta}}\end{split}\]Solve the weighted least squares problem:
\[\begin{split}\boldsymbol{\Lambda} &= \gamma^n\,\boldsymbol{\Lambda}_{\text{old}} + \mathbf{X}^\top \mathbf{W} \mathbf{X} \\ \mathbf{w} &\leftarrow \boldsymbol{\Lambda}^{-1}\!\left( \gamma^n\,\boldsymbol{\Lambda}_{\text{old}}\, \mathbf{w}_{\text{old}} + \mathbf{X}^\top (\mathbf{W} \odot \mathbf{z}) \right)\end{split}\]Check convergence: \(\|\mathbf{w}_{\text{new}} - \mathbf{w}_{\text{old}}\|_\infty < \texttt{tol}\).
The default runs 5 iterations per partial_fit call
(LaplaceApproximator(n_iter=5)). For fast online updates where the
previous posterior is a good initialization, n_iter=1 (a single
Newton step) is often sufficient.
Sampling#
Weight vectors are drawn from the Gaussian approximation:
Predictions are transformed through the inverse link:
For the logit link, samples are probabilities in \((0, 1)\). For the log link, samples are positive rates.
Decay#
Same as NormalRegressor: scales the precision, widens the posterior, mean unchanged. See Choosing and Tuning a Decay Rate.
Hyperparameter semantics#
Parameter |
Controls |
Practical guidance |
|---|---|---|
|
Prior precision. Sets \(\boldsymbol{\Lambda}_0 = \alpha\mathbf{I}\). |
1.0 (default). Higher values regularize more. |
|
Likelihood family. |
Match to your outcome type. |
|
Posterior approximation strategy. Default is
|
See the IRLS section above for |
|
Decay factor \(\gamma\). See Choosing and Tuning a Decay Rate. |
1.0 (default) for stationary environments. |
Trade-offs vs. conjugate models#
The Laplace approximation buys flexible likelihoods (logistic, Poisson) at a cost:
The posterior is Gaussian at the MAP, not exact. Tail behavior and multimodality are lost.
IRLS iterations are more expensive than a single conjugate update (\(O(n\_iter \cdot p^2 n)\) vs. \(O(p^2 n)\)).
With
n_iter=1, the cost matches a conjugate update, but the approximation is only good when the previous posterior is close to the new MAP.
For Gaussian outcomes, NormalRegressor is
exact and cheaper. For binary or count outcomes,
NormalRegressor is also viable: by
Bernstein-von Mises, the linear model’s posterior concentrates
correctly even under likelihood misspecification, and in practice
it performs well (see the Comparing Bayesian Approaches to Disjoint Linear Bandits example).
The GLM buys you a proper likelihood at the cost of IRLS iterations.