Bayesian GLM
============

Bayesian generalized linear model for non-Gaussian likelihoods (binary
and count data). The posterior is not conjugate; it is approximated as
a Gaussian at the MAP via Laplace approximation / IRLS.

See also: :doc:`normal` (exact conjugate, Gaussian likelihood).


Symbols
-------

.. list-table::
   :header-rows: 1
   :widths: 15 85

   * - Symbol
     - Meaning
   * - :math:`p`
     - Number of features
   * - :math:`n`
     - Number of observations in a batch
   * - :math:`\mathbf{w}`
     - Weight vector (MAP estimate stored as ``coef_``)
   * - :math:`\alpha`
     - Prior precision scalar
   * - :math:`\boldsymbol{\Lambda}`
     - Posterior precision matrix (stored as ``cov_inv_``)
   * - :math:`\boldsymbol{\eta}`
     - Linear predictor :math:`\mathbf{X}\mathbf{w}`
   * - :math:`g^{-1}`
     - Inverse link function (sigmoid or exp)
   * - :math:`\boldsymbol{\mu}`
     - Mean response :math:`g^{-1}(\boldsymbol{\eta})`
   * - :math:`\mathbf{W}`
     - Diagonal IRLS weight matrix
   * - :math:`\mathbf{z}`
     - Working response (pseudo-targets)
   * - :math:`\gamma`
     - Decay factor, in :math:`(0, 1]`


Prior
-----

.. math::

   \mathbf{w} \sim \mathcal{N}(\mathbf{0},\; \alpha^{-1}\mathbf{I})


Likelihood
----------

**Logit link** (Bernoulli):

.. math::

   p(y_i = 1 \mid \mathbf{x}_i, \mathbf{w})
   = \sigma(\mathbf{x}_i^\top \mathbf{w})
   = \frac{1}{1 + \exp(-\mathbf{x}_i^\top \mathbf{w})}

**Log link** (Poisson):

.. math::

   \mathbb{E}[y_i \mid \mathbf{x}_i, \mathbf{w}]
   = \exp(\mathbf{x}_i^\top \mathbf{w})

Neither likelihood is conjugate to the Gaussian prior.

**Reference:** Murphy (2012) Chapter 8 [1]_.


Laplace approximation via IRLS
-------------------------------

The posterior is approximated as:

.. math::

   p(\mathbf{w} \mid \mathcal{D})
   \approx \mathcal{N}(\hat{\mathbf{w}},\;
     \boldsymbol{\Lambda}^{-1})

where :math:`\hat{\mathbf{w}}` is the MAP estimate and
:math:`\boldsymbol{\Lambda}` is the Hessian of the negative
log-posterior evaluated at the MAP.

The MAP is found by iteratively reweighted least squares (IRLS). Each
iteration:

1. Compute the linear predictor and link derivatives:

   .. math::

      \boldsymbol{\eta} &= \mathbf{X}\mathbf{w} \\
      \boldsymbol{\mu} &= g^{-1}(\boldsymbol{\eta}) \\
      \frac{d\boldsymbol{\mu}}{d\boldsymbol{\eta}}
      &= \begin{cases}
        \boldsymbol{\mu}(1 - \boldsymbol{\mu}) & \text{logit} \\
        \boldsymbol{\mu} & \text{log}
      \end{cases}

2. Form IRLS weights and working response:

   .. math::

      \mathbf{W} &= \mathrm{diag}\!\left(
        \frac{d\boldsymbol{\mu}}{d\boldsymbol{\eta}}
      \right) \\
      \mathbf{z} &= \boldsymbol{\eta}
        + \frac{\mathbf{y} - \boldsymbol{\mu}}
               {d\boldsymbol{\mu}/d\boldsymbol{\eta}}

3. Solve the weighted least squares problem:

   .. math::

      \boldsymbol{\Lambda}
      &= \gamma^n\,\boldsymbol{\Lambda}_{\text{old}}
        + \mathbf{X}^\top \mathbf{W} \mathbf{X} \\
      \mathbf{w}
      &\leftarrow \boldsymbol{\Lambda}^{-1}\!\left(
        \gamma^n\,\boldsymbol{\Lambda}_{\text{old}}\,
          \mathbf{w}_{\text{old}}
        + \mathbf{X}^\top (\mathbf{W} \odot \mathbf{z})
      \right)

4. Check convergence:
   :math:`\|\mathbf{w}_{\text{new}} - \mathbf{w}_{\text{old}}\|_\infty < \texttt{tol}`.

The default runs 5 iterations per ``partial_fit`` call
(``LaplaceApproximator(n_iter=5)``). For fast online updates where the
previous posterior is a good initialization, ``n_iter=1`` (a single
Newton step) is often sufficient.


Sampling
--------

Weight vectors are drawn from the Gaussian approximation:

.. math::

   \mathbf{w}_s \sim \mathcal{N}(\hat{\mathbf{w}},\;
     \boldsymbol{\Lambda}^{-1})

Predictions are transformed through the inverse link:

.. math::

   \hat{\mathbf{y}} = g^{-1}(\mathbf{X}\,\mathbf{w}_s)

For the logit link, samples are probabilities in :math:`(0, 1)`. For
the log link, samples are positive rates.


Decay
-----

.. math::

   \boldsymbol{\Lambda} \leftarrow \gamma^n\,\boldsymbol{\Lambda}

Same as :doc:`normal`: scales the precision, widens the posterior,
mean unchanged. See :doc:`/howto/decay`.


Hyperparameter semantics
-------------------------

.. list-table::
   :header-rows: 1
   :widths: 20 50 30

   * - Parameter
     - Controls
     - Practical guidance
   * - ``alpha``
     - Prior precision. Sets
       :math:`\boldsymbol{\Lambda}_0 = \alpha\mathbf{I}`.
     - 1.0 (default). Higher values regularize more.
   * - ``link``
     - Likelihood family. ``'logit'`` for binary outcomes,
       ``'log'`` for counts.
     - Match to your outcome type.
   * - ``approximator``
     - Posterior approximation strategy. Default is
       ``LaplaceApproximator(n_iter=5, tol=1e-4)``.
     - See the IRLS section above for ``n_iter`` guidance.
   * - ``learning_rate``
     - Decay factor :math:`\gamma`. See :doc:`/howto/decay`.
     - 1.0 (default) for stationary environments.


Trade-offs vs. conjugate models
--------------------------------

The Laplace approximation buys flexible likelihoods (logistic, Poisson)
at a cost:

- The posterior is Gaussian at the MAP, not exact. Tail behavior and
  multimodality are lost.
- IRLS iterations are more expensive than a single conjugate update
  (:math:`O(n\_iter \cdot p^2 n)` vs. :math:`O(p^2 n)`).
- With ``n_iter=1``, the cost matches a conjugate update, but the
  approximation is only good when the previous posterior is close to
  the new MAP.

For Gaussian outcomes, :class:`~bayesianbandits.NormalRegressor` is
exact and cheaper. For binary or count outcomes,
:class:`~bayesianbandits.NormalRegressor` is also viable: by
Bernstein-von Mises, the linear model's posterior concentrates
correctly even under likelihood misspecification, and in practice
it performs well (see the :doc:`/notebooks/linear-bandits` example).
The GLM buys you a proper likelihood at the cost of IRLS iterations.


References
----------

.. [1] Murphy, K. P. (2012). *Machine Learning: A Probabilistic
   Perspective*, Chapter 8. MIT Press.