EmpiricalBayesGammaRegressor#

Tunes the Gamma prior parameters via the Negative Binomial marginal likelihood, using Minka’s EM algorithm [1] with stabilized forgetting to prevent prior collapse under exponential decay.

Builds on the posterior update from Intercept-Only Models, which is inherited unchanged. Read that page first.

Symbols#

Symbol	Meaning
\(\alpha\)	Gamma shape parameter (prior)
\(\beta\)	Gamma rate parameter (prior)
\(\lambda_g\)	Latent rate for group \(g\)
\(c_g\)	Effective count in group \(g\): \(c_g = \alpha_g^{\text{post}} - \alpha\)
\(n_g\)	Effective exposure in group \(g\): \(n_g = \beta_g^{\text{post}} - \beta\)
\(G\)	Number of groups (unique values of the first feature)
\(\psi\)	Digamma function
\(\psi'\)	Trigamma function

Negative Binomial marginal likelihood#

The Gamma-Poisson marginal for \(G\) groups with effective counts \(c_g\) and exposures \(n_g\) under a shared prior \(\text{Gamma}(\alpha, \beta)\):

\[\log p(\text{data} \mid \alpha, \beta) = \sum_{g=1}^{G} \left[ \log\Gamma(c_g + \alpha) - \log\Gamma(\alpha) - \log\Gamma(c_g + 1) + \alpha \log\frac{\beta}{\beta + n_g} + c_g \log\frac{n_g}{\beta + n_g} \right]\]

EM derivation (Minka 2002, section 2.1)#

The marginal likelihood is maximized via EM, treating the per-group rate \(\lambda_g\) as a latent variable [1].

E-step#

Compute posterior moments of \(\lambda_g\) under \(\text{Gamma}(\alpha + c_g,\, \beta + n_g)\):

\[\begin{split}\mathbb{E}[\lambda_g] &= \frac{\alpha + c_g}{\beta + n_g} \\ \mathbb{E}[\log \lambda_g] &= \psi(\alpha + c_g) - \log(\beta + n_g)\end{split}\]

M-step#

Solve the Gamma MLE on expected sufficient statistics.

Rate update (closed-form given \(\alpha\)):

\[\beta = \frac{\alpha}{\bar{\mathbb{E}}[\lambda]}\]

where \(\bar{\mathbb{E}}[\lambda] = \frac{1}{G} \sum_g \mathbb{E}[\lambda_g]\).

Shape update (generalized Newton [1], section 1). The fixed-point equation is:

\[\log \alpha - \psi(\alpha) = \log \bar{\mathbb{E}}[\lambda] - \bar{\mathbb{E}}[\log \lambda]\]

solved by the iteration:

\[\frac{1}{\alpha^{\text{new}}} = \frac{1}{\alpha} + \frac{\log \bar{\mathbb{E}}[\lambda] - \bar{\mathbb{E}}[\log \lambda] - \log \alpha + \psi(\alpha)} {\alpha^2 \left(\frac{1}{\alpha} - \psi'(\alpha)\right)}\]

Counts and exposure extraction#

The effective counts and exposures are derived from the difference between posterior and prior parameters:

\[c_g = \alpha_g^{\text{post}} - \alpha, \qquad n_g = \beta_g^{\text{post}} - \beta\]

With stabilized forgetting (see below), this gives the decayed counts and exposures exactly.

Stabilized forgetting#

The problem. Uniform decay \((\alpha_g, \beta_g) \leftarrow \gamma\,(\alpha_g, \beta_g)\) drives all parameters to zero.

The solution. Every decay (both explicit via decay() and implicit via learning_rate in partial_fit) re-injects the EB-tuned prior:

\[\mathbf{p}_g \leftarrow \gamma^n \, \mathbf{p}_g + (1 - \gamma^n) \, \mathbf{p}^{\text{prior}}\]

where \(\mathbf{p}_g = (\alpha_g, \beta_g)\) and \(n\) is the number of observations. This maintains the invariant:

\[\mathbf{p}_g - \mathbf{p}^{\text{prior}} = \gamma^n \, (c_g, n_g)\]

so effective counts and exposures are correctly decayed and the prior contribution never vanishes.

fit vs. partial_fit#

fit iterates the EM step to convergence on fixed data (up to n_eb_iter times, checking \(|\Delta \log p| < \texttt{eb\_tol}\)), then refits the posterior with the converged prior. partial_fit runs one EM step per call, correcting all group posteriors by the prior change \(\mathbf{p}^{\text{new}} - \mathbf{p}^{\text{old}}\).

Hyperparameter semantics#

Parameter	Controls	Practical guidance
`alpha`	Initial Gamma shape. EB tunes this.	Start at 1.0.
`beta`	Initial Gamma rate. EB tunes this.	Start at 1.0.
`n_eb_iter`	Maximum EM iterations during `fit`.	10 (default). Set to 0 to disable EB during `fit`.
`eb_tol`	Convergence tolerance on log evidence change.	1e-4 (default).
`learning_rate`	Decay factor \(\gamma\). See Choosing and Tuning a Decay Rate.	1.0 (default) for stationary environments.

Robustness to misspecified priors#

The EB prior mean \(\alpha/\beta\) converges to the average group rate regardless of initialization. The prior strength (concentration) converges more slowly but is less critical for predictions.

import numpy as np
from bayesianbandits import EmpiricalBayesGammaRegressor

rng = np.random.default_rng(42)
true_alpha, true_beta = 3.0, 1.0  # true mean rate = 3.0

# Wrong initial prior: mean rate = 0.1
model = EmpiricalBayesGammaRegressor(
    alpha=1.0, beta=10.0, random_state=0
)

for g in range(200):
    rate = rng.gamma(true_alpha, 1.0 / true_beta)
    obs = rng.poisson(rate, size=rng.poisson(10) + 1)
    model.partial_fit(np.full((len(obs), 1), g), obs)

# Recovered prior mean: alpha/beta ~ 3.0
print(f"Prior mean rate: {model.alpha / model.beta:.2f}")