EmpiricalBayesGammaRegressor ============================ Tunes the Gamma prior parameters via the Negative Binomial marginal likelihood, using Minka's EM algorithm [1]_ with stabilized forgetting to prevent prior collapse under exponential decay. Builds on the posterior update from :doc:`intercept-only`, which is inherited unchanged. Read that page first. Symbols ------- .. list-table:: :header-rows: 1 :widths: 15 85 * - Symbol - Meaning * - :math:`\alpha` - Gamma shape parameter (prior) * - :math:`\beta` - Gamma rate parameter (prior) * - :math:`\lambda_g` - Latent rate for group :math:`g` * - :math:`c_g` - Effective count in group :math:`g`: :math:`c_g = \alpha_g^{\text{post}} - \alpha` * - :math:`n_g` - Effective exposure in group :math:`g`: :math:`n_g = \beta_g^{\text{post}} - \beta` * - :math:`G` - Number of groups (unique values of the first feature) * - :math:`\psi` - Digamma function * - :math:`\psi'` - Trigamma function Negative Binomial marginal likelihood -------------------------------------- The Gamma-Poisson marginal for :math:`G` groups with effective counts :math:`c_g` and exposures :math:`n_g` under a shared prior :math:`\text{Gamma}(\alpha, \beta)`: .. math:: \log p(\text{data} \mid \alpha, \beta) = \sum_{g=1}^{G} \left[ \log\Gamma(c_g + \alpha) - \log\Gamma(\alpha) - \log\Gamma(c_g + 1) + \alpha \log\frac{\beta}{\beta + n_g} + c_g \log\frac{n_g}{\beta + n_g} \right] EM derivation (Minka 2002, section 2.1) ---------------------------------------- The marginal likelihood is maximized via EM, treating the per-group rate :math:`\lambda_g` as a latent variable [1]_. E-step ~~~~~~ Compute posterior moments of :math:`\lambda_g` under :math:`\text{Gamma}(\alpha + c_g,\, \beta + n_g)`: .. math:: \mathbb{E}[\lambda_g] &= \frac{\alpha + c_g}{\beta + n_g} \\ \mathbb{E}[\log \lambda_g] &= \psi(\alpha + c_g) - \log(\beta + n_g) M-step ~~~~~~ Solve the Gamma MLE on expected sufficient statistics. **Rate update** (closed-form given :math:`\alpha`): .. math:: \beta = \frac{\alpha}{\bar{\mathbb{E}}[\lambda]} where :math:`\bar{\mathbb{E}}[\lambda] = \frac{1}{G} \sum_g \mathbb{E}[\lambda_g]`. **Shape update** (generalized Newton [1]_, section 1). The fixed-point equation is: .. math:: \log \alpha - \psi(\alpha) = \log \bar{\mathbb{E}}[\lambda] - \bar{\mathbb{E}}[\log \lambda] solved by the iteration: .. math:: \frac{1}{\alpha^{\text{new}}} = \frac{1}{\alpha} + \frac{\log \bar{\mathbb{E}}[\lambda] - \bar{\mathbb{E}}[\log \lambda] - \log \alpha + \psi(\alpha)} {\alpha^2 \left(\frac{1}{\alpha} - \psi'(\alpha)\right)} Counts and exposure extraction ------------------------------- The effective counts and exposures are derived from the difference between posterior and prior parameters: .. math:: c_g = \alpha_g^{\text{post}} - \alpha, \qquad n_g = \beta_g^{\text{post}} - \beta With stabilized forgetting (see below), this gives the decayed counts and exposures exactly. Stabilized forgetting --------------------- **The problem.** Uniform decay :math:`(\alpha_g, \beta_g) \leftarrow \gamma\,(\alpha_g, \beta_g)` drives all parameters to zero. **The solution.** Every decay (both explicit via ``decay()`` and implicit via ``learning_rate`` in ``partial_fit``) re-injects the EB-tuned prior: .. math:: \mathbf{p}_g \leftarrow \gamma^n \, \mathbf{p}_g + (1 - \gamma^n) \, \mathbf{p}^{\text{prior}} where :math:`\mathbf{p}_g = (\alpha_g, \beta_g)` and :math:`n` is the number of observations. This maintains the invariant: .. math:: \mathbf{p}_g - \mathbf{p}^{\text{prior}} = \gamma^n \, (c_g, n_g) so effective counts and exposures are correctly decayed and the prior contribution never vanishes. fit vs. partial_fit ------------------- ``fit`` iterates the EM step to convergence on fixed data (up to ``n_eb_iter`` times, checking :math:`|\Delta \log p| < \texttt{eb\_tol}`), then refits the posterior with the converged prior. ``partial_fit`` runs one EM step per call, correcting all group posteriors by the prior change :math:`\mathbf{p}^{\text{new}} - \mathbf{p}^{\text{old}}`. Hyperparameter semantics ------------------------- .. list-table:: :header-rows: 1 :widths: 20 50 30 * - Parameter - Controls - Practical guidance * - ``alpha`` - Initial Gamma shape. EB tunes this. - Start at 1.0. * - ``beta`` - Initial Gamma rate. EB tunes this. - Start at 1.0. * - ``n_eb_iter`` - Maximum EM iterations during ``fit``. - 10 (default). Set to 0 to disable EB during ``fit``. * - ``eb_tol`` - Convergence tolerance on log evidence change. - 1e-4 (default). * - ``learning_rate`` - Decay factor :math:`\gamma`. See :doc:`/howto/decay`. - 1.0 (default) for stationary environments. Robustness to misspecified priors ---------------------------------- The EB prior mean :math:`\alpha/\beta` converges to the average group rate regardless of initialization. The prior strength (concentration) converges more slowly but is less critical for predictions. .. code-block:: python import numpy as np from bayesianbandits import EmpiricalBayesGammaRegressor rng = np.random.default_rng(42) true_alpha, true_beta = 3.0, 1.0 # true mean rate = 3.0 # Wrong initial prior: mean rate = 0.1 model = EmpiricalBayesGammaRegressor( alpha=1.0, beta=10.0, random_state=0 ) for g in range(200): rate = rng.gamma(true_alpha, 1.0 / true_beta) obs = rng.poisson(rate, size=rng.poisson(10) + 1) model.partial_fit(np.full((len(obs), 1), g), obs) # Recovered prior mean: alpha/beta ~ 3.0 print(f"Prior mean rate: {model.alpha / model.beta:.2f}") References ---------- .. [1] Minka, T. P. (2002). "Estimating a Gamma distribution." Microsoft Research Technical Report.