NormalRegressor =============== Bayesian linear regression with known noise variance (known :math:`\beta`). Gaussian prior on weights, exact conjugate updates in the precision parameterization. See also: :doc:`normal-inverse-gamma` (unknown variance), :doc:`empirical-bayes` (automatic hyperparameter tuning). Symbols ------- .. list-table:: :header-rows: 1 :widths: 15 85 * - Symbol - Meaning * - :math:`p` - Number of features (dimensionality of the weight vector) * - :math:`n` - Number of observations in a batch * - :math:`\mathbf{w}` - Weight vector (posterior mean stored as ``coef_``) * - :math:`\alpha` - Prior precision scalar * - :math:`\beta` - Noise precision (inverse noise variance, :math:`1/\sigma^2`) * - :math:`\boldsymbol{\Lambda}` - Posterior precision matrix (stored as ``cov_inv_``) * - :math:`\boldsymbol{\mu}` - Posterior mean of the weight vector (stored as ``coef_``) * - :math:`\gamma` - Learning rate / decay factor, in :math:`(0, 1]` * - :math:`\mathbf{W}` - Diagonal matrix of effective sample weights * - :math:`w_i` - User-supplied sample weight for observation :math:`i` Generative model ---------------- **Prior:** .. math:: \mathbf{w} \sim \mathcal{N}(\mathbf{0},\; \alpha^{-1} \mathbf{I}) The prior precision matrix is :math:`\boldsymbol{\Lambda}_0 = \alpha \mathbf{I}` and the prior mean is :math:`\boldsymbol{\mu}_0 = \mathbf{0}`. **Likelihood:** .. math:: y_i \mid \mathbf{x}_i, \mathbf{w} \sim \mathcal{N}(\mathbf{x}_i^\top \mathbf{w},\; \beta^{-1}) Here :math:`\beta` is the noise *precision* (inverse variance), not the variance itself. **References:** Bishop (2006) Section 3.3 [1]_, Murphy (2012) Chapter 7 [2]_. Update equations ---------------- The posterior is Gaussian: :math:`\mathbf{w} \mid \mathcal{D} \sim \mathcal{N}(\boldsymbol{\mu}_n, \boldsymbol{\Lambda}_n^{-1})`. **Precision update:** .. math:: \boldsymbol{\Lambda}_n = \gamma^n \,\boldsymbol{\Lambda}_{\text{old}} + \beta\, \mathbf{X}^\top \mathbf{W} \mathbf{X} **Mean update (via the information vector):** .. math:: \boldsymbol{\eta}_n &= \gamma^n \,\boldsymbol{\Lambda}_{\text{old}}\,\boldsymbol{\mu}_{\text{old}} + \beta\, \mathbf{X}^\top \mathbf{W} \mathbf{y} \\ \boldsymbol{\mu}_n &= \boldsymbol{\Lambda}_n^{-1}\, \boldsymbol{\eta}_n The implementation stores :math:`\boldsymbol{\Lambda}` and computes :math:`\boldsymbol{\eta}` as an intermediate; the mean is recovered by a single Cholesky solve. When :math:`\gamma = 1` (the default), these reduce to the standard conjugate update. Effective weights within a batch ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When :math:`\gamma < 1` and a batch of :math:`n` observations is processed in a single ``partial_fit`` call, each observation receives an effective weight that depends on its position in the batch: .. math:: w_{\text{eff}, i} = w_i \cdot \gamma^{\,n - 1 - i}, \qquad i = 0, 1, \ldots, n{-}1 Earlier observations in the batch are decayed more than later ones. This ensures that processing a batch of :math:`n` observations in one call gives the same posterior as processing them one at a time with decay between each. When :math:`\gamma = 1`, all effective weights equal the user-supplied weights (or 1 if none are provided). Sampling -------- The ``sample`` method draws weight vectors from the posterior and projects them through the design matrix: .. math:: \mathbf{w}_s \sim \mathcal{N}(\boldsymbol{\mu}_n,\; \boldsymbol{\Lambda}_n^{-1}), \qquad \hat{\mathbf{y}} = \mathbf{X}\,\mathbf{w}_s Sampling from the precision parameterization uses :math:`\mathbf{w}_s = \boldsymbol{\mu}_n + \mathbf{L}^{-\top}\mathbf{z}` where :math:`\mathbf{L}\mathbf{L}^\top = \boldsymbol{\Lambda}_n` and :math:`\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})`. .. note:: ``sample`` produces draws of the *expected reward* :math:`\mathbf{X}\mathbf{w}_s`, not noisy reward realizations. Observation noise :math:`\varepsilon \sim \mathcal{N}(0, \beta^{-1})` is not added. Thompson sampling needs samples of the expected value under parameter uncertainty, not noisy outcomes. Standalone decay ---------------- The ``decay`` method scales the precision matrix without observing new data: .. math:: \boldsymbol{\Lambda} \leftarrow \gamma^n \,\boldsymbol{\Lambda} where :math:`n` is the number of rows in the ``X`` argument. The posterior mean is unchanged because :math:`(\gamma^n \boldsymbol{\Lambda})^{-1}(\gamma^n \boldsymbol{\Lambda}\,\boldsymbol{\mu}) = \boldsymbol{\mu}`. This is equivalent to a random-walk state-space model where the transition shrinks the effective precision by :math:`\gamma` per time step. See :doc:`/howto/decay` for practical guidance. Hyperparameter semantics ------------------------- .. list-table:: :header-rows: 1 :widths: 20 50 30 * - Parameter - Controls - Practical guidance * - ``alpha`` - Prior precision. Sets :math:`\boldsymbol{\Lambda}_0 = \alpha \mathbf{I}`. Higher values give stronger regularization toward zero. - Start with 1.0. Range: 0.01 to 100. * - ``beta`` - Noise precision :math:`1/\sigma^2`. Scales the data contribution to the precision matrix. Higher values mean observations are trusted more (lower noise). - Set to :math:`1/\sigma^2` if the noise scale is known. Otherwise consider :class:`~bayesianbandits.NormalInverseGammaRegressor`. * - ``learning_rate`` - Decay factor :math:`\gamma`. Applied per observation during ``partial_fit`` and per row during ``decay``. - 1.0 (default) for stationary environments. See :doc:`/howto/decay` for tuning advice. References ---------- .. [1] Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*, Section 3.3. Springer. .. [2] Murphy, K. P. (2012). *Machine Learning: A Probabilistic Perspective*, Chapter 7. MIT Press.