Exploration Policies
====================

A policy maps posterior samples to arm selections. All four share a
common interface:

1. Draw posterior predictive samples for each arm:
   :math:`\tilde{\theta}_a \sim p(\theta_a \mid \mathcal{D}_a)`,
   then :math:`\tilde{r}_a = g_a(\tilde{\theta}_a, \mathbf{x}_t)`
   where :math:`g_a` is the reward function.
2. Compute a decision statistic from those samples (policy-specific).
3. Select the arm that maximizes the statistic.

The update rule is shared by Thompson sampling, UCB, and
epsilon-greedy: call ``arm.update(X, y)`` directly. EXP3A overrides
this with importance-weighted updates.

.. note::

   The regret bounds listed below are proven for stationary,
   non-contextual bandits with exact conjugate posteriors. This library
   targets contextual, non-stationary settings with approximate
   posteriors and decay. The formal guarantees do not directly apply.

.. list-table:: Samples per arm
   :header-rows: 1
   :widths: 30 20 50

   * - Policy
     - Samples
     - Reason
   * - :class:`~bayesianbandits.ThompsonSampling`
     - 1
     - One posterior draw suffices for probability matching
   * - :class:`~bayesianbandits.UpperConfidenceBound`
     - 1000
     - Monte Carlo quantile estimation needs many draws
   * - :class:`~bayesianbandits.EpsilonGreedy`
     - 1000
     - Accurate posterior mean via Monte Carlo
   * - :class:`~bayesianbandits.EXP3A`
     - 100
     - Only needs mean estimates; 100 draws are typically sufficient


Thompson Sampling
-----------------

**Selection rule.** Draw a single sample from each arm's posterior
predictive and select the arm with the highest sample:

.. math::

   \tilde{\theta}_a \sim p(\theta_a \mid \mathcal{D}_a)
   \quad \forall a, \qquad
   a^* = \arg\max_a \; g_a(\tilde{\theta}_a, \mathbf{x}_t)

This is *probability matching*: each arm is selected with probability
equal to its posterior probability of being optimal.

**Hyperparameters.** None.

**Update rule.** Default: ``arm.update(X, y)``.

**Regret bounds.** For the :math:`K`-armed stochastic bandit with
exact conjugate posteriors [1]_ [2]_:

.. math::

   \mathbb{E}[\mathrm{Regret}(T)]
   = O\!\left(\sqrt{KT \ln K}\right)

with an asymptotically optimal problem-dependent bound of
:math:`O\!\left(\sum_{a:\Delta_a>0} \frac{\ln T}{\Delta_a}\right)`
matching the Lai-Robbins lower bound.

Upper Confidence Bound
----------------------

Strictly, this is an upper *credible* bound (a posterior quantile),
not the frequentist confidence bound of classical UCB. The name is
conventional.

**Selection rule.** Compute the :math:`\alpha`-quantile of each arm's
posterior predictive and select the arm with the highest quantile:

.. math::

   a^* = \arg\max_a \; Q_\alpha\!\left(
     g_a(\theta_a, \mathbf{x}_t) \mid \mathcal{D}_a
   \right)

The quantile is estimated via Monte Carlo: draw :math:`S` samples per
arm, compute ``np.quantile(samples, alpha)``, then argmax.

**Hyperparameters.**

.. list-table::
   :header-rows: 1
   :widths: 20 50 30

   * - Parameter
     - Controls
     - Practical guidance
   * - ``alpha``
     - Quantile level (default 0.68). Controls optimism.
       :math:`\alpha = 0.5` is the median (greedy on the posterior
       mean). :math:`\alpha = 0.68` is roughly one standard deviation
       for Gaussian posteriors. :math:`\alpha = 0.95` is aggressive
       exploration.
     - Start with 0.68. Increase for more exploration.
   * - ``samples``
     - Number of Monte Carlo draws per arm (default 1000).
     - 1000 is usually sufficient. Reduce for lower latency if
       quantile accuracy is not critical.

**Update rule.** Default: ``arm.update(X, y)``.

**Regret bounds.** With a quantile schedule
:math:`\alpha_t = 1 - 1/(t \log^c(T))`, Bayesian UCB achieves [3]_:

.. math::

   \mathbb{E}[\mathrm{Regret}(T)]
   = O\!\left(\sum_{a:\Delta_a>0}
   \frac{\ln T}{\Delta_a}\right)

With a fixed :math:`\alpha` (as implemented), the problem-dependent
bound is not guaranteed, but a minimax rate of
:math:`O(\sqrt{KT \ln T})` holds under mild conditions.

Epsilon-Greedy
--------------

**Selection rule.** Exploit the posterior mean with probability
:math:`1 - \varepsilon`, explore uniformly at random with probability
:math:`\varepsilon`:

.. math::

   a_t =
   \begin{cases}
   \displaystyle\arg\max_a \; \hat{\mu}_a(\mathbf{x}_t)
     & \text{with probability } 1 - \varepsilon \\[4pt]
   \text{Uniform}\{1, \ldots, K\}
     & \text{with probability } \varepsilon
   \end{cases}

where :math:`\hat{\mu}_a(\mathbf{x}_t)` is the Monte Carlo posterior
mean for arm :math:`a` in context :math:`\mathbf{x}_t`.

**Hyperparameters.**

.. list-table::
   :header-rows: 1
   :widths: 20 50 30

   * - Parameter
     - Controls
     - Practical guidance
   * - ``epsilon``
     - Exploration probability (default 0.1).
       :math:`\varepsilon = 0` is pure greedy (no exploration).
       :math:`\varepsilon = 1` is pure random.
     - 0.01 to 0.2 is typical. Start with 0.1.
   * - ``samples``
     - Number of Monte Carlo draws per arm (default 1000).
     - 1000 is usually sufficient for accurate mean estimation.

**Update rule.** Default: ``arm.update(X, y)``.

**Regret bounds.** For fixed :math:`\varepsilon` [4]_:

.. math::

   \mathbb{E}[\mathrm{Regret}(T)]
   \le \varepsilon\,T\,\Delta_{\max}
   + \sum_{a:\Delta_a>0} \frac{C}{\Delta_a}

With a decaying schedule :math:`\varepsilon_t = O(K/t)`, the minimax
rate :math:`O(\sqrt{KT})` is achievable. This implementation uses a
fixed :math:`\varepsilon` (no schedule).


EXP3A
------

Standard EXP3 [4]_ is noncontextual and model-free: it maintains its
own cumulative weight tables with no underlying belief system. This
implementation keeps the parts that give adversarial robustness
(uniform mixing, importance-weighted updates) but replaces the weight
tables with Bayesian posterior means. That swap is what makes it
contextual, compatible with decay, and anytime: posterior means don't
grow with :math:`T`, so :math:`\eta` needs no horizon-dependent
tuning. The default variant is EXP3-IX [5]_. The
:doc:`/notebooks/adversarial` notebook demonstrates both adversarial
robustness and exploitation of a weak opponent.

Unlike the other policies, EXP3A overrides the update rule with
importance-weighted observations.


Selection rule
~~~~~~~~~~~~~~

1. Compute posterior mean rewards via Monte Carlo:

   .. math::

      \hat{\mu}_a(\mathbf{x}_t)
      = \frac{1}{S} \sum_{s=1}^{S}
        g_a(\tilde{\theta}_a^{(s)}, \mathbf{x}_t)

2. Compute exponential weights (with log-sum-exp stability):

   .. math::

      w_a = \exp\!\bigl(\eta \cdot
        (\hat{\mu}_a - \max_j \hat{\mu}_j)\bigr)

3. Mix with uniform exploration:

   .. math::

      P_t(a) = (1 - \gamma_{\text{forced}})
        \frac{w_a}{\sum_j w_j}
      + \frac{\gamma_{\text{forced}}}{K}

4. Sample arm :math:`a_t` according to :math:`P_t`.

When :math:`\gamma_{\text{forced}} = 0` (default), this is a pure
softmax over posterior means.


Update rule (importance-weighted)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

EXP3A overrides the default update. After observing reward :math:`y`
for arm :math:`a_t`:

1. Recompute the selection probabilities :math:`P_t(a)` using the
   current posterior (stateless design).

2. Compute the importance weight:

   .. math::

      w_t = \frac{1}{P_t(a_t) + \gamma_{\text{ix}}}

3. Update the selected arm with the weighted observation:

   .. math::

      \text{arm}_{a_t}.\text{update}(\mathbf{X},\, \mathbf{y},\,
        \text{sample\_weight} = w_t)

The importance weighting corrects for selection bias: arms pulled
frequently get down-weighted, arms pulled rarely get up-weighted.


Two variants
~~~~~~~~~~~~

**EXP3-IX** (default): ``gamma=0``, ``ix_gamma=eta/2``. No forced
exploration; the IX regularization term :math:`\gamma_{\text{ix}}`
in the denominator bounds importance weights at
:math:`1/\gamma_{\text{ix}}`, preventing catastrophic updates when an
arm has very low selection probability. Recommended by Neu (2015) [5]_
for better empirical performance and high-probability bounds.

**Standard EXP3**: ``gamma>0``, ``ix_gamma=0``. Forced exploration via
:math:`\gamma`-mixing ensures every arm has at least
:math:`\gamma/K` selection probability. Unbiased importance weights
(no IX regularization). Original formulation from Auer et al. (2002)
[4]_.


Hyperparameters
~~~~~~~~~~~~~~~

.. list-table::
   :header-rows: 1
   :widths: 20 50 30

   * - Parameter
     - Controls
     - Practical guidance
   * - ``gamma``
     - Forced exploration rate (default 0.0). Each arm is guaranteed
       at least :math:`\gamma/K` selection probability.
     - 0.0 for EXP3-IX (recommended). Set > 0 for standard EXP3.
   * - ``eta``
     - Temperature for exponential weights (default 1.0). Higher
       values create sharper distinctions between arms (more
       exploitation).
     - Must be calibrated to reward scale. If rewards are in
       :math:`[0, 1]`, :math:`\eta = 1.0` is reasonable. If rewards
       are in :math:`[0, 100]`, use :math:`\eta = 0.01`.
   * - ``ix_gamma``
     - IX regularization (default ``eta/2``). Bounds importance
       weights at :math:`1/\gamma_{\text{ix}}`.
     - Use the default (``eta/2``) for EXP3-IX. Set to 0 for
       standard EXP3 (unbiased weights).
   * - ``samples``
     - Number of Monte Carlo draws per arm (default 100).
     - 100 is usually sufficient for mean estimation.


Regret bounds
~~~~~~~~~~~~~

EXP3-IX achieves a high-probability bound of [5]_:

.. math::

   \mathrm{Regret}(T) = O\!\left(\sqrt{KT \ln K}\right)

Standard EXP3 achieves the same minimax rate in expectation [4]_.

References
----------

.. [1] Chapelle, O. & Li, L. (2011). "An empirical evaluation of
   Thompson sampling." *NeurIPS 24*, 2249--2257.

.. [2] Agrawal, S. & Goyal, N. (2012). "Analysis of Thompson sampling
   for the multi-armed bandit problem." *COLT*, JMLR W&CP 23,
   39.1--39.26.

.. [3] Kaufmann, E., Cappe, O., & Garivier, A. (2012). "On Bayesian
   upper confidence bounds for bandit problems." *AISTATS*, JMLR W&CP
   22, 592--600.

.. [4] Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002).
   "Finite-time analysis of the multiarmed bandit problem."
   *Machine Learning*, 47(2--3), 235--256.

.. [5] Neu, G. (2015). "Explore no more: Improved high-probability
   regret bounds for non-stochastic bandits." *NeurIPS 28*,
   3168--3176.