Exploration Policies#

A policy maps posterior samples to arm selections. All four share a common interface:

  1. Draw posterior predictive samples for each arm: \(\tilde{\theta}_a \sim p(\theta_a \mid \mathcal{D}_a)\), then \(\tilde{r}_a = g_a(\tilde{\theta}_a, \mathbf{x}_t)\) where \(g_a\) is the reward function.

  2. Compute a decision statistic from those samples (policy-specific).

  3. Select the arm that maximizes the statistic.

The update rule is shared by Thompson sampling, UCB, and epsilon-greedy: call arm.update(X, y) directly. EXP3A overrides this with importance-weighted updates.

Note

The regret bounds listed below are proven for stationary, non-contextual bandits with exact conjugate posteriors. This library targets contextual, non-stationary settings with approximate posteriors and decay. The formal guarantees do not directly apply.

Samples per arm#

Policy

Samples

Reason

ThompsonSampling

1

One posterior draw suffices for probability matching

UpperConfidenceBound

1000

Monte Carlo quantile estimation needs many draws

EpsilonGreedy

1000

Accurate posterior mean via Monte Carlo

EXP3A

100

Only needs mean estimates; 100 draws are typically sufficient

Thompson Sampling#

Selection rule. Draw a single sample from each arm’s posterior predictive and select the arm with the highest sample:

\[\tilde{\theta}_a \sim p(\theta_a \mid \mathcal{D}_a) \quad \forall a, \qquad a^* = \arg\max_a \; g_a(\tilde{\theta}_a, \mathbf{x}_t)\]

This is probability matching: each arm is selected with probability equal to its posterior probability of being optimal.

Hyperparameters. None.

Update rule. Default: arm.update(X, y).

Regret bounds. For the \(K\)-armed stochastic bandit with exact conjugate posteriors [1] [2]:

\[\mathbb{E}[\mathrm{Regret}(T)] = O\!\left(\sqrt{KT \ln K}\right)\]

with an asymptotically optimal problem-dependent bound of \(O\!\left(\sum_{a:\Delta_a>0} \frac{\ln T}{\Delta_a}\right)\) matching the Lai-Robbins lower bound.

Upper Confidence Bound#

Strictly, this is an upper credible bound (a posterior quantile), not the frequentist confidence bound of classical UCB. The name is conventional.

Selection rule. Compute the \(\alpha\)-quantile of each arm’s posterior predictive and select the arm with the highest quantile:

\[a^* = \arg\max_a \; Q_\alpha\!\left( g_a(\theta_a, \mathbf{x}_t) \mid \mathcal{D}_a \right)\]

The quantile is estimated via Monte Carlo: draw \(S\) samples per arm, compute np.quantile(samples, alpha), then argmax.

Hyperparameters.

Parameter

Controls

Practical guidance

alpha

Quantile level (default 0.68). Controls optimism. \(\alpha = 0.5\) is the median (greedy on the posterior mean). \(\alpha = 0.68\) is roughly one standard deviation for Gaussian posteriors. \(\alpha = 0.95\) is aggressive exploration.

Start with 0.68. Increase for more exploration.

samples

Number of Monte Carlo draws per arm (default 1000).

1000 is usually sufficient. Reduce for lower latency if quantile accuracy is not critical.

Update rule. Default: arm.update(X, y).

Regret bounds. With a quantile schedule \(\alpha_t = 1 - 1/(t \log^c(T))\), Bayesian UCB achieves [3]:

\[\mathbb{E}[\mathrm{Regret}(T)] = O\!\left(\sum_{a:\Delta_a>0} \frac{\ln T}{\Delta_a}\right)\]

With a fixed \(\alpha\) (as implemented), the problem-dependent bound is not guaranteed, but a minimax rate of \(O(\sqrt{KT \ln T})\) holds under mild conditions.

Epsilon-Greedy#

Selection rule. Exploit the posterior mean with probability \(1 - \varepsilon\), explore uniformly at random with probability \(\varepsilon\):

\[\begin{split}a_t = \begin{cases} \displaystyle\arg\max_a \; \hat{\mu}_a(\mathbf{x}_t) & \text{with probability } 1 - \varepsilon \\[4pt] \text{Uniform}\{1, \ldots, K\} & \text{with probability } \varepsilon \end{cases}\end{split}\]

where \(\hat{\mu}_a(\mathbf{x}_t)\) is the Monte Carlo posterior mean for arm \(a\) in context \(\mathbf{x}_t\).

Hyperparameters.

Parameter

Controls

Practical guidance

epsilon

Exploration probability (default 0.1). \(\varepsilon = 0\) is pure greedy (no exploration). \(\varepsilon = 1\) is pure random.

0.01 to 0.2 is typical. Start with 0.1.

samples

Number of Monte Carlo draws per arm (default 1000).

1000 is usually sufficient for accurate mean estimation.

Update rule. Default: arm.update(X, y).

Regret bounds. For fixed \(\varepsilon\) [4]:

\[\mathbb{E}[\mathrm{Regret}(T)] \le \varepsilon\,T\,\Delta_{\max} + \sum_{a:\Delta_a>0} \frac{C}{\Delta_a}\]

With a decaying schedule \(\varepsilon_t = O(K/t)\), the minimax rate \(O(\sqrt{KT})\) is achievable. This implementation uses a fixed \(\varepsilon\) (no schedule).

EXP3A#

Standard EXP3 [4] is noncontextual and model-free: it maintains its own cumulative weight tables with no underlying belief system. This implementation keeps the parts that give adversarial robustness (uniform mixing, importance-weighted updates) but replaces the weight tables with Bayesian posterior means. That swap is what makes it contextual, compatible with decay, and anytime: posterior means don’t grow with \(T\), so \(\eta\) needs no horizon-dependent tuning. The default variant is EXP3-IX [5]. The Adversarial Bandits: Playing Rock-Paper-Scissors Against an Exploitative Opponent notebook demonstrates both adversarial robustness and exploitation of a weak opponent.

Unlike the other policies, EXP3A overrides the update rule with importance-weighted observations.

Selection rule#

  1. Compute posterior mean rewards via Monte Carlo:

    \[\hat{\mu}_a(\mathbf{x}_t) = \frac{1}{S} \sum_{s=1}^{S} g_a(\tilde{\theta}_a^{(s)}, \mathbf{x}_t)\]
  2. Compute exponential weights (with log-sum-exp stability):

    \[w_a = \exp\!\bigl(\eta \cdot (\hat{\mu}_a - \max_j \hat{\mu}_j)\bigr)\]
  3. Mix with uniform exploration:

    \[P_t(a) = (1 - \gamma_{\text{forced}}) \frac{w_a}{\sum_j w_j} + \frac{\gamma_{\text{forced}}}{K}\]
  4. Sample arm \(a_t\) according to \(P_t\).

When \(\gamma_{\text{forced}} = 0\) (default), this is a pure softmax over posterior means.

Update rule (importance-weighted)#

EXP3A overrides the default update. After observing reward \(y\) for arm \(a_t\):

  1. Recompute the selection probabilities \(P_t(a)\) using the current posterior (stateless design).

  2. Compute the importance weight:

    \[w_t = \frac{1}{P_t(a_t) + \gamma_{\text{ix}}}\]
  3. Update the selected arm with the weighted observation:

    \[\text{arm}_{a_t}.\text{update}(\mathbf{X},\, \mathbf{y},\, \text{sample\_weight} = w_t)\]

The importance weighting corrects for selection bias: arms pulled frequently get down-weighted, arms pulled rarely get up-weighted.

Two variants#

EXP3-IX (default): gamma=0, ix_gamma=eta/2. No forced exploration; the IX regularization term \(\gamma_{\text{ix}}\) in the denominator bounds importance weights at \(1/\gamma_{\text{ix}}\), preventing catastrophic updates when an arm has very low selection probability. Recommended by Neu (2015) [5] for better empirical performance and high-probability bounds.

Standard EXP3: gamma>0, ix_gamma=0. Forced exploration via \(\gamma\)-mixing ensures every arm has at least \(\gamma/K\) selection probability. Unbiased importance weights (no IX regularization). Original formulation from Auer et al. (2002) [4].

Hyperparameters#

Parameter

Controls

Practical guidance

gamma

Forced exploration rate (default 0.0). Each arm is guaranteed at least \(\gamma/K\) selection probability.

0.0 for EXP3-IX (recommended). Set > 0 for standard EXP3.

eta

Temperature for exponential weights (default 1.0). Higher values create sharper distinctions between arms (more exploitation).

Must be calibrated to reward scale. If rewards are in \([0, 1]\), \(\eta = 1.0\) is reasonable. If rewards are in \([0, 100]\), use \(\eta = 0.01\).

ix_gamma

IX regularization (default eta/2). Bounds importance weights at \(1/\gamma_{\text{ix}}\).

Use the default (eta/2) for EXP3-IX. Set to 0 for standard EXP3 (unbiased weights).

samples

Number of Monte Carlo draws per arm (default 100).

100 is usually sufficient for mean estimation.

Regret bounds#

EXP3-IX achieves a high-probability bound of [5]:

\[\mathrm{Regret}(T) = O\!\left(\sqrt{KT \ln K}\right)\]

Standard EXP3 achieves the same minimax rate in expectation [4].

References#