Exploration Policies#
A policy maps posterior samples to arm selections. All four share a common interface:
Draw posterior predictive samples for each arm: \(\tilde{\theta}_a \sim p(\theta_a \mid \mathcal{D}_a)\), then \(\tilde{r}_a = g_a(\tilde{\theta}_a, \mathbf{x}_t)\) where \(g_a\) is the reward function.
Compute a decision statistic from those samples (policy-specific).
Select the arm that maximizes the statistic.
The update rule is shared by Thompson sampling, UCB, and
epsilon-greedy: call arm.update(X, y) directly. EXP3A overrides
this with importance-weighted updates.
Note
The regret bounds listed below are proven for stationary, non-contextual bandits with exact conjugate posteriors. This library targets contextual, non-stationary settings with approximate posteriors and decay. The formal guarantees do not directly apply.
Policy |
Samples |
Reason |
|---|---|---|
1 |
One posterior draw suffices for probability matching |
|
1000 |
Monte Carlo quantile estimation needs many draws |
|
1000 |
Accurate posterior mean via Monte Carlo |
|
100 |
Only needs mean estimates; 100 draws are typically sufficient |
Thompson Sampling#
Selection rule. Draw a single sample from each arm’s posterior predictive and select the arm with the highest sample:
This is probability matching: each arm is selected with probability equal to its posterior probability of being optimal.
Hyperparameters. None.
Update rule. Default: arm.update(X, y).
Regret bounds. For the \(K\)-armed stochastic bandit with exact conjugate posteriors [1] [2]:
with an asymptotically optimal problem-dependent bound of \(O\!\left(\sum_{a:\Delta_a>0} \frac{\ln T}{\Delta_a}\right)\) matching the Lai-Robbins lower bound.
Upper Confidence Bound#
Strictly, this is an upper credible bound (a posterior quantile), not the frequentist confidence bound of classical UCB. The name is conventional.
Selection rule. Compute the \(\alpha\)-quantile of each arm’s posterior predictive and select the arm with the highest quantile:
The quantile is estimated via Monte Carlo: draw \(S\) samples per
arm, compute np.quantile(samples, alpha), then argmax.
Hyperparameters.
Parameter |
Controls |
Practical guidance |
|---|---|---|
|
Quantile level (default 0.68). Controls optimism. \(\alpha = 0.5\) is the median (greedy on the posterior mean). \(\alpha = 0.68\) is roughly one standard deviation for Gaussian posteriors. \(\alpha = 0.95\) is aggressive exploration. |
Start with 0.68. Increase for more exploration. |
|
Number of Monte Carlo draws per arm (default 1000). |
1000 is usually sufficient. Reduce for lower latency if quantile accuracy is not critical. |
Update rule. Default: arm.update(X, y).
Regret bounds. With a quantile schedule \(\alpha_t = 1 - 1/(t \log^c(T))\), Bayesian UCB achieves [3]:
With a fixed \(\alpha\) (as implemented), the problem-dependent bound is not guaranteed, but a minimax rate of \(O(\sqrt{KT \ln T})\) holds under mild conditions.
Epsilon-Greedy#
Selection rule. Exploit the posterior mean with probability \(1 - \varepsilon\), explore uniformly at random with probability \(\varepsilon\):
where \(\hat{\mu}_a(\mathbf{x}_t)\) is the Monte Carlo posterior mean for arm \(a\) in context \(\mathbf{x}_t\).
Hyperparameters.
Parameter |
Controls |
Practical guidance |
|---|---|---|
|
Exploration probability (default 0.1). \(\varepsilon = 0\) is pure greedy (no exploration). \(\varepsilon = 1\) is pure random. |
0.01 to 0.2 is typical. Start with 0.1. |
|
Number of Monte Carlo draws per arm (default 1000). |
1000 is usually sufficient for accurate mean estimation. |
Update rule. Default: arm.update(X, y).
Regret bounds. For fixed \(\varepsilon\) [4]:
With a decaying schedule \(\varepsilon_t = O(K/t)\), the minimax rate \(O(\sqrt{KT})\) is achievable. This implementation uses a fixed \(\varepsilon\) (no schedule).
EXP3A#
Standard EXP3 [4] is noncontextual and model-free: it maintains its own cumulative weight tables with no underlying belief system. This implementation keeps the parts that give adversarial robustness (uniform mixing, importance-weighted updates) but replaces the weight tables with Bayesian posterior means. That swap is what makes it contextual, compatible with decay, and anytime: posterior means don’t grow with \(T\), so \(\eta\) needs no horizon-dependent tuning. The default variant is EXP3-IX [5]. The Adversarial Bandits: Playing Rock-Paper-Scissors Against an Exploitative Opponent notebook demonstrates both adversarial robustness and exploitation of a weak opponent.
Unlike the other policies, EXP3A overrides the update rule with importance-weighted observations.
Selection rule#
Compute posterior mean rewards via Monte Carlo:
\[\hat{\mu}_a(\mathbf{x}_t) = \frac{1}{S} \sum_{s=1}^{S} g_a(\tilde{\theta}_a^{(s)}, \mathbf{x}_t)\]Compute exponential weights (with log-sum-exp stability):
\[w_a = \exp\!\bigl(\eta \cdot (\hat{\mu}_a - \max_j \hat{\mu}_j)\bigr)\]Mix with uniform exploration:
\[P_t(a) = (1 - \gamma_{\text{forced}}) \frac{w_a}{\sum_j w_j} + \frac{\gamma_{\text{forced}}}{K}\]Sample arm \(a_t\) according to \(P_t\).
When \(\gamma_{\text{forced}} = 0\) (default), this is a pure softmax over posterior means.
Update rule (importance-weighted)#
EXP3A overrides the default update. After observing reward \(y\) for arm \(a_t\):
Recompute the selection probabilities \(P_t(a)\) using the current posterior (stateless design).
Compute the importance weight:
\[w_t = \frac{1}{P_t(a_t) + \gamma_{\text{ix}}}\]Update the selected arm with the weighted observation:
\[\text{arm}_{a_t}.\text{update}(\mathbf{X},\, \mathbf{y},\, \text{sample\_weight} = w_t)\]
The importance weighting corrects for selection bias: arms pulled frequently get down-weighted, arms pulled rarely get up-weighted.
Two variants#
EXP3-IX (default): gamma=0, ix_gamma=eta/2. No forced
exploration; the IX regularization term \(\gamma_{\text{ix}}\)
in the denominator bounds importance weights at
\(1/\gamma_{\text{ix}}\), preventing catastrophic updates when an
arm has very low selection probability. Recommended by Neu (2015) [5]
for better empirical performance and high-probability bounds.
Standard EXP3: gamma>0, ix_gamma=0. Forced exploration via
\(\gamma\)-mixing ensures every arm has at least
\(\gamma/K\) selection probability. Unbiased importance weights
(no IX regularization). Original formulation from Auer et al. (2002)
[4].
Hyperparameters#
Parameter |
Controls |
Practical guidance |
|---|---|---|
|
Forced exploration rate (default 0.0). Each arm is guaranteed at least \(\gamma/K\) selection probability. |
0.0 for EXP3-IX (recommended). Set > 0 for standard EXP3. |
|
Temperature for exponential weights (default 1.0). Higher values create sharper distinctions between arms (more exploitation). |
Must be calibrated to reward scale. If rewards are in \([0, 1]\), \(\eta = 1.0\) is reasonable. If rewards are in \([0, 100]\), use \(\eta = 0.01\). |
|
IX regularization (default |
Use the default ( |
|
Number of Monte Carlo draws per arm (default 100). |
100 is usually sufficient for mean estimation. |
Regret bounds#
EXP3-IX achieves a high-probability bound of [5]:
Standard EXP3 achieves the same minimax rate in expectation [4].