Exploration Policies#

A policy maps posterior samples to arm selections. All four share a common interface:

Draw posterior predictive samples for each arm: \(\tilde{\theta}_a \sim p(\theta_a \mid \mathcal{D}_a)\), then \(\tilde{r}_a = g_a(\tilde{\theta}_a, \mathbf{x}_t)\) where \(g_a\) is the reward function.
Compute a decision statistic from those samples (policy-specific).
Select the arm that maximizes the statistic.

The update rule is shared by Thompson sampling, UCB, and epsilon-greedy: call arm.update(X, y) directly. EXP3A overrides this with importance-weighted updates.

Note

The regret bounds listed below are proven for stationary, non-contextual bandits with exact conjugate posteriors. This library targets contextual, non-stationary settings with approximate posteriors and decay. The formal guarantees do not directly apply.

Samples per arm#
Policy	Samples	Reason
`ThompsonSampling`	1	One posterior draw suffices for probability matching
`UpperConfidenceBound`	1000	Monte Carlo quantile estimation needs many draws
`EpsilonGreedy`	1000	Accurate posterior mean via Monte Carlo
`EXP3A`	100	Only needs mean estimates; 100 draws are typically sufficient

Thompson Sampling#

Selection rule. Draw a single sample from each arm’s posterior predictive and select the arm with the highest sample:

\[\tilde{\theta}_a \sim p(\theta_a \mid \mathcal{D}_a) \quad \forall a, \qquad a^* = \arg\max_a \; g_a(\tilde{\theta}_a, \mathbf{x}_t)\]

This is probability matching: each arm is selected with probability equal to its posterior probability of being optimal.

Hyperparameters. None.

Update rule. Default: arm.update(X, y).

Regret bounds. For the \(K\)-armed stochastic bandit with exact conjugate posteriors [1] [2]:

\[\mathbb{E}[\mathrm{Regret}(T)] = O\!\left(\sqrt{KT \ln K}\right)\]

with an asymptotically optimal problem-dependent bound of \(O\!\left(\sum_{a:\Delta_a>0} \frac{\ln T}{\Delta_a}\right)\) matching the Lai-Robbins lower bound.

Upper Confidence Bound#

Strictly, this is an upper credible bound (a posterior quantile), not the frequentist confidence bound of classical UCB. The name is conventional.

Selection rule. Compute the \(\alpha\)-quantile of each arm’s posterior predictive and select the arm with the highest quantile:

\[a^* = \arg\max_a \; Q_\alpha\!\left( g_a(\theta_a, \mathbf{x}_t) \mid \mathcal{D}_a \right)\]

The quantile is estimated via Monte Carlo: draw \(S\) samples per arm, compute np.quantile(samples, alpha), then argmax.

Hyperparameters.

Parameter	Controls	Practical guidance
`alpha`	Quantile level (default 0.68). Controls optimism. \(\alpha = 0.5\) is the median (greedy on the posterior mean). \(\alpha = 0.68\) is roughly one standard deviation for Gaussian posteriors. \(\alpha = 0.95\) is aggressive exploration.	Start with 0.68. Increase for more exploration.
`samples`	Number of Monte Carlo draws per arm (default 1000).	1000 is usually sufficient. Reduce for lower latency if quantile accuracy is not critical.

Update rule. Default: arm.update(X, y).

Regret bounds. With a quantile schedule \(\alpha_t = 1 - 1/(t \log^c(T))\), Bayesian UCB achieves [3]:

\[\mathbb{E}[\mathrm{Regret}(T)] = O\!\left(\sum_{a:\Delta_a>0} \frac{\ln T}{\Delta_a}\right)\]

With a fixed \(\alpha\) (as implemented), the problem-dependent bound is not guaranteed, but a minimax rate of \(O(\sqrt{KT \ln T})\) holds under mild conditions.

Epsilon-Greedy#

Selection rule. Exploit the posterior mean with probability \(1 - \varepsilon\), explore uniformly at random with probability \(\varepsilon\):

\[\begin{split}a_t = \begin{cases} \displaystyle\arg\max_a \; \hat{\mu}_a(\mathbf{x}_t) & \text{with probability } 1 - \varepsilon \\[4pt] \text{Uniform}\{1, \ldots, K\} & \text{with probability } \varepsilon \end{cases}\end{split}\]

where \(\hat{\mu}_a(\mathbf{x}_t)\) is the Monte Carlo posterior mean for arm \(a\) in context \(\mathbf{x}_t\).

Hyperparameters.

Parameter	Controls	Practical guidance
`epsilon`	Exploration probability (default 0.1). \(\varepsilon = 0\) is pure greedy (no exploration). \(\varepsilon = 1\) is pure random.	0.01 to 0.2 is typical. Start with 0.1.
`samples`	Number of Monte Carlo draws per arm (default 1000).	1000 is usually sufficient for accurate mean estimation.

Update rule. Default: arm.update(X, y).

Regret bounds. For fixed \(\varepsilon\) [4]:

\[\mathbb{E}[\mathrm{Regret}(T)] \le \varepsilon\,T\,\Delta_{\max} + \sum_{a:\Delta_a>0} \frac{C}{\Delta_a}\]

With a decaying schedule \(\varepsilon_t = O(K/t)\), the minimax rate \(O(\sqrt{KT})\) is achievable. This implementation uses a fixed \(\varepsilon\) (no schedule).

EXP3A#

Standard EXP3 [4] is noncontextual and model-free: it maintains its own cumulative weight tables with no underlying belief system. This implementation keeps the parts that give adversarial robustness (uniform mixing, importance-weighted updates) but replaces the weight tables with Bayesian posterior means. That swap is what makes it contextual, compatible with decay, and anytime: posterior means don’t grow with \(T\), so \(\eta\) needs no horizon-dependent tuning. The default variant is EXP3-IX [5]. The Adversarial Bandits: Playing Rock-Paper-Scissors Against an Exploitative Opponent notebook demonstrates both adversarial robustness and exploitation of a weak opponent.

Unlike the other policies, EXP3A overrides the update rule with importance-weighted observations.

Selection rule#

Compute posterior mean rewards via Monte Carlo:

\[\hat{\mu}_a(\mathbf{x}_t) = \frac{1}{S} \sum_{s=1}^{S} g_a(\tilde{\theta}_a^{(s)}, \mathbf{x}_t)\]
Compute exponential weights (with log-sum-exp stability):

\[w_a = \exp\!\bigl(\eta \cdot (\hat{\mu}_a - \max_j \hat{\mu}_j)\bigr)\]
Mix with uniform exploration:

\[P_t(a) = (1 - \gamma_{\text{forced}}) \frac{w_a}{\sum_j w_j} + \frac{\gamma_{\text{forced}}}{K}\]
Sample arm \(a_t\) according to \(P_t\).

When \(\gamma_{\text{forced}} = 0\) (default), this is a pure softmax over posterior means.

Update rule (importance-weighted)#

EXP3A overrides the default update. After observing reward \(y\) for arm \(a_t\):

Recompute the selection probabilities \(P_t(a)\) using the current posterior (stateless design).
Compute the importance weight:

\[w_t = \frac{1}{P_t(a_t) + \gamma_{\text{ix}}}\]
Update the selected arm with the weighted observation:

\[\text{arm}_{a_t}.\text{update}(\mathbf{X},\, \mathbf{y},\, \text{sample\_weight} = w_t)\]

The importance weighting corrects for selection bias: arms pulled frequently get down-weighted, arms pulled rarely get up-weighted.

Two variants#

EXP3-IX (default): gamma=0, ix_gamma=eta/2. No forced exploration; the IX regularization term \(\gamma_{\text{ix}}\) in the denominator bounds importance weights at \(1/\gamma_{\text{ix}}\), preventing catastrophic updates when an arm has very low selection probability. Recommended by Neu (2015) [5] for better empirical performance and high-probability bounds.

Standard EXP3: gamma>0, ix_gamma=0. Forced exploration via \(\gamma\)-mixing ensures every arm has at least \(\gamma/K\) selection probability. Unbiased importance weights (no IX regularization). Original formulation from Auer et al. (2002) [4].

Hyperparameters#

Parameter	Controls	Practical guidance
`gamma`	Forced exploration rate (default 0.0). Each arm is guaranteed at least \(\gamma/K\) selection probability.	0.0 for EXP3-IX (recommended). Set > 0 for standard EXP3.
`eta`	Temperature for exponential weights (default 1.0). Higher values create sharper distinctions between arms (more exploitation).	Must be calibrated to reward scale. If rewards are in \([0, 1]\), \(\eta = 1.0\) is reasonable. If rewards are in \([0, 100]\), use \(\eta = 0.01\).
`ix_gamma`	IX regularization (default `eta/2`). Bounds importance weights at \(1/\gamma_{\text{ix}}\).	Use the default (`eta/2`) for EXP3-IX. Set to 0 for standard EXP3 (unbiased weights).
`samples`	Number of Monte Carlo draws per arm (default 100).	100 is usually sufficient for mean estimation.

Regret bounds#

EXP3-IX achieves a high-probability bound of [5]:

\[\mathrm{Regret}(T) = O\!\left(\sqrt{KT \ln K}\right)\]

Standard EXP3 achieves the same minimax rate in expectation [4].

Exploration Policies#

Thompson Sampling#

Upper Confidence Bound#

Epsilon-Greedy#

EXP3A#

Selection rule#

Update rule (importance-weighted)#

Two variants#

Hyperparameters#

Regret bounds#

References#

This Page