bayesianbandits.EXP3A#
- class bayesianbandits.EXP3A(gamma: float = 0.0, eta: float = 1.0, ix_gamma: float | None = None, samples: int = 100)#
Bases:
objectEXP3.A: Average-based anytime variant of EXP3 for adversarial bandits.
This algorithm provides adversarial robustness while solving several practical limitations of traditional adversarial bandit algorithms like EXP3/EXP4.
At each round the algorithm computes arm selection probabilities from exponential weights over estimated average rewards:
\[P_t(a) = (1 - \gamma)\, \frac{\exp\bigl(\eta\,\hat{\mu}_a(x_t)\bigr)} {\sum_{j}\exp\bigl(\eta\,\hat{\mu}_j(x_t)\bigr)} \;+\; \frac{\gamma}{K}\]where \(\hat{\mu}_a(x_t)\) is the Monte Carlo posterior mean for arm \(a\) in context \(x_t\). After observing reward \(r_t\), the selected arm’s learner is updated with importance-weighted observations:
\[w_t = \frac{1}{P_t(a_t) + \gamma_{\mathrm{ix}}}\]The IX (Implicit eXploration) regularisation \(\gamma_{\mathrm{ix}}\) [2] replaces the forced-exploration mixture in classical EXP3.
Core adversarial mechanism:
Importance weighting: Uses \(w = 1/(P(a) + \gamma_{\mathrm{ix}})\) to debias reward estimates
Optional forced exploration: \(\gamma\)-mixing guarantees \(P(a) \ge \gamma/K\) when
gamma > 0No assumptions: Works with arbitrary reward sequences, including adversarial
Algorithm variants:
EXP3-IX (Neu, 2015) [2] - Default: - Set gamma = 0 and ix_gamma > 0 (default: eta/2) - No forced exploration (pure exponential weights) - Regularized importance weights: 1/(P(arm) + γ_ix) - Better empirical performance and high-probability bounds
Standard EXP3 (Auer et al., 2002) [1]: - Set gamma > 0 and ix_gamma = 0 - Uses forced exploration via γ-mixing - Unbiased importance weights: 1/P(arm)
Key practical advantages:
Contextual by default: Handles contexts naturally through Bayesian learners without the complexity of explicit policy enumeration (EXP4)
Anytime algorithm: No need to know time horizon T in advance or tune η based on expected runtime - just set η based on reward scale
Adaptive to change: Tracks moving targets and changing adversaries by using this library’s Bayesian learners, which support both sample weights and Bayesian forgetting via variance-increasing decay
Numerical stability: Weights stay bounded by exp(η * max_reward), avoiding the numerical overflow issues of cumulative approaches
When to use this algorithm:
Adversarial or non-stationary environments
Unknown or variable time horizons
Contextual problems where adversary can adapt based on context
Long-running experiments where numerical stability matters
When you need to detect and adapt to strategy changes
The importance weighting ensures each learner estimates the average reward it would receive under uniform sampling, preventing adversaries from exploiting the algorithm’s selection bias.
- Parameters:
gamma (
float, default0.0) – Exploration rate for forced exploration. Each arm pulled with probability at least γ/K when gamma > 0. Default is 0 (no forced exploration).eta (
float, default1.0) –Temperature for exponential weights. Higher values create sharper distinctions between arms. Unlike standard EXP3, doesn’t need scaling with horizon since we use averages not sums.
Note: The appropriate value of eta depends on the scale of your rewards. Larger reward scales require smaller eta values to avoid numerical issues and maintain meaningful exploration.
ix_gamma (
floatorNone, defaultNone) –Regularization parameter for importance weights. If None, defaults to eta/2 as recommended by Neu (2015). Use 0 for unbiased weights. IX stands for Implicit eXploration.
Note: This is unfortunately also called “gamma” in Neu (2015), but it serves a different purpose than the exploration rate gamma. It does serve to encourage exploration, but by regularizing the importance weights (smaller updates, spend more time close to the prior for each arm) rather than by mixing exploration into the arm selection probabilities.
samples (
int, default100) – Number of samples to use for computing expected rewards via Monte Carlo. Higher values improve accuracy but increase computational cost. The more parameters your Bayesian learners have, the more samples you may need to get stable estimates.
References
Examples
EXP3-IX variant (default, recommended):
>>> from bayesianbandits import Arm, NormalInverseGammaRegressor >>> from bayesianbandits import ContextualAgent >>> >>> # Create arms with Bayesian learners >>> arms = [ ... Arm(action_token=f"action_{i}", ... learner=NormalInverseGammaRegressor()) ... for i in range(3) ... ] >>> >>> # Initialize EXP3-IX policy (default) >>> policy = EXP3A(eta=2.0) # ix_gamma defaults to eta/2 = 1.0 >>> >>> # Create agent >>> agent = ContextualAgent(arms, policy)
Standard EXP3 with forced exploration:
>>> # Initialize standard EXP3 policy >>> policy = EXP3A(gamma=0.1, eta=2.0, ix_gamma=0.0) >>> >>> # Create agent >>> agent = ContextualAgent(arms, policy)
See also
ThompsonSamplingRandomized exploration via posterior sampling. Generally the default choice for stochastic environments.
UpperConfidenceBoundDeterministic optimism via posterior quantiles. Explores more aggressively than Thompson sampling for uncertain arms.
EpsilonGreedySimple exploration via random arm selection. Does not use posterior uncertainty to guide exploration.
Notes
Regret bounds (standard setting). In the classical non-stochastic \(K\)-armed bandit with \(T\) rounds, EXP3-IX achieves a high-probability regret bound of
\[\mathrm{Regret}(T) = O\!\left(\sqrt{KT\ln K}\right)\]with appropriately tuned \(\eta\) and \(\gamma_{\mathrm{ix}}\) [2]. The original EXP3 with forced exploration achieves the same minimax rate in expectation [1].
Applicability to this library. This implementation is inspired by EXP3/EXP3-IX but makes several practical modifications that affect the formal guarantees:
Uses average-based rewards instead of cumulative sums
Integrates with Bayesian learners rather than maintaining explicit per-round weight tables
Employs Monte Carlo estimation for expected rewards
Supports variance-increasing decay for non-stationary environments
Under these modifications the classical regret bounds do not formally apply. However, the core adversarial-robustness mechanism—importance- weighted updates that debias reward estimates under non-uniform selection—is preserved, and the algorithm should be viewed as a practical variant that maintains the adversarial robustness intuition of EXP3 while adapting to contextual, anytime, and non-stationary settings.
- __init__(gamma: float = 0.0, eta: float = 1.0, ix_gamma: float | None = None, samples: int = 100)#
- property samples_needed: int#
Number of samples per arm per context needed for decision making.
- select(samples: NDArray[np.float64], arms: List[Arm[ContextType, TokenType]], rng: Generator, top_k: None = None) List[Arm[ContextType, TokenType]]#
- select(samples: NDArray[np.float64], arms: List[Arm[ContextType, TokenType]], rng: Generator, top_k: int) List[List[Arm[ContextType, TokenType]]]
Select arms based on pre-generated samples using EXP3A.
- update(arm: Arm[ContextType, TokenType], X: ContextType, y: ndarray[tuple[int, ...], dtype[float64]], all_arms: List[Arm[ContextType, TokenType]], rng: Generator, sample_weight: ndarray[tuple[int, ...], dtype[float64]] | None = None) None#
Update arm with importance-weighted reward.
The importance weighting ensures unbiased estimates under adversarial reward generation, making the algorithm robust to arbitrary adversaries.
- Parameters:
arm (
Arm) – The arm that was pulledX (
array-likeofshape (n_samples,n_features)) – Context matrixy (
array-likeofshape (n_samples,)) – Observed rewardsall_arms (
List[Arm]) – All available arms (needed to recompute probabilities)rng (
np.random.Generator) – Random number generator (unused but part of interface)sample_weight (
Optional[NDArray[np.float64]], defaultNone) – Sample weights for each observation. If provided, these are multiplied with the importance weights.