bayesianbandits.EXP3A#

class bayesianbandits.EXP3A(gamma: float = 0.0, eta: float = 1.0, ix_gamma: float | None = None, samples: int = 100)#

Bases: object

EXP3.A: Average-based anytime variant of EXP3 for adversarial bandits.

This algorithm provides adversarial robustness while solving several practical limitations of traditional adversarial bandit algorithms like EXP3/EXP4.

At each round the algorithm computes arm selection probabilities from exponential weights over estimated average rewards:

\[P_t(a) = (1 - \gamma)\, \frac{\exp\bigl(\eta\,\hat{\mu}_a(x_t)\bigr)} {\sum_{j}\exp\bigl(\eta\,\hat{\mu}_j(x_t)\bigr)} \;+\; \frac{\gamma}{K}\]

where \(\hat{\mu}_a(x_t)\) is the Monte Carlo posterior mean for arm \(a\) in context \(x_t\). After observing reward \(r_t\), the selected arm’s learner is updated with importance-weighted observations:

\[w_t = \frac{1}{P_t(a_t) + \gamma_{\mathrm{ix}}}\]

The IX (Implicit eXploration) regularisation \(\gamma_{\mathrm{ix}}\) [2] replaces the forced-exploration mixture in classical EXP3.

Core adversarial mechanism:

Importance weighting: Uses \(w = 1/(P(a) + \gamma_{\mathrm{ix}})\) to debias reward estimates
Optional forced exploration: \(\gamma\)-mixing guarantees \(P(a) \ge \gamma/K\) when gamma > 0
No assumptions: Works with arbitrary reward sequences, including adversarial

Algorithm variants:

EXP3-IX (Neu, 2015) [2] - Default: - Set gamma = 0 and ix_gamma > 0 (default: eta/2) - No forced exploration (pure exponential weights) - Regularized importance weights: 1/(P(arm) + γ_ix) - Better empirical performance and high-probability bounds
Standard EXP3 (Auer et al., 2002) [1]: - Set gamma > 0 and ix_gamma = 0 - Uses forced exploration via γ-mixing - Unbiased importance weights: 1/P(arm)

Key practical advantages:

Contextual by default: Handles contexts naturally through Bayesian learners without the complexity of explicit policy enumeration (EXP4)
Anytime algorithm: No need to know time horizon T in advance or tune η based on expected runtime - just set η based on reward scale
Adaptive to change: Tracks moving targets and changing adversaries by using this library’s Bayesian learners, which support both sample weights and Bayesian forgetting via variance-increasing decay
Numerical stability: Weights stay bounded by exp(η * max_reward), avoiding the numerical overflow issues of cumulative approaches

When to use this algorithm:

Adversarial or non-stationary environments
Unknown or variable time horizons
Contextual problems where adversary can adapt based on context
Long-running experiments where numerical stability matters
When you need to detect and adapt to strategy changes

The importance weighting ensures each learner estimates the average reward it would receive under uniform sampling, preventing adversaries from exploiting the algorithm’s selection bias.

Parameters:

gamma (float, default 0.0) – Exploration rate for forced exploration. Each arm pulled with probability at least γ/K when gamma > 0. Default is 0 (no forced exploration).
eta (float, default 1.0) –
Temperature for exponential weights. Higher values create sharper distinctions between arms. Unlike standard EXP3, doesn’t need scaling with horizon since we use averages not sums.

Note: The appropriate value of eta depends on the scale of your rewards. Larger reward scales require smaller eta values to avoid numerical issues and maintain meaningful exploration.
ix_gamma (float or None, default None) –
Regularization parameter for importance weights. If None, defaults to eta/2 as recommended by Neu (2015). Use 0 for unbiased weights. IX stands for Implicit eXploration.

Note: This is unfortunately also called “gamma” in Neu (2015), but it serves a different purpose than the exploration rate gamma. It does serve to encourage exploration, but by regularizing the importance weights (smaller updates, spend more time close to the prior for each arm) rather than by mixing exploration into the arm selection probabilities.
samples (int, default 100) – Number of samples to use for computing expected rewards via Monte Carlo. Higher values improve accuracy but increase computational cost. The more parameters your Bayesian learners have, the more samples you may need to get stable estimates.

References

Examples

EXP3-IX variant (default, recommended):

>>> from bayesianbandits import Arm, NormalInverseGammaRegressor
>>> from bayesianbandits import ContextualAgent
>>>
>>> # Create arms with Bayesian learners
>>> arms = [
...     Arm(action_token=f"action_{i}",
...         learner=NormalInverseGammaRegressor())
...     for i in range(3)
... ]
>>>
>>> # Initialize EXP3-IX policy (default)
>>> policy = EXP3A(eta=2.0)  # ix_gamma defaults to eta/2 = 1.0
>>>
>>> # Create agent
>>> agent = ContextualAgent(arms, policy)

Standard EXP3 with forced exploration:

>>> # Initialize standard EXP3 policy
>>> policy = EXP3A(gamma=0.1, eta=2.0, ix_gamma=0.0)
>>>
>>> # Create agent
>>> agent = ContextualAgent(arms, policy)

bayesianbandits.EXP3A#

This Page