bayesianbandits.EXP3A#

class bayesianbandits.EXP3A(gamma: float = 0.0, eta: float = 1.0, ix_gamma: float | None = None, samples: int = 100)#

Bases: object

EXP3.A: Average-based anytime variant of EXP3 for adversarial bandits.

This algorithm provides adversarial robustness while solving several practical limitations of traditional adversarial bandit algorithms like EXP3/EXP4.

At each round the algorithm computes arm selection probabilities from exponential weights over estimated average rewards:

\[P_t(a) = (1 - \gamma)\, \frac{\exp\bigl(\eta\,\hat{\mu}_a(x_t)\bigr)} {\sum_{j}\exp\bigl(\eta\,\hat{\mu}_j(x_t)\bigr)} \;+\; \frac{\gamma}{K}\]

where \(\hat{\mu}_a(x_t)\) is the Monte Carlo posterior mean for arm \(a\) in context \(x_t\). After observing reward \(r_t\), the selected arm’s learner is updated with importance-weighted observations:

\[w_t = \frac{1}{P_t(a_t) + \gamma_{\mathrm{ix}}}\]

The IX (Implicit eXploration) regularisation \(\gamma_{\mathrm{ix}}\) [2] replaces the forced-exploration mixture in classical EXP3.

Core adversarial mechanism:

  • Importance weighting: Uses \(w = 1/(P(a) + \gamma_{\mathrm{ix}})\) to debias reward estimates

  • Optional forced exploration: \(\gamma\)-mixing guarantees \(P(a) \ge \gamma/K\) when gamma > 0

  • No assumptions: Works with arbitrary reward sequences, including adversarial

Algorithm variants:

  1. EXP3-IX (Neu, 2015) [2] - Default: - Set gamma = 0 and ix_gamma > 0 (default: eta/2) - No forced exploration (pure exponential weights) - Regularized importance weights: 1/(P(arm) + γ_ix) - Better empirical performance and high-probability bounds

  2. Standard EXP3 (Auer et al., 2002) [1]: - Set gamma > 0 and ix_gamma = 0 - Uses forced exploration via γ-mixing - Unbiased importance weights: 1/P(arm)

Key practical advantages:

  1. Contextual by default: Handles contexts naturally through Bayesian learners without the complexity of explicit policy enumeration (EXP4)

  2. Anytime algorithm: No need to know time horizon T in advance or tune η based on expected runtime - just set η based on reward scale

  3. Adaptive to change: Tracks moving targets and changing adversaries by using this library’s Bayesian learners, which support both sample weights and Bayesian forgetting via variance-increasing decay

  4. Numerical stability: Weights stay bounded by exp(η * max_reward), avoiding the numerical overflow issues of cumulative approaches

When to use this algorithm:

  • Adversarial or non-stationary environments

  • Unknown or variable time horizons

  • Contextual problems where adversary can adapt based on context

  • Long-running experiments where numerical stability matters

  • When you need to detect and adapt to strategy changes

The importance weighting ensures each learner estimates the average reward it would receive under uniform sampling, preventing adversaries from exploiting the algorithm’s selection bias.

Parameters:
  • gamma (float, default 0.0) – Exploration rate for forced exploration. Each arm pulled with probability at least γ/K when gamma > 0. Default is 0 (no forced exploration).

  • eta (float, default 1.0) –

    Temperature for exponential weights. Higher values create sharper distinctions between arms. Unlike standard EXP3, doesn’t need scaling with horizon since we use averages not sums.

    Note: The appropriate value of eta depends on the scale of your rewards. Larger reward scales require smaller eta values to avoid numerical issues and maintain meaningful exploration.

  • ix_gamma (float or None, default None) –

    Regularization parameter for importance weights. If None, defaults to eta/2 as recommended by Neu (2015). Use 0 for unbiased weights. IX stands for Implicit eXploration.

    Note: This is unfortunately also called “gamma” in Neu (2015), but it serves a different purpose than the exploration rate gamma. It does serve to encourage exploration, but by regularizing the importance weights (smaller updates, spend more time close to the prior for each arm) rather than by mixing exploration into the arm selection probabilities.

  • samples (int, default 100) – Number of samples to use for computing expected rewards via Monte Carlo. Higher values improve accuracy but increase computational cost. The more parameters your Bayesian learners have, the more samples you may need to get stable estimates.

References

Examples

EXP3-IX variant (default, recommended):

>>> from bayesianbandits import Arm, NormalInverseGammaRegressor
>>> from bayesianbandits import ContextualAgent
>>>
>>> # Create arms with Bayesian learners
>>> arms = [
...     Arm(action_token=f"action_{i}",
...         learner=NormalInverseGammaRegressor())
...     for i in range(3)
... ]
>>>
>>> # Initialize EXP3-IX policy (default)
>>> policy = EXP3A(eta=2.0)  # ix_gamma defaults to eta/2 = 1.0
>>>
>>> # Create agent
>>> agent = ContextualAgent(arms, policy)

Standard EXP3 with forced exploration:

>>> # Initialize standard EXP3 policy
>>> policy = EXP3A(gamma=0.1, eta=2.0, ix_gamma=0.0)
>>>
>>> # Create agent
>>> agent = ContextualAgent(arms, policy)

See also

ThompsonSampling

Randomized exploration via posterior sampling. Generally the default choice for stochastic environments.

UpperConfidenceBound

Deterministic optimism via posterior quantiles. Explores more aggressively than Thompson sampling for uncertain arms.

EpsilonGreedy

Simple exploration via random arm selection. Does not use posterior uncertainty to guide exploration.

Notes

Regret bounds (standard setting). In the classical non-stochastic \(K\)-armed bandit with \(T\) rounds, EXP3-IX achieves a high-probability regret bound of

\[\mathrm{Regret}(T) = O\!\left(\sqrt{KT\ln K}\right)\]

with appropriately tuned \(\eta\) and \(\gamma_{\mathrm{ix}}\) [2]. The original EXP3 with forced exploration achieves the same minimax rate in expectation [1].

Applicability to this library. This implementation is inspired by EXP3/EXP3-IX but makes several practical modifications that affect the formal guarantees:

  • Uses average-based rewards instead of cumulative sums

  • Integrates with Bayesian learners rather than maintaining explicit per-round weight tables

  • Employs Monte Carlo estimation for expected rewards

  • Supports variance-increasing decay for non-stationary environments

Under these modifications the classical regret bounds do not formally apply. However, the core adversarial-robustness mechanism—importance- weighted updates that debias reward estimates under non-uniform selection—is preserved, and the algorithm should be viewed as a practical variant that maintains the adversarial robustness intuition of EXP3 while adapting to contextual, anytime, and non-stationary settings.

__init__(gamma: float = 0.0, eta: float = 1.0, ix_gamma: float | None = None, samples: int = 100)#
property samples_needed: int#

Number of samples per arm per context needed for decision making.

select(samples: NDArray[np.float64], arms: List[Arm[ContextType, TokenType]], rng: Generator, top_k: None = None) List[Arm[ContextType, TokenType]]#
select(samples: NDArray[np.float64], arms: List[Arm[ContextType, TokenType]], rng: Generator, top_k: int) List[List[Arm[ContextType, TokenType]]]

Select arms based on pre-generated samples using EXP3A.

update(arm: Arm[ContextType, TokenType], X: ContextType, y: ndarray[tuple[int, ...], dtype[float64]], all_arms: List[Arm[ContextType, TokenType]], rng: Generator, sample_weight: ndarray[tuple[int, ...], dtype[float64]] | None = None) None#

Update arm with importance-weighted reward.

The importance weighting ensures unbiased estimates under adversarial reward generation, making the algorithm robust to arbitrary adversaries.

Parameters:
  • arm (Arm) – The arm that was pulled

  • X (array-like of shape (n_samples, n_features)) – Context matrix

  • y (array-like of shape (n_samples,)) – Observed rewards

  • all_arms (List[Arm]) – All available arms (needed to recompute probabilities)

  • rng (np.random.Generator) – Random number generator (unused but part of interface)

  • sample_weight (Optional[NDArray[np.float64]], default None) – Sample weights for each observation. If provided, these are multiplied with the importance weights.