bayesianbandits.EpsilonGreedy#

class bayesianbandits.EpsilonGreedy(epsilon: float = 0.1, samples: int = 1000)#

Bases: PolicyDefaultUpdate[ContextType, TokenType]

Policy object for \(\varepsilon\)-greedy selection.

With probability \(1 - \varepsilon\) the arm with the highest estimated posterior mean reward is selected (exploit); with probability \(\varepsilon\) an arm is chosen uniformly at random (explore):

\[\begin{split}a_t = \begin{cases} \arg\max_a \; \hat{\mu}_a(x_t) & \text{with probability } 1 - \varepsilon, \\ \text{Uniform}\{1, \ldots, K\} & \text{with probability } \varepsilon, \end{cases}\end{split}\]

where \(\hat{\mu}_a(x_t) = \mathbb{E}_{\theta_a \mid \mathcal{D}_a}[g_a(\theta_a, x_t)]\) is estimated via Monte Carlo with samples posterior draws.

Parameters:
  • epsilon (float, default 0.1) – Probability of exploration.

  • samples (int, default 1000) – Number of posterior samples used to estimate the arm means.

Examples

>>> from bayesianbandits import Agent, Arm, NormalInverseGammaRegressor
>>> from bayesianbandits import EpsilonGreedy
>>>
>>> arms = [
...     Arm(f"arm_{i}", learner=NormalInverseGammaRegressor())
...     for i in range(3)
... ]
>>> agent = Agent(arms, EpsilonGreedy(epsilon=0.1))

See also

ThompsonSampling

Randomized exploration via posterior sampling. Generally the default choice; explores proportionally to uncertainty.

UpperConfidenceBound

Deterministic optimism via posterior quantiles. Explores more aggressively than Thompson sampling for uncertain arms.

EXP3A

Adversarial robustness via importance-weighted updates. Use when reward generation may be adversarial or non-stochastic.

Notes

Regret bounds (standard setting). For a \(K\)-armed stochastic bandit with fixed \(\varepsilon\), the expected regret is bounded by

\[\mathbb{E}[\mathrm{Regret}(T)] \;\le\; \varepsilon\,T\,\Delta_{\max} \;+\; \sum_{a:\Delta_a>0} \frac{C}{\Delta_a}\]

where \(\Delta_a = \mu^* - \mu_a\) is the sub-optimality gap and \(C\) depends on the concentration of the mean estimator [2]. With a decaying schedule \(\varepsilon_t = O(K / t)\) the minimax rate \(O(\sqrt{KT})\) is achievable.

Applicability to this library. The classical analysis assumes stationary rewards and empirical sample means with known concentration properties [1]. This implementation replaces the empirical mean with a Bayesian posterior mean (which may use approximate inference) and supports contextual features and variance-increasing decay for non-stationarity. Under these modifications the formal bounds do not directly apply, but the basic explore-exploit trade-off controlled by \(\varepsilon\) is preserved.

References

__init__(epsilon: float = 0.1, samples: int = 1000)#
property samples_needed: int#

Number of samples per arm per context needed for decision making.

select(samples: NDArray[np.float64], arms: List[Arm[ContextType, TokenType]], rng: Generator, top_k: None = None) List[Arm[ContextType, TokenType]]#
select(samples: NDArray[np.float64], arms: List[Arm[ContextType, TokenType]], rng: Generator, top_k: int) List[List[Arm[ContextType, TokenType]]]

Select arms based on pre-generated samples using epsilon-greedy.