bayesianbandits.EpsilonGreedy#
- class bayesianbandits.EpsilonGreedy(epsilon: float = 0.1, samples: int = 1000)#
Bases:
PolicyDefaultUpdate[ContextType,TokenType]Policy object for \(\varepsilon\)-greedy selection.
With probability \(1 - \varepsilon\) the arm with the highest estimated posterior mean reward is selected (exploit); with probability \(\varepsilon\) an arm is chosen uniformly at random (explore):
\[\begin{split}a_t = \begin{cases} \arg\max_a \; \hat{\mu}_a(x_t) & \text{with probability } 1 - \varepsilon, \\ \text{Uniform}\{1, \ldots, K\} & \text{with probability } \varepsilon, \end{cases}\end{split}\]where \(\hat{\mu}_a(x_t) = \mathbb{E}_{\theta_a \mid \mathcal{D}_a}[g_a(\theta_a, x_t)]\) is estimated via Monte Carlo with
samplesposterior draws.- Parameters:
epsilon (
float, default0.1) – Probability of exploration.samples (
int, default1000) – Number of posterior samples used to estimate the arm means.
Examples
>>> from bayesianbandits import Agent, Arm, NormalInverseGammaRegressor >>> from bayesianbandits import EpsilonGreedy >>> >>> arms = [ ... Arm(f"arm_{i}", learner=NormalInverseGammaRegressor()) ... for i in range(3) ... ] >>> agent = Agent(arms, EpsilonGreedy(epsilon=0.1))
See also
ThompsonSamplingRandomized exploration via posterior sampling. Generally the default choice; explores proportionally to uncertainty.
UpperConfidenceBoundDeterministic optimism via posterior quantiles. Explores more aggressively than Thompson sampling for uncertain arms.
EXP3AAdversarial robustness via importance-weighted updates. Use when reward generation may be adversarial or non-stochastic.
Notes
Regret bounds (standard setting). For a \(K\)-armed stochastic bandit with fixed \(\varepsilon\), the expected regret is bounded by
\[\mathbb{E}[\mathrm{Regret}(T)] \;\le\; \varepsilon\,T\,\Delta_{\max} \;+\; \sum_{a:\Delta_a>0} \frac{C}{\Delta_a}\]where \(\Delta_a = \mu^* - \mu_a\) is the sub-optimality gap and \(C\) depends on the concentration of the mean estimator [2]. With a decaying schedule \(\varepsilon_t = O(K / t)\) the minimax rate \(O(\sqrt{KT})\) is achievable.
Applicability to this library. The classical analysis assumes stationary rewards and empirical sample means with known concentration properties [1]. This implementation replaces the empirical mean with a Bayesian posterior mean (which may use approximate inference) and supports contextual features and variance-increasing decay for non-stationarity. Under these modifications the formal bounds do not directly apply, but the basic explore-exploit trade-off controlled by \(\varepsilon\) is preserved.
References
- __init__(epsilon: float = 0.1, samples: int = 1000)#
- property samples_needed: int#
Number of samples per arm per context needed for decision making.
- select(samples: NDArray[np.float64], arms: List[Arm[ContextType, TokenType]], rng: Generator, top_k: None = None) List[Arm[ContextType, TokenType]]#
- select(samples: NDArray[np.float64], arms: List[Arm[ContextType, TokenType]], rng: Generator, top_k: int) List[List[Arm[ContextType, TokenType]]]
Select arms based on pre-generated samples using epsilon-greedy.