bayesianbandits.UpperConfidenceBound#
- class bayesianbandits.UpperConfidenceBound(alpha: float = 0.68, samples: int = 1000)#
Bases:
PolicyDefaultUpdate[ContextType,TokenType]Policy object for Bayesian upper confidence bound.
At each round, posterior samples are used to estimate the \(\alpha\)-quantile of each arm’s reward distribution, and the arm with the highest quantile is selected:
\[a^* = \arg\max_a \; Q_{\alpha}\!\bigl(g_a(\theta_a) \mid \mathcal{D}_a\bigr)\]where \(Q_{\alpha}\) denotes the \(\alpha\)-quantile of the posterior predictive reward, estimated via Monte Carlo with
samplesdraws.- Parameters:
alpha (
float, default0.68) – Quantile level used as the upper confidence bound. Higher values produce more optimistic estimates and encourage exploration. The default of 0.68 corresponds roughly to a one-standard-deviation bound for a Gaussian posterior.samples (
int, default1000) – Number of posterior samples used to estimate the quantile.
Examples
Default one-sigma bound:
>>> from bayesianbandits import Agent, Arm, NormalInverseGammaRegressor >>> from bayesianbandits import UpperConfidenceBound >>> >>> arms = [ ... Arm(f"arm_{i}", learner=NormalInverseGammaRegressor()) ... for i in range(3) ... ] >>> agent = Agent(arms, UpperConfidenceBound())
More aggressive exploration with a higher quantile:
>>> agent = Agent(arms, UpperConfidenceBound(alpha=0.95))
See also
ThompsonSamplingRandomized exploration via posterior sampling. Generally the default choice; explores proportionally to uncertainty.
EpsilonGreedySimple exploration via random arm selection. Does not use posterior uncertainty to guide exploration.
EXP3AAdversarial robustness via importance-weighted updates. Use when reward generation may be adversarial or non-stochastic.
Notes
Regret bounds (standard setting). Kaufmann et al. (2012) show that Bayesian UCB with a quantile schedule \(\alpha_t = 1 - 1/(t \log^c(T))\) achieves
\[\mathbb{E}[\mathrm{Regret}(T)] = O\!\left(\sum_{a:\Delta_a>0} \frac{\ln T}{\Delta_a}\right)\]matching the Lai-Robbins lower bound up to constants [2]. With a fixed \(\alpha\) (as used here), the problem-dependent bound is not guaranteed, but a minimax rate of \(O(\sqrt{KT \ln T})\) still holds under mild conditions.
Applicability to this library. The bounds above assume stationary rewards, exact conjugate posteriors, and a non-contextual setting [1]. This library uses a fixed quantile level, supports contextual features, approximate posteriors, and variance-increasing decay for non-stationarity. Under these modifications the formal guarantees do not directly apply, though the optimism-in-the-face-of-uncertainty principle that drives UCB’s exploration is preserved.
References
- __init__(alpha: float = 0.68, samples: int = 1000)#
- property samples_needed: int#
Number of samples per arm per context needed for decision making.
- select(samples: NDArray[np.float64], arms: List[Arm[ContextType, TokenType]], rng: Generator, top_k: None = None) List[Arm[ContextType, TokenType]]#
- select(samples: NDArray[np.float64], arms: List[Arm[ContextType, TokenType]], rng: Generator, top_k: int) List[List[Arm[ContextType, TokenType]]]
Select arms based on pre-generated samples using upper confidence bounds.