bayesianbandits.UpperConfidenceBound#

class bayesianbandits.UpperConfidenceBound(alpha: float = 0.68, samples: int = 1000)#

Bases: PolicyDefaultUpdate[ContextType, TokenType]

Policy object for Bayesian upper confidence bound.

At each round, posterior samples are used to estimate the \(\alpha\)-quantile of each arm’s reward distribution, and the arm with the highest quantile is selected:

\[a^* = \arg\max_a \; Q_{\alpha}\!\bigl(g_a(\theta_a) \mid \mathcal{D}_a\bigr)\]

where \(Q_{\alpha}\) denotes the \(\alpha\)-quantile of the posterior predictive reward, estimated via Monte Carlo with samples draws.

Parameters:
  • alpha (float, default 0.68) – Quantile level used as the upper confidence bound. Higher values produce more optimistic estimates and encourage exploration. The default of 0.68 corresponds roughly to a one-standard-deviation bound for a Gaussian posterior.

  • samples (int, default 1000) – Number of posterior samples used to estimate the quantile.

Examples

Default one-sigma bound:

>>> from bayesianbandits import Agent, Arm, NormalInverseGammaRegressor
>>> from bayesianbandits import UpperConfidenceBound
>>>
>>> arms = [
...     Arm(f"arm_{i}", learner=NormalInverseGammaRegressor())
...     for i in range(3)
... ]
>>> agent = Agent(arms, UpperConfidenceBound())

More aggressive exploration with a higher quantile:

>>> agent = Agent(arms, UpperConfidenceBound(alpha=0.95))

See also

ThompsonSampling

Randomized exploration via posterior sampling. Generally the default choice; explores proportionally to uncertainty.

EpsilonGreedy

Simple exploration via random arm selection. Does not use posterior uncertainty to guide exploration.

EXP3A

Adversarial robustness via importance-weighted updates. Use when reward generation may be adversarial or non-stochastic.

Notes

Regret bounds (standard setting). Kaufmann et al. (2012) show that Bayesian UCB with a quantile schedule \(\alpha_t = 1 - 1/(t \log^c(T))\) achieves

\[\mathbb{E}[\mathrm{Regret}(T)] = O\!\left(\sum_{a:\Delta_a>0} \frac{\ln T}{\Delta_a}\right)\]

matching the Lai-Robbins lower bound up to constants [2]. With a fixed \(\alpha\) (as used here), the problem-dependent bound is not guaranteed, but a minimax rate of \(O(\sqrt{KT \ln T})\) still holds under mild conditions.

Applicability to this library. The bounds above assume stationary rewards, exact conjugate posteriors, and a non-contextual setting [1]. This library uses a fixed quantile level, supports contextual features, approximate posteriors, and variance-increasing decay for non-stationarity. Under these modifications the formal guarantees do not directly apply, though the optimism-in-the-face-of-uncertainty principle that drives UCB’s exploration is preserved.

References

__init__(alpha: float = 0.68, samples: int = 1000)#
property samples_needed: int#

Number of samples per arm per context needed for decision making.

select(samples: NDArray[np.float64], arms: List[Arm[ContextType, TokenType]], rng: Generator, top_k: None = None) List[Arm[ContextType, TokenType]]#
select(samples: NDArray[np.float64], arms: List[Arm[ContextType, TokenType]], rng: Generator, top_k: int) List[List[Arm[ContextType, TokenType]]]

Select arms based on pre-generated samples using upper confidence bounds.