bayesianbandits.ThompsonSampling#

class bayesianbandits.ThompsonSampling#

Bases: PolicyDefaultUpdate[ContextType, TokenType]

Policy object for Thompson sampling.

Thompson sampling chooses the best arm with probability equal to the probability that the arm is the best arm. At each round, a single sample is drawn from each arm’s posterior and the arm with the highest sample is selected:

\[\tilde{\theta}_a \sim p(\theta_a \mid \mathcal{D}_a) \quad \forall a, \qquad a^* = \arg\max_a \; g_a(\tilde{\theta}_a)\]

where \(g_a\) is the reward function (possibly context-dependent) and \(\mathcal{D}_a\) is the data observed for arm \(a\).

Examples

>>> from bayesianbandits import Agent, Arm, NormalInverseGammaRegressor
>>> from bayesianbandits import ThompsonSampling
>>>
>>> arms = [
...     Arm(f"arm_{i}", learner=NormalInverseGammaRegressor())
...     for i in range(3)
... ]
>>> agent = Agent(arms, ThompsonSampling())

See also

UpperConfidenceBound

Deterministic optimism via posterior quantiles. Explores more aggressively than Thompson sampling for uncertain arms.

EpsilonGreedy

Simple exploration via random arm selection. Does not use posterior uncertainty to guide exploration.

EXP3A

Adversarial robustness via importance-weighted updates. Use when reward generation may be adversarial or non-stochastic.

Notes

Regret bounds (standard setting). For the \(K\)-armed stochastic bandit with Beta-Bernoulli or Gaussian conjugate models, Thompson sampling achieves a Bayesian expected regret of

\[\mathbb{E}[\mathrm{Regret}(T)] = O\!\left(\sqrt{KT \ln K}\right)\]

and an asymptotically optimal problem-dependent bound of \(O\!\left(\sum_{a:\Delta_a>0} \frac{\ln T}{\Delta_a}\right)\) matching the Lai-Robbins lower bound [2].

Applicability to this library. The bounds above are proven for stationary, non-contextual bandits with exact conjugate posteriors [1]. This library targets anytime, contextual, and potentially non-stationary problems, and the Bayesian learners may use approximate posteriors (e.g. Laplace approximations) or variance-increasing decay. Under these modifications the classical regret guarantees do not formally apply, though the core explore-exploit mechanism—probability matching via posterior sampling—is preserved.

References

property samples_needed: int#

Number of samples per arm per context needed for decision making.

select(samples: NDArray[np.float64], arms: List[Arm[ContextType, TokenType]], rng: Generator, top_k: None = None) List[Arm[ContextType, TokenType]]#
select(samples: NDArray[np.float64], arms: List[Arm[ContextType, TokenType]], rng: Generator, top_k: int) List[List[Arm[ContextType, TokenType]]]

Select arms based on pre-generated samples.