bayesianbandits.ThompsonSampling#

class bayesianbandits.ThompsonSampling#

Bases: PolicyDefaultUpdate[ContextType, TokenType]

Policy object for Thompson sampling.

Thompson sampling chooses the best arm with probability equal to the probability that the arm is the best arm. At each round, a single sample is drawn from each arm’s posterior and the arm with the highest sample is selected:

\[\tilde{\theta}_a \sim p(\theta_a \mid \mathcal{D}_a) \quad \forall a, \qquad a^* = \arg\max_a \; g_a(\tilde{\theta}_a)\]

where \(g_a\) is the reward function (possibly context-dependent) and \(\mathcal{D}_a\) is the data observed for arm \(a\).

Examples

>>> from bayesianbandits import Agent, Arm, NormalInverseGammaRegressor
>>> from bayesianbandits import ThompsonSampling
>>>
>>> arms = [
...     Arm(f"arm_{i}", learner=NormalInverseGammaRegressor())
...     for i in range(3)
... ]
>>> agent = Agent(arms, ThompsonSampling())

See also

UpperConfidenceBound: Deterministic optimism via posterior quantiles. Explores more aggressively than Thompson sampling for uncertain arms.
EpsilonGreedy: Simple exploration via random arm selection. Does not use posterior uncertainty to guide exploration.
EXP3A: Adversarial robustness via importance-weighted updates. Use when reward generation may be adversarial or non-stochastic.

Notes

Regret bounds (standard setting). For the \(K\)-armed stochastic bandit with Beta-Bernoulli or Gaussian conjugate models, Thompson sampling achieves a Bayesian expected regret of

\[\mathbb{E}[\mathrm{Regret}(T)] = O\!\left(\sqrt{KT \ln K}\right)\]

and an asymptotically optimal problem-dependent bound of \(O\!\left(\sum_{a:\Delta_a>0} \frac{\ln T}{\Delta_a}\right)\) matching the Lai-Robbins lower bound [2].

Applicability to this library. The bounds above are proven for stationary, non-contextual bandits with exact conjugate posteriors [1]. This library targets anytime, contextual, and potentially non-stationary problems, and the Bayesian learners may use approximate posteriors (e.g. Laplace approximations) or variance-increasing decay. Under these modifications the classical regret guarantees do not formally apply, though the core explore-exploit mechanism—probability matching via posterior sampling—is preserved.

References

property samples_needed: int#: Number of samples per arm per context needed for decision making.

select(samples: NDArray[np.float64], arms: List[Arm[ContextType, TokenType]], rng: Generator, top_k: None = None) → List[Arm[ContextType, TokenType]]#
select(samples: NDArray[np.float64], arms: List[Arm[ContextType, TokenType]], rng: Generator, top_k: int) → List[List[Arm[ContextType, TokenType]]]: Select arms based on pre-generated samples.

bayesianbandits.ThompsonSampling#

This Page