bayesianbandits (bayesianbandits)

A Python library for Bayesian Multi-Armed Bandits.

This library implements a variety of multi-armed bandit algorithms, including epsilon-greedy, Thompson sampling, and upper confidence bound. It also handles a number of common problems in multi-armed bandit problems, including contextual bandits, delayed reward, and restless bandits.

This library is designed to be easy to use and extend. It is built on top of scikit-learn, and uses scikit-learn-style estimators to model the arms. This allows you to use any scikit-learn estimator that supports the partial_fit and sample methods as an arm in a bandit. Restless bandits also require the decay method.

The Agent API found in bayesianbandits.api is reasonably stable and is currently used in production.

Agent API

The Agent API is the most ergonomic way to use this library in production. It is designed to maximize your IDE’s ability to autocomplete and type-check your code. Additionally, it is designed to make it easy to modify the arms and the policies of your bandit as your needs change.

The Agent API requires a slightly different interface for choice policies than the old Bandit API, but these policies and the policy decorators use the same underlying code. Both are available for backwards compatibility.

Agent(arms, policy[, random_seed])

Agent for a non-contextual multi-armed bandit problem.

ContextualAgent(arms, policy[, random_seed])

Agent for a contextual multi-armed bandit problem.

EpsilonGreedy([epsilon, samples])

Policy object for epsilon-greedy.


Policy object for Thompson sampling.

UpperConfidenceBound([alpha, samples])

Policy object for upper confidence bound.

Bandit and Arm Classes

The Arm class is the base class for all arms in a bandit. Its constructor takes two arguments, action_function and reward_function, which represent the action taken by the pull method of the arm and the mechanism for computing the reward from the outcome of the action.

Bandit([rng, cache])

Base class for bandits.

Arm(action_token[, reward_function, learner])

Arm of a bandit.

Bandit Decorators

These class decorators can be used to specialize Bandit subclasses for particular problems.


Decorator for making a bandit contextual.


Decorator for restless bandits.


These functions can be used to create policy functions for bandits. They should be passed to the policy argument of the bandit decorator.

epsilon_greedy([epsilon, samples])

Creates an epsilon-greedy choice algorithm.


Creates a Thompson sampling choice algorithm.

upper_confidence_bound([alpha, samples])

Creates a UCB choice algorithm.


These estimators are the underlying models for the arms in a bandit. They should be passed to the learner argument of the bandit decorator. Each of these Bayesian estimators can be converted to a recursive estimator by passing a learning_rate argument to the constructor that is less than 1. Each of them implement a decay method that uses the learning_rate to increase the variance of the prior. This is a type of state-space model that is useful for restless bandits.

DirichletClassifier(alphas, *[, ...])

Intercept-only Dirichlet Classifier

GammaRegressor(alpha, beta, *[, ...])

Intercept-only Gamma regression model.

NormalRegressor(alpha, beta, *[, ...])

A Bayesian linear regression model that assumes a Gaussian noise distribution.

NormalInverseGammaRegressor(*[, mu, lam, a, ...])

Bayesian linear regression with unknown variance.


These are custom exceptions raised by the bandit classes.


Exception raised when the user does not handle delayed reward bandits correctly.


Warning raised when the user does not handle delayed reward bandits correctly.