bayesianbandits.Agent#

class bayesianbandits.Agent(arms: Sequence[Arm[Any, TokenType]], policy: PolicyProtocol[Any, TokenType], random_seed: int | None | Generator = None)#

Bases: Generic[TokenType]

Agent for a non-contextual (classic) multi-armed bandit problem.

Implements the \(K\)-armed bandit where the agent selects an arm \(a_t\) without observing any side information. Internally this is a thin wrapper around ContextualAgent with the context fixed to a single intercept column \(x = [1]\):

\[a_t = \pi\bigl(\{p(\theta_a \mid \mathcal{D}_a)\}_{a=1}^{K} \bigr), \qquad \mathcal{D}_{a_t} \leftarrow \mathcal{D}_{a_t} \cup \{r_t\}\]

All pull, update, and decay calls automatically synthesize the intercept context, so the caller never needs to provide a feature matrix.

Parameters:
  • arms (Sequence[Arm[Any, TokenType]]) – Arms to choose from. Each arm must carry a fitted or unfitted learner and a unique action token that identifies it.

  • policy (PolicyProtocol[Any, TokenType]) – Policy object that implements arm selection given posteriors. Built-in options: ThompsonSampling, UpperConfidenceBound, EpsilonGreedy.

  • random_seed (int, np.random.Generator, or None, default None) – Controls the random number generator shared by the policy and all learners. Pass an int for reproducible results across calls.

See also

ContextualAgent

Agent that conditions decisions on a feature matrix.

LipschitzContextualAgent

Shared-learner agent with configurable design matrix; generalizes both Agent and ContextualAgent.

Notes

Because the context is always a single intercept, every learner reduces to an intercept-only model. For example, NormalRegressor becomes a simple Bayesian estimate of the mean reward, and DirichletClassifier maintains a posterior over class probabilities. See [1] for an empirical comparison of policies in this setting.

References

Examples

Create a non-contextual agent and pull:

>>> from bayesianbandits import Arm, NormalInverseGammaRegressor
>>> from bayesianbandits import Agent, ThompsonSampling
>>> arms = [
...     Arm(0, learner=NormalInverseGammaRegressor()),
...     Arm(1, learner=NormalInverseGammaRegressor()),
... ]
>>> agent = Agent(arms, ThompsonSampling(), random_seed=0)
>>> agent.pull()
[1]

No context matrix is needed. The update and decay methods similarly take only a reward vector:

>>> import numpy as np
>>> y = np.array([100.0])
>>> agent.update(y)
>>> agent.select_for_update(0).update(y)
__init__(arms: Sequence[Arm[Any, TokenType]], policy: PolicyProtocol[Any, TokenType], random_seed: int | None | Generator = None) None#
add_arm(arm: Arm[ndarray[tuple[int, ...], dtype[float64]], TokenType]) None#

Add an arm to the bandit.

Parameters:

arm (Arm) – Arm to add to the bandit.

Raises:

ValueError – If the arm’s action token is already in the bandit.

arm(token: TokenType) Arm[ndarray[tuple[int, ...], dtype[float64]], TokenType]#

Get an arm by its action token.

Parameters:

token (TokenType) – Action token of the arm to get.

Returns:

Arm with the given action token.

Return type:

Arm[NDArray[np.float64], TokenType]

Raises:

KeyError – If the arm’s action token is not in the bandit.

property arm_to_update: Arm[ndarray[tuple[int, ...], dtype[float64]], TokenType]#
property arms: List[Arm[ndarray[tuple[int, ...], dtype[float64]], TokenType]]#
decay(decay_rate: float | None = None) None#

Decay all arms of the bandit.

Parameters:

decay_rate (Optional[float], default None) – Decay rate to use for decaying the arm. If None, the decay rate of the arm’s learner is used.

property policy: PolicyProtocol[ndarray[tuple[int, ...], dtype[float64]], TokenType]#
pull() List[TokenType]#
pull(*, top_k: int) List[List[TokenType]]

Choose arm(s) and pull.

Parameters:

top_k (int, optional) – Number of arms to select. If None (default), selects single best arm. If specified, selects top k arms.

Returns:

If top_k is None: List containing single action token [token] If top_k is int: List containing a list of k action tokens [[token1, token2, …]]

Return type:

List[TokenType] or List[List[TokenType]]

Notes

When top_k is None, arm_to_update is set to the selected arm. When top_k is specified, arm_to_update is NOT updated - you must explicitly call select_for_update() before update() to specify which arm’s feedback you’re providing.

remove_arm(token: Any) None#

Remove an arm from the bandit.

Parameters:

token (Any) – Action token of the arm to remove.

Raises:

KeyError – If the arm’s action token is not in the bandit.

property rng: Generator#
select_for_update(token: TokenType) Self#

Set the arm_to_update and return self for chaining.

Parameters:

token (Any) – Action token of the arm to update.

Returns:

Self for chaining.

Return type:

Self

Raises:

KeyError – If the arm’s action token is not in the bandit.

update(y: ndarray[tuple[int, ...], dtype[float64]], sample_weight: ndarray[tuple[int, ...], dtype[float64]] | None = None) None#

Update the arm_to_update with an observed reward.

Parameters:
  • y (NDArray[np.float64]) – Reward(s) to use for updating the arm.

  • sample_weight (Optional[NDArray[np.float64]], default None) – Sample weights to use for updating the arm. If None, all samples are weighted equally.