bayesianbandits.Agent#

class bayesianbandits.Agent(arms: Sequence[Arm[Any, TokenType]], policy: PolicyProtocol[Any, TokenType], random_seed: int | None | Generator = None)#

Bases: Generic[TokenType]

Agent for a non-contextual (classic) multi-armed bandit problem.

Implements the \(K\)-armed bandit where the agent selects an arm \(a_t\) without observing any side information. Internally this is a thin wrapper around ContextualAgent with the context fixed to a single intercept column \(x = [1]\):

\[a_t = \pi\bigl(\{p(\theta_a \mid \mathcal{D}_a)\}_{a=1}^{K} \bigr), \qquad \mathcal{D}_{a_t} \leftarrow \mathcal{D}_{a_t} \cup \{r_t\}\]

All pull, update, and decay calls automatically synthesize the intercept context, so the caller never needs to provide a feature matrix.

Parameters:

arms (Sequence[Arm[Any, TokenType]]) – Arms to choose from. Each arm must carry a fitted or unfitted learner and a unique action token that identifies it.
policy (PolicyProtocol[Any, TokenType]) – Policy object that implements arm selection given posteriors. Built-in options: ThompsonSampling, UpperConfidenceBound, EpsilonGreedy.
random_seed (int, np.random.Generator, or None, default None) – Controls the random number generator shared by the policy and all learners. Pass an int for reproducible results across calls.

See also

ContextualAgent: Agent that conditions decisions on a feature matrix.
LipschitzContextualAgent: Shared-learner agent with configurable design matrix; generalizes both Agent and ContextualAgent.

Notes

Because the context is always a single intercept, every learner reduces to an intercept-only model. For example, NormalRegressor becomes a simple Bayesian estimate of the mean reward, and DirichletClassifier maintains a posterior over class probabilities. See [1] for an empirical comparison of policies in this setting.

References

Examples

Create a non-contextual agent and pull:

>>> from bayesianbandits import Arm, NormalInverseGammaRegressor
>>> from bayesianbandits import Agent, ThompsonSampling
>>> arms = [
...     Arm(0, learner=NormalInverseGammaRegressor()),
...     Arm(1, learner=NormalInverseGammaRegressor()),
... ]
>>> agent = Agent(arms, ThompsonSampling(), random_seed=0)
>>> agent.pull()
[1]

No context matrix is needed. The update and decay methods similarly take only a reward vector:

>>> import numpy as np
>>> y = np.array([100.0])
>>> agent.update(y)
>>> agent.select_for_update(0).update(y)

__init__(arms: Sequence[Arm[Any, TokenType]], policy: PolicyProtocol[Any, TokenType], random_seed: int | None | Generator = None) → None#

add_arm(arm: Arm[ndarray[tuple[int, ...], dtype[float64]], TokenType]) → None#

Add an arm to the bandit.

Parameters:: arm (Arm) – Arm to add to the bandit.
Raises:: ValueError – If the arm’s action token is already in the bandit.

arm(token: TokenType) → Arm[ndarray[tuple[int, ...], dtype[float64]], TokenType]#

Get an arm by its action token.

Parameters:: token (TokenType) – Action token of the arm to get.
Returns:: Arm with the given action token.
Return type:: Arm[NDArray[np.float64], TokenType]
Raises:: KeyError – If the arm’s action token is not in the bandit.

property arm_to_update: Arm[ndarray[tuple[int, ...], dtype[float64]], TokenType]#

property arms: List[Arm[ndarray[tuple[int, ...], dtype[float64]], TokenType]]#

decay(decay_rate: float | None = None) → None#

Decay all arms of the bandit.

Parameters:: decay_rate (Optional[float], default None) – Decay rate to use for decaying the arm. If None, the decay rate of the arm’s learner is used.

property policy: PolicyProtocol[ndarray[tuple[int, ...], dtype[float64]], TokenType]#

pull() → List[TokenType]#

pull(*, top_k: int) → List[List[TokenType]]

Choose arm(s) and pull.

Parameters:: top_k (int, optional) – Number of arms to select. If None (default), selects single best arm. If specified, selects top k arms.
Returns:: If top_k is None: List containing single action token [token] If top_k is int: List containing a list of k action tokens [[token1, token2, …]]
Return type:: List[TokenType] or List[List[TokenType]]

Notes

When top_k is None, arm_to_update is set to the selected arm. When top_k is specified, arm_to_update is NOT updated - you must explicitly call select_for_update() before update() to specify which arm’s feedback you’re providing.

remove_arm(token: Any) → None#

Remove an arm from the bandit.

Parameters:: token (Any) – Action token of the arm to remove.
Raises:: KeyError – If the arm’s action token is not in the bandit.

property rng: Generator#

select_for_update(token: TokenType) → Self#

Set the arm_to_update and return self for chaining.

Parameters:: token (Any) – Action token of the arm to update.
Returns:: Self for chaining.
Return type:: Self
Raises:: KeyError – If the arm’s action token is not in the bandit.

update(y: ndarray[tuple[int, ...], dtype[float64]], sample_weight: ndarray[tuple[int, ...], dtype[float64]] | None = None) → None#

Update the arm_to_update with an observed reward.

Parameters:

y (NDArray[np.float64]) – Reward(s) to use for updating the arm.
sample_weight (Optional[NDArray[np.float64]], default None) – Sample weights to use for updating the arm. If None, all samples are weighted equally.

bayesianbandits.Agent#

This Page