bayesianbandits.Agent#
- class bayesianbandits.Agent(arms: Sequence[Arm[Any, TokenType]], policy: PolicyProtocol[Any, TokenType], random_seed: int | None | Generator = None)#
Bases:
Generic[TokenType]Agent for a non-contextual (classic) multi-armed bandit problem.
Implements the \(K\)-armed bandit where the agent selects an arm \(a_t\) without observing any side information. Internally this is a thin wrapper around
ContextualAgentwith the context fixed to a single intercept column \(x = [1]\):\[a_t = \pi\bigl(\{p(\theta_a \mid \mathcal{D}_a)\}_{a=1}^{K} \bigr), \qquad \mathcal{D}_{a_t} \leftarrow \mathcal{D}_{a_t} \cup \{r_t\}\]All
pull,update, anddecaycalls automatically synthesize the intercept context, so the caller never needs to provide a feature matrix.- Parameters:
arms (
Sequence[Arm[Any,TokenType]]) – Arms to choose from. Each arm must carry a fitted or unfitted learner and a unique action token that identifies it.policy (
PolicyProtocol[Any,TokenType]) – Policy object that implements arm selection given posteriors. Built-in options:ThompsonSampling,UpperConfidenceBound,EpsilonGreedy.random_seed (
int,np.random.Generator, orNone, defaultNone) – Controls the random number generator shared by the policy and all learners. Pass an int for reproducible results across calls.
See also
ContextualAgentAgent that conditions decisions on a feature matrix.
LipschitzContextualAgentShared-learner agent with configurable design matrix; generalizes both Agent and ContextualAgent.
Notes
Because the context is always a single intercept, every learner reduces to an intercept-only model. For example,
NormalRegressorbecomes a simple Bayesian estimate of the mean reward, andDirichletClassifiermaintains a posterior over class probabilities. See [1] for an empirical comparison of policies in this setting.References
Examples
Create a non-contextual agent and pull:
>>> from bayesianbandits import Arm, NormalInverseGammaRegressor >>> from bayesianbandits import Agent, ThompsonSampling >>> arms = [ ... Arm(0, learner=NormalInverseGammaRegressor()), ... Arm(1, learner=NormalInverseGammaRegressor()), ... ] >>> agent = Agent(arms, ThompsonSampling(), random_seed=0) >>> agent.pull() [1]
No context matrix is needed. The
updateanddecaymethods similarly take only a reward vector:>>> import numpy as np >>> y = np.array([100.0]) >>> agent.update(y) >>> agent.select_for_update(0).update(y)
- __init__(arms: Sequence[Arm[Any, TokenType]], policy: PolicyProtocol[Any, TokenType], random_seed: int | None | Generator = None) None#
- add_arm(arm: Arm[ndarray[tuple[int, ...], dtype[float64]], TokenType]) None#
Add an arm to the bandit.
- Parameters:
arm (
Arm) – Arm to add to the bandit.- Raises:
ValueError – If the arm’s action token is already in the bandit.
- arm(token: TokenType) Arm[ndarray[tuple[int, ...], dtype[float64]], TokenType]#
Get an arm by its action token.
- Parameters:
token (
TokenType) – Action token of the arm to get.- Returns:
Arm with the given action token.
- Return type:
Arm[NDArray[np.float64],TokenType]- Raises:
KeyError – If the arm’s action token is not in the bandit.
- decay(decay_rate: float | None = None) None#
Decay all arms of the bandit.
- Parameters:
decay_rate (
Optional[float], defaultNone) – Decay rate to use for decaying the arm. If None, the decay rate of the arm’s learner is used.
- property policy: PolicyProtocol[ndarray[tuple[int, ...], dtype[float64]], TokenType]#
- pull() List[TokenType]#
- pull(*, top_k: int) List[List[TokenType]]
Choose arm(s) and pull.
- Parameters:
top_k (
int, optional) – Number of arms to select. If None (default), selects single best arm. If specified, selects top k arms.- Returns:
If top_k is None: List containing single action token [token] If top_k is int: List containing a list of k action tokens [[token1, token2, …]]
- Return type:
List[TokenType]orList[List[TokenType]]
Notes
When top_k is None, arm_to_update is set to the selected arm. When top_k is specified, arm_to_update is NOT updated - you must explicitly call select_for_update() before update() to specify which arm’s feedback you’re providing.
- remove_arm(token: Any) None#
Remove an arm from the bandit.
- Parameters:
token (
Any) – Action token of the arm to remove.- Raises:
KeyError – If the arm’s action token is not in the bandit.
- property rng: Generator#
- select_for_update(token: TokenType) Self#
Set the arm_to_update and return self for chaining.
- Parameters:
token (
Any) – Action token of the arm to update.- Returns:
Self for chaining.
- Return type:
Self- Raises:
KeyError – If the arm’s action token is not in the bandit.
- update(y: ndarray[tuple[int, ...], dtype[float64]], sample_weight: ndarray[tuple[int, ...], dtype[float64]] | None = None) None#
Update the arm_to_update with an observed reward.
- Parameters:
y (
NDArray[np.float64]) – Reward(s) to use for updating the arm.sample_weight (
Optional[NDArray[np.float64]], defaultNone) – Sample weights to use for updating the arm. If None, all samples are weighted equally.