bayesianbandits.LipschitzContextualAgent#

class bayesianbandits.LipschitzContextualAgent(arms: Sequence[Arm[Any, TokenType]], policy: PolicyProtocol[Any, TokenType], arm_featurizer: ArmFeaturizer[TokenType], learner: Learner[Any], batch_reward_function: Callable[[ndarray[tuple[int, ...], dtype[float64]], List[Any]], ndarray[tuple[int, ...], dtype[float64]]] | Callable[[ndarray[tuple[int, ...], dtype[float64]], List[Any], Sized], ndarray[tuple[int, ...], dtype[float64]]] | None = None, random_seed: int | None | Generator = None)#

Bases: Generic[TokenType]

Contextual agent with a shared learner and a configurable design matrix.

This is the most general agent in the library. The design matrix, constructed by the arm_featurizer, encodes your assumptions about how arms relate to each other and to the context. By choosing the right design matrix structure, you can express a spectrum of models:

  • One-hot arms only (no context features): recovers Agent (non-contextual bandits).

  • One-hot arms interacted with context (block-diagonal design matrix): recovers ContextualAgent (disjoint bandits, independent parameters per arm).

  • Shared features + arm-specific intercepts: hybrid bandits with cross-arm learning and Bayesian shrinkage toward shared structure (see the hybrid-bandits tutorial).

  • Continuous arm features: Lipschitz-style bandits (the namesake), where nearby arms in feature space share information.

Formally, the agent uses a single shared learner that conditions on both context and arm features:

\[\tilde{x}_{a} = \phi(x, a), \qquad r \mid \tilde{x}_{a} \sim p(r \mid \theta, \tilde{x}_{a})\]

where \(\phi\) is the arm featurizer that constructs the design matrix from context \(x\) and arm identity \(a\), and \(\theta\) is the shared parameter vector. At each round, posterior samples for all arms are drawn in a single vectorized call:

\[a^* = \pi\bigl( \{\tilde{\theta} \sim p(\theta \mid \mathcal{D})\}, \; \{\phi(x, a)\}_{a=1}^{K}\bigr)\]
Parameters:
  • arms (Sequence[Arm[Any, TokenType]]) – Arms to choose from. Arms may have learner=None; the shared learner is set on every arm during initialization.

  • policy (PolicyProtocol[Any, TokenType]) – Policy object for arm selection. All built-in policies (ThompsonSampling, UpperConfidenceBound, EpsilonGreedy) are compatible.

  • arm_featurizer (ArmFeaturizer[TokenType]) – Featurizer that constructs the design matrix from (context, action_tokens) in a single vectorized call. The structure of this matrix encodes assumptions about how arms relate to each other and to the context – see Notes.

  • learner (Learner) – Shared learner instance that will be set on all arms. Because all arms share this object, updates to any arm improve predictions for every arm.

  • batch_reward_function (BatchRewardFunction or ContextAwareBatchRewardFunction or None, default None) –

    Optional function that processes rewards for all arms at once.

    Traditional signature:

    def batch_reward(samples, action_tokens):
        # samples: shape (n_arms, n_contexts, size, ...)
        # action_tokens: list of length n_arms
        return rewards  # shape (n_arms, n_contexts, size)
    

    Context-aware signature:

    def batch_reward(samples, action_tokens, X):
        # X: original context, shape (n_contexts, n_features)
        return rewards  # shape (n_arms, n_contexts, size)
    

    The action_tokens list is ordered to match the first dimension of samples. If None and all arms use the identity reward function, an optimized batch identity is used automatically.

  • random_seed (int, np.random.Generator, or None, default None) – Controls the random number generator shared by the policy and the learner. Pass an int for reproducible results across calls.

See also

ContextualAgent

Independent-learner agent; equivalent to this agent with a block-diagonal design matrix (no parameter sharing).

Agent

Non-contextual (intercept-only) agent; equivalent to this agent with one-hot arm indicators and no context features.

ArmColumnFeaturizer

Default featurizer that appends an arm identifier column to the context matrix.

Notes

Vectorized pull. During pull(), contexts are enriched for all arms in a single featurizer call, followed by a single learner sample call for the entire (n_arms * n_contexts) batch. This yields significant speedups when \(K \gg 100\).

Selective update. During update(), contexts are enriched only for the selected arm, so the update cost is independent of \(K\).

Design matrix as assumption encoding. The structure of \(\phi(x, a)\) is the mechanism by which you encode domain knowledge about the relationship between arms [3]. A block-diagonal design matrix (one-hot arms interacted with context) yields fully independent parameters per arm – equivalent to ContextualAgent. Adding shared columns (e.g. user features that affect all arms) introduces cross-arm learning: the shared learner pools data across arms for those features while keeping arm-specific effects separate. This creates a “poor man’s hierarchical model” where Bayesian priors automatically shrink arm-specific effects toward the shared structure. See the hybrid-bandits tutorial for a worked example.

Relationship to other agents. Agent and ContextualAgent are special cases of this agent with particular design matrix structures. This makes LipschitzContextualAgent the most general agent in the library, suitable for any problem where you can describe the arm structure through features.

Name origin. The class name comes from the Lipschitz bandit literature [1] [2], where rewards vary smoothly with continuous arm features. The agent is not limited to that setting – it works equally well with discrete arms and arbitrary feature structures.

References

Examples

Create an agent for product recommendation with 100 products:

>>> import numpy as np
>>> from bayesianbandits import Arm, NormalRegressor, ThompsonSampling
>>> from bayesianbandits import ArmColumnFeaturizer
>>>
>>> # Define action space - product IDs
>>> product_ids = list(range(100))
>>>
>>> # Create arms without learners initially
>>> arms = [Arm(token, learner=None) for token in product_ids]
>>>
>>> # Create agent with shared learner
>>> agent = LipschitzContextualAgent(
...     arms=arms,
...     policy=ThompsonSampling(),
...     arm_featurizer=ArmColumnFeaturizer(column_name='product_id'),
...     learner=NormalRegressor(alpha=1.0, beta=1.0),
...     random_seed=0
... )
>>>
>>> # Use normally - single call handles all arms efficiently
>>> X = np.array([[25, 50000], [35, 75000]])  # age, income
>>> selected_products = agent.pull(X)  # Returns [product_id1, product_id2]
>>>
>>> # Update with observed rewards
>>> for token, context, reward in zip(selected_products, X, [1.0, 0.5]):
...     agent.select_for_update(token).update(np.atleast_2d(context), np.array([reward]))

Using a batch reward function for revenue optimization:

>>> # Pre-compute revenue array for all products (vectorized approach)
>>> n_products = 100
>>> product_revenues = np.random.uniform(0.5, 3.0, n_products)  # Revenue per product
>>>
>>> # Create vectorized batch reward function
>>> def revenue_batch_reward(samples, action_tokens):
...     # Direct numpy indexing - fully vectorized
...     multipliers = product_revenues[action_tokens]
...     # Broadcast to match samples shape: (n_arms, n_contexts, size)
...     return samples * multipliers[:, np.newaxis, np.newaxis]
>>>
>>> # Create agent with batch reward function
>>> agent = LipschitzContextualAgent(
...     arms=arms,
...     policy=ThompsonSampling(),
...     arm_featurizer=ArmColumnFeaturizer(column_name="product_id"),
...     learner=NormalRegressor(alpha=1, beta=1),
...     batch_reward_function=revenue_batch_reward
... )

Using a context-aware batch reward function:

>>> # Context-aware: calculate gross profit from prices, costs, and taxes
>>> # Arms represent different price points
>>> price_points = np.array([9.99, 14.99, 19.99, 24.99, 29.99])
>>> arms = [Arm(i, learner=None) for i in range(len(price_points))]
>>>
>>> def gross_profit_reward(samples, action_tokens, X):
...     # X contains: [customer_value, cost_per_unit, tax_rate]
...     costs = X[:, 1]      # shape: (n_contexts,)
...     tax_rates = X[:, 2]  # shape: (n_contexts,)
...
...     # Get prices for selected arms
...     prices = price_points[action_tokens]  # shape: (n_arms,)
...
...     # Vectorized profit calculation for all (arm, context) pairs
...     # Revenue after tax: price * (1 - tax_rate)
...     # Gross profit: revenue_after_tax - cost
...     revenue_after_tax = prices[:, np.newaxis] * (1 - tax_rates[np.newaxis, :])
...     gross_profit = revenue_after_tax - costs[np.newaxis, :]
...
...     # Apply profit multiplier to samples, clamping negative profits to 0
...     profit_multiplier = np.maximum(gross_profit, 0)
...     return samples * profit_multiplier[:, :, np.newaxis]
__init__(arms: Sequence[Arm[Any, TokenType]], policy: PolicyProtocol[Any, TokenType], arm_featurizer: ArmFeaturizer[TokenType], learner: Learner[Any], batch_reward_function: Callable[[ndarray[tuple[int, ...], dtype[float64]], List[Any]], ndarray[tuple[int, ...], dtype[float64]]] | Callable[[ndarray[tuple[int, ...], dtype[float64]], List[Any], Sized], ndarray[tuple[int, ...], dtype[float64]]] | None = None, random_seed: int | None | Generator = None)#
add_arm(arm: Arm[Any, TokenType]) None#

Add an arm to the agent and set the shared learner.

Parameters:

arm (Arm[Any, TokenType]) – Arm to add to the agent.

Raises:

ValueError – If the arm’s action token is already in the agent.

arm(token: TokenType) Arm[Any, TokenType]#

Get an arm by its action token.

Parameters:

token (TokenType) – Action token of the arm to get.

Returns:

Arm with the action token.

Return type:

Arm[Any, TokenType]

Raises:

KeyError – If the arm’s action token is not in the agent.

decay(X: Sized, decay_rate: float | None = None) None#

Decay the shared learner with all arms’ features.

Parameters:
  • X (Sized) – Context matrix to use for decaying.

  • decay_rate (Optional[float], default None) – Decay rate to use. If None, the learner’s default decay rate is used.

Notes

This method enriches contexts with a single arm’s features and applies decay to the shared learner once. This ensures we decay based on the number of contexts, not the number of arms.

pull(X: Sized) List[TokenType]#
pull(X: Sized, *, top_k: int) List[List[TokenType]]

Choose arm(s) and pull based on the context(s).

Parameters:
  • X (Sized) – Context matrix to use for choosing arms.

  • top_k (int, optional) – Number of arms to select per context. If None (default), selects single best arm per context. If specified, selects top k arms per context.

Returns:

If top_k is None: List of action tokens (one per context) If top_k is int: List of lists of action tokens

Return type:

List[TokenType] or List[List[TokenType]]

Notes

When top_k is None, arm_to_update is set to the last selected arm. When top_k is specified, arm_to_update is NOT updated - you must explicitly call select_for_update() before update() to specify which arm’s feedback you’re providing.

The method performs vectorized operations: 1. Single featurizer call for all arms (major efficiency gain) 2. Single learner sample call for all arm-context pairs 3. Efficient reshape and reward function application 4. Policy selection using standard interface

remove_arm(token: TokenType) None#

Remove an arm from the agent.

Parameters:

token (TokenType) – Action token of the arm to remove.

Raises:

KeyError – If the arm’s action token is not in the agent.

property rng: Generator#
select_for_update(token: TokenType) Self#

Set the arm_to_update and return self for chaining.

Parameters:

token (TokenType) – Action token of the arm to update.

Returns:

Self for chaining.

Return type:

Self

Raises:

KeyError – If the arm’s action token is not in the agent.

update(X: Sized, y: ndarray[tuple[int, ...], dtype[float64]], sample_weight: ndarray[tuple[int, ...], dtype[float64]] | None = None) None#

Update the arm_to_update with the context(s) and the reward(s).

Parameters:
  • X (Sized) – Context matrix to use for updating the arm.

  • y (NDArray[np.float64]) – Reward(s) to use for updating the arm.

  • sample_weight (Optional[NDArray[np.float64]], default None) – Sample weights to use for updating the arm. If None, all samples are weighted equally.

Notes

This method enriches contexts with ONLY the selected arm’s features, then delegates to the policy’s update method which will call arm.update() using the shared learner.