Handling Delayed Rewards#

With contextual bandits, you make decisions for many contexts at once and collect rewards later. You serve ads to a batch of users now and learn which ones converted hours later. You recommend products to thousands of visitors and get purchase signals overnight. The API reflects this: pull() and update() are independent calls. There is no requirement that they alternate.

The basic pattern#

Pull for a batch of contexts, store the action tokens alongside request IDs, update later when rewards arrive.

from bayesianbandits import (
    Arm, ContextualAgent, NormalRegressor, ThompsonSampling,
)
import numpy as np

arms = [
    Arm(f"variant_{i}", learner=NormalRegressor(alpha=1.0, beta=1.0))
    for i in range(3)
]
agent = ContextualAgent(arms, ThompsonSampling(), random_seed=42)

# Decision time
X_batch = np.array([[1.0, 2.0]])
actions = agent.pull(X_batch)
# Store (request_id, action, context) in your database

# Later, when rewards arrive
for token, X_row, reward in zip(
    actions,
    X_batch,
    [1.0],  # rewards from your database
):
    agent.select_for_update(token).update(
        np.atleast_2d(X_row), np.atleast_1d(reward)
    )

Nothing about the posterior update depends on how much time has passed since the pull.

Batch updates for the same arm#

When multiple rewards for the same arm arrive at once (common with a nightly batch job), group them into a single update() call rather than looping row by row. The learner does one precision-matrix update instead of N sequential ones:

# Group rewards by arm token
for token, group in rewards_grouped_by_arm.items():
    X_batch = np.vstack(group["contexts"])
    y_batch = np.array(group["rewards"])
    agent.select_for_update(token).update(X_batch, y_batch)

The posterior is the same whether you update in one batch or one row at a time: the sufficient statistics are identical either way. (With learning_rate < 1, each row triggers decay, so order matters. See Choosing and Tuning a Decay Rate for why decoupling decay from updates avoids this.)

Multiple pulls before any update#

You can pull many times before updating. Each pull samples from the current posterior. The only state pull() sets is arm_to_update, which you override with select_for_update() anyway.

Interaction with decay#

If you run agent.decay() on a schedule, a delayed observation is still incorporated normally when it arrives. The model has already decayed (widened its uncertainty), and the observation tightens it back. Decay reflects time passing; the update reflects new information.

If decay is aggressive and rewards arrive late, the posterior variance has already grown by the time the observation lands. The observation has more influence on the decayed posterior than it would have on a tighter one, so a single stale reward can pull the model further than you might expect.

# Daily loop
X_today = np.array([[1.0, 2.0]])
actions = agent.pull(X_today)
# ... store decisions ...

# Nightly: decay + update with yesterday's rewards
agent.decay(np.array([[0.0, 0.0]]), decay_rate=0.99)
for token, X_row, reward in yesterdays_rewards:
    agent.select_for_update(token).update(
        np.atleast_2d(X_row), np.atleast_1d(reward)
    )

See Choosing and Tuning a Decay Rate for choosing the rate.

Non-contextual agents#

Agent works the same way, just without context:

from bayesianbandits import Agent

arms = [
    Arm(f"slot_{i}", learner=NormalRegressor(alpha=1.0, beta=1.0))
    for i in range(3)
]
agent = Agent(arms, ThompsonSampling(), random_seed=42)

(token,) = agent.pull()
# later...
agent.select_for_update(token).update(np.array([1.0]))

If something goes wrong#

“I updated the wrong arm.” pull() sets arm_to_update to the last selected arm. If you pull again before updating, the previous selection is overwritten. Always use select_for_update(token) explicitly rather than relying on the implicit state from pull().