Handling Delayed Rewards#
With contextual bandits, you make decisions for many contexts at once
and collect rewards later. You serve ads to a batch of users now and
learn which ones converted hours later. You recommend products to
thousands of visitors and get purchase signals overnight. The API reflects this:
pull() and update() are independent calls. There is no
requirement that they alternate.
The basic pattern#
Pull for a batch of contexts, store the action tokens alongside request IDs, update later when rewards arrive.
from bayesianbandits import (
Arm, ContextualAgent, NormalRegressor, ThompsonSampling,
)
import numpy as np
arms = [
Arm(f"variant_{i}", learner=NormalRegressor(alpha=1.0, beta=1.0))
for i in range(3)
]
agent = ContextualAgent(arms, ThompsonSampling(), random_seed=42)
# Decision time
X_batch = np.array([[1.0, 2.0]])
actions = agent.pull(X_batch)
# Store (request_id, action, context) in your database
# Later, when rewards arrive
for token, X_row, reward in zip(
actions,
X_batch,
[1.0], # rewards from your database
):
agent.select_for_update(token).update(
np.atleast_2d(X_row), np.atleast_1d(reward)
)
Nothing about the posterior update depends on how much time has passed since the pull.
Batch updates for the same arm#
When multiple rewards for the same arm arrive at once (common with a
nightly batch job), group them into a single update() call rather
than looping row by row. The learner does one precision-matrix update
instead of N sequential ones:
# Group rewards by arm token
for token, group in rewards_grouped_by_arm.items():
X_batch = np.vstack(group["contexts"])
y_batch = np.array(group["rewards"])
agent.select_for_update(token).update(X_batch, y_batch)
The posterior is the same whether you update in one batch or one row
at a time: the sufficient statistics are identical either way. (With
learning_rate < 1, each row triggers decay, so order matters. See
Choosing and Tuning a Decay Rate for why decoupling decay from updates avoids this.)
Multiple pulls before any update#
You can pull many times before updating. Each pull samples from the
current posterior. The only state pull() sets is
arm_to_update, which you override with select_for_update()
anyway.
Interaction with decay#
If you run agent.decay() on a schedule, a delayed observation is
still incorporated normally when it arrives. The model has already
decayed (widened its uncertainty), and the observation tightens it
back. Decay reflects time passing; the update reflects new
information.
If decay is aggressive and rewards arrive late, the posterior variance has already grown by the time the observation lands. The observation has more influence on the decayed posterior than it would have on a tighter one, so a single stale reward can pull the model further than you might expect.
# Daily loop
X_today = np.array([[1.0, 2.0]])
actions = agent.pull(X_today)
# ... store decisions ...
# Nightly: decay + update with yesterday's rewards
agent.decay(np.array([[0.0, 0.0]]), decay_rate=0.99)
for token, X_row, reward in yesterdays_rewards:
agent.select_for_update(token).update(
np.atleast_2d(X_row), np.atleast_1d(reward)
)
See Choosing and Tuning a Decay Rate for choosing the rate.
Non-contextual agents#
Agent works the same way, just without context:
from bayesianbandits import Agent
arms = [
Arm(f"slot_{i}", learner=NormalRegressor(alpha=1.0, beta=1.0))
for i in range(3)
]
agent = Agent(arms, ThompsonSampling(), random_seed=42)
(token,) = agent.pull()
# later...
agent.select_for_update(token).update(np.array([1.0]))
If something goes wrong#
“I updated the wrong arm.”
pull() sets arm_to_update to the last selected arm. If you pull
again before updating, the previous selection is overwritten. Always
use select_for_update(token) explicitly rather than relying on the
implicit state from pull().