Handling Delayed Rewards
========================

With contextual bandits, you make decisions for many contexts at once
and collect rewards later. You serve ads to a batch of users now and
learn which ones converted hours later. You recommend products to
thousands of visitors and get purchase signals overnight. The API reflects this:
``pull()`` and ``update()`` are independent calls. There is no
requirement that they alternate.


The basic pattern
-----------------

Pull for a batch of contexts, store the action tokens alongside
request IDs, update later when rewards arrive.

.. code-block:: python

   from bayesianbandits import (
       Arm, ContextualAgent, NormalRegressor, ThompsonSampling,
   )
   import numpy as np

   arms = [
       Arm(f"variant_{i}", learner=NormalRegressor(alpha=1.0, beta=1.0))
       for i in range(3)
   ]
   agent = ContextualAgent(arms, ThompsonSampling(), random_seed=42)

   # Decision time
   X_batch = np.array([[1.0, 2.0]])
   actions = agent.pull(X_batch)
   # Store (request_id, action, context) in your database

   # Later, when rewards arrive
   for token, X_row, reward in zip(
       actions,
       X_batch,
       [1.0],  # rewards from your database
   ):
       agent.select_for_update(token).update(
           np.atleast_2d(X_row), np.atleast_1d(reward)
       )

Nothing about the posterior update depends on how much time has passed
since the pull.


Batch updates for the same arm
-------------------------------

When multiple rewards for the same arm arrive at once (common with a
nightly batch job), group them into a single ``update()`` call rather
than looping row by row. The learner does one precision-matrix update
instead of N sequential ones:

.. code-block:: python

   # Group rewards by arm token
   for token, group in rewards_grouped_by_arm.items():
       X_batch = np.vstack(group["contexts"])
       y_batch = np.array(group["rewards"])
       agent.select_for_update(token).update(X_batch, y_batch)

The posterior is the same whether you update in one batch or one row
at a time: the sufficient statistics are identical either way. (With
``learning_rate < 1``, each row triggers decay, so order matters. See
:doc:`decay` for why decoupling decay from updates avoids this.)


Multiple pulls before any update
---------------------------------

You can pull many times before updating. Each pull samples from the
current posterior. The only state ``pull()`` sets is
``arm_to_update``, which you override with ``select_for_update()``
anyway.


Interaction with decay
-----------------------

If you run ``agent.decay()`` on a schedule, a delayed observation is
still incorporated normally when it arrives. The model has already
decayed (widened its uncertainty), and the observation tightens it
back. Decay reflects time passing; the update reflects new
information.

If decay is aggressive and rewards arrive late, the posterior
variance has already grown by the time the observation lands. The
observation has *more* influence on the decayed posterior than it
would have on a tighter one, so a single stale reward can pull the
model further than you might expect.

.. code-block:: python

   # Daily loop
   X_today = np.array([[1.0, 2.0]])
   actions = agent.pull(X_today)
   # ... store decisions ...

   # Nightly: decay + update with yesterday's rewards
   agent.decay(np.array([[0.0, 0.0]]), decay_rate=0.99)
   for token, X_row, reward in yesterdays_rewards:
       agent.select_for_update(token).update(
           np.atleast_2d(X_row), np.atleast_1d(reward)
       )

See :doc:`decay` for choosing the rate.


Non-contextual agents
----------------------

:class:`~bayesianbandits.Agent` works the same way, just without context:

.. code-block:: python

   from bayesianbandits import Agent

   arms = [
       Arm(f"slot_{i}", learner=NormalRegressor(alpha=1.0, beta=1.0))
       for i in range(3)
   ]
   agent = Agent(arms, ThompsonSampling(), random_seed=42)

   (token,) = agent.pull()
   # later...
   agent.select_for_update(token).update(np.array([1.0]))


If something goes wrong
------------------------

**"I updated the wrong arm."**
``pull()`` sets ``arm_to_update`` to the last selected arm. If you pull
again before updating, the previous selection is overwritten. Always
use ``select_for_update(token)`` explicitly rather than relying on the
implicit state from ``pull()``.