Writing Custom Reward Functions =============================== A reward function separates what the learner models (outcomes) from what those outcomes are worth (utility). If your learner already models the quantity you want to maximize (revenue via ``NormalRegressor``, profit via ``NormalInverseGammaRegressor``), the default identity function is fine and you don't need a custom reward function. See :doc:`/notebooks/linear-bandits` for how far you can get without one. .. note:: ``update()`` always trains on raw outcomes, not transformed rewards. The learner models what actually happens; the reward function captures what it's worth. You can change your cost structure or utility function without retraining. Map binary outcomes to profit ----------------------------- A ``BayesianGLM`` with ``link="logit"`` models click-through or conversion probability. But different arms may have different revenue per conversion and different costs per impression. The learner trains on raw binary outcomes (0/1); the reward function turns probability samples into expected profit. .. code-block:: python import numpy as np from bayesianbandits import Arm, ContextualAgent, BayesianGLM, ThompsonSampling def make_profit_reward(revenue, cost): """Expected profit = P(convert) * revenue - cost.""" def reward(samples): return samples * revenue - cost return reward arms = [ Arm( "campaign_A", reward_function=make_profit_reward(revenue=50.0, cost=10.0), learner=BayesianGLM(alpha=1.0, link="logit"), ), Arm( "campaign_B", reward_function=make_profit_reward(revenue=20.0, cost=2.0), learner=BayesianGLM(alpha=1.0, link="logit"), ), Arm( "campaign_C", reward_function=make_profit_reward(revenue=100.0, cost=30.0), learner=BayesianGLM(alpha=1.0, link="logit"), ), ] agent = ContextualAgent(arms, ThompsonSampling(), random_seed=42) X = np.array([[1.0, 0.0, 25.0]]) # user features (action,) = agent.pull(X) # Update gets the raw binary outcome, NOT the profit agent.update(X, np.array([1])) # conversion happened ``BayesianGLM`` with the logit link already returns probability samples in [0, 1], so ``samples * revenue - cost`` is all you need. Use context in the reward function ----------------------------------- When utility depends on context (user-specific shipping cost, regional tax), the reward function can accept a second parameter named ``X`` to receive the context matrix. The parameter **must** be named ``X``: the library inspects the function signature to detect context-awareness. See :ref:`troubleshooting ` if this isn't working. .. code-block:: python from bayesianbandits import ( Arm, ContextualAgent, NormalRegressor, ThompsonSampling, ) def net_profit_reward(samples, X): """Subtract per-row cost (last column of X) from gross revenue.""" cost = X[:, -1] return samples - cost arms = [ Arm( f"product_{i}", reward_function=net_profit_reward, learner=NormalRegressor(alpha=1.0, beta=1.0), ) for i in range(3) ] agent = ContextualAgent(arms, ThompsonSampling(), random_seed=42) X = np.array([[5.0, 1.2]]) # features + cost (action,) = agent.pull(X) agent.update(X, y=np.array([6.5])) # raw gross revenue Express non-linear utility -------------------------- Reward functions don't have to be linear. Here's an asymmetric loss that penalizes underperformance more than overperformance: .. code-block:: python def asymmetric_loss(samples, target=10.0, penalty=3.0): """Penalize undershoot 3x relative to overshoot.""" diff = samples - target return np.where(diff < 0, penalty * diff, diff) Other examples: threshold utility (``np.maximum(samples - threshold, 0)``), diminishing returns (``np.log1p(samples)``). Batch reward functions for shared-learner bandits ------------------------------------------------- When the learner models one thing (revenue) but you need to transform it per-arm (multiply by a margin that differs across products), use a batch reward function. With :class:`~bayesianbandits.LipschitzContextualAgent` and many arms, processing rewards one arm at a time is inefficient. A batch reward function handles all arms in a single vectorized call. Two signatures: - ``f(samples, action_tokens) -> rewards`` - ``f(samples, action_tokens, X) -> rewards`` (context-aware; third param must be named ``X``) Input ``samples`` shape is ``(n_arms, n_contexts, size)``. Output must match. .. code-block:: python import pandas as pd from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder, StandardScaler from bayesianbandits import ( Arm, LipschitzContextualAgent, NormalRegressor, ThompsonSampling, ArmColumnFeaturizer, LearnerPipeline, ) MARGINS = { "product_0": 0.3, "product_1": 0.5, "product_2": 0.8, } def margin_reward(samples, action_tokens): """Multiply revenue samples by per-product margin.""" multipliers = np.array([MARGINS[t] for t in action_tokens]) return samples * multipliers[:, np.newaxis, np.newaxis] arm_tokens = list(MARGINS.keys()) sample_enriched = pd.DataFrame({ "score": [0.8, 0.6, 0.9] * 3, "arm": arm_tokens * 3, }) ct = ColumnTransformer([ ("num", StandardScaler(), ["score"]), ("arm", OneHotEncoder(sparse_output=False), ["arm"]), ]) ct.fit(sample_enriched) shared_learner = LearnerPipeline( steps=[("preprocess", ct)], learner=NormalRegressor(alpha=1.0, beta=1.0), ) arms = [Arm(token) for token in arm_tokens] agent = LipschitzContextualAgent( arms=arms, policy=ThompsonSampling(), arm_featurizer=ArmColumnFeaturizer("arm"), learner=shared_learner, batch_reward_function=margin_reward, random_seed=42, ) context = pd.DataFrame({"score": [0.7]}) (action,) = agent.pull(context) .. _reward-troubleshooting: If something goes wrong ----------------------- **Shape mismatch error from the policy.** Your reward function must return shape ``(size, n_contexts)``. For most estimators (``NormalRegressor``, ``BayesianGLM``, ``NormalInverseGammaRegressor``, ``GammaRegressor``) the input is already that shape, so element-wise operations just work. ``DirichletClassifier`` is the exception: its samples are ``(size, n_contexts, n_classes)`` and the reward function must collapse the class axis (e.g. ``samples[..., 1] * revenue - cost``). For batch reward functions, the contract is ``(n_arms, n_contexts, size)`` in and out. **TypeError: reward() missing 1 required positional argument.** You wrote a context-aware reward function but named the parameter something other than ``X``. The library checks ``inspect.signature()`` for a parameter literally named ``X``. If it doesn't find one, it calls your function with ``samples`` only, and your function raises because it expected two arguments. Rename the parameter to ``X``. **PicklingError when serializing the agent.** Lambdas, closures, and factory-produced functions (like ``make_profit_reward`` above) are not picklable with standard ``pickle``. If you need to serialize agents, use a callable class with ``__call__`` instead, or use a serialization library that handles closures (e.g. ``cloudpickle``). **Reward function has side effects or mutable state.** Thompson sampling calls the reward function on every ``pull()``. A reward function that mutates external state (counters, accumulators, caches with size limits) will behave unpredictably. Keep reward functions stateless. .. seealso:: :class:`~bayesianbandits.Arm` for full reward function type signatures.