Contextual Bandits for Count Data
Let’s look at how UCB can be used to make the exploit-explore tradeoff for another plausible business problem.
Maximizing Web Storefront Transactions per Week
Imagine we have a web storefront for which we want to maximize transactions per week. To do so, our design team has come up with a new layout that they claim will be more effective, but will it? This is an example of a two-armed bandit, but in contrast to the Bernoulli bandit, we are trying to maximize a count variable.
Mathematically speaking, we will receive a count reward with the form \(\textrm{Reward}_k \sim \textrm{Poisson}(\lambda _k)\).
Simulating the Problem
We’ll create a blackbox VisitorOracle
class that represents the transaction information we get from our storefront at the end of each week. We’ll again intentionally set it up such that the new action has a small but real lift over the status quo.
[1]:
from typing import List
import numpy as np
class VisitorOracle:
def __init__(self, n: float, m: float):
self.n = n
self.m = m
self.rewards: List[int] = []
def status_quo_action(self):
self.rewards.append(np.random.poisson(self.n))
def new_proposal_action(self):
self.rewards.append(np.random.poisson(self.m))
two_armed_bandit = VisitorOracle(20, 25)
Setting up the learner
bayesianbandits
makes a GammaRegressor
class available to perform conjugate prior Bayesian inference on count data. This time, we’ll use a fairly informative prior - presumably, we have plenty of knowledge about historical transaction data for our storefront, so we’d be able to use that to pick a reasonable prior.
First, we define our action structure.
[2]:
from enum import Enum
class VisitorActions(Enum):
STATUS_QUO = 0
NEW_PROPOSAL = 1
def take_action(self, visitor: VisitorOracle):
if self == VisitorActions.STATUS_QUO:
visitor.status_quo_action()
elif self == VisitorActions.NEW_PROPOSAL:
visitor.new_proposal_action()
And our Agent
instance. This time, we’ll be using the upper_confidence_bound
algorithm, which uses the upper bound of a credible interval to pick which arm to pull. Somewhat arbitrarily, we’ll pick a one-sided 84% credible interval, which corresponds roughly to a \(\mu + \sigma\) interval, given a normally-distributed posterior.
[3]:
from bayesianbandits import Agent, Arm, GammaRegressor, UpperConfidenceBound
arms = [
Arm(VisitorActions.STATUS_QUO, learner=GammaRegressor(alpha=20, beta=1)),
Arm(VisitorActions.NEW_PROPOSAL, learner=GammaRegressor(alpha=20, beta=1)),
]
agent = Agent(arms=arms, policy=UpperConfidenceBound(0.84))
Now, let’s simulate some learning. We’ll say we want to run this experiment for a quarter, or 13 weeks.
[4]:
for _ in range(13):
action_token, = agent.pull()
action_token.take_action(two_armed_bandit)
agent.update(np.array([two_armed_bandit.rewards[-1]]))
Indeed, we see that our agent identifies the new proposed arm as the better option and spent more time pulling it.
[5]:
import matplotlib.pyplot as plt
plt.hist(
agent.arms[0].sample(np.array([[1]]), size=5000),
alpha=0.8,
label="status quo arm",
density=True,
stacked=True,
bins=100
)
plt.hist(
agent.arms[1].sample(np.array([[1]]), size=5000),
alpha=0.8,
label="new proposal arm",
density=True,
stacked=True,
bins=100
)
plt.xlabel("Mean Reward")
plt.ylabel("Frequency")
plt.legend()
plt.show()
As happy as we might be about the above results, our happiness is dashed when our marketing team comes to us with brand-new market research suggesting that our customers actually fall into two major demographics, each of which may have a different reaction to our proposed website layouts. This is an example of a contextual multi-armed bandit problem - in addition to being presented with a choice, we get some information about the choice we’d like to incorporate into our decision making.
Fortunately, bayesianbandits
can handle contextual bandits with the @contextual
decorator.
[6]:
from bayesianbandits import ContextualAgent
contextual_bandit = VisitorOracle(20, 25)
context_arms = [
Arm(VisitorActions.STATUS_QUO, learner=GammaRegressor(alpha = 20, beta = 1)),
Arm(VisitorActions.NEW_PROPOSAL, learner=GammaRegressor(alpha = 20, beta = 1)),
]
context_aware_agent = ContextualAgent(arms=context_arms, policy=UpperConfidenceBound(0.84))
This time, let’s say that the X = 1
demographic has a positive reaction to our proposed layout, but the X = 0
demographic reacts negatively. By giving the context information to the agent during the pull
and update
phases, it will learn which action works best in each context.
Let’s simulate another year of data.
[7]:
for _ in range(52):
X = np.atleast_2d(np.random.randint(0, 2))
if X == 1:
contextual_bandit.n = 20
contextual_bandit.m = 25
else:
contextual_bandit.n = 20
contextual_bandit.m = 18
action, = context_aware_agent.pull(X)
action.take_action(contextual_bandit)
context_aware_agent.update(X, np.atleast_1d(contextual_bandit.rewards[-1]))
By plotting what our Agent has learned about each arm, we can see that it has correctly identified that in context 0, the status quo arm is most rewarding, while in context 1, the new proposed arm is most rewarding.
[8]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 2, figsize=(12, 5))
ax[0].hist(
context_aware_agent.arms[0].sample(np.array([[0]]), size=5000),
alpha=0.8,
label="status quo arm",
density=True,
stacked=True,
bins=100
)
ax[0].hist(
context_aware_agent.arms[1].sample(np.array([[0]]), size=5000),
alpha=0.8,
label="new proposal arm",
density=True,
stacked=True,
bins=100
)
ax[0].set_xlabel("Mean Reward")
ax[0].set_ylabel("Frequency")
ax[0].set_title("Context 0")
ax[0].legend()
ax[1].hist(
context_aware_agent.arms[0].sample(np.array([[1]]), size=5000),
alpha=0.8,
label="status quo arm",
density=True,
stacked=True,
bins=100
)
ax[1].hist(
context_aware_agent.arms[1].sample(np.array([[1]]), size=5000),
alpha=0.8,
label="new proposal arm",
density=True,
stacked=True,
bins=100
)
ax[1].set_xlabel("Mean Reward")
ax[1].set_title("Context 1")
plt.show()