Delayed and non-stationary multi-armed bandit problems

The traditional setup for multi-armed bandits is that the reward is given immediately after the action is taken. However, in many real-world applications, the reward is delayed. For example, in dynamic pricing, the reward is given after the customer makes the purchase, which can be days or weeks after the price is set. The problem becomes more complicated if the bandit is non-stationary, i.e., the reward distribution changes over time. In this notebook, we will show how to use bayesianbandits to approach this family of problems.

How can we implement a geographic pricing strategy?

Suppose we are a company that sells a product in multiple cities. We want to maximize our profit by setting the price for each city. The demand for the product is different in each city, and the demand changes over time. We want to learn the demand distribution for each city and set the price accordingly. The reward is the profit, which is the price minus the cost. The cost is the same for all cities, so we can ignore it for now, and we’ll make the simplifying assumption that supply is infinite.

Simulating the problem

We will simulate the problem using the GeographicPricing class.

[1]:
import numpy as np

np.random.seed(2)


class GeographicPricing:
    def __init__(self, num_locations: int):
        # Randomly generate demand parameters for each location
        self.demand_intercepts = np.random.uniform(low=50, high=100, size=num_locations)
        self.demand_slopes = np.random.uniform(low=-2, high=-0.1, size=num_locations)

        self.num_locations = num_locations

    def demand(self, location: int, price: int) -> float:
        # Linear demand function
        return max(
            self.demand_intercepts[location] + self.demand_slopes[location] * price, 0
        )

    def margin(self, location: int, price: int) -> float:
        # Margin is price times demand, as we're assuming zero marginal cost
        return price * self.demand(location, price)

    def adjust_demand(self):
        # Randomly adjust demand parameters for all locations
        adjustment_factor_intercept = np.random.uniform(
            low=-0.1, high=0.1, size=self.num_locations
        )
        adjustment_factor_slope = np.random.uniform(
            low=-0.05, high=0.05, size=self.num_locations
        )
        self.demand_intercepts += adjustment_factor_intercept * self.demand_intercepts
        self.demand_slopes += adjustment_factor_slope * self.demand_slopes

    def reset_demand(self):
        # Reset demand parameters to original values
        self.demand_intercepts = np.random.uniform(
            low=50, high=100, size=self.num_locations
        )
        self.demand_slopes = np.random.uniform(
            low=-2, high=-0.1, size=self.num_locations
        )


oracle = GeographicPricing(10)

We’ll again use Thompson sampling as the bandit algorithm. We’ll have a global intercept, as well as an intercept for each location. This, combined with the regularization behavior of the NormalInverseGamma regressor, will create a sort of “poor man’s hierarchical model” where the intercepts for each location are pulled towards the global intercept. Let’s say from historical data, we know that we make about 1000 dollars per market per day, so we’ll set the prior for the mean of the global intercept to 1000, and all others to 0.

First, we’ll define our action space. To keep things single, we’ll consider a discrete action space with 5 possible prices:

[2]:
from enum import Enum


class Price(Enum):
    price_5 = 5
    price_8 = 8
    price_11 = 11
    price_14 = 14
    price_17 = 17

    def take_action(self, location: int) -> float:
        return oracle.margin(location, self.value)

Next, we’ll create our ContextualAgent.

[3]:
from bayesianbandits import NormalInverseGammaRegressor, Arm, ContextualAgent, ThompsonSampling

est = NormalInverseGammaRegressor(mu=np.array([1000] + [0] * 10))
policy = ThompsonSampling()

arms = [
    Arm(
        Price.price_5,
        learner=NormalInverseGammaRegressor(mu=np.array([1000] + [0] * 10)),
    ),
    Arm(
        Price.price_8,
        learner=NormalInverseGammaRegressor(mu=np.array([1000] + [0] * 10)),
    ),
    Arm(
        Price.price_11,
        learner=NormalInverseGammaRegressor(mu=np.array([1000] + [0] * 10)),
    ),
    Arm(
        Price.price_14,
        learner=NormalInverseGammaRegressor(mu=np.array([1000] + [0] * 10)),
    ),
    Arm(
        Price.price_17,
        learner=NormalInverseGammaRegressor(mu=np.array([1000] + [0] * 10)),
    ),
]

agent = ContextualAgent(arms=arms, policy=policy, random_seed=111)

For the first round, we’ll assume that demand is constant, so our measure of success will be the Bandit learning the true demand distribution. However, we only get to update our agent once per day, so we’ll have to make decisions for all 10 cities at once, then gather the rewards.

In this simulation, the reward for any city-price combination is constant over time, so computing regret is easy. We just have to figure out the best price for each city, and compute the difference between that and the price we chose.

[4]:
one_hot_encoding = np.append(np.ones((10, 1)), np.eye(10), axis=-1)

regret: list[float] = []

best_price_per_city = np.array(
    [
        [oracle.margin(location, price.value) for price in Price]
        for location in range(10)
    ]
).max(axis=1)

for iteration in range(365):
    # Pick prices for each location
    decisions = agent.pull(one_hot_encoding)

    # End of the day, collect rewards
    rewards = [token.take_action(location) for location, token in enumerate(decisions)]
    regret.append((best_price_per_city - np.array(rewards)).sum())

    # Update the agent
    for token, reward, context_row in zip(decisions, rewards, one_hot_encoding):
        agent.select_for_update(token).update(
            np.atleast_2d(context_row), np.atleast_1d(reward)
        )

We can see that the agent achieves sublinear regret quick quickly.

[5]:
import matplotlib.pyplot as plt

plt.plot(np.cumsum(regret))
plt.xlabel("Round")
plt.ylabel("Cumulative Regret")
[5]:
Text(0, 0.5, 'Cumulative Regret')
../_images/notebooks_delayed-reward_9_1.png

Because we chose a linear demand function, we know for a fact that the $17 price is optimal for all cities. We can see that the agent’s learned expected reward per city for the $17 price is very close to the true maximum reward.

[6]:
agent.arm(Price.price_17).learner.coef_[0] + agent.arm(Price.price_17).learner.coef_[1:]
[6]:
array([ 983.33767094,  585.4084048 ,  813.26291113,  923.87281284,
        730.97890218,  983.68946315,  914.73714568, 1069.20621592,
        991.20988766,  543.58047382])
[7]:
best_price_per_city
[7]:
array([ 983.66025439,  584.58922102,  813.11095419,  924.02978011,
        730.588562  ,  984.01212752,  914.86917212, 1069.7657688 ,
        991.55338423,  542.53651526])

While that was a fun simulation, it’s not very realistic. In reality, demand changes over time. Let’s simulate that by making the demand for each city a random walk. This is still fairly unrealistic, but we can use this setting to see how the agent performs when the demand is non-stationary.

[8]:
agent = ContextualAgent(
    arms=[
        Arm(
            Price.price_5,
            learner=NormalInverseGammaRegressor(mu=np.array([1000] + [0] * 10)),
        ),
        Arm(
            Price.price_8,
            learner=NormalInverseGammaRegressor(mu=np.array([1000] + [0] * 10)),
        ),
        Arm(
            Price.price_11,
            learner=NormalInverseGammaRegressor(mu=np.array([1000] + [0] * 10)),
        ),
        Arm(
            Price.price_14,
            learner=NormalInverseGammaRegressor(mu=np.array([1000] + [0] * 10)),
        ),
        Arm(
            Price.price_17,
            learner=NormalInverseGammaRegressor(mu=np.array([1000] + [0] * 10)),
        ),
    ],
    policy=policy,
    random_seed=111,
)
np.random.seed(2)
one_hot_encoding = np.append(np.ones((10, 1)), np.eye(10), axis=-1)

regret: list[float] = []

for iteration in range(1000):
    # Start of each week, randomly adjust demand
    if iteration % 7 == 0:
        oracle.adjust_demand()

    # Recompute best price per city
    best_price_per_city = np.array(
        [
            [oracle.margin(location, price.value) for price in Price]
            for location in range(10)
        ]
    ).max(axis=1)

    # Start of the day, adjust prices
    decisions = agent.pull(one_hot_encoding)

    # End of the day, collect rewards
    rewards = [token.take_action(location) for location, token in enumerate(decisions)]
    regret.append((best_price_per_city - np.array(rewards)).sum())

    # Update the agent
    for token, reward, context_row in zip(decisions, rewards, one_hot_encoding):
        agent.select_for_update(token).update(
            np.atleast_2d(context_row), np.atleast_1d(reward)
        )

Now, we see that the agent’s performance is much worse. This is because the agent is not able to adapt to the changing demand. As we can see in the below plot, the agent begins to learn the true demand distribution, but then it changes, and the agent has already found a very narrow posterior distribution for the demand. Unfortunately, this posterior is out of date, so the agent will be slow to adapt.

[9]:
import matplotlib.pyplot as plt

plt.plot(np.cumsum(regret))
plt.xlabel("Round")
plt.ylabel("Cumulative Regret")
[9]:
Text(0, 0.5, 'Cumulative Regret')
../_images/notebooks_delayed-reward_16_1.png

Instead, we can utilize the Agent.decay method to increase the variance of our posterior distributions, thereby encouraging the agent to maintain a baseline level of exploration. The choice of decay_rate is arbitrary, it is essentially an exponential decay rate that increases the variance of the posterior distributions. We can see that this helps the agent adapt to the changing demand. However, decaying too often can prevent the agent from ever learning how to maximize the reward.

[10]:
np.random.seed(2)
oracle.reset_demand()
agent = ContextualAgent(
    arms=[
        Arm(
            Price.price_5,
            learner=NormalInverseGammaRegressor(mu=np.array([1000] + [0] * 10)),
        ),
        Arm(
            Price.price_8,
            learner=NormalInverseGammaRegressor(mu=np.array([1000] + [0] * 10)),
        ),
        Arm(
            Price.price_11,
            learner=NormalInverseGammaRegressor(mu=np.array([1000] + [0] * 10)),
        ),
        Arm(
            Price.price_14,
            learner=NormalInverseGammaRegressor(mu=np.array([1000] + [0] * 10)),
        ),
        Arm(
            Price.price_17,
            learner=NormalInverseGammaRegressor(mu=np.array([1000] + [0] * 10)),
        ),
    ],
    policy=policy,
    random_seed=111,
)

one_hot_encoding = np.append(np.ones((10, 1)), np.eye(10), axis=-1)

regret: list[float] = []

for iteration in range(1000):
    # Start of each week, randomly adjust demand
    if iteration % 7 == 0:
        oracle.adjust_demand()

    # Recompute best price per city
    best_price_per_city = np.array(
        [
            [oracle.margin(location, price.value) for price in Price]
            for location in range(10)
        ]
    ).max(axis=1)

    # Start of the day, adjust prices
    decisions = agent.pull(one_hot_encoding)

    # End of the day, collect rewards
    rewards = [token.take_action(location) for location, token in enumerate(decisions)]
    regret.append((best_price_per_city - np.array(rewards)).sum())

    # Update the agent
    for token, reward, context_row in zip(decisions, rewards, one_hot_encoding):
        agent.select_for_update(token).update(
            np.atleast_2d(context_row), np.atleast_1d(reward)
        )

    # End of each month, decay the bandit
    agent.decay(one_hot_encoding[0], decay_rate=0.98)

In this case, we can see that the agent is able to keep up with the changing demand, but that does not necessarily result in lower cumulative regret. This is because the agent is exploring more, and therefore choosing suboptimal prices more often. When choosing to use Agent.decay, we must be careful to treat the decay_rate as a hyperparameter and tune it accordingly.

[11]:
import matplotlib.pyplot as plt

plt.plot(np.cumsum(regret))
plt.xlabel("Round")
plt.ylabel("Cumulative Regret")
[11]:
Text(0, 0.5, 'Cumulative Regret')
../_images/notebooks_delayed-reward_20_1.png