Thomson sampling in Python¶
This notebook provides a step-by-step guide to implementing basic Thompson sampling in Python.
Using this implementation, we will then run simulations, to test our theory results from class.
We will use the numpy
package for all our numerical calculations.
import numpy as np
Let's go over some preliminaries, to familiarize ourselves with the required commands.
Assume we have three treatment arms 0, 1, 2, and two possible outcomes, 0, 1.
To get started, define two example vectors of treatment values and outcome values.
We just use these for illustration, and to develop our implementation.
treatments = np.array([0,1,0,1,0,2])
outcomes = np.array([0,0,0,1,0,1])
Which observations have treatment value 1?
np.equal(treatments, 1)
array([False, True, False, True, False, False])
What were the outcomes for these observations?
outcomes[np.equal(treatments, 1)]
array([0, 1])
What is the total number of successes (outcome 1) for these observations?
np.sum(outcomes[np.equal(treatments, 1)])
1
What is the total number of failures (outcome 1) for these observations?
np.sum(1 - outcomes[np.equal(treatments, 1)])
1
Now count total successes and total failures for each treatment arm.
The following uses Python's elegant list comprehension syntax, instead of a loop over the k
treatment arms.
k = 3
successes = np.array([np.sum( outcomes[np.equal(treatments, d)]) for d in range(k)])
failures = np.array([np.sum(1 - outcomes[np.equal(treatments, d)]) for d in range(k)])
print(successes)
print(failures)
[0 1 1] [3 1 0]
Generate a random draw from the beta distribution with parameters 1 and 2.
np.random.beta(1, 2)
0.20135263034038475
Assume a uniform prior for each treatment arm.
Then the true mean has a posterior distribution which is a Beta distribution with parameters given by:
1 + successes
, 1 + failures
.
For example, for treatment 2 in the above example we have a Beta(2,1) posterior.
We can sample one draw from the posterior for each arm as follows, since numpy functions are vectorized.
posterior_draw = np.random.beta(1 + successes, 1 + failures)
print(posterior_draw)
[0.19611936 0.42983356 0.3486667 ]
Which of these draws is the largest? (remember that Python indices start at 0.)
np.argmax(posterior_draw)
1
Now we are ready to put everything together into one function.
This function takes vectors of treatment values and outcomes, and returns a treatment value according to Thompson sampling.
def thompson(outcomes, treatments, k):
'''Given a vector of binary outcomes and discrete treatments,
return a treatment value according to Thompson sampling.'''
successes = np.array([np.sum( outcomes[np.equal(treatments, d)]) for d in range(k)])
failures = np.array([np.sum(1 - outcomes[np.equal(treatments, d)]) for d in range(k)])
posterior_draw = np.random.beta(1 + successes, 1 + failures)
return np.argmax(posterior_draw)
Now let's test this function.
Every time we call it, it returns a treatment value.
This is the treatment that we should assign to the next unit, according to Thompson sampling.
thompson(outcomes, treatments, k)
1
Let's do that again. Thompson sampling returns a random treatment, so the answer might be different each time.
thompson(outcomes, treatments, k)
0
Code for simulations¶
The function that we just built is all you need to run an experiment using Thompson sampling.
But to simulate how well Thompson sampling performs, we need to do more work.
Let us first build a function which simulates outcomes according to the parameter vector $\theta$, and assigns treatment according to Thompson sampling.
We can simulate outcomes using np.random.binomial
.
The following does so for an example of $\theta$.
theta = np.array([.2,.5,.8])
treatment = 2
np.random.binomial(1, theta[treatment])
1
We can pre-allocate memory by using np.empty
. This is good practice for faster computation.
np.empty(10, dtype = np.int8)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int8)
To simulate a trajectory of treatments and outcomes, we need to loop over time periods t
going from 1 to T
.
In each time period the thompson
function uses the trajectory generated thus far, to give a new treatment assignment.
The outcome is then simulated using this assigned treatment.
The following function returns trajectories of treatment, outcome, and expected outcome for each time period, stored in a matrix.
def simulate_thompson_path(theta, T):
'''Given a vector of expected potential outcomes,
simulate a path of outcomes and treatments according to Thompson sampling.'''
k = len(theta)
treatment = np.empty(T, dtype = np.int8)
outcome = np.empty(T, dtype = np.int8)
for t in range(T):
treatment[t] = thompson(outcome[range(t-1)], treatment[range(t-1)], k)
outcome[t] = np.random.binomial(1, theta[treatment[t]])
return np.array([treatment, outcome, theta[treatment]])
Let's test this function.
theta = np.array([.2,.5,.8])
T = 10
print(simulate_thompson_path(theta, T))
[[0. 2. 2. 2. 2. 2. 2. 1. 1. 1. ] [0. 1. 1. 1. 1. 1. 1. 1. 1. 0. ] [0.2 0.8 0.8 0.8 0.8 0.8 0.8 0.5 0.5 0.5]]
Now we want to set up simulations where we evaluate the performance of Thompson sampling.
To do so, we want to run the simulation many times, from scratch, and to store the results.
Any time you have to do the same computation many times, parallel processing is a good idea, to make things run faster.
Parallel processing is easy in Python with the multiprocess
package.
from multiprocess import Pool
The following function uses Pool
to repeatedly (reps
times) run simulate_thompson_path
.
It stores the results in the list paths
.
It then returns the average across simulations. recall that we want to maximize the average of outcomes. If the algorithm works well, the average should go up over time.
def average_thompson_path(theta, T, reps = 5000):
'''Simulate reps replicate Thompson paths.'''
with Pool() as pool:
paths = pool.map(lambda _: simulate_thompson_path(theta, T),
range(reps))
return np.array(paths).mean(axis = 0)
Let's try it, for a small number of reps
(just so it runs quickly).
average_thompson_path(theta, 80, reps = 100)
array([[1.1 , 1.06 , 1.32 , 1.5 , 1.39 , 1.34 , 1.58 , 1.56 , 1.41 , 1.56 , 1.41 , 1.83 , 1.82 , 1.81 , 1.74 , 1.91 , 1.97 , 1.55 , 1.84 , 1.9 , 1.59 , 1.59 , 1.98 , 1.84 , 1.92 , 1.75 , 1.67 , 1.44 , 1.84 , 1.75 , 1.84 , 1.91 , 1.83 , 1.59 , 1.67 , 1.84 , 1.83 , 1.84 , 1.68 , 1.82 , 1.76 , 1.6 , 1.83 , 2. , 1.76 , 1.83 , 1.92 , 2. , 2. , 2. , 1.76 , 1.99 , 1.99 , 2. , 1.76 , 1.99 , 1.84 , 2. , 1.76 , 2. , 2. , 2. , 2. , 1.99 , 1.92 , 1.84 , 1.84 , 2. , 1.84 , 1.92 , 1.52 , 1.99 , 2. , 1.67 , 2. , 1.92 , 1.99 , 2. , 1.84 , 2. ], [0.65 , 0.57 , 0.74 , 0.82 , 0.67 , 0.74 , 0.83 , 0.57 , 0.75 , 0.73 , 0.68 , 0.82 , 0.92 , 0.83 , 0.66 , 0.83 , 0.73 , 0.5 , 0.68 , 0.75 , 0.6 , 0.74 , 0.91 , 0.91 , 0.67 , 0.76 , 0.76 , 0.67 , 0.75 , 0.84 , 0.68 , 0.68 , 0.59 , 0.83 , 0.75 , 0.67 , 0.91 , 1. , 0.92 , 0.59 , 0.83 , 0.68 , 0.73 , 0.91 , 0.76 , 0.75 , 0.68 , 0.6 , 0.89 , 0.68 , 0.83 , 0.83 , 0.76 , 0.66 , 1. , 0.99 , 0.59 , 0.76 , 0.67 , 1. , 0.83 , 0.75 , 0.84 , 0.92 , 0.75 , 0.59 , 0.83 , 0.9 , 0.84 , 0.75 , 0.58 , 0.84 , 0.6 , 0.83 , 0.82 , 0.83 , 0.9 , 0.68 , 0.92 , 0.91 ], [0.53 , 0.518, 0.596, 0.65 , 0.617, 0.602, 0.674, 0.668, 0.623, 0.668, 0.623, 0.749, 0.746, 0.743, 0.722, 0.773, 0.791, 0.665, 0.752, 0.77 , 0.677, 0.677, 0.794, 0.752, 0.776, 0.725, 0.701, 0.632, 0.752, 0.725, 0.752, 0.773, 0.749, 0.677, 0.701, 0.752, 0.749, 0.752, 0.704, 0.746, 0.728, 0.68 , 0.749, 0.8 , 0.728, 0.749, 0.776, 0.8 , 0.8 , 0.8 , 0.728, 0.797, 0.797, 0.8 , 0.728, 0.797, 0.752, 0.8 , 0.728, 0.8 , 0.8 , 0.8 , 0.8 , 0.797, 0.776, 0.752, 0.752, 0.8 , 0.752, 0.776, 0.656, 0.797, 0.8 , 0.701, 0.8 , 0.776, 0.797, 0.8 , 0.752, 0.8 ]])
This is a bit hard to read.
Let us instead plot the simulation output, using the Python library matplotlib
.
import matplotlib.pyplot as plt
plt.style.use('ggplot')
The following function plots average regret across simulations, for each time period $t$.
Recall that regret is defined as the difference between:
- The expected output for the best treatment,
max(theta)
, and - the expected output for the chosen treatment, which we stored in
avg_thompson[2]
.
def plot_average_thompspon_path(avg_thompson, theta, T):
avg_thompson = average_thompson_path(theta, T)
regret = max(theta) - avg_thompson[2]
plt.plot(range(T), regret)
plt.ylim(bottom = 0)
plt.xlabel('Time $t$')
plt.ylabel('Regret')
plt.title(f'Expected regret of Thompsons sampling for parameter vector {theta}')
Generate plots¶
Now we are ready to see how well Thompson sampling works for different values of $\theta$, over time.
T = 80
theta = np.array([.2,.5,.8])
plot_average_thompspon_path(average_thompson_path(theta, T),
theta, T)
theta = np.array([.3,.4,.5,.6,.7])
plot_average_thompspon_path(average_thompson_path(theta, T),
theta, T)
Regret as a function of $\theta$¶
In the theory lecture, we discussed how bandit problems are easy if the gap across arms is large or small, and hardest for intermediate values.
Let's see how that plays out in simulations, for $T=200$.
In the following we hold $\theta_1$ fixed at 0.5, and vary the value of $\theta_2$.
For each value of $\theta_2$, we run average_thompson_path
.
We then plot cumulative regret as a function of $\theta_2$.
This function takes a while to run, even on a fast computer.
def cumulative_average_regret(T):
theta2s = np.linspace(0,1, 21)
paths = [np.max([.5, theta2]) - average_thompson_path(np.array([.5, theta2]), T)
for theta2 in theta2s]
regret_function = np.array([np.mean(path[2]) for path in paths])
plt.plot(theta2s, regret_function)
plt.ylim(bottom = 0)
plt.xlabel('$\\theta_2$')
plt.ylabel('Regret')
plt.title(f'Expected regret of Thompsons sampling for different $\\theta_2$')
cumulative_average_regret(200)