Case Study: A/B Tests

A/B tests are used to test changes on a web page by running an experiment where a control group sees the old version, while the experiment group sees the new version. A metric is then chosen to measure the level of engagement from users in each group. These results are then used to judge whether one version is more effective than the other. A/B testing is very much like hypothesis testing with the following hypotheses:

  • Null Hypothesis: The new version is no better, or even worse, than the old version
  • Alternative Hypothesis: The new version is better than the old version

If we fail to reject the null hypothesis, the results would suggest keeping the old version. If we reject the null hypothesis, the results would suggest launching the change. These tests can be used for a wide variety of changes, from large feature additions to small adjustments in color, to see what change maximizes your metric the most.

A/B testing also has its drawbacks. It can help you compare two options, but it can’t tell you about an option you haven’t considered. It can also produce bias results when tested on existing users, due to factors like change aversion and novelty effect.

  • Change Aversion: Existing users may give an unfair advantage to the old version, simply because they are unhappy with change, even if it’s ultimately for the better.
  • Novelty Effect: Existing users may give an unfair advantage to the new version, because they’re excited or drawn to the change, even if it isn’t any better in the long run.

 

# get actions from control group
control_df = df.query(‘group == “control”‘)

# compute click through rate
control_ctr = control_df.query(‘action == “click”‘).count()[0] / control_df.query(‘action == “view”‘).count()[0]

# view click through rate
control_ctr

# get actions from experiment group
experiment_df = df.query(‘group == “experiment”‘)

# compute click through rate
experiment_ctr = experiment_df.query(‘action == “click”‘).count()[0] / experiment_df.query(‘action == “view”‘).count()[0]

# view click through rate
experiment_ctr

# compute observed difference in click through raet
obs_diff = experiment_ctr – control_ctr
obs_diff

# simulate sampling distribution for difference in proportions, or CTRs
diffs = []
for _ in range(10000):
b_samp = df.sample(df.shape[0], replace=True)
control_df = b_samp.query(‘group == “control”‘)
experiment_df = b_samp.query(‘group == “experiment”‘)
control_ctr = control_df.query(‘action == “click”‘).count()[0] / control_df.query(‘action == “view”‘).count()[0]
experiment_ctr = experiment_df.query(‘action == “click”‘).count()[0] / experiment_df.query(‘action == “view”‘).count()[0]
diffs.append(experiment_ctr – control_ctr)

# convert to numpy
diffs = np.array(diffs)

# plot sampling distribution
plt.hist(diffs);

# simulate distribution under the null hypothesis
null_vals = np.random.normal(0, diffs.std(), diffs.size)

# plot null distribution and line at our observed differece
plt.hist(null_vals)
plt.axvline(x=obs_diff, color=’red’);

# compute p-value
(null_vals > obs_diff).mean()

 

Let’s recap the steps we took to analyze the results of this A/B test.

  1. We computed the observed difference between the metric, click through rate, for the control and experiment group.
  2. We simulated the sampling distribution for the difference in proportions (or difference in click through rates).
  3. We used this sampling distribution to simulate the distribution under the null hypothesis, by creating a random normal distribution centered at 0 with the same spread and size.
  4. We computed the p-value by finding the proportion of values in the null distribution that were greater than our observed difference.
  5. We used this p-value to determine the statistical significance of our observed difference.

 

  1. Enrollment Rate: Click through rate for the Enroll button the course overview page
  2. Average Reading Duration: Average number of seconds spent on the course overview page
  3. Average Classroom Time: Average number of days spent in the classroom for students enrolled in the course
  4. Completion Rate: Course completion rate for students enrolled in the course

 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
% matplotlib inline

np.random.seed(42)

df = pd.read_csv(‘course_page_actions.csv’)
df.head()

# Get dataframe with all records from control group
control_df = df.query(‘group == “control”‘)

# Compute click through rate for control group
control_ctr = control_df.query(‘action == “enroll”‘).count()[0] / control_df.query(‘action == “view”‘).count()[0]

# Display click through rate
control_ctr

# Get dataframe with all records from control group
experiment_df = df.query(‘group == “experiment”‘)

# Compute click through rate for experiment group
experiment_ctr = experiment_df.query(‘action == “enroll”‘).count()[0] / experiment_df.query(‘action == “view”‘).count()[0]

# Display click through rate
experiment_ctr

# Compute the observed difference in click through rates
obs_diff = experiment_ctr – control_ctr

# Display observed difference
obs_diff

# Create a sampling distribution of the difference in proportions
# with bootstrapping
diffs = []
size = df.shape[0]
for _ in range(10000):
b_samp = df.sample(size, replace=True)
control_df = b_samp.query(‘group == “control”‘)
experiment_df = b_samp.query(‘group == “experiment”‘)
control_ctr = control_df.query(‘action == “enroll”‘).count()[0] / control_df.query(‘action == “view”‘).count()[0]
experiment_ctr = experiment_df.query(‘action == “enroll”‘).count()[0] / experiment_df.query(‘action == “view”‘).count()[0]
diffs.append(experiment_ctr – control_ctr)

# Convert to numpy array
diffs = np.array(diffs)

# Plot sampling distribution
plt.hist(diffs);

# Simulate distribution under the null hypothesis
null_vals = np.random.normal(0, diffs.std(), diffs.size)

# Plot the null distribution
plt.hist(null_vals);

# Plot observed statistic with the null distibution
plt.hist(null_vals);
plt.axvline(obs_diff, c=’red’)

# Compute p-value
(null_vals > obs_diff).mean()

 

views = df.query(‘action==”views”‘)

reading_times = views.groupby([‘id’, ‘group’])[‘duration’].mean()

reading_times = reading_times.reset_index()

reading_times.head()

 

# compute average reading durations for each group
control_mean = df.query(‘group == “control”‘).duration.mean()
experiment_mean = df.query(‘group == “experiment”‘).duration.mean()
control_mean, experiment_mean

control_mean = df.query(‘group == “control”‘)[‘duration’].mean()
experiment_mean = df.query(‘group == “experiment”‘)[‘duration’].mean()
control_mean, experiment_mean

# compute observed difference in means
obs_diff = experiment_mean – control_mean
obs_diff

# simulate sampling distribution for the difference in means
diffs = []
for _ in range(10000):
b_samp = df.sample(df.shape[0], replace=True)
control_mean = b_samp.query(‘group == “control”‘).duration.mean()
experiment_mean = b_samp.query(‘group == “experiment”‘).duration.mean()
diffs.append(experiment_mean – control_mean)

# convert to numpy array
diffs = np.array(diffs)

# plot sampling distribution
plt.hist(diffs);

# simulate the distribution under the null hypothesis
null_vals = np.random.normal(0, diffs.std(), diffs.size)

# plot null distribution
plt.hist(null_vals);

# plot null distribution and where our observed statistic falls
plt.hist(null_vals)
plt.axvline(x=obs_diff, color=’red’);

# compute p-value
(null_vals > obs_diff).mean()

  1. We computed the observed difference between the metric, average reading duration, for the control and experiment group.
  2. We simulated the sampling distribution for the difference in means (or average reading durations).
  3. We used this sampling distribution to simulate the distribution under the null hypothesis, by creating a random normal distribution centered at 0 with the same spread and size.
  4. We computed the p-value by finding the proportion of values in the null distribution that were greater than our observed difference.
  5. We used this p-value to determine the statistical significance of our observed difference.

 

The more metrics you evaluate, the more likely you are to observe significant differences just by chance – similar to what you saw in previous lessons with multiple tests. Luckily, this multiple comparisons problem can be handled in several ways.

 

Since the Bonferroni method is too conservative when we expect correlation among metrics, we can better approach this problem with more sophisticated methods, such as the closed testing procedureBoole-Bonferroni bound, and the Holm-Bonferroni method. These are less conservative and take this correlation into account.

If you do choose to use a less conservative method, just make sure the assumptions of that method are truly met in your situation, and that you’re not just trying to cheat on a p-value. Choosing a poorly suited test just to get significant results will only lead to misguided decisions that harm your company’s performance in the long run.

 

Difficulties in A/B Testing

As you saw in the scenarios above, there are many factors to consider when designing an A/B test and drawing conclusions based on its results. To conclude, here are some common ones to consider.

  • Novelty effect and change aversion when existing users first experience a change
  • Sufficient traffic and conversions to have significant and repeatable results
  • Best metric choice for making the ultimate decision (eg. measuring revenue vs. clicks)
  • Long enough run time for the experiment to account for changes in behavior based on time of day/week or seasonal events.
  • Practical significance of a conversion rate (the cost of launching a new feature vs. the gain from the increase in conversion)
  • Consistency among test subjects in the control and experiment group (imbalance in the population represented in each group can lead to situations like Simpson’s Paradox)

 

 

%d bloggers like this: