A/B tests are used to test changes on a web page by running an experiment where a **control group** sees the old version, while the **experiment group** sees the new version. A **metric** is then chosen to measure the level of engagement from users in each group. These results are then used to judge whether one version is more effective than the other. A/B testing is very much like hypothesis testing with the following hypotheses:

**Null Hypothesis:**The new version is no better, or even worse, than the old version**Alternative Hypothesis:**The new version is better than the old version

If we fail to reject the null hypothesis, the results would suggest keeping the old version. If we reject the null hypothesis, the results would suggest launching the change. These tests can be used for a wide variety of changes, from large feature additions to small adjustments in color, to see what change maximizes your metric the most.

A/B testing also has its drawbacks. It can help you compare two options, but it can’t tell you about an option you haven’t considered. It can also produce bias results when tested on existing users, due to factors like change aversion and novelty effect.

**Change Aversion:**Existing users may give an unfair advantage to the old version, simply because they are unhappy with change, even if it’s ultimately for the better.**Novelty Effect:**Existing users may give an unfair advantage to the new version, because they’re excited or drawn to the change, even if it isn’t any better in the long run.

# get actions from control group

control_df = df.query(‘group == “control”‘)

# compute click through rate

control_ctr = control_df.query(‘action == “click”‘).count()[0] / control_df.query(‘action == “view”‘).count()[0]

# view click through rate

control_ctr

# get actions from experiment group

experiment_df = df.query(‘group == “experiment”‘)

# compute click through rate

experiment_ctr = experiment_df.query(‘action == “click”‘).count()[0] / experiment_df.query(‘action == “view”‘).count()[0]

# view click through rate

experiment_ctr

# compute observed difference in click through raet

obs_diff = experiment_ctr – control_ctr

obs_diff

# simulate sampling distribution for difference in proportions, or CTRs

diffs = []

for _ in range(10000):

b_samp = df.sample(df.shape[0], replace=True)

control_df = b_samp.query(‘group == “control”‘)

experiment_df = b_samp.query(‘group == “experiment”‘)

control_ctr = control_df.query(‘action == “click”‘).count()[0] / control_df.query(‘action == “view”‘).count()[0]

experiment_ctr = experiment_df.query(‘action == “click”‘).count()[0] / experiment_df.query(‘action == “view”‘).count()[0]

diffs.append(experiment_ctr – control_ctr)

# convert to numpy

diffs = np.array(diffs)

# plot sampling distribution

plt.hist(diffs);

# simulate distribution under the null hypothesis

null_vals = np.random.normal(0, diffs.std(), diffs.size)

# plot null distribution and line at our observed differece

plt.hist(null_vals)

plt.axvline(x=obs_diff, color=’red’);

# compute p-value

(null_vals > obs_diff).mean()

Let’s recap the steps we took to analyze the results of this A/B test.

- We computed the
**observed difference**between the metric, click through rate, for the control and experiment group. - We simulated the
**sampling distribution**for the difference in proportions (or difference in click through rates). - We used this sampling distribution to simulate the
**distribution under the null**hypothesis, by creating a random normal distribution centered at 0 with the same spread and size. - We computed the
**p-value**by finding the proportion of values in the null distribution that were greater than our observed difference. - We used this p-value to determine the
**statistical significance**of our observed difference.

**Enrollment Rate:**Click through rate for the*Enroll*button the course overview page**Average Reading Duration:**Average number of seconds spent on the course overview page**Average Classroom Time:**Average number of days spent in the classroom for students enrolled in the course**Completion Rate:**Course completion rate for students enrolled in the course

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

% matplotlib inline

np.random.seed(42)

df = pd.read_csv(‘course_page_actions.csv’)

df.head()

# Get dataframe with all records from control group

control_df = df.query(‘group == “control”‘)

# Compute click through rate for control group

control_ctr = control_df.query(‘action == “enroll”‘).count()[0] / control_df.query(‘action == “view”‘).count()[0]

# Display click through rate

control_ctr

# Get dataframe with all records from control group

experiment_df = df.query(‘group == “experiment”‘)

# Compute click through rate for experiment group

experiment_ctr = experiment_df.query(‘action == “enroll”‘).count()[0] / experiment_df.query(‘action == “view”‘).count()[0]

# Display click through rate

experiment_ctr

# Compute the observed difference in click through rates

obs_diff = experiment_ctr – control_ctr

# Display observed difference

obs_diff

# Create a sampling distribution of the difference in proportions

# with bootstrapping

diffs = []

size = df.shape[0]

for _ in range(10000):

b_samp = df.sample(size, replace=True)

control_df = b_samp.query(‘group == “control”‘)

experiment_df = b_samp.query(‘group == “experiment”‘)

control_ctr = control_df.query(‘action == “enroll”‘).count()[0] / control_df.query(‘action == “view”‘).count()[0]

experiment_ctr = experiment_df.query(‘action == “enroll”‘).count()[0] / experiment_df.query(‘action == “view”‘).count()[0]

diffs.append(experiment_ctr – control_ctr)

# Convert to numpy array

diffs = np.array(diffs)

# Plot sampling distribution

plt.hist(diffs);

# Simulate distribution under the null hypothesis

null_vals = np.random.normal(0, diffs.std(), diffs.size)

# Plot the null distribution

plt.hist(null_vals);

# Plot observed statistic with the null distibution

plt.hist(null_vals);

plt.axvline(obs_diff, c=’red’)

# Compute p-value

(null_vals > obs_diff).mean()

views = df.query(‘action==”views”‘)

reading_times = views.groupby([‘id’, ‘group’])[‘duration’].mean()

reading_times = reading_times.reset_index()

reading_times.head()

# compute average reading durations for each group

control_mean = df.query(‘group == “control”‘).duration.mean()

experiment_mean = df.query(‘group == “experiment”‘).duration.mean()

control_mean, experiment_mean

control_mean = df.query(‘group == “control”‘)[‘duration’].mean()

experiment_mean = df.query(‘group == “experiment”‘)[‘duration’].mean()

control_mean, experiment_mean

# compute observed difference in means

obs_diff = experiment_mean – control_mean

obs_diff

# simulate sampling distribution for the difference in means

diffs = []

for _ in range(10000):

b_samp = df.sample(df.shape[0], replace=True)

control_mean = b_samp.query(‘group == “control”‘).duration.mean()

experiment_mean = b_samp.query(‘group == “experiment”‘).duration.mean()

diffs.append(experiment_mean – control_mean)

# convert to numpy array

diffs = np.array(diffs)

# plot sampling distribution

plt.hist(diffs);

# simulate the distribution under the null hypothesis

null_vals = np.random.normal(0, diffs.std(), diffs.size)

# plot null distribution

plt.hist(null_vals);

# plot null distribution and where our observed statistic falls

plt.hist(null_vals)

plt.axvline(x=obs_diff, color=’red’);

# compute p-value

(null_vals > obs_diff).mean()

- We computed the
**observed difference**between the metric, average reading duration, for the control and experiment group. - We simulated the
**sampling distribution**for the difference in means (or average reading durations). - We used this sampling distribution to simulate the
**distribution under the null**hypothesis, by creating a random normal distribution centered at 0 with the same spread and size. - We computed the
**p-value**by finding the proportion of values in the null distribution that were greater than our observed difference. - We used this p-value to determine the
**statistical significance**of our observed difference.

The more metrics you evaluate, the more likely you are to observe significant differences just by chance – similar to what you saw in previous lessons with multiple tests. Luckily, this multiple comparisons problem can be handled in several ways.

Since the Bonferroni method is too conservative when we expect correlation among metrics, we can better approach this problem with more sophisticated methods, such as the closed testing procedure, Boole-Bonferroni bound, and the Holm-Bonferroni method. These are less conservative and take this correlation into account.

If you do choose to use a less conservative method, just make sure the assumptions of that method are truly met in your situation, and that you’re not just trying to cheat on a p-value. Choosing a poorly suited test just to get significant results will only lead to misguided decisions that harm your company’s performance in the long run.

# Difficulties in A/B Testing

As you saw in the scenarios above, there are many factors to consider when designing an A/B test and drawing conclusions based on its results. To conclude, here are some common ones to consider.

- Novelty effect and change aversion when existing users first experience a change
- Sufficient traffic and conversions to have significant and repeatable results
- Best metric choice for making the ultimate decision (eg. measuring revenue vs. clicks)
- Long enough run time for the experiment to account for changes in behavior based on time of day/week or seasonal events.
- Practical significance of a conversion rate (the cost of launching a new feature vs. the gain from the increase in conversion)
- Consistency among test subjects in the control and experiment group (imbalance in the population represented in each group can lead to situations like Simpson’s Paradox)