Hypothesis Testing

rules for setting up null and alternative hypotheses:

  1. The H_0 is true before you collect any data.
  2. The H_0 usually states there is no effect or that two groups are equal.
  3. The H_0 and H_1 are competing, non-overlapping hypotheses.
  4. H_1 is what we would like to prove to be true.
  5. H_0 contains an equal sign of some kind – either =, \leq, or \geq.
  6. H_1 contains the opposition of the null – either \neq>, or <.

You saw that the statement, “Innocent until proven guilty” is one that suggests the following hypotheses are true:

H_0: Innocent

H_1: Guilty

We can relate this to the idea that “innocent” is true before we collect any data. Then the alternative must be a competing, non-overlapping hypothesis. Hence, the alternative hypothesis is that an individual is guilty.

 

Because we wanted to test if a new page was better than an existing page, we set that up in the alternative. Two indicators are that the null should hold the equality, and the statement we would like to be true should be in the alternative. Therefore, it would look like this:

H_0: \mu_1 \leq \mu_2

H_1: \mu_1 > \mu_2

Here \mu_1 represents the population mean return from the new page. Similarly, \mu_2 represents the population mean return from the old page.

Depending on your question of interest, you would change your null and alternative hypotheses to match.

 

Type I Errors

Type I errors have the following features:

  1. You should set up your null and alternative hypotheses, so that the worse of your errors is the type I error.
  2. They are denoted by the symbol \alpha.
  3. The definition of a type I error is: Deciding the alternative (H_1) is true, when actually (H_0) is true.
  4. Type I errors are often called false positives.

Type II Errors

  1. They are denoted by the symbol \beta.
  2. The definition of a type II error is: Deciding the null (H_0) is true, when actually (H_1) is true.
  3. Type II errors are often called false negatives.

In the most extreme case, we can always choose one hypothesis (say always choosing the null) to ensure that a particular error never occurs (never a type I error assuming we always choose the null). However, more generally, there is a relationship where with a single set of data decreasing your chance of one type of error, increases the chance of the other error occurring.

 

Parachute Example

This example let you see one of the most extreme cases of errors that might be committed in hypothesis testing. In a type I error an individual died. In a type II error, you lost 30 dollars.

In the hypothesis tests you build in the upcoming lessons, you will be able to choose a type I error threshold, and your hypothesis tests will be created to minimize the type II errors after ensuring the type I error rate is met.

 

You are always performing hypothesis tests on population parameters, never on statistics. Statistics are values that you already have from the data, so it does not make sense to perform hypothesis tests on these values.

Common hypothesis tests include:

  1. Testing a population mean (One sample t-test).
  2. Testing the difference in means (Two sample t-test)
  3. Testing the difference before and after some treatment on the same individual (Paired t-test)
  4. Testing a population proportion (One sample z-test)
  5. Testing the difference between population proportions (Two sample z-test)

You can use one of these sites to provide a t-table or z-table to support one of the above approaches: t-tablet-table or z-table

There are literally 100s of different hypothesis tests! However, instead of memorizing how to perform all of these tests, you can find the statistic(s) that best estimates the parameter(s) you want to estimate, you can bootstrap to simulate the sampling distribution. Then you can use your sampling distribution to assist in choosing the appropriate hypothesis.

 

low, high = np.percentilre(means, 2.5), np.percentile(means, 97.5)

plt.axvline(x=low, color=’r’, linewidth=2)

plt.axvline(x=high, color=’r’, linewidth=2)

 

Simulating From the Null Hypothesis

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
np.random.seed(42)

full_data = pd.read_csv(‘../data/coffee_dataset.csv’)
sample_data = full_data.sample(200)

1. If you were interested in if the average height for coffee drinkers is the same as for non-coffee drinkers, what would the null and alternative be? Place them in the cell below, and use your answer to answer the first quiz question below.

Since there is no directional component associated with this statement, a not equal to seems most reasonable.

H0:μcoffμno=0H0:μcoff−μno=0
H0:μcoffμno0H0:μcoff−μno≠0

μcoffμcoff and μnoμno are the population mean values for coffee drinkers and non-coffee drinkers, respectivley.

2. If you were interested in if the average height for coffee drinkers is less than non-coffee drinkers, what would the null and alternative be? Place them in the cell below, and use your answer to answer the second quiz question below.

 In this case, there is a question associated with a direction – that is the average height for coffee drinkers is less than non-coffee drinkers. Below is one of the ways you could write the null and alternative. Since the mean for coffee drinkers is listed first here, the alternative would suggest that this is negative.
H0:μcoffμno0H0:μcoff−μno≥0
H0:μcoffμno<0H0:μcoff−μno<0

μcoffμcoff and μnoμno are the population mean values for coffee drinkers and non-coffee drinkers, respectivley.

3. For 10,000 iterations: bootstrap the sample data, calculate the mean height for coffee drinkers and non-coffee drinkers, and calculate the difference in means for each sample. You will want to have three arrays at the end of the iterations – one for each mean and one for the difference in means. Use the results of your sampling distribution, to answer the third quiz question below.

nocoff_means, coff_means, diffs = [], [], []

for _ in range(10000):
bootsamp = sample_data.sample(200, replace = True)
coff_mean = bootsamp[bootsamp[‘drinks_coffee’] == True][‘height’].mean()
nocoff_mean = bootsamp[bootsamp[‘drinks_coffee’] == False][‘height’].mean()
# append the info
coff_means.append(coff_mean)
nocoff_means.append(nocoff_mean)
diffs.append(coff_mean – nocoff_mean)

np.std(nocoff_means) # the standard deviation of the sampling distribution for nocoff

np.std(coff_means) # the standard deviation of the sampling distribution for coff

np.std(diffs) # the standard deviation for the sampling distribution for difference in means

plt.hist(nocoff_means, alpha = 0.5);
plt.hist(coff_means, alpha = 0.5); # They look pretty normal to me!

plt.hist(diffs, alpha = 0.5); # again normal – this is by the central limit theorem

4. Now, use your sampling distribution for the difference in means and the docs to simulate what you would expect if your sampling distribution were centered on zero. Also, calculate the observed sample mean difference in sample_data. Use your solutions to answer the last questions in the quiz below.

 We would expect the sampling distribution to be normal by the Central Limit Theorem, and we know the standard deviation of the sampling distribution of the difference in means from the previous question, so we can use this to simulate draws from the sampling distribution under the null hypothesis. If there is truly no difference, then the difference between the means should be zero.

null_vals = np.random.normal(0, np.std(diffs), 10000) # Here are 10000 draws from the sampling distribution under the null

plt.hist(null_vals); #Here is the sampling distribution of the difference under the null

Nice job! That’s right. Notice the standard deviation with the difference in means is larger than either of the individual. It turns out that this value for the standard deviation of the difference is actually the square root of the sum of the variance of each of the individual sampling distributions. And the mean has a standard deviation of the original draws divided by the square root of the sample size taken. More on this here and here.

 

What Is A P-value Anyway?

The definition of a p-value is the probability of observing your statistic (or one more extreme in favor of the alternative) if the null hypothesis is true.

In this video, you learned exactly how to calculate this value. The more extreme in favor of the alternative portion of this statement determines the shading associated with your p-value.

Therefore, you have the following cases:

If your parameter is greater than some value in the alternative hypothesis, your shading would look like this to obtain your p-value:

If your parameter is less than some value in the alternative hypothesis, your shading would look like this to obtain your p-value:

If your parameter is not equal to some value in the alternative hypothesis, your shading would look like this to obtain your p-value:

You could integrate the sampling distribution to obtain the area for each of these p-values. Alternatively, you will be simulating to obtain these proportions in the next concepts.

 

 

sample+mean = sample.df.height.mean()

(null_vals > sample_mean).mean()

(null_vals < sample_mean).mean()

null_mean = 70

(null_vals < sample_mean).mean() + (null_vals > null_mean + (null_mean – sample_mean)).mean()

low = sample_mean

high = null_mean + (null_mean – sample_mean)

plt.hist(null_vals);

plt.axvline(x=low, color=’r’, linewidth=2)

plt.axvline(x=high, color=’r’, linewidth=2)

There are a lot of moving parts in these videos. Let’s highlight the process:

  1. Simulate the values of your statistic that are possible from the null.
  2. Calculate the value of the statistic you actually obtained in your data.
  3. Compare your statistic to the values from the null.
  4. Calculate the proportion of null values that are considered extreme based on your alternative.

 

The p-value is the probability of getting our statistic or a more extreme value if the null is true.

Therefore, small p-values suggest our null is not true. Rather, our statistic is likely to have come from a different distribution than the null.

When the p-value is large, we have evidence that our statistic was likely to come from the null hypothesis. Therefore, we do not have evidence to reject the null.

By comparing our p-value to our type I error threshold (\alpha), we can make our decision about which hypothesis we will choose.

pval \leq \alpha \Rightarrow Reject H_0

pval > \alpha \Rightarrow Fail to Reject H_0

 

The word accept is one that is avoided when making statements regarding the null and alternative. You are not stating that one of the hypotheses is true. Rather, you are making a decision based on the likelihood of your data coming from the null hypothesis with regard to your type I error threshold.

Therefore, the wording used in conclusions of hypothesis testing includes: We reject the null hypothesisor We fail to reject the null hypothesis. This lends itself to the idea that you start with the null hypothesis true by default, and “choosing” the null at the end of the test would have been the choice even if no data were collected.

Drawing Conclusions – Calculating Errors

import numpy as np
import pandas as pd

jud_data = pd.read_csv(‘../data/judicial_dataset_predictions.csv’)
par_data = pd.read_csv(‘../data/parachute_dataset.csv’)

jud_data.head()

par_data.head()

1. Above, you can see the actual and predicted columns for each of the datasets. Using the jud_data, find the proportion of errors for the dataset, and furthermore, the percentage of errors of each type. Use the results to answer the questions in quiz 1 below.

Hint for quiz: an error is any time the prediction doesn’t match an actual value. Additionally, there are Type I and Type II errors to think about. We also know we can minimize one type of error by maximizing the other type of error. If we predict all individuals as innocent, how many of the guilty are incorrectly labeled? Similarly, if we predict all individuals as guilty, how many of the innocent are incorrectly labeled?

jud_data[jud_data[‘actual’] != jud_data[‘predicted’]].shape[0]/jud_data.shape[0] # Number of errros

jud_data.query(“actual == ‘innocent’ and predicted == ‘guilty'”).count()[0]/jud_data.shape[0] # Type 1

jud_data.query(“actual == ‘guilty’ and predicted == ‘innocent'”).count()[0]/jud_data.shape[0] # Type 2

2. Above, you can see the actual and predicted columns for each of the datasets. Using the par_data, find the proportion of errors for the dataset, and furthermore, the percentage of errors of each type. Use the results to answer the questions in quiz 2 below.

par_data[par_data[‘actual’] != par_data[‘predicted’]].shape[0]/par_data.shape[0] # Number of errros

par_data.query(“actual == ‘fails’ and predicted == ‘opens'”).count()[0]/par_data.shape[0] # Type 1

par_data.query(“actual == ‘opens’ and predicted == ‘fails'”).count()[0]/par_data.shape[0] # Type 2

 

One of the most important aspects of interpreting any statistical results (and one that is frequently overlooked) is assuring that your sample is truly representative of your population of interest.

Particularly in the way that data is collected today in the age of computers, response bias is so important to keep in mind. In the 2016 U.S election, polls conducted by many news media suggested a staggering difference from the reality of poll results. You can read about how response bias played a role here.

Hypothesis Testing vs. Machine Learning

With large sample sizes, hypothesis testing leads to even the smallest of findings as statistically significant. However, these findings might not be practically significant at all.

For example, Imagine you find that statistically more people prefer beverage 1 to beverage 2 on a study of more than one million people. Based on this you decide to open a shop to sell beverage 1. You then find out that beverage 1 is only more popular than beverage 2 by 0.0002% (but a statistically significant amount with your large sample size). Practically, maybe you should have opened a store that sold both.

Hypothesis testing takes an aggregate approach towards the conclusions made based on data, as these tests are aimed at understanding population parameters (which are aggregate population values).

Alternatively, machine learning techniques take an individual approach towards making conclusions, as they attempt to predict an outcome for each specific data point.

In the final lessons of this class, you will learn about two of the most fundamental machine learning approaches used in practice: linear and logistic regression.

 

When performing more than one hypothesis test, your type I error compounds. In order to correct for this, a common technique is called the Bonferroni correction. This correction is very conservative, but says that your new type I error rate should be the error rate you actually want divided by the number of tests you are performing.

Therefore, if you would like to hold a type I error rate of 1% for each of 20 hypothesis tests, the Bonferronicorrected rate would be 0.01/20 = 0.0005. This would be the new rate you should use as your comparison to the p-value for each of the 20 tests to make your decision.

Other Techniques

Additional techniques to protect against compounding type I errors include:

  1. Tukey correction
  2. Q-values

 

A two-sided hypothesis test (that is a test involving a \neq in the alternative) is the same in terms of the conclusions made as a confidence interval as long as:

1 – CI = \alpha

For example, a 95% confidence interval will draw the same conclusions as a hypothesis test with a type I error rate of 0.05 in terms of which hypothesis to choose, because:

1 – 0.95 = 0.05

assuming that the alternative hypothesis is a two sided test.

Video on effect size here.

The Impact of Large Sample Sizes

When we increase our sample size, even the smallest of differences may seem significant.

To illustrate this point, work through this notebook and the quiz questions that follow below.

Start by reading in the libraries and data.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
np.random.seed(42)

full_data = pd.read_csv(‘coffee_dataset.csv’)

1. In this case, imagine we are interested in testing if the mean height of all individuals in full_data is equal to 67.60 inches. First, use quiz 1 below to identify the null and alternative hypotheses for these cases.

$$H_0: \mu = 67.60$$

$$H_1: \mu \neq 67.60$$

2. What is the population mean height? What is the standard deviation of the population heights? Create a sample set of data using the code below. What is the sample mean height? Simulate the sampling distribution for the mean of five values to see the shape and plot a histogram. What is the standard deviation of the sampling distribution of the mean of five draws? Use quiz 2 below to assure your answers are correct.

# population height mean and standard deviation
full_data.height.mean(), full_data.height.std()

sample1 = full_data.sample(5)
sample1.height.mean()

sampling_dist_mean5 = []

for _ in range(10000):
bootstrap_sample = sample1.sample(5, replace=True)
bootstrap_mean = bootstrap_sample.height.mean()
sampling_dist_mean5.append(bootstrap_mean)

plt.hist(sampling_dist_mean5);

std_sampling_dist = np.std(sampling_dist_mean5)
std_sampling_dist

null_mean = 67.60  
# this is another way to compute the standard deviation of the sampling distribution theoretically  
std_sampling_dist = full_data.height.std()/np.sqrt(5)  
num_sims = 10000

null_sims = np.random.normal(null_mean, std_sampling_dist, num_sims)  
low_ext = (null_mean - (sample1.height.mean() - null_mean))  
high_ext = sample1.height.mean()  

(null_sims > high_ext).mean() + (null_sims < low_ext).mean()

3. Using the null and alternative set up in question 1 and the results of your sampling distribution in question 2, simulate the mean values you would expect from the null hypothesis. Use these simulated values to determine a p-value to make a decision about your null and alternative hypotheses. Check your solution using quiz 3 and quiz 4 below.

Hint: Use the numpy documentation here to assist with your solution.

null_mean = 67.60
null_vals = np.random.normal(null_mean, std_sampling_dist, 10000)

plt.hist(null_vals);

# where our sample mean falls on null distribution
plt.axvline(x=sample1.height.mean(), color = ‘red’);

# for a two sided hypothesis, we want to look at anything
# more extreme from the null in both directions
obs_mean = sample1.height.mean()

prob_more_extreme_high = (null_vals > obs_mean).mean()
prob_more_extreme_low = (null_vals < null_mean – (obs_mean – null_mean)).mean()

pval = prob_more_extreme_low + prob_more_extreme_high
pval

# let’s see where our sample mean falls on the null distribution
lower_bound = null_mean – (obs_mean – null_mean)
upper_bound = obs_mean

plt.hist(null_vals);
plt.axvline(x=lower_bound, color = ‘red’);
plt.axvline(x=upper_bound, color = ‘red’);

4. Now, imagine you received the same sample mean you calculated from the sample in question 2 above, but with a sample of 1000. What would the new standard deviation be for your sampling distribution for the mean of 1000 values? Additionally, what would your new p-value be for choosing between the null and alternative hypotheses you set up? Simulate the sampling distribution for the mean of five values to see the shape and plot a histogram. Use your solutions here to answer the second to last quiz question below.

# get standard deviation for a sample size of 1000
sample2 = full_data.sample(1000)
sampling_dist_mean1000 = []
for _ in range(10000):
bootstrap_sample = sample2.sample(1000, replace=True)
bootstrap_mean = bootstrap_sample.height.mean()
sampling_dist_mean1000.append(bootstrap_mean)

std_sampling_dist1000 = np.std(sampling_dist_mean1000)
std_sampling_dist1000

null_vals = np.random.normal(null_mean, std_sampling_dist1000, 10000)

plt.hist(null_vals);
plt.axvline(x=lower_bound, color = ‘red’);
plt.axvline(x=upper_bound, color = ‘red’);

# for a two sided hypothesis, we want to look at anything
# more extreme from the null in both directions

prob_more_extreme_low = (null_vals < lower_bound).mean()
prob_more_extreme_high = (upper_bound < null_vals).mean()

pval = prob_more_extreme_low + prob_more_extreme_high
pval

 

Multiple Tests

In this notebook, you will work with a similar dataset to the judicial dataset you were working with before. However, instead of working with decisions already being provided, you are provided with a p-value associated with each individual.

Use the questions in the notebook and the dataset to answer the questions at the bottom of this page.

Here is a glimpse of the data you will be working with:

import numpy as np
import pandas as pd

df = pd.read_csv(‘judicial_dataset_pvalues.csv’)
df.head()

1. Remember back to the null and alternative hypotheses for this example. Use that information to determine the answer for Quiz 1 and Quiz 2 below.

A pvalue is the probability of observing your data or more extreme data, if the null is true. Type I errors are when you choose the alternative when the null is true, and vice-versa for Type II. Therefore, deciding an individual is guilty when they are actually innocent is a Type I error. The alpha level is a threshold for the percent of the time you are willing to commit a Type I error.

2. If we consider each individual as a single hypothesis test, find the conservative Bonferroni corrected alpha level we should use to maintain a 5% type I error rate.

bonf_alpha = 0.05/df.shape[0]
bonf_alpha

3. What is the proportion of type I errors made if the correction isn’t used? How about if it is used?

Use your answers to find the solution to Quiz 3 below.

In order to find the number of type I errors made in without the correction – we need to find all those that are actually innocent with pvalues less than 0.05.

df.query(“actual == ‘innocent’ and pvalue < 0.05”).count()[0]/df.shape[0] # If not used

df.query(“actual == ‘innocent’ and pvalue < @bonf_alpha”).count()[0]/df.shape[0] # If used

4. Think about how hypothesis tests can be used, and why this example wouldn’t exactly work in terms of being able to use hypothesis testing in this way. Check your answer with Quiz 4 below.

This is looking at individuals, and that is more of the aim for machine learning techniques. Hypothesis testing and confidence intervals are for population parameters. Therefore, they are not meant to tell us about individual cases, and we wouldn’t obtain pvalues for individuals in this way. We could get probabilities, but that isn’t the same as the probabilities associated with the relationship to sampling distributions as you have seen in these lessons.

 

Recap

Wow! That was a ton. You learned:

  1. How to set up hypothesis tests. You learned the null hypothesis is what we assume to be true before we collect any data, and the alternative is usually what we want to try and prove to be true.
  2. You learned about Type I and Type II errors. You learned that Type I errors are the worst type of errors, and these are associated with choosing the alternative when the null hypothesis is actually true.
  3. You learned that p-values are the probability of observing your data or something more extreme in favor of the alternative given the null hypothesis is true. You learned that using a confidence interval from the bootstrapping samples, you can essentially make the same decisions as in hypothesis testing (without all of the confusion of p-values).
  4. You learned how to make decisions based on p-values. That is, if the p-value is less than your Type I error threshold, then you have evidence to reject the null and choose the alternative. Otherwise, you fail to reject the null hypothesis.
  5. You learned that when sample sizes are really large, everything appears statistically significant (that is you end up rejecting essentially every null), but these results may not be practically significant.
  6. You learned that when performing multiple hypothesis tests, your errors will compound. Therefore, using some sort of correction to maintain your true Type I error rate is important. A simple, but very conservative approach is to use what is known as a Bonferroni correction, which says you should just divide your \alpha level (or Type I error threshold) by the number of tests performed.
%d bloggers like this: