Descriptive Statistics

Descriptive statistics is about describing our collected data using the measures discussed throughout this lesson: measures of center, measures of spread, shape of our distribution, and outliers. We can also use plots of our data to gain a better understanding.


Inferential Statistics

Inferential Statistics is about using our collected data to draw conclusions to a larger population. Performing inferential statistics well requires that we take a sample that accurately represents our population of interest.

A common way to collect data is via a survey. However, surveys may be extremely biased depending on the types of questions that are asked, and the way the questions are asked. This is a topic you should think about when tackling the the first project.

We looked at specific examples that allowed us to identify the

  1. Population – our entire group of interest.
  2. Parameter – numeric summary about a population
  3. Sample – subset of the population
  4. Statistic numeric summary about a sample

 

Sampling Distributions Introduction

In order to gain a bit more comfort with this idea of sampling distributions, let’s do some practice in python.

Below is an array that represents the students we saw in the previous videos, where 1 represents the students that drink coffee, and 0 represents the students that do not drink coffee.

import numpy as np
np.random.seed(42)

students = np.array([1,0,1,1,1,1,0,0,0,0,1,1,1,1,1,1,1,1,1,1,0])

p=students.mean()

sample = np.random.choice(students, 5)
sample.mean()

sample_props = []
for _ in range(100000):
sample = np.random.choice(students, 5)
sample_props.append(sample.mean())

np.mean(sample_props)

np.var(students), np.std(students)

np.var(sample_props), np.std(sample_props)

p = students.mean()
p*(1-p) # This matches the variance of the original 21 draws

p*(1-p)/5 # Matches the variance for the sampling distribution of the proportion with 5 draws

##Simulate your 20 draws
sample_props20 = []
for _ in range(100000):
sample = np.random.choice(students, 20)
sample_props20.append(sample.mean())

##Compare your variance values as computed in 6 and 8,
##but with your sample of 20 values

np.var(sample_props20), np.std(sample_props20) # Both are smaller the variance is p(1-p)/20 now

 

Sampling Distributions Notes

We have already learned some really valuable ideas about sampling distributions:


First, we have defined sampling distributions as the distribution of a statistic.

This is fundamental – I cannot stress the importance of this idea. We simulated the creation of sampling distributions in the previous ipython notebook for samples of size 5 and size 20, which is something you will do more than once in the upcoming concepts and lessons.


Second, we found out some interesting ideas about sampling distributions that will be iterated later in this lesson as well. We found that for proportions (and also means, as proportions are just the mean of 1 and 0 values), the following characteristics hold.

  1. The sampling distribution is centered on the original parameter value.
  2. The sampling distribution decreases its variance depending on the sample size used. Specifically, the variance of the sampling distribution is equal to the variance of the original data divided by the sample size used. This is always true for the variance of a sample mean!

In notation, we say if we have a random variable, \bold{X}, with variance of \bold{\sigma^2}, then the distribution of \bold{\bar{X}} (the sampling distribution of the sample mean) has a variance of \bold{\frac{\sigma^2}{n}}

 

As you saw in this video, we commonly use Greek symbols as parameters and lowercase letters as the corresponding statistics. Sometimes in the literature, you might also see the same Greek symbols with a “hat” to represent that this is an estimate of the corresponding parameter.

Below is a table that provides some of the most common parameters and corresponding statistics, as shown in the video.

Remember that all parameters pertain to a population, while all statistics pertain to a sample.

Parameter Statistic Description
\mu \bar{x} “The mean of a dataset”
\pi p “The mean of a dataset with only 0 and 1 values – a proportion”
\mu_1 – \mu_2 \bar{x}_1-\bar{x}_2 “The difference in means”
\pi_1 – \pi_2 p_1-p_2 “The difference in proportions”
\beta b “A regression coefficient – frequently used with subscripts”
\sigma s “The standard deviation”
\sigma^2 s^2 “The variance”
\rho r “The correlation coefficient”

 

Two important mathematical theorems for working with sampling distributions include:

  1. Law of Large Numbers
  2. Central Limit Theorem

The Law of Large Numbers says that as our sample size increases, the sample mean gets closer to the population mean, but how did we determine that the sample mean would estimate a population mean in the first place? How would we identify another relationship between parameter and statistic like this in the future?


Three of the most common ways are with the following estimation techniques:

Though these are beyond the scope of what is covered in this course, these are techniques that should be well understood for Data Scientist’s that may need to understand how to estimate some value that isn’t as common as a mean or variance. Using one of these methods to determine a “best estimate”, would be a necessity.

 

Law of Large Numbers Example

Use the dataset below stored in pop_data to answer the following questions, and complete the following quiz questions.

import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
np.random.seed(42)

pop_data = np.random.gamma(1,100,3000)
plt.hist(pop_data);

len(pop_data) #3000

pop_data.mean() #100.359787

sample = np.random.choice(pop_data, 5)
np.mean(sample) #27.6858

sample2 = np.random.choice(pop_data, 20)
np.mean(sample2) #163.3701

sample3 = np.random.choice(pop_data, 100)
np.mean(sample3) #119.5507

 

The Central Limit Theorem states that with a large enough sample size the sampling distribution of the mean will be normally distributed.

The Central Limit Theorem actually applies for these well known statistics:

  1. Sample means (\bar{x})
  2. Sample proportions (p)
  3. Difference in sample means (\bar{x}_1 – \bar{x}_2)
  4. Difference in sample proportions (p_1 – p_2)

And it applies for additional statistics, but it doesn’t apply for all statistics! . You will see more on this towards the end of this lesson.

 

Central Limit Theorem

Work through the questions and use the created variables to answer the questions that follow below the notebook.

Run the below cell to get started.

import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
np.random.seed(42)

pop_data = np.random.gamma(1,100,3000)
plt.hist(pop_data);

1. In order to create the sampling distribution for the average of 3 draws of this distribution, follow these steps:

a. Use numpy’s random.choice to simulate 3 draws from the pop_data array.

b. Compute the mean of these 3 draws.

c. Write a loop to simulate this process 10,000 times, and store each mean into an array called means_size_3.

d. Plot a histogram of your sample means.

e. Use means_size_3 and pop_data to answer the quiz questions below.

means_size_3 = []
for _ in range(10000):
sample = np.random.choice(pop_data, 3)
means_size_3.append(sample.mean())

plt.hist(means_size_3);

means_size_100 = []
for _ in range(10000):
sample = np.random.choice(pop_data, 100)
means_size_100.append(sample.mean())

plt.hist(means_size_100);

Central Limit Theorem – Part III

You saw how the Central Limit Theorem worked for the sample mean in the earlier concept. However, let’s consider another example to see a case where the Central Limit Theorem doesn’t work…

Work through the questions and use the created variables to answer the questions that follow below the notebook.

Run the below cell to get started.

import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
np.random.seed(42)

pop_data = np.random.gamma(1,100,3000)
plt.hist(pop_data);

1. In order to create the sampling distribution for the variance of 100 draws of this distribution, follow these steps:

a. Use numpy’s random.choice to simulate 100 draws from the pop_data array.

b. Compute the variance of these 100 draws.

c. Write a loop to simulate this process 10,000 times, and store each variance into an array called var_size_100.

d. Plot a histogram of your sample variances.

e. Use var_size_100 and pop_data to answer the quiz questions below.

vars_size_100 = []
for _ in range(10000):
sample = np.random.choice(pop_data, 100)
vars_size_100.append(sample.var())

plt.hist(vars_size_100);

 

Bootstrapping is sampling with replacement. Using random.choice in python actually samples in this way. Where the probability of any number in our set stays the same regardless of how many times it has been chosen. Flipping a coin and rolling a die are kind of like bootstrap sampling as well, as rolling a 6 in one scenario doesn’t mean that 6 is less likely later.

 

You actually have been bootstrapping to create sampling distributions in earlier parts of this lesson, but this can be extended to a bigger idea.

It turns out, we can do a pretty good job of finding out where a parameter is by using a sampling distribution created from bootstrapping from only a sample. This will be covered in depth in the next lessons.

Three of the most common ways are with the following estimation techniques for finding “good statistics” are as shown previously:

Though these are beyond the scope of what is covered in this course, these are techniques that should be well understood for data scientists who may need to understand how to estimate some value that isn’t as common as a mean or variance. Using one of these methods to determine a “best estimate” would be a necessity.

import numpy as np
np.random.seed(42)

die_vals = np.array([1,2,3,4,5,6])

np.random.choice(die_vals, size=20)

np.random.choice(die_vals, replace=False, size=20)

ValueError: Cannot take a larger sample than population when 'replace=False'


Two helpful links:

  • You can learn more about Bradley Efron here.
  • Additional notes on why bootstrapping works as a technique for inference can be found here.

 

Recap

In this lesson, you have learned a ton! You learned:


Sampling Distributions

  • Sampling Distributions are the distribution of a statistic (any statistic).
  • There are two very important mathematical theorems that are related to sampling distributions: The Law of Large Numbers and The Central Limit Theorem.
  • The Law of Large Numbers states that as a sample size increases, the sample mean will get closer to the population mean. In general, if our statistic is a “good” estimate of a parameter, it will approach our parameter with larger sample sizes.
  • The Central Limit Theorem states that with large enough sample sizes our sample mean will follow a normal distribution, but it turns out this is true for more than just the sample mean.

Bootstrapping

  • Bootstrapping is a technique where we sample from a group with replacement.
  • We can use bootstrapping to simulate the creation of sampling distribution, which you did many times in this lesson.
  • By bootstrapping and then calculating repeated values of our statistics, we can gain an understanding of the sampling distribution of our statistics.

 

We can use bootstrapping and sampling distributions to build confidence intervals for our parameters of interest.

By finding the statistic that best estimates our parameter(s) of interest (say the sample mean to estimate the population mean or the difference in sample means to estimate the difference in population means), we can easily build confidence intervals for the parameter of interest.