In this lesson, you will:

  1. Identify Regression Applications
  2. Learn How Regression Works
  3. Apply Regression to Problems Using Python

Machine Learning is frequently split into supervised and unsupervised learning. Regression, which you will be learning about in this lesson (and its extensions in later lessons), is an example of supervised machine learning.

In supervised machine learning, you are interested in predicting a label for your data. Commonly, you might want to predict fraud, customers that will buy a product, or home values in an area.

In unsupervised machine learning, you are interested in clustering data together that isn’t already labeled. This is covered in more detail in the Machine Learning Engineer Nanodegree. However, we will not be going into the details of these algorithms in this course.

In simple linear regression, we compare two quantitative variables to one another.

The response variable is what you want to predict, while the explanatory variable is the variable you use to predict the response. A common way to visualize the relationship between two variables in linear regression is using a scatterplot. You will see more on this in the concepts ahead.

Scatter plots

Scatter plots are a common visual for comparing two quantitative variables. A common summary statistic that relates to a scatter plot is the correlation coefficient commonly denoted by r.

Though there are a few different ways to measure correlation between two variables, the most common way is with Pearson’s correlation coefficient. Pearson’s correlation coefficient provides the:

  1. Strength
  2. Direction

of a linear relationshipSpearman’s Correlation Coefficient does not measure linear relationships specifically, and it might be more appropriate for certain cases of associating two variables.

Correlation Coefficients

Correlation coefficients provide a measure of the strength and direction of a linear relationship.

We can tell the direction based on whether the correlation is positive or negative.

A rule of thumb for judging the strength:

Strong                               Moderate                               Weak
0.7 \leq |r| \leq 1.0              0.3 \leq |r| < 0.7                     0.0 \leq |r| < 0.3

Calculation of the Correlation Coefficient

r = \frac{\sum\limits_{i=1}^n(x_i – \bar{x})(y_i – \bar{y})}{\sqrt{\sum(x_i – \bar{x})^2}\sqrt{\sum(y_i – \bar{y})^2}}

It can also be calculated in Excel and other spreadsheet applications using CORREL(col1, col2), where col1and col2 are the two columns you are looking to compare to one another.

What is the value of the correlation coefficient?

Use the CORREL function.

A line is commonly identified by an intercept and a slope.

The intercept is defined as the predicted value of the response when the x-variable is zero.

The slope is defined as the predicted change in the response for every one unit increase in the x-variable.

We notate the line in linear regression in the following way:

\hat{y} = b_0 + b_1x_1

where

\hat{y} is the predicted value of the response from the line.

b_0 is the intercept.

b_1 is the slope.

x_1 is the explanatory variable.

y is an actual response value for a data point in our dataset (not a prediction from our line).

The main algorithm used to find the best fit line is called the least-squares algorithm, which finds the line that minimizes \sum\limits_{i=1}^n(y_i – \hat{y_i})^2.

There are other ways we might choose a “best” line, but this algorithm tends to do a good job in many scenarios.

How Do We Determine The Line of Best Fit?

You saw in the last video, that in regression we are interested in minimizing the following function:

\sum\limits_{i=1}^n(y_i – \hat{y}_i)^2

It turns out that in order to minimize this function, we have set equations that provide the intercept and slope that should be used.

If you have a set of points like the values in the image here:

In order to compute the slope and intercept, we need to compute the following:

\bar{x} = \frac{1}{n}\sum x_i

\bar{y} = \frac{1}{n}\sum y_i

s_y = \sqrt{\frac{1}{n-1}\sum\limits(y_i – \bar{y})^2}

s_x = \sqrt{\frac{1}{n-1}\sum\limits(x_i – \bar{x})^2}

r = \frac{\sum\limits_{i=1}^n(x_i – \bar{x})(y_i – \bar{y})}{\sqrt{\sum(x_i – \bar{x})^2}\sqrt{\sum(y_i – \bar{y})^2}}

b_1 = r\frac{s_y}{s_x}

b_0 =\bar{y} – b_1\bar{x}

But Before You Get Carried Away…

Though you are now totally capable of carrying out these steps….

In the age of computers, it doesn’t really make sense to do this all by hand. Instead, using computers can allow us to focus on interpreting and acting on the output. If you want to see a step by step of this in Excel, you can find that here. With the rest of this lesson, you will get some practice with this in Python.

 

PIP installs STATSMODELS will install the library to your machine

import pandas as pd
import numpy as np
import statsmodels.api as sm

df[‘intercept’]=1

lm = sm.OLS(df[‘price’], df([[‘intercept’, ‘area’]])
results = lm.fit()
result.summary()

Here is a post on the need of an intercept in nearly all cases of regression. Again, there are very few cases where you do not need to include an intercept in a linear model.

 

We can perform hypothesis tests for the coefficients in our linear models using Python (and other software). These tests help us determine if there is a statistically significant linear relationship between a particular variable and the response. The hypothesis test for the intercept isn’t useful in most cases.

However, the hypothesis test for each x-variable is a test of if that population slope is equal to zero vs. an alternative where the parameter differs from zero. Therefore, if the slope is different than zero (the alternative is true), we have evidence that the x-variable attached to that coefficient has a statistically significant linear relationship with the response. This in turn suggests that the x-variable should help us in predicting the response (or at least be better than not having it in the model).

 

The Rsquared value is the square of the correlation coefficient.

A common definition for the Rsquared variable is that it is the amount of variability in the response variable that can be explained by the x-variable in our model. In general, the closer this value is to 1, the better our model fits the data.

Many feel that Rsquared isn’t a great measure (which is possible true), but I would argue that using cross-validation can assist us with validating with any measure that helps us understand the fit of a model to our data. Here, you can find one such result on why an individual doesn’t care for Rsquared.

 

Housing Analysis

import numpy as np
import pandas as pd
import statsmodels.api as sm;

df = pd.read_csv(‘./house_price_area_only.csv’)
df.head()

1. Use the documentation here and the statsmodels library to fit a linear model to predict price based on area. Obtain a summary of the results, and use them to answer the following quiz questions. Don’t forget to add an intercept.

df[‘intercept’] = 1

lm = sm.OLS(df[‘price’], df[[‘intercept’, ‘area’]])
results = lm.fit()
results.summary()

 

Regression Carats vs. Price

import numpy as np
import pandas as pd
import statsmodels.api as sms;
import matplotlib.pyplot as plt
%matplotlib inline

df = pd.read_csv(‘./carats.csv’, header= None)
df.columns = [‘carats’, ‘price’]
df.head()

1. Similar to the last notebook, fit a simple linear regression model to predict price based on the weight of a diamond. Use your results to answer the first question below. Don’t forget to add an intercept.

df[‘intercept’] = 1

lm = sms.OLS(df[‘price’], df[[‘intercept’, ‘carats’]])
results = lm.fit()
results.summary()

2. Use scatter to create a scatterplot of the relationship between price and weight. Then use the scatterplot and the output from your regression model to answer the second quiz question below.

plt.scatter(df[‘carats’], df[‘price’]);
plt.xlabel(‘Carats’);
plt.ylabel(‘Price’);
plt.title(‘Price vs. Carats’);

 

HomesVCrime

import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.datasets import load_boston
import matplotlib.pyplot as plt
%matplotlib inline

boston_data = load_boston()
df = pd.DataFrame()
df[‘MedianHomePrice’] = boston_data.target
df2 = pd.DataFrame(boston_data.data)
df[‘CrimePerCapita’] = df2.iloc[:,0];
df.head()

The Boston housing data is a built in dataset in the sklearn library of python. You will be using two of the variables from this dataset, which are stored in df. The median home price in thousands of dollars and the crime per capita in the area of the home are shown above.

1. Use this dataframe to fit a linear model to predict the home price based on the crime rate. Use your output to answer the first quiz below. Don’t forget an intercept.

df[‘intercept’] = 1

lm = sm.OLS(df[‘MedianHomePrice’], df[[‘intercept’, ‘CrimePerCapita’]])
results = lm.fit()
results.summary()

2.Plot the relationship between the crime rate and median home price below. Use your plot and the results from the first question as necessary to answer the remaining quiz questions below.

plt.scatter(df[‘CrimePerCapita’], df[‘MedianHomePrice’]);
plt.xlabel(‘Crime/Capita’);
plt.ylabel(‘Median Home Price’);
plt.title(‘Median Home Price vs. CrimePerCapita’);

 

## To show the line that was fit I used the following code from
## https://plot.ly/matplotlib/linear-fits/
## It isn’t the greatest fit… but it isn’t awful either

import plotly.plotly as py
import plotly.graph_objs as go

# MatPlotlib
import matplotlib.pyplot as plt
from matplotlib import pylab

# Scientific libraries
from numpy import arange,array,ones
from scipy import stats

xi = arange(0,100)
A = array([ xi, ones(100)])

# (Almost) linear sequence
y = df[‘MedianHomePrice’]
x = df[‘CrimePerCapita’]

# Generated linear fit
slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)
line = slope*xi+intercept

plt.plot(x,y,’o’, xi, line);
plt.xlabel(‘Crime/Capita’);
plt.ylabel(‘Median Home Price’);
pylab.title(‘Median Home Price vs. CrimePerCapita’);

 

Recap

In this lesson, you learned about simple linear regression. The topics in this lesson included:

  1. Simple linear regression is about building a line that models the relationship between two quantitative variables.
  2. Learning about correlation coefficients. You learned that this is a measure that can inform you about the strength and direction of a linear relationship.
  3. The most common way to visualize simple linear regression is using a scatterplot.
  4. A line is defined by an intercept and slope, which you found using the statsmodels library in Python.
  5. You learned the interpretations for the slope, intercept, and Rsquared values.

Up Next

In the next lesson, you will extend your knowledge from simple linear regression to multiple linear regression.

%d bloggers like this: