Fitting Logistic Regression

import numpy as np
import pandas as pd
import statsmodels.api as sm

df = pd.read_csv(‘./fraud_dataset.csv’)
df.head()

1. As you can see, there are two columns that need to be changed to dummy variables. Replace each of the current columns to the dummy version. Use the 1 for weekday and True, and 0 otherwise. Use the first quiz to answer a few questions about the dataset.

df[‘weekday’] = pd.get_dummies(df[‘day’])[‘weekday’]
df[[‘not_fraud’,’fraud’]] = pd.get_dummies(df[‘fraud’])
df = df.drop(‘not_fraud’, axis=1)

print(df[‘fraud’].mean())
print(df[‘weekday’].mean())
print(df.groupby(‘fraud’).mean()[‘duration’])

2. Now that you have dummy variables, fit a logistic regression model to predict if a transaction is fraud using both day and duration. Don’t forget an intercept! Use the second quiz below to assure you fit the model correctly.

df[‘intercept’] = 1
log_mod = sm.Logit(df[‘fraud’], df[[‘intercept’, ‘weekday’, ‘duration’]])
results = log_mod.fit()
results.summary()

 

np.exp(-1.4637), np.exp(2.5465)

1/np.exp(-1.4637)

 

Interpreting Results of Logistic Regression

In this notebook (and quizzes), you will be getting some practice with interpreting the coefficients in logistic regression. Using what you saw in the previous video should be helpful in assisting with this notebook.

The dataset contains four variables: admitgregpa, and prestige:

  • admit is a binary variable. It indicates whether or not a candidate was admitted into UCLA (admit = 1) our not (admit = 0).
  • gre is the GRE score. GRE stands for Graduate Record Examination.
  • gpa stands for Grade Point Average.
  • prestige is the prestige of an applicant alta mater (the school attended before applying), with 1 being the highest (highest prestige) and 4 as the lowest (not prestigious).

To start, let’s read in the necessary libraries and data.

import numpy as np
import pandas as pd
import statsmodels.api as sm

df = pd.read_csv(“./admissions.csv”)
df.head()

There are a few different ways you might choose to work with the prestige column in this dataset. For this dataset, we will want to allow for the change from prestige 1 to prestige 2 to allow a different acceptance rate than changing from prestige 3 to prestige 4.

  1. With the above idea in place, create the dummy variables needed to change prestige to a categorical variable, rather than quantitative, then answer quiz 1 below.

df[[‘prest_1’, ‘prest_2’, ‘prest_3′,’prest_4’,]] = pd.get_dummies(df[‘prestige’])

df.head()

df[‘prestige’].astype(str).value_counts()

2. Now, fit a logistic regression model to predict if an individual is admitted using gregpa, and prestige with a baseline of the prestige value of 1. Use the results to answer quiz 2 and 3 below. Don’t forget an intercept.

df[‘intercept’] = 1

logit_mod = sm.Logit(df[‘admit’], df[[‘intercept’,’gre’, ‘gpa’, ‘prest_2’, ‘prest_3’, ‘prest_4’]])
results = logit_mod.fit()
results.summary()

np.exp(results.params)

1/_

df.groupby(‘prestige’).mean()[‘admit’]

 

Udacity’s Machine Learning Course is a great way to get started with the ideas that connect with these two lessons, as well as demonstrate some more advanced techniques.

The Machine Learning Nanodegree is another great step following this program.

 

Recall: True Positive / (True Positive + False Negative). Out of all the items that are truly positive, how many were correctly classified as positive. Or simply, how many positive items were ‘recalled’ from the dataset.

Precision: True Positive / (True Positive + False Positive). Out of all the items labeled as positive, how many truly belong to the positive class.

 

Model Diagnostics in Python

In this notebook, you will be trying out some of the model diagnostics you saw from Sebastian, but in your case there will only be two cases – either admitted or not admitted.

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, precision_score, recall_score, accuracy_score
from sklearn.model_selection import train_test_split
np.random.seed(42)

df = pd.read_csv(‘../data/admissions.csv’)
df.head()

1. Change prestige to dummy variable columns that are added to df. Then divide your data into training and test data. Create your test set as 20% of the data, and use a random state of 0. Your response should be the admit column. Here are the docs, which can also find with a quick google search if you get stuck.

df[[‘prest_1’, ‘prest_2’, ‘prest_3’, ‘prest_4’]] = pd.get_dummies(df[‘prestige’])
X = df.drop([‘admit’, ‘prestige’, ‘prest_1’] , axis=1)
y = df[‘admit’]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=0)

2. Now use sklearn’s Logistic Regression to fit a logistic model using gregpa, and 3 of your prestige dummy variables. For now, fit the logistic regression model without changing any of the hyperparameters.

The usual steps are:

  • Instantiate
  • Fit (on train)
  • Predict (on test)
  • Score (compare predict to test)

As a first score, obtain the confusion matrix. Then answer the first question below about how well your model performed on the test data.

log_mod = LogisticRegression()
log_mod.fit(X_train, y_train)
preds = log_mod.predict(X_test)
confusion_matrix(y_test, preds)

3. Now, try out a few additional metrics: precisionrecall, and accuracy are all popular metrics, which you saw with Sebastian. You could compute these directly from the confusion matrix, but you can also use these built in functions in sklearn.

Another very popular set of metrics are ROC curves and AUC. These actually use the probability from the logistic regression models, and not just the label. This is also a great resource for understanding ROC curves and AUC.

Try out these metrics to answer the second quiz question below. I also provided the ROC plot below. The ideal case is for this to shoot all the way to the upper left hand corner. Again, these are discussed in more detail in the Machine Learning Udacity program.

precision_score(y_test, preds)

recall_score(y_test, preds)

accuracy_score(y_test, preds)

from ggplot import *
from sklearn.metrics import roc_curve, auc
%matplotlib inline

preds = log_mod.predict_proba(X_test)[:,1]
fpr, tpr, _ = roc_curve(y_test, preds)

df = pd.DataFrame(dict(fpr=fpr, tpr=tpr))
ggplot(df, aes(x=’fpr’, y=’tpr’)) +\
geom_line() +\
geom_abline(linetype=’dashed’)

At the end of the video, the wikipedia documentation is actually the opposite way of sklearn’s output. The predicted is across the columns, and the actual is across the rows. Therefore,

Predicted
Actual 0 1
0 23 1
1 14 2
  • Therefore, there are 23 non-admitted that we predict to be non-admitted.
  • There are 14 admitted that we predicted to be non-admitted.
  • There is 1 non-admitted that we predict to be admitted.
  • There are 2 admitted that we predict to be admitted.

Why Train-Test Split And Additional Documentation

Here is the documentation for logistic regression sklearn. Additionally, here is the documentation for working with confusion matrices.

In this screencast, you created a train and test dataset, which is very popular in machine learning. A great paper on this topic is provided here. In general, it is useful to split your data into training and testing data to assure your model can predict well not only on the data it was fit to, but also on data that the model has never seen before. Proving the model performs well on test data assures that you have a model that will do well in the future use cases – whether that be future students, future transactions, or any other future predictions you might want to make.

 

Recap

This concludes the practical statistics content! Much of what you saw in these last two lessons on Multiple Linear Regression and Logistic Regression begins to move towards more of a Data Science view of the world, and beyond what most Data Analysts perform on a day to day basis. However, I hope you enjoyed some of the challenges in these two lessons.

These lessons on Multiple Linear Regression and Logistic Regression were just a first glimpse of two methods that are a part of supervised machine learning. You can learn more from the free Udacity course or gain project reviews and the Udacity community as a part of the journey with the Machine Learning Nanodegree.

In this lesson, we looked at Logistic Regression. You learned:

  1. How to use python to perform logistic regression to predict binary response values in both statsmodels and sklearn.
  2. How to interpret coefficients from logistic regression output in statsmodels.
  3. How to assess how well your model is performing using a variety of metrics.
  4. How to assess model fit in python.

You have come a long way! Congrats, and good luck with the project!

The notebook solutions and data for sampling distributions through logistic regression lessons are available at the bottom of this page.