Blog Posts

Simple Linear Regression

In this lesson, you will: Identify Regression Applications Learn How Regression Works Apply Regression to Problems Using Python Machine Learning is frequently split into supervised and unsupervised learning. Regression, which you will be learning about in this lesson (and its extensions in later lessons), is an example of supervised machine learning. In supervised machine learning, you are interested in predicting a label for your data. Commonly, you might want to predict fraud, customers that will buy a product, or home values in an area. In unsupervised machine learning, you are interested in clustering data together that isn't already labeled. This is covered in more detail in the Machine Learning Engineer Nanodegree. However, we will not be going into the details of these algorithms in this course.…

Continue Reading

Case Study: A/B Tests

A/B tests are used to test changes on a web page by running an experiment where a control group sees the old version, while the experiment group sees the new version. A metric is then chosen to measure the level of engagement from users in each group. These results are then used to judge whether one version is more effective than the other. A/B testing is very much like hypothesis testing with the following hypotheses: Null Hypothesis: The new version is no better, or even worse, than the old version Alternative Hypothesis: The new version is better than the old version If we fail to reject the null hypothesis, the results would suggest keeping the old version. If we reject the null hypothesis, the results would suggest launching…

Continue Reading

Hypothesis Testing

rules for setting up null and alternative hypotheses: The H_0H0​ is true before you collect any data. The H_0H0​ usually states there is no effect or that two groups are equal. The H_0H0​ and H_1H1​ are competing, non-overlapping hypotheses. H_1H1​ is what we would like to prove to be true. H_0H0​ contains an equal sign of some kind - either =, \leq≤, or \geq≥. H_1H1​ contains the opposition of the null - either \neq≠, >>, or <<. You saw that the statement, "Innocent until proven guilty" is one that suggests the following hypotheses are true: H_0H0​: Innocent H_1H1​: Guilty We can relate this to the idea that "innocent" is true before we collect any data. Then the alternative must be a competing, non-overlapping hypothesis. Hence, the alternative hypothesis is that an individual is guilty.   Because…

Continue Reading

Confidence Intervals – Udacity

import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline np.random.seed(42) full_data = pd.read_csv('../data/coffee_dataset.csv') sample_data = full_data.sample(200)   diffs = [] for _ in range(10000): bootsamp = sample_data.sample(200, replace = True) coff_mean = bootsamp[bootsamp['drinks_coffee'] == True]['height'].mean() nocoff_mean = bootsamp[bootsamp['drinks_coffee'] == False]['height'].mean() diffs.append(coff_mean - nocoff_mean) np.percentile(diffs, 0.5), np.percentile(diffs, 99.5) # statistical evidence coffee drinkers are on average taller   diffs_age = [] for _ in range(10000): bootsamp = sample_data.sample(200, replace = True) under21_mean = bootsamp[bootsamp['age'] == '<21']['height'].mean() over21_mean = bootsamp[bootsamp['age'] != '<21']['height'].mean() diffs_age.append(over21_mean - under21_mean) np.percentile(diffs_age, 0.5), np.percentile(diffs_age, 99.5) # statistical evidence that over21 are on average taller   diffs_coff_under21 = [] for _ in range(10000): bootsamp = sample_data.sample(200, replace = True) under21_coff_mean = bootsamp.query("age == '<21'…

Continue Reading

Statistics – Udacity

Descriptive Statistics Descriptive statistics is about describing our collected data using the measures discussed throughout this lesson: measures of center, measures of spread, shape of our distribution, and outliers. We can also use plots of our data to gain a better understanding. Inferential Statistics Inferential Statistics is about using our collected data to draw conclusions to a larger population. Performing inferential statistics well requires that we take a sample that accurately represents our population of interest. A common way to collect data is via a survey. However, surveys may be extremely biased depending on the types of questions that are asked, and the way the questions are asked. This is a topic you should think about when tackling the the first project. We…

Continue Reading

Probability – Udacity

Probability Here you learned some fundamental rules of probability. Using notation, we could say that the outcome of a coin flip could either be T or H for the event that the coin flips tails or heads, respectively. Then the following rules are true: \bold{P(H)} = 0.5P(H)=0.5 \bold{1 - P(H) = P(\text{not H})} = 0.51−P(H)=P(not H)=0.5 where \bold{\text{not H}}not H is the event of anything other than heads. Since, there are only two possible outcomes, we have that \bold{P(\text{not H}) = P(T)} = 0.5.P(not H)=P(T)=0.5. In later concepts, you will see this with the following notation: \bold{\lnot H}¬H. Across multiple coin flips, we have the probability of seeing n heads as \bold{P(H)^n}P(H)n. This is because these events are independent. We can get two generic rules from this: The probability of any event must be between…

Continue Reading

Project Notes – Udacity

b for new cell x to delete cell m to markdown   #axis = 1 so that the column names being used #before dropping, you could chart it to learn what is missing, make sure not different from general dataframe df.hist(figsize=(10,0)) df[df.Age.isnull()].hist(figsize=(10,0)) df.fillna(df.mean(), inplace=True) was fare associated with survival? create masks for rows survived and rows died survived = df.Survived == True died = df.Survived == False df.Fare[survived].mean() df.Fare[died].mean() df.Fare[survived.hist(alpha=0.5, bins=20, label='survived') df.Fare[died].hist(alpha=0.5, bins=20, label='died'); #semicolon on last line so that text doesnt pop up #alpha to make it more transparent #bin is number of columns   dg.groupby('Pclass').Survived.mean().plot(kind='bar');   df.Age[survived].hist(alpha=0.5, bins=20, label='survived') df.Age[died].hist(alpha=0.5, bins=20, label='died');   dg.groupby('Sex').Survived.mean().plot(kind='bar'); df.groupby('Sex')['Pclass'].value_counts() df.query('Sex =="female"')['Fare'].median(), df.query('Sex =="male"')['Fare'].median() df.groupby(['Pclass', 'Sex']).Survived.mean().plot(kind='bar');   df.SibSp[survived].valuecounts().plot(kind='bar, alpha=0.5, color='blue', labels='survived') df.SibSp[died].valuecounts().plot(kind='bar, alpha=0.5,…

Continue Reading

Data Analysis Process – Case Study 2 – Udacity

Cleaning Column Labels 1. Drop extraneous columns Drop features that aren't consistent (not present in both datasets) or aren't relevant to our questions. Use Pandas' dropfunction. 2. Rename Columns Change the "Sales Area" column label in the 2008 dataset to "Cert Region" for consistency. Rename all column labels to replace spaces with underscores and convert everything to lowercase. (Underscores can be much easier work with in Python than spaces. For example, having spaces wouldn't allow you to use df.column_name instead of df['column_name'] to select columns or use query(). Being consistent with lowercase and underscores also helps make column names easy to remember.) # load datasets import pandas as pd df_08 = pd.read_csv('all_alpha_08.csv') df_18 = pd.read_csv('all_alpha_18.csv') # view 2008 dataset df_08.head(1) # view 2018 dataset df_18.head(1) Drop…

Continue Reading

Data Analysis Process – Case Study 1 – Udacity

Appending Data # import numpy and pandas import numpy as np import pandas as pd # load red and white wine datasets red_df = pd.read_csv('winequality-red.csv', sep=';') white_df = pd.read_csv('winequality-white.csv', sep=';') red_df.rename(columns={'total_sulfur-dioxide':'total_sulfur_dioxide'}, inplace=True) Create Color Columns Create two arrays as long as the number of rows in the red and white dataframes that repeat the value “red” or “white.” NumPy offers really easy way to do this. Here’s the documentation for NumPy’s repeat function. Take a look and try it yourself. # create color array for red dataframe color_red = np.repeat('red', red_df.shape[0]) # create color array for white dataframe color_white = np.repeat('white', white_df.shape[0] Add arrays to the red and white dataframes. Do this by setting a new column called 'color' to the appropriate…

Continue Reading

Plotting with Pandas – Udacity

import pandas as pd % matplotlib inline df_census = pd.read_csv('census_income_data.csv') df_census.info() df_census.hist(figsize=8, 8)); df_census['age'].hist() df_census['age'].plot(kind='hist'); df_census['education'].value_counts() #aggregates counts for each unique value in a column df_census['education'].value_counts().plot(kind='bar') df_census['education'].value_counts().plot(kind='pie', figsize=(8, 8));   df_cancer = pd.read_csv('cancer_data_edited.csv') pd.plotting.scatter_matrix(df_cancer, figsize=(15, 15)); df_cancer.plot(x='compactness', y='concavity', kind='scatter'); df_cancer['concave_points'].plot(kind='box');   import pandas as pd df = pd.read_csv('cancer_data_edited.csv') df.head() df_m = df[df['diagnosis'] == 'M'] df_m.head() mask = df['diagnosis'] == 'M' df_m = df[mask] df_m df_m['area'].describe() df_b = df[df['diagnosis'] == 'B'] df_b['area'].describe()   import matplotlib.pyplot as plt % matplotlib inline fig, ax = plt.subplots(figsize=(8,6)) ax.hist(df_b['area'], alpha = 0.5, label='benign') ax.hist(df_m['area'], alpha = 0.5, label='malignant') ax.set_title('Distribution of Benign and Malignant Tumor Areas') ax.set_xlabel('Area') ax.set_ylabel('Count') ax.legend(loc='upper right') plt.show   Exploring Data with Visuals Quiz # imports and load data import pandas as pd %…

Continue Reading