Appending Data

# import numpy and pandas
import numpy as np
import pandas as pd

# load red and white wine datasets
red_df = pd.read_csv(‘winequality-red.csv’, sep=’;’)
white_df = pd.read_csv(‘winequality-white.csv’, sep=’;’)

red_df.rename(columns={‘total_sulfur-dioxide’:’total_sulfur_dioxide’}, inplace=True)

Create Color Columns

Create two arrays as long as the number of rows in the red and white dataframes that repeat the value “red” or “white.” NumPy offers really easy way to do this. Here’s the documentation for NumPy’s repeat function. Take a look and try it yourself.

# create color array for red dataframe
color_red = np.repeat(‘red’, red_df.shape[0])

# create color array for white dataframe
color_white = np.repeat(‘white’, white_df.shape[0]

Add arrays to the red and white dataframes. Do this by setting a new column called ‘color’ to the appropriate array. The cell below does this for the red dataframe.

red_df[‘color’] = color_red
red_df.head()

Do the same for the white dataframe and use head() to confirm the change.

white_df[‘color’] = color_white
white_df.head()

Combine DataFrames with Append

Check the documentation for Pandas’ append function and see if you can use this to figure out how to combine the dataframes. (Bonus: Why aren’t we using the merge method to combine the dataframes?) If you don’t get it, I’ll show you how afterwards. Make sure to save your work in this notebook! You’ll come back to this later.

# append dataframes
wine_df = red_df.append(white_df)

# view dataframe to check for success
wine_df.head()

Save Combined Dataset

Save your newly combined dataframe as winequality_edited.csv. Remember, set index=False to avoid saving with an unnamed column!

wine_df.to_csv(‘winequality_edited.csv’, index=False)

 

Drawing Conclusions Using Groupby

In the notebook below, you’re going to investigate two questions about this data using Pandas’ groupby function. Here are tips for answering each question:

Q1: Is a certain type of wine (red or white) associated with higher quality?

For this question, compare the average quality of red wine with the average quality of white wine with groupby. To do this group by color and then find the mean quality of each group.

Q2: What level of acidity (pH value) receives the highest average rating?

This question is more tricky because unlike color, which has clear categories you can group by (red and white) pH is a quantitative variable without clear categories. However, there is a simple fix to this. You can create a categorical variable from a quantitative variable by creating your own categories. Pandas’ cut function let’s you “cut” data in groups. Using this, create a new column called acidity_levels with these categories:

Acidity Levels:

  1. High: Lowest 25% of pH values
  2. Moderately High: 25% – 50% of pH values
  3. Medium: 50% – 75% of pH values
  4. Low: 75% – max pH value

Here, the data is being split at the 25th, 50th, and 75th percentile. Remember, you can get these numbers with Pandas’ describe()! After you create these four categories, you’ll be able to use groupby to get the mean quality rating for each acidity level.

# Find the mean quality of each wine type (red and white) with groupby
df.groupby(‘color’).mean()

# View the min, 25%, 50%, 75%, max pH values with Pandas describe
df.describe()

# Bin edges that will be used to “cut” the data into groups
bin_edges = [2.72, 3.11,3.21, 3.32, 4.01] # Fill in this list with five values you just found

# Labels for the four acidity level groups
bin_names = [ ‘high’, ‘moderately high’ , ‘medium’, ‘low’ ] # Name each acidity level category

# Creates acidity_levels column
df[‘acidity_levels’] = pd.cut(df[‘pH’], bin_edges, labels=bin_names)

# Checks for successful creation of this column
df.head()

# Find the mean quality of each acidity level with groupby
df.groupby(‘acidity_levels’).mean()

# Save changes for the next section
df.to_csv(‘winequality_edited.csv’, index=False)

Another useful function that we’re going to use is Pandas’ query function.

In the previous lesson, we selected rows in a dataframe by indexing with a mask. Here are those same examples, along with equivalent statements that use query().

# selecting malignant records in cancer data
df_m = df[df['diagnosis'] == 'M']
df_m = df.query('diagnosis == "M"')

# selecting records of people making over $50K
df_a = df[df['income'] == ' >50K']
df_a = df.query('income == " >50K"')

Drawing Conclusions Using Query

In the notebook below, you’re going to investigate two questions about this data using Pandas’ query function. Here are tips for answering each question:

Q1: Do wines with higher alcoholic content receive better ratings?

To answer this question, use query to create two groups of wine samples:

  1. Low alcohol (samples with an alcohol content less than the median)
  2. High alcohol (samples with an alcohol content greater than or equal to the median)

Then, find the mean quality rating of each group.

Q2: Do sweeter wines (more residual sugar) receive better ratings?

Similarly, use the median to split the samples into two groups by residual sugar and find the mean quality rating of each group.

# get the median amount of alcohol content
df.describe()

# select samples with alcohol content less than the median
low_alcohol = df[df[‘alcohol’] < 10.491801]

# select samples with alcohol content greater than or equal to the median
high_alcohol = df[df[‘alcohol’] >= 10.491801]

# ensure these queries included each sample exactly once
num_samples = df.shape[0]
num_samples == low_alcohol[‘quality’].count() + high_alcohol[‘quality’].count() # should be True

# get mean quality rating for the low alcohol and high alcohol groups
bin_edges = [8,10.3,14.9]
bin_names = [‘low’, ‘high’]
df[‘alcohol_levels’] = pd.cut(df[‘alcohol’], bin_edges, labels=bin_names)
df.groupby(‘alcohol_levels’).mean()

import maplotlib.pyplot as plt
import seaborn as sns

https://seaborn.pydata.org/examples/index.html

Creating a Bar Chart Using Matplotlib

import matplotlib.pyplot as plt
% matplotlib inline
There are two required arguments in pyplot’s bar function: the x-coordinates of the bars, and the heights of the bars.

plt.bar([1, 2, 3], [224, 620, 425]);
You can specify the x tick labels using pyplot’s xticks function, or by specifying another parameter in the bar function. The two cells below accomplish the same thing.

# plot bars
plt.bar([1, 2, 3], [224, 620, 425])

# specify x coordinates of tick labels and their labels
plt.xticks([1, 2, 3], [‘a’, ‘b’, ‘c’]);

# plot bars with x tick labels
plt.bar([1, 2, 3], [224, 620, 425], tick_label=[‘a’, ‘b’, ‘c’]);
Set the title and label axes like this.

plt.bar([1, 2, 3], [224, 620, 425], tick_label=[‘a’, ‘b’, ‘c’])
plt.title(‘Some Title’)
plt.xlabel(‘Some X Label’)
plt.ylabel(‘Some Y Label’);

Plotting with Matplotlib

Use Matplotlib to create bar charts that visualize the conclusions you made with groupby and query.

# Import necessary packages and load `winequality_edited.csv`
import pandas as pd
import matplotlib.pyplot as plt
% matplotlib inline

df = pd.read_csv(‘winequality_edited.csv’)
df.info()
df.head()

#1: Do wines with higher alcoholic content receive better ratings?

Create a bar chart with one bar for low alcohol and one bar for high alcohol wine samples. This first one is filled out for you.

# Use query to select each group and get its mean quality
median = df[‘alcohol’].median()
low = df.query(‘alcohol < {}’.format(median))
high = df.query(‘alcohol >= {}’.format(median))

mean_quality_low = low[‘quality’].mean()
mean_quality_high = high[‘quality’].mean()

# Create a bar chart with proper labels
locations = [1, 2]
heights = [mean_quality_low, mean_quality_high]
labels = [‘Low’, ‘High’]
plt.bar(locations, heights, tick_label=labels)
plt.title(‘Average Quality Ratings by Alcohol Content’)
plt.xlabel(‘Alcohol Content’)
plt.ylabel(‘Average Quality Rating’);

#2: Do sweeter wines receive higher ratings?

Create a bar chart with one bar for low residual sugar and one bar for high residual sugar wine samples.

# Use query to select each group and get its mean quality
median = df[‘residual_sugar’].median()
lowsugar = df.query(‘residual_sugar < {}’.format(median))
highsugar = df.query(‘residual_sugar >= {}’.format(median))

num_samples = df.shape[0]
num_samples == lowsugar[‘quality’].count() + highsugar[‘quality’].count() # should be True

mean_lowsugar = lowsugar[‘quality’].mean()
mean_highsugar = highsugar[‘quality’].mean()

# Create a bar chart with proper labels
locations = [1, 2]
heights = [mean_lowsugar, mean_highsugar]
labels = [‘Low Sugar’, ‘High Sugar’]
plt.bar(locations, heights, tick_label=labels)
plt.title (‘Average Quality Ratings by Sugar Level’)
plt.xlabel(‘Sugar Content’)
plt.ylabel(‘Average Quality Rating’);

#3: What level of acidity receives the highest average rating?

Create a bar chart with a bar for each of the four acidity levels.

Plotting Wine Type and Quality with Matplotlib

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
% matplotlib inline
import seaborn as sns
sns.set_style(‘darkgrid’)

wine_df = pd.read_csv(‘winequality_edited.csv’)

Create arrays for red bar heights white bar heights

Remember, there’s a bar for each combination of color and quality rating. Each bar’s height is based on the proportion of samples of that color with that quality rating.

  1. Red bar proportions = counts for each quality rating / total # of red samples
  2. White bar proportions = counts for each quality rating / total # of white samples

# get counts for each rating and color
color_counts = wine_df.groupby([‘color’, ‘quality’]).count()[‘pH’]
color_counts

# get total counts for each color
color_totals = wine_df.groupby(‘color’).count()[‘pH’]
color_totals

# get proportions by dividing red rating counts by total # of red samples
red_proportions = color_counts[‘red’] / color_totals[‘red’]
red_proportions

# get proportions by dividing white rating counts by total # of white samples
white_proportions = color_counts[‘white’] / color_totals[‘white’]
white_proportions

Plot proportions on a bar chart

Set the x coordinate location for each rating group and and width of each bar.

ind = np.arange(len(red_proportions)) # the x locations for the groups
width = 0.35 # the width of the bars

Now let’s create the plot.

# plot bars
red_bars = plt.bar(ind, red_proportions, width, color=’r’, alpha=.7, label=’Red Wine’)
white_bars = plt.bar(ind + width, white_proportions, width, color=’w’, alpha=.7, label=’White Wine’)

# title and labels
plt.ylabel(‘Proportion’)
plt.xlabel(‘Quality’)
plt.title(‘Proportion by Wine Color and Quality’)
locations = ind + width / 2 # xtick locations
labels = [‘3’, ‘4’, ‘5’, ‘6’, ‘7’, ‘8’, ‘9’] # xtick labels
plt.xticks(locations, labels)

# legend
plt.legend()

Oh, that didn’t work because we’re missing a red wine value for a the 9 rating. Even though this number is a 0, we need it for our plot. Run the last two cells after running the cell below.

red_proportions[‘9′] = 0
red_proportions

ind = np.arange(len(red_proportions)) # the x locations for the groups
width = 0.35 # the width of the bars

# plot bars
red_bars = plt.bar(ind, red_proportions, width, color=’r’, alpha=.7, label=’Red Wine’)
white_bars = plt.bar(ind + width, white_proportions, width, color=’w’, alpha=.7, label=’White Wine’)

# title and labels
plt.ylabel(‘Proportion’)
plt.xlabel(‘Quality’)
plt.title(‘Proportion by Wine Color and Quality’)
locations = ind + width / 2 # xtick locations
labels = [‘3’, ‘4’, ‘5’, ‘6’, ‘7’, ‘8’, ‘9’] # xtick labels
plt.xticks(locations, labels)

# legend
plt.legend()

 

 

%d bloggers like this: