Power Plant

df_powerplant = pd.read_csv(‘powerplant_data.csv’)
labels = [‘Temperature’, ‘Exhaust Vacuum’, ‘Ambient Pressure’, ‘Relative Humidity’, ‘Net hourly electrical energy output’]
df_powerplant = pd.read_csv(‘powerplant_data.csv’, header=0, names=labels)
df_powerplant.head()

df_powerplant.to_csv(‘powerplant_data_edited.csv’, index=False)

 

Assessing and Building Intuition

Once you have your data loaded into dataframes, Pandas makes a quick investigation of the data really easy. Let’s explore some helpful methods for assessing and building intuition about a dataset. We can use the cancer data from before to help us.

import pandas as pd

df = pd.read_csv(‘cancer_data.csv’)
df.head()

# this returns a tuple of the dimensions of the dataframe
df.shape

# this returns the datatypes of the columns
df.dtypes

# although the datatype for diagnosis appears to be object, further
# investigation shows it’s a string
type(df[‘diagnosis’][0])

Pandas actually stores pointers to strings in dataframes and series, which is why object instead of str appears as the datatype. Understanding this is not essential for data analysis – just know that strings will appear as objects in Pandas.

# this displays a concise summary of the dataframe,
# including the number of non-null values in each column
df.info()

# this returns the number of unique values in each column
df.nunique()

# this returns useful descriptive statistics for each column of data
df.describe()

# this returns the first few lines in our dataframe
# by default, it returns the first five
df.head()

# although, you can specify however many rows you’d like returned
df.head(20)

# same thing applies to `.tail()` which returns the last few rows
df.tail(2)

Indexing and Selecting Data in Pandas

Let’s separate this dataframe into three new dataframes – one for each metric (mean, standard error, and maximum). To get the data for each dataframe, we need to select the id and diagnosis columns, as well as the ten columns for that metric.

# View the index number and label for each column
for i, v in enumerate(df.columns):
print(i, v)

We can select data using loc and iloc, which you can read more about hereloc uses labels of rows or columns to select data, while iloc uses the index numbers. We’ll use these to index the dataframe below.

# select all the columns from ‘id’ to the last mean column
df_means = df.loc[:,’id’:’fractal_dimension_mean’]
df_means.head()

# repeat the step above using index numbers
df_means = df.iloc[:,:11]
df_means.head()

Let’s save the dataframe of means for later.

df_means.to_csv(‘cancer_data_means.csv’, index=False)

Cleaning Example

import pandas as pd

df = pd.read_csv(‘product_view_data.csv’)

df

df.head()

df.info()

Replace NaN with mean

mean = df[‘view_duration’].mean()

df[‘view_duration’] = df[‘view_duration’].fillna(mean)

or

df[‘view_duration’].fillna(mean, inplace=True)

Find Duplicates

sum(df.duplicated())

# when entire line is duplicated

df.drop_duplicates(inplace=True)

Convert to datetime

df[‘timestamp’] = pd.to_datetime(df[‘timestamp’])

Renaming Columns

Since we also previously changed our dataset to only include means of tumor features, the “_mean” at the end of each feature seems unnecessary. It just takes extra time to type in our analysis later. Let’s come up with a list of new labels to assign to our columns.

# remove “_mean” from column names
new_labels = []
for col in df.columns:
if ‘_mean’ in col:
new_labels.append(col[:-5]) # exclude last 6 characters
else:
new_labels.append(col)

# new labels for our columns
new_labels

# assign new labels to columns in dataframe
df.columns = new_labels

# display first few rows of dataframe to confirm changes
df.head()

# save this for later
df.to_csv(‘cancer_data_edited.csv’, index=False)

Plotting with Pandas

import pandas as pd

% matplotlib inline

df_census = pd.read_csv(‘census_income_data.csv’)

df_census.info()

df_census.hist(figsize=8, 8));

df_census[‘age’].hist()

df_census[‘age’]..plot(kind=’hist’);

df_census[‘education’].value_counts().plot(kind=’bar’)

df_census[‘education’].value_counts()

df_census[‘education’].value_counts().plot(kind=’pie’, figsize=(8, 8));

 

df_cancer = pd.read_csv(‘cancer_data_edited.csv’)

pd.plotting.scatter_matrix(df_cancer, figsize=(15, 15));

df_cancer.plot(x=’compactness’, y=’concavity’, kind=’scatter’);

df_cancer[‘concave_points’].plot(kind=’box’);

 

import pandas as pd

df = pd.read_csv(‘cancer_data_edited.csv’)

df.head()

df_m = df[df[‘diagnosis’] == ‘M’]

df_m.head()

mask = df[‘diagnosis’] == ‘M’

df_m = df[mask]

df_m

df_m[‘area’].describe()

df_b = df[df[‘diagnosis’] == ‘B’]

df_b[‘area’].describe()

 

# total sales for the last month
df.iloc[196:, 1:].sum()
df.iloc[196:, 1:].sum().plot(kind=’bar’)

# average sales
df.mean()
df.mean().plot(kind=’pie’)

# sales on march 13, 2016
df[df[‘week’] == ‘2016-03-13’]
sales = df[df[‘week’] == ‘2016-03-13′]
sales.iloc[0, 1:].plot(kind=’bar’)

# worst week for store C
df[df[‘storeC’] == df[‘storeC’].min()]

# total sales during most recent 3 month period
df.iloc[187:200].sum()

last_three_months = df[df[‘week’] >= ‘2017-12-01’]
last_three_months.iloc[:, 1:].sum() # exclude sum of week column

last_three_months = df[df[‘week’] >= ‘2017-12-01′]
last_three_months.iloc[:, 1:].sum().plot(kind=’pie’)

 

EDA with Visuals

Create visualizations to answer the quiz questions below this notebook.

wine_df[‘fixed_acidity’].plot(kind=’hist’)

wine_df.plot(x=’fixed_acidity’, y=’quality’, kind=’scatter’)

Drawing Conclusions Using Groupby

# Load `winequality_edited.csv`
import pandas as pd
df = pd.read_csv(‘winequality_edited.csv’)

Is a certain type of wine associated with higher quality?

# Find the mean quality of each wine type (red and white) with groupby
df.groupby(‘color’).mean()

%d bloggers like this: