b for new cell

x to delete cell

m to markdown

 

#axis = 1 so that the column names being used

#before dropping, you could chart it to learn what is missing, make sure not different from general dataframe

df.hist(figsize=(10,0))

df[df.Age.isnull()].hist(figsize=(10,0))

df.fillna(df.mean(), inplace=True)

was fare associated with survival?

create masks for rows survived and rows died

survived = df.Survived == True

died = df.Survived == False

df.Fare[survived].mean()

df.Fare[died].mean()

df.Fare[survived.hist(alpha=0.5, bins=20, label=’survived’)

df.Fare[died].hist(alpha=0.5, bins=20, label=’died’);

#semicolon on last line so that text doesnt pop up

#alpha to make it more transparent

#bin is number of columns

 

dg.groupby(‘Pclass’).Survived.mean().plot(kind=’bar’);

 

df.Age[survived].hist(alpha=0.5, bins=20, label=’survived’)

df.Age[died].hist(alpha=0.5, bins=20, label=’died’);

 

dg.groupby(‘Sex’).Survived.mean().plot(kind=’bar’);

df.groupby(‘Sex’)[‘Pclass’].value_counts()

df.query(‘Sex ==”female”‘)[‘Fare’].median(), df.query(‘Sex ==”male”‘)[‘Fare’].median()

df.groupby([‘Pclass’, ‘Sex’]).Survived.mean().plot(kind=’bar’);

 

df.SibSp[survived].valuecounts().plot(kind=’bar, alpha=0.5, color=’blue’, labels=’survived’)

df.SibSp[died].valuecounts().plot(kind=’bar, alpha=0.5, color=’orange’, labels=’died’)

plt.legend();

 

df.Parch[survived].valuecounts().plot(kind=’bar, alpha=0.5, color=’blue’, labels=’survived’)

df.Parch[died].valuecounts().plot(kind=’bar, alpha=0.5, color=’orange’, labels=’died’)

plt.legend();

 

df.Embarked[survived].valuecounts().plot(kind=’bar, alpha=0.5, color=’blue’, labels=’survived’)

df.Embarked[died].valuecounts().plot(kind=’bar, alpha=0.5, color=’orange’, labels=’died’)

plt.legend();

 

but use matplotlib instead!

 

df.isnull().any()


The Hardest Thing In Data Science

https://gallery.azure.ai/Experiment/Methods-for-handling-missing-values-1

The Limitations of the Data in Predictive Analytics

Exploration Phase

The project's visualizations are varied and show multiple comparisons and trends. Relevant statistics are computed throughout the analysis when an inference is made about the data.

At least two kinds of plots should be created as part of the explorations.

Required-At least two kinds of plots should be created as part of the explorations, currently the notebook only has bar plots
Suggestion- I would encourage you to make use of Seaborn library and try more varied visualizations like histograms, scatter plots etc

Conclusions Phase

The results of the analysis are presented such that any limitations are clear. The analysis does not state or imply that one change causes another based solely on a correlation.

Required – The conclusion is missing the limitation of the dataset. You need to state what could be your future work or potential areas to explore. Were there some shortcommings/factors limiting your analysis. All this should be clearly stated.

Communication

Reasoning is provided for each analysis decision, plot, and statistical summary.

Required- Before every plot or analysis decision you want to achieve and after every plot or analysis decision you need to explain what was the outcome, what are your inferences and observations or what do you conclude by observing the plot /analysis decision/statistical summary
The results presented in the conclusion should actually be the reasoning and observations from the plots

Binary Variables should be given descriptive labeling

  • Required Although binary representations are great for analysis, they are not the best for visualizations that need to be easily readable. Please convert them to proper descriptions. For example, Gender might be 1 and 0, but they really mean “Male” and “Female”. You’ve done this for most of your other plots so it should be no issue to do it for this.