Statistics Refresher
Outliers
Checkpoint I
Boxplots
Checkpoint II
Correlations
Summary
Statistics & Python Refresher
In the previous chapter we covered a number of statistics concepts and others that are worth remembering. We'll briefly review them here, but if these concepts are new to you, please review Lesson 5.
Average (mean): The sum of all of the values divided by the number of values. The mean can be a useful tool for quickly summarize your data. However, the mean is an incomplete picture of what your distribution actually looks like. Specifically, it gives a sense of where the center of the data is, but not the spread of the data (how widely the values range).
​
Range (spread): The distance between the largest number in a set and the smallest number.
​
Standard Deviation: The difference in spread can be quantified by something called the “standard deviation.” For now, the details of how this is calculated don’t matter. The intuition is that for each value, we calculate how far it is from the mean and then combine all those differences together into a single number the standard deviation). Therefore, if many of the values are far from the mean, we get a big standard deviation, meaning that the spread of the values is high. Alternatively, if most of the values are close to the mean, we will get a small standard deviation, and that means that the spread is narrow.
Python Packages & Importing Files
Instead of writing our own code to deal with large data sets, we will use what we refer to as a Python package. Packages are collections of code someone else has already written to perform complicated tasks that you can then use so you don’t have to reinvent the wheel! There are many packages available to serve all sorts of applications and needs.
​
Often times we need to do more than use packages, we need to import data from other computers. The syntax to achieve this for a sample file called HumanHeightWeightData.csv has been shown below.
1 import pandas # work with big data
2 import numpy # statistics functions
3 # Load data
4 human_data = pandas.read_csv('HumanHeightWeightData.csv')