Statistics Refresher
Outliers
Checkpoint I
Boxplots
Checkpoint II
Correlations
Summary
Outliers
So far, we have seen some pretty symmetrical distributions, where all the data points were centered around the mean of 0. What if we had a distribution where most of the data points were still around a mean of 0, but we had one additional data point far away from this mean? We can visualize this case using the following lines of code:
# Import necessary packages for data visualization
import numpy # this package allows us to use some statistical functions
import matplotlib.pyplot as plt # this package allows us to graph data
# Create a histogram with a mean of 0 and a standard deviation of 10, with an additional outlier data point
fig4 = plt.figure()
d = list(numpy.random.normal(loc = 0, scale = 10, size = 100)) + [200] # Create a set of data points, 100 (size) of which have a mean (loc) of 0 and a standard deviation (scale) of 10, bu with an additional outlier data point equal to 200
# Note that it is not important if you perfectly understand this line of code right now
plt.hist(d, bins = 20) # Plot these data
fig4.savefig('Histogram[0,10]+Outlier.png')
This produces the following histogram, located in the tab titled "Histogram[0,10]+Outlier.png":
Notice that there is one data point that looks far away from the rest of the data (at 200 on the x-axis). We call these data points that do not fit the same pattern of the expected distribution "outliers." Verify yourself using the terminal below
Medians
​
In our outlier example, the mean of the expected distribution is 0 but if we calculate the mean of the data including the outlier, we get a non-zero answer. One statistic that is not as influenced by the outlier is the median. The median is calculated by ranking all of the data from the smallest value to the largest value and then selecting the middle value, referred to as the median. If we calculate the median for our data with the outlier, it will be much closer to the intuitive value of 0.