top of page

Statistics Refresher

Outliers

Checkpoint I

Boxplots

Checkpoint II

Correlations

Summary

Outliers

So far, we have seen some pretty symmetrical distributions, where all the data points were centered around the mean of 0. What if we had a distribution where most of the data points were still around a mean of 0, but we had one additional data point far away from this mean? We can visualize this case using the following lines of code:

# Import necessary packages for data visualization

import numpy # this package allows us to use some statistical functions

import matplotlib.pyplot as plt # this package allows us to graph data

 

# Create a histogram with a mean of 0 and a standard deviation of 10, with an additional outlier data point

fig4 = plt.figure()

d = list(numpy.random.normal(loc = 0, scale = 10, size = 100)) + [200] # Create a set of data points, 100 (size) of which have a mean (loc) of 0 and a standard deviation (scale) of 10, bu with an additional outlier data point equal to 200

# Note that it is not important if you perfectly understand this line of code right now

plt.hist(d, bins = 20) # Plot these data

fig4.savefig('Histogram[0,10]+Outlier.png')

This produces the following histogram, located in the tab titled "Histogram[0,10]+Outlier.png":

Histogram[0,10]+Outlier.png

Notice that there is one data point that looks far away from the rest of the data (at 200 on the x-axis). We call these data points that do not fit the same pattern of the expected distribution "outliers." Verify yourself using the terminal below

Medians

​

In our outlier example, the mean of the expected distribution is 0 but if we calculate the mean of the data including the outlier, we get a non-zero answer. One statistic that is not as influenced by the outlier is the median. The median is calculated by ranking all of the data from the smallest value to the largest value and then selecting the middle value, referred to as the median. If we calculate the median for our data with the outlier, it will be much closer to the intuitive value of 0.

bottom of page