top of page

Big Data






Histograms I

Histograms II

Checkpoint #1



Summary Statistics

We already discussed the average (mean) earlier, which is the sum of all of the values divided by the number of values. The mean can be a useful tool for quickly summarize your data. However, the mean is an incomplete picture of what your distribution actually looks like. Specifically, it gives a sense of where the center of the data is, but not the spread of the data (how widely the values range).

Why is range important? Imagine living in a climate where the mean temperature year-round is 65 degrees Fahrenheit, and it’s almost never below 60 or above 70 degrees. Compare that to living in a climate where the mean temperature year-round is 65 degrees, but it can go as low as 0 or as high as 100. Though the means are the same, these climates are obviously very different! 


We can create histograms with the same mean (0) but that look very different with similar lines of code. The code for the other two histograms is very similar, but only differs in the value of the scale parameter in the second line of code (equal to 1, 5, or 0.1 in each graph, respectively).

1 import numpy # statistics functions

2 import matplotlib.pyplot as plt # plotting

3 # Create a histogram with mean MEAN and standard deviation SD

4 MEAN =

5 SD = 1 # Change these values to see the impact of different spreads

6 fig1 = plt.figure() 

7 plt.hist(numpy.random.normal(loc = MEAN, scale = SD, size = 1000)) 

8 plt.xlim(-20, 20) # set x-axis range to be -20 to 20

9 fig1.savefig('Histogram_Mean_SD.png')

The difference in spread can be quantified by something called the “standard deviation.” For now, the details of how this is calculated don’t matter. The intuition is that for each value, we calculate how far it is from the mean and then combine all those differences together into a single number the standard deviation). Therefore, if many of the values are far from the mean, we get a big standard deviation, meaning that the spread of the values is high. Alternatively, if most of the values are close to the mean, we will get a small standard deviation, and that means that the spread is narrow.

bottom of page