If you work with lots of points of data that have similar attributes, but the values are all over the place, a box plot could be a best friend. What is a box plot? A box plot is a data visualization that visually shows you the distribution of your data for given attributes. A box plot is sometimes called a cat and whiskers because the min and max look like whiskers to the body which represents the middle 50% of data. If you have lots of data, you probably want to quickly know how the data looks as a whole, and this is the perfect visualization for that task. Let me explain…
Below we have a box plot (thank you netuitive.com for the picture). For a given set of numeric data, you will have a minimum and maximum value. These are represented by the end of the visualization. The actual box represents the middle 50% data, with the median value of the data represented by the line in the box. The value between the minimum and edge of the box represents the lower quartile of data, and the max to the edge of the box represents the upper quartile of data. If any values are plotted outside this chart, the values are considered outliers.
A lot of information in one graphic, huh? That’s why it’s one of my favorites!
Let’s take an example I came up with for Data Visualization. I downloaded some data around running events for the Summer Olympics. I was curious as to what the marathon times look like comparing women and men. Here is the first box plot comparing the data:
We can see that there is a bit wider range in the last quartile of marathon times for women than men, but overall it seems that the times are about the same distribution. Recall that the dots are outliers in the data over the years.
I was curious as to specific years, broken down by gender. Note that women did not have the option to run the marathon at the Olympics until 1984, so there is no data for them in years preceding 1984. If we take a look at the men first, we can see yearly breakdowns. (I should note here that the data file only contained the gold, silver, and bronze medalists, so not all times are included. It also changes the visualization a bit because there should really be more than 3 data recordings for this graph to “work”, but I thought the data was different and interesting.)
We can see in some years, the finish times were much more spread out than others. Also, that over time, the marathon times have gotten faster and the top 3 finishers cross the line closer to each other. It is odd that although the general trend is down for finish times, there are a couple spikes for data. I was curious if this had to do with temperature and/or elevation at all, so I added both to my marathon data (thank you, Google), and plotted it along the bottom in years. (Note that I could not get temperature data all the way back, so I started where I had data to work with in the set. Also, to get both temperature and elevation on the same graph, I enabled a 2nd Y axis for the elevation data.)
Originally I had seen that 1968 was a strange year in that it had quite a jump. We can see below that that spike in finish times is likely due to the elevation – 7400’!
With women filtered out for the years they ran the marathon, we can see there is a spike for 1992, but our available data does not show a possible reason. (Of course, I researched this… I saw that the marathon started at 6:30 PM in 1992. As a marathoner myself, I can’t imagine trying to stay rested and relaxed, planning how to eat during the day, and starting a race in the EVENING. Most of our races start before 7 AM!)
While this has nothing to do with box plots, but because I think maps are neat, I wanted to plot the temperature for each Olympics and the average time for the marathon. I split it up by gender to see the results. I found it interesting that the men’s time was faster in 1992 than women’s visually.
So, to take one from Sheldon’s “Fun with Flags”, I hope you learned little in my “Fun with Box Plots” today.
PS – I have also included the data set if you want to see other track distance stats, too: OlympicTrackStats