Friday, November 27, 2015

Spurious Corrleations

There is a relatively new book out called Spurious Correlations by Tyler Vigen. This book focuses on data sets that are clearly unrelated yet correlated very well together. Some are fun like this one about the connection between Nicholas Cage films and drowning.

While others almost seem plausible like this one correlating revenue generated by arcades and computer science doctorates.
The interesting thing is that all of these are generated by a computer program that scrapes the internet for data and then sees if they are compatible for correlating. You can hear Tyler talking about it here.
But before there was a book there was (and still is) the site The site has gone through a couple of incarnations but it's current form is a lot cleaner. I think the original site is a little nicer for one reason, though. It gives the table of data along with the graph. With the current site the graphs look nicer but to get the actual values for each point, you have to hover over any point to reveal them.

Classroom Connections

So what can we use this for? At the very least, we can use it to discuss the nature of correlation vs causation and the miss use of correlation by median, politicians etc. There is actually a nice little TEDx talk about this very thing:

So just looking through the already created graphs is one thing that you could do. But there is an awesome feature built into the site that allows you to Discover a Correlation. So here you have access to all the data sets he has scraped from the net and use them to find your own spurious correlation. So you start by choosing the first variable you want to work with. To do that you first pick a topic and then click View Variables. You will then see all the datasets relating to that topic (for the below graph, I chose Miscellaneous). Choose the dataset you want to use as your first variable and then click Correlate (I chose Staple Sales). Then you get a list of all the datasets that have a strong correlation with the one you chose. So pick your favourite and the click on Chart (I chose Age of Academy Awards Best Actress). Note that as you see these variables you will see the correlation coefficient). And that creates the graph and gives you a permalink that will have the table of values and other correlation info.

Now what I can do is take that table of values and do some analysis on it. So for example, I imported that into Fathom and create the line of best fit or any other analysis that you would normally do for two variable data. So I would have your students find the most outrageous correlated variables and the do the analysis.
And if you like some of the graphs seen on the new site now but you want the tables of data, you can use this same method to build the graph and get the table of values that way rather than highlighting each point to get the. So for our Nicholas Cage data above, here is the link to the raw dataset. Note that if you don't like that those graphs are black background, you can click on Rechart and it will give you a printer friendly version.
So have fun finding your spurious correlations. BTW, thanks to Mark Esping for reminding me of this site.


Main site:
Original site:
Build your own:

Saturday, November 21, 2015

Anscombe's Quartet

Anscombe's Quartet is four two variable sets of data that have a particularly interesting property.

Upon examination the first three sets have the same x values but other than that the y values all seem random. But the interesting thing starts when you start to do some numerical analysis on them. Just start with some simple single variable calculations.
  • Mean of each x set = 9
  • Mean of each y set = 7.50
  • Variance of each x set = 11
  • Variance of each y set = 4.122-4.128
So, almost identical. And then if you take that a step further you can do the two variable analysis on each set and get the following:
  • Correlation of each set = 0.816
  • Line of best fit for each set y = 3 + 0.5x
So with all that analysis done, you might get the impression that these are pretty much just different aspects of the same sets of data. But then when you graph them you get something entirely different:
So you really see that they are very different sets of data. The lesson here is that your data cannot be fully described with either numerical or graphical analysis but really both are necessary.

Classroom Connections

So how do you use this in class? This set is really best used for students who have had both single variable and two variable analysis. It really is a great set for tying together many of the concepts of data analysis.

One thing that you can do is use this Desmos Activity Builder that walks students through the analysis. Keep in mind that students should be familiar with calculating mean and variance via a spreadsheet. They should also be familiar with using Desmos in terms of graphing functions and doing linear regression.

To analyze the data you (or your students) can use this Google Spreadsheet or CODAP file

Do you have ideas of any leading questions you would ask students? Do you have ways that you could use this dataset with students? Leave your ideas in the comment section.


The Data: Google Sheet CODAP
The Activity: