Found Data: 2015

Monday, December 21, 2015

How much would you pay for a $50 Gift Card?

How much would you pay for a gift card on eBay? Perhaps, let me back up a bit. Maybe for Christmas someone gets me a Tiffany's gift card. I will likely not be going to Tiffany's any time soon (don't tell my wife). So that gift card is not worth much to me. But it may be worth something to someone else. So being an enterprising person, I put it up for auction on eBay. I wouldn't expect to sell it for more than what the gift card is worth (you would think). So the question then is, what percent of the actual value of the card will I be able to sell it for? Well years ago the crew at Freakonomics shared this data set of of 100 gift cards and what they sold for on eBay. The data is almost 10 years old but it still turns out that this is a fairly rich data set.

The Analysis

So the attributes in this set are the card type (Best Buy, iTunes etc), the value of the card, how much it sold for, what were the shipping costs, how many bids did it have, what was the feedback rating of the seller, the percentage of the sale (including the shipping), the average percentage per card and the actual link of the auction. So that means there are a large amount of things you can analyse. For single variable stuff you could find measures of central tendency for the entire set or individually for each type of card. Or just choose your type of single variable graph and create it for the whole group or by card type.

Or you could do some double variable analysis comparing to see the connection between the value of the card and the sale price (for either the whole group or by card type.

And because the data exists, you could even do some comparisons of the average percentage that a card gets.

Sample Questions

Identify the outliers for each card type (Value, sold etc) and suggest why they might be outliers
Identify the spread for the Value of each card type. Why might some cards have smaller spreads than others?
How does the linear regression compare for different types of cards?
Are there any cards that were sold for more than they were worth? What might cause someone to pay more for a card than what it is worth?
Why might some cards have a higher average sale rate?

Download the Data

The original Spreadsheet (Excel, Google Sheets)
Fathom (Data, With Graphs)
CODAP file

Let me know if you used this data set or if you have suggestions of what to do with it beyond this.

Thursday, December 17, 2015

Movie Data

Given that as I type this the new Star Wars movie coming out this week it seems like a perfect time to highlight some places to go get data about movies. So there are a pile of places to go. And kids (and most humans) love movies so why not find some data that kids will be more engaged to explore. As it turns out there are a few really great places to get real time data on movies. I'm going to focus on two.

Box Office Mojo

The first one is http://www.boxofficemojo.com/. There is a lot of data that you can choose from and it is almost realtime. For example you can click on Daily and it will give the summary of total domestic (US) ticket sales for each day. Or at the top if you click the daily summary you will get the top movies of the day and how much they made (among other things, right down to the dollar). You can even drill down and click on the movie name to get things like how many theatres it is in. One of the other neat things is they have "Showdowns" of movies and do comparisons like this one from Interstellar, Gravity and The Martian. But by far the coolest thing is the all time chart which gives the records for a huge number of metrics.

The Numbers

The second site I like is http://www.the-numbers.com/ , Here you can get some of the same stats like the box office info from any day of any year, but also stuff on DVD sales as well as how bankable a star is. And it even has a special Report Builder page where you can generate your own report with the info you want. But for me, by far, the best part is their movie budgets page where you can get the all time list of movies by production budget (over 5000 of them) or top 20 movies that were most profitable.

The Analysis

There is so much that you can do with this data that you could probably pick off any topic and find something to report on. But let me highlight a few of my favourite things to do. For example, with the daily movie data from Box Office Mojo (Fathom, Fathom Sol, Google Sheet). At the low end you could create histograms, dot plots and box plots, and compare measures of central tendency. At the higher end you can have them look for outliers or compare what happens day to day.

That daily data was a summary, you can also take the daily data from The Numbers (Fathom, Fathom Sol, Google Sheet) and my favourite thing to do after looking at the single variable analysis of the amount of money is to look at the two variable analysis of how the money compares to the number of theatres each movie was in. And then see if any of the movies might get lost in that data (like the Big Short which hardly played in any theatres but had the most tickets sold per theatre. Or that In the Heart of the Sea is doing better than expected and the Peanuts Movie is doing worse than expected

Another of my favourite things is to look at how movies did compared to what it cost to make them. There is a lot of info on this on The Numbers and one of my favourite examples is that of the Blair Witch Project. A movie that only cost $60,000 to make yet had a world wide total gross of almost $250 million. You can get the daily numbers for any movie like this and in this case see that this started out in one theatre, did well. Then expanded to about 30 theatres and did well and then finally got a much wider distribution and blew up.

That is just a small amount of what you could do with this data. Especially if you use the full set from the Numbers (Fathom, Google Sheets)

Sample Questions

What I usually do with these sites is ask something more general. I introduce them and then just ask "What story does this data tell? Use graphs and calculations to tell your story."
Another thing I ask is to look at the all time list and use a site like http://natoonline.org/data/ticket-price/ to put everything in today's dollars. They can check their answers on the Box Office Mojo summary page where they show that Gone With the Wind, adjusted for inflation, would have grossed over $1.7 billion domestically (there is no worldwide data). Or even look at the story that they tell about adjusted data. The dataset on movie ticket prices alone is pretty good for analysis.
For the younger grades you could make bar graphs or circle graphs about their favourite movie franchise, for example, like Harry Potter (Google Sheets, Google Sheets with Graphs)

Other Movie Resources

The FiveThirtyEight.com site often does a lot of stories on movies and there is a great podcast about the problems with the movie rating sites and how they handle data. Read and listen about it here and here. And of course there is the famous movie quotes as visualizations

Download the Data

Of course go to The Numbers and Box Office Mojo at any time to get the most up to date data on movies. All the files I analyzed here can be found in this folder. Note that all of these files were generated BEFORE Star Wars: The Force Awakens came out so it will be interesting to see how it changes the data.

Let me know if you used these data set or if you have suggestions of what to do with it beyond this. Or if you created a lesson based on this data, share it below.

Friday, December 11, 2015

Reddit Discussions

It's no secret that I am a big fan of fivethirtyeight.com. They do some great statistical analysis of sports, entertainment and politics. They also have some interactive data sections where they take a topic and let you get the data on it. Take for example this one on the information site Reddit.com. This is a pretty thriving community of Internet users who participate on discussion board from a large range of topics (some inappropriate). That being said, they have scraped the site and found the usage of certain key words and matched them up against each other. Take, for example the usage of Batman, Superman or Spiderman over the last 8 years (and 1.7 billion comments) or so.

When you go to http://projects.fivethirtyeight.com/reddit-ngram/ it will immediately randomly choose a few keywords. There are many choices and you can click Shuffle to get a new set (BEWARE that some of the search terms are swears so I wouldn't click that in class) but you can also just type in any keywords that you want to compare. This is similar to Google Trends but just for Reddit

The Analysis

On any graph you can drag the sliders on the zoom bar to zoom into any place on the graph. You can also adjust the smoothing which will change how many days the averaging relate to. The graphs made are essentially broken line graphs and you can get the data for any set by clicking on the Download the Data button. The values in that CSV file represent the percent of the total number of comments that that word or phrase accounts for in the given time period.

Sample Questions

Identify any trends in the data.
Identify why there might be spikes in the data. That is, what was happening in the news at that time that might cause people to use those words

Download the Data

Website: http://projects.fivethirtyeight.com/reddit-ngram/
CSVs can be downloaded from any data set.

Let me know if you used this data set or if you have suggestions of what to do with it beyond this.

Boy Band Data

Finding data that might be interesting to students and will let you do some mathematical analysis is sometimes hard. But thanks to fivethirtyeight.com we have lots of examples. This one takes the lyrics of boy bands summarizes them. That is, what are the top 20 one, two, three and four word phrases. They have done the work of collecting the data and now we can make some graphs of it.

http://fivethirtyeight.com/datalab/90s-boy-band-lyrics-theyre-all-about-you/

The Analysis

Now they have done the work of collecting the data but I have transferred it all to a Google Sheet so that we can do some analysis. Because the data is a summary of discrete information then the appropriate graphs would be bar graphs. If you need to know how to make a bar graph with Google Sheets you can try this video.

Though this is not particularly rich data set, it is good for having students make bar graphs with technology and they can see if they can see any trends in phrases (maybe how the progression of na na's goes)

Sample Questions

Which phrases continue as the number of words increase?
Which phrases make sense in being in the top 20?
How does the count of each phrase change as the number of words increases?
Write some sample lyrics in a possible new popular boy band song
What are some inferences you can make about this data?

Download the Data

Website: http://fivethirtyeight.com/datalab/90s-boy-band-lyrics-theyre-all-about-you/
Google Sheet (with Graphs)

Let me know if you used this data set or if you have suggestions of what to do with it beyond this.

Friday, December 4, 2015

Smoking and Cancer

For many years I used to use the Data & Story Library (DASL) but for some reason the data on the site is unavailable currently but there are some great data sets there. Since they are unavailable I thought I would share some of my favs.

The Analysis

Probably my most favourite is the Smoking and Cancer story. This is a great data set for talking about correlation. The data is the gives the average number of cigarettes smoked in each US state and then the rates of bladder cancer, lung cancer, kidney cancer and leukaemia for each state. So at the very least you can have students create the graphs of each of the afflictions vs the number of cigarettes smoked. When you do you get the following graphs:

The thing I like the most about this is that when you do that you see that bladder cancer has the strongest correlation which is not intuitive. But in the above graph you will notice that the scales are all different. The graph below shows the same graphs but all with the same scale. Here you see that even though bladder cancer may have a similar correlation as smoking, there really isn't much of a relationship (ie no matter how many cigarettes smoked the rate of bladder cancer barely changes). And since the other two have low or no correlation, you can see that smoking has the largest connection to lung cancer.

So it's a good lesson about correlation and why it is important to scale the axes similarly when comparing data.

Sample Questions

Which pairs of data appear to have a connection to each other?
What do each of the numbers represent in each equation?
Which of the scatter plots indicate that there is a relationship between the data?
Use your least squares equations to predict what the death rate would be for each relationship if the Cig value was 10 or 50. How confident can you be of each prediction?

Download the Data

Fathom (Data) (Solution)
Google Spreadsheet
CODAP file

Let me know if you used this data set or if you have suggestions of what to do with it beyond this.

Friday, November 27, 2015

Spurious Corrleations

There is a relatively new book out called Spurious Correlations by Tyler Vigen. This book focuses on data sets that are clearly unrelated yet correlated very well together. Some are fun like this one about the connection between Nicholas Cage films and drowning.

While others almost seem plausible like this one correlating revenue generated by arcades and computer science doctorates.

The interesting thing is that all of these are generated by a computer program that scrapes the internet for data and then sees if they are compatible for correlating. You can hear Tyler talking about it here.

But before there was a book there was (and still is) the site http://tylervigen.com/spurious-correlations The site has gone through a couple of incarnations but it's current form is a lot cleaner. I think the original site is a little nicer for one reason, though. It gives the table of data along with the graph. With the current site the graphs look nicer but to get the actual values for each point, you have to hover over any point to reveal them.

Classroom Connections

So what can we use this for? At the very least, we can use it to discuss the nature of correlation vs causation and the miss use of correlation by median, politicians etc. There is actually a nice little TEDx talk about this very thing:

So just looking through the already created graphs is one thing that you could do. But there is an awesome feature built into the site that allows you to Discover a Correlation. So here you have access to all the data sets he has scraped from the net and use them to find your own spurious correlation. So you start by choosing the first variable you want to work with. To do that you first pick a topic and then click View Variables. You will then see all the datasets relating to that topic (for the below graph, I chose Miscellaneous). Choose the dataset you want to use as your first variable and then click Correlate (I chose Staple Sales). Then you get a list of all the datasets that have a strong correlation with the one you chose. So pick your favourite and the click on Chart (I chose Age of Academy Awards Best Actress). Note that as you see these variables you will see the correlation coefficient). And that creates the graph and gives you a permalink that will have the table of values and other correlation info.

Now what I can do is take that table of values and do some analysis on it. So for example, I imported that into Fathom and create the line of best fit or any other analysis that you would normally do for two variable data. So I would have your students find the most outrageous correlated variables and the do the analysis.

And if you like some of the graphs seen on the new site now but you want the tables of data, you can use this same method to build the graph and get the table of values that way rather than highlighting each point to get the. So for our Nicholas Cage data above, here is the link to the raw dataset. Note that if you don't like that those graphs are black background, you can click on Rechart and it will give you a printer friendly version.
So have fun finding your spurious correlations. BTW, thanks to Mark Esping for reminding me of this site.

Resources

Main site: http://tylervigen.com/spurious-correlations
Original site: http://tylervigen.com/old-version.html
Build your own: http://tylervigen.com/discover

Saturday, November 21, 2015

Anscombe's Quartet

Anscombe's Quartet is four two variable sets of data that have a particularly interesting property.

Upon examination the first three sets have the same x values but other than that the y values all seem random. But the interesting thing starts when you start to do some numerical analysis on them. Just start with some simple single variable calculations.

Mean of each x set = 9
Mean of each y set = 7.50
Variance of each x set = 11
Variance of each y set = 4.122-4.128

So, almost identical. And then if you take that a step further you can do the two variable analysis on each set and get the following:

Correlation of each set = 0.816
Line of best fit for each set y = 3 + 0.5x

So with all that analysis done, you might get the impression that these are pretty much just different aspects of the same sets of data. But then when you graph them you get something entirely different:

So you really see that they are very different sets of data. The lesson here is that your data cannot be fully described with either numerical or graphical analysis but really both are necessary.

Classroom Connections

So how do you use this in class? This set is really best used for students who have had both single variable and two variable analysis. It really is a great set for tying together many of the concepts of data analysis.

One thing that you can do is use this Desmos Activity Builder that walks students through the analysis. Keep in mind that students should be familiar with calculating mean and variance via a spreadsheet. They should also be familiar with using Desmos in terms of graphing functions and doing linear regression.

To analyze the data you (or your students) can use this Google Spreadsheet or CODAP file

Do you have ideas of any leading questions you would ask students? Do you have ways that you could use this dataset with students? Leave your ideas in the comment section.

Resources

Info: https://en.wikipedia.org/wiki/Anscombe%27s_quartet
The Data: Google Sheet CODAP
The Activity: https://teacher.desmos.com/activitybuilder/custom/56364b6f58d09115172b6a3c

Found Data

Pages

Monday, December 21, 2015

How much would you pay for a $50 Gift Card?

The Analysis

Sample Questions

Other Stories

Download the Data

Thursday, December 17, 2015

Movie Data

Box Office Mojo

The Numbers

The Analysis

Sample Questions

Other Movie Resources

Download the Data

Friday, December 11, 2015

Reddit Discussions

The Analysis

Sample Questions

Download the Data

Boy Band Data

The Analysis

Sample Questions

Download the Data

Friday, December 4, 2015

Smoking and Cancer

The Analysis

Sample Questions

Download the Data

Friday, November 27, 2015

Spurious Corrleations

Classroom Connections

Resources

Saturday, November 21, 2015

Anscombe's Quartet

Classroom Connections

Resources