Found Data

Tuesday, January 26, 2016

Magazines

A while back I started doing this activity with my students on the first day. For homework I would tell them to go home and find two magazines, get their prices the number of pages and count the number of pages with ads on them. Once they brought that in then we would combine all the data into one set. I got the idea from browsing through an Oprah magazine and being shocked at how many pages I had to turn in order to get to a page that had actual content on it. Eventually I automated the process by using a Google Form to collect the data. And by adding another criteria (the type of magazine), this actually turns into a pretty rich data set.

The Analysis

Certainly with this data set you can do any number of things pertaining to calculations (average, standard deviation, correlation etc) but I liked to use it to start to have a need to move from single variable analysis to two variable analysis. For example, the magazine in the current set with the highest number of ad pages is In Style with 380 add pages (which is definitely an outlier)

This seems outrageous and the hope is that this will intrigue the students into asking questions. And perhaps they will also realize that it's the magazine with the largest number of total pages. And that then presents a need to do a different type of analysis (two variable scatter plot). And when you do that analysis you will see that although 380 pages is proportionally a little high for a magazine with 620 total pages and is not so outrageous.
This is a good data set to just look at the basic stuff (creating bar graphs, histograms, box plots, scatterplots, measuring central tendency, determining correlations, finding least squared lines etc)
Other things you can do is look at the break up popularity of magazine (in your class or with this data set) by type of magazine. By breaking it up into types of magazine, you can have an opportunity for students to compare graphs . When students compare graphs, an important skill to have them demonstrate is to make sure the size and scales of the graph are similar. This data set can help facilitate that.

Sample Questions

Create histograms of each of the numerical attributes and plot the mean and median on each graph. Describe each histogram as skewed right, left or symmetrical and justify your answers
Compare the graphs of total pages to ad pages
What proportion of magazines would be Sports & Entertainment in the average household?
What type of distribution would the number of ad pages be described as? Justify your answer.
Are there any outliers in the number of ad pages? Do the outliers change if you consider the type of magazine instead of the whole group?
Is the number of total pages (or ad pages) in the magazine correlated with the price of the magazine?
If a magazine were to have 120 pages, how many of them would you expect to have ads? Is this number different if you consider the type of magazine instead of all the magazines in the group?

Download the Data

You (or your students) can add to the existing data set using this form. The current data can be then found on this Google Sheet.
Fathom file (with graphs)

CODAP file cccc

Let me know if you used this data set or if you have suggestions of what to do with it beyond this.

Saturday, January 23, 2016

Trending Data

I have known about all of these trending search engines and thought they were quaint but recently I have actually seen some examples of uses that make me believe they maybe worth more and worth talking about in an senior Data Management class. For example I saw this one from @NateSilver538

Level of attention to Trump already so high that Palin isn't increasing searches for him much. But huge for Palin! pic.twitter.com/BSIEwVj8P6
— Nate Silver (@NateSilver538) January 19, 2016

Another example is from the Science Friday Podcast talking about tracking "hate" through Google searches. Listen below:
The trending site used in both of those cases was Google Trends and has been around for a while. Basically you put in the search terms you wish to compare and it shows how often they were searched on Google. For example the Superbowl is coming up in a couple of weeks so if you search "Superbowl", it shouldn't be surprising that we get a periodic pattern:

Once you have one search term, you can add others. For example, let's see how popular Christmas is compared to the Superbowl:

Another place to look for trending terms is Twitter. And the site Hashtags.org gives analytics. Here you enter a hashtag and get the last 24 hours of Twitter traffic for that hashtag (at least in the free version). You can't do a comparison of hashtags but you can search any hashtag you wish. However you could highlight

Another place you can get trend data is Quantcast.com. This site does analytics on website traffic in general

You can get detailed analytics for free from any of the sites that are listed as directly measured.

The Analysis

Though with most of the trending sites, there is not much analysis to be done, we often hear about topics "trending" so these sites can be used to bring something concrete to class. But some simple analysis can be done with the Quantcast site by just importing the table of sites and you can do work on histograms and even bar graphs.

Sample Questions

Find a trending topic on Twitter or Google. Verify the data using one of the trending analytic sites. Compare to a similar topic.
How does the traffic of the top 10 most popular sites compare to the next 10?
Are there any outliers in the set of most popular sites?

Download the Data

Website: https://www.google.ca/trends/
Website: https://www.hashtags.org/
Website: https://www.quantcast.com/top-sites
Quantcast data (Sheets, Sheets with graphs, Fathom, Fathom with Graphs, CODAP)

Let me know if you used this data set or if you have suggestions of what to do with it beyond this.

Friday, January 15, 2016

Where are the Rey Star Wars Toys?

This comes from a post from Five Thirty Eight looking at the distribution of new toys from the new Star Wars film. This is just a simple data set that could be made into a bar graph where students might be interested in the data. And it seems like maybe the scarcity of Rey toys was not accidental.

The Analysis

There is not much analysis for students to do here. They can create the bar graph and then answer some questions about it. The point here is that the data set itself is what is interesting for students. Students could also make a pie graph from the data since it represents 100% of the data. One of the good things this data set can do is help show why pie graphs aren't that good for analysis since the data is so close to each other (if just looking at the pie slices it is hard to tell which is bigger - without the percents showing). Most statisticians agree that, for the most part, pie graphs are not very informative. Yet we see them all the time. For example, look at the two representations to the right. The bar graph and pie graph show the same information but the pie graph is only useful for specific analysis if the percentages are actually shown. Otherwise it would be hard to determine the relative sizes of the pieces of pie and thus the relative weights of each type of toy. The problem becomes even worse when you use a 3D pie graph (so often used on news shows) and without the percents you cannot tell the difference in size between many of the pies. Of course the pie graph looks nicer, though.

Sample Questions

By what percentage do the number of Kylo Ren toys surpass BB-8?
Which type of graph would be better for this data, bar or circle? Justify your choice.

Download the Data

Google Sheets (with graphs)
The original post
http://fivethirtyeight.com/features/wheresrey-the-star-wars-heroine-is-featured-in-fewer-toys-than-all-the-new-dudes/

Wednesday, January 6, 2016

Earthquake Database

Last week friends of mine felt a 4.8 magnitude earthquake on Vancouver Island. So it seems like a perfect time to post some resources on data about earthquakes. As it turns out, depending on the magnitude, there are a lot of earthquakes that happen world wide each year. And we can get that data, almost realtime, from any number of earthquake databases. I like the one that the US Geological Service provides. This lets you set a few options and search earthquakes based on those options. The default is then a map that shows the result of your search.

The Analysis

Once you chose which options to use, then you have to get the data. I suggest that you limit your searches originally to those over magnitude 6 if you are looking at an extended time period (in 2015 there were over 140. If you play around with the magnitude (say dropping the threshold to 4.5) then you could get a huge amount (which you may or may not want). For example, if you drop that threshold to 4.5 there are over 6800 earthquakes found from 2015.

Once you get the data, you can just click the Download button on the top left to choose a CSV file that can be imported into any spreadsheet or Fathom. The obvious analysis here is a single variable set of the Magnitude (they call it mag in the data set). So you could do any number of histograms, box plots, dot plots etc as well as measures of central tendency and standard deviation. It's a really good data set for having students go through all the basic calculations needed when doing a single variable analysis.

Depending on when you get your data you will get outliers.

Usually the data will come out skewed to the right as most of the quakes are typically at the low end (this is regardless of what you choose as your threshold.

You can also do a neat "heat map" by choosing Map in CODAP and dragging something like the Magnitude onto the middle of the graph so it appears as a colour spectrum. This can be done in Fathom by plotting the Longitude and Latitude (and thus getting a map) onto the regular graph.

Here's a quick video on getting this data from the database into CODAP to use the Mapping feature:

Sample Questions

Determine the measures of central tendency for the magnitude of the earthquakes
Determine the five number summary for the magnitude of the earthquakes
Which earthquake(s) were the most extreme? Where they outliers?
How are the measures of central tendency affected if you remove the outlier(s) when looking at the magnitude of the earthquakes?
Determine whether the data for the magnitude of the earthquakes is skewed to the right or left.

Other Earthquake Data

If students are trying to do something more with their earthquake data (like analyze then make sense of it) they might try getting more info at IRIS (Incorporated Research Institutions for Seismology). There they have some of the same data and more plus other info that might be relative. Thanks to @frankmcgowa for that one

Download the data

Website: http://earthquake.usgs.gov/earthquakes/search/
Sample data for 2015 (Google Sheets, Fathom, Fathom with graphs, CODAP)

Let me know if you used this data set or if you have suggestions of what to do with it beyond this.

Monday, December 21, 2015

How much would you pay for a $50 Gift Card?

How much would you pay for a gift card on eBay? Perhaps, let me back up a bit. Maybe for Christmas someone gets me a Tiffany's gift card. I will likely not be going to Tiffany's any time soon (don't tell my wife). So that gift card is not worth much to me. But it may be worth something to someone else. So being an enterprising person, I put it up for auction on eBay. I wouldn't expect to sell it for more than what the gift card is worth (you would think). So the question then is, what percent of the actual value of the card will I be able to sell it for? Well years ago the crew at Freakonomics shared this data set of of 100 gift cards and what they sold for on eBay. The data is almost 10 years old but it still turns out that this is a fairly rich data set.

The Analysis

So the attributes in this set are the card type (Best Buy, iTunes etc), the value of the card, how much it sold for, what were the shipping costs, how many bids did it have, what was the feedback rating of the seller, the percentage of the sale (including the shipping), the average percentage per card and the actual link of the auction. So that means there are a large amount of things you can analyse. For single variable stuff you could find measures of central tendency for the entire set or individually for each type of card. Or just choose your type of single variable graph and create it for the whole group or by card type.

Or you could do some double variable analysis comparing to see the connection between the value of the card and the sale price (for either the whole group or by card type.

And because the data exists, you could even do some comparisons of the average percentage that a card gets.

Sample Questions

Identify the outliers for each card type (Value, sold etc) and suggest why they might be outliers
Identify the spread for the Value of each card type. Why might some cards have smaller spreads than others?
How does the linear regression compare for different types of cards?
Are there any cards that were sold for more than they were worth? What might cause someone to pay more for a card than what it is worth?
Why might some cards have a higher average sale rate?

Download the Data

The original Spreadsheet (Excel, Google Sheets)
Fathom (Data, With Graphs)
CODAP file

Let me know if you used this data set or if you have suggestions of what to do with it beyond this.

Thursday, December 17, 2015

Movie Data

Given that as I type this the new Star Wars movie coming out this week it seems like a perfect time to highlight some places to go get data about movies. So there are a pile of places to go. And kids (and most humans) love movies so why not find some data that kids will be more engaged to explore. As it turns out there are a few really great places to get real time data on movies. I'm going to focus on two.

Box Office Mojo

The first one is http://www.boxofficemojo.com/. There is a lot of data that you can choose from and it is almost realtime. For example you can click on Daily and it will give the summary of total domestic (US) ticket sales for each day. Or at the top if you click the daily summary you will get the top movies of the day and how much they made (among other things, right down to the dollar). You can even drill down and click on the movie name to get things like how many theatres it is in. One of the other neat things is they have "Showdowns" of movies and do comparisons like this one from Interstellar, Gravity and The Martian. But by far the coolest thing is the all time chart which gives the records for a huge number of metrics.

The Numbers

The second site I like is http://www.the-numbers.com/ , Here you can get some of the same stats like the box office info from any day of any year, but also stuff on DVD sales as well as how bankable a star is. And it even has a special Report Builder page where you can generate your own report with the info you want. But for me, by far, the best part is their movie budgets page where you can get the all time list of movies by production budget (over 5000 of them) or top 20 movies that were most profitable.

The Analysis

There is so much that you can do with this data that you could probably pick off any topic and find something to report on. But let me highlight a few of my favourite things to do. For example, with the daily movie data from Box Office Mojo (Fathom, Fathom Sol, Google Sheet). At the low end you could create histograms, dot plots and box plots, and compare measures of central tendency. At the higher end you can have them look for outliers or compare what happens day to day.

That daily data was a summary, you can also take the daily data from The Numbers (Fathom, Fathom Sol, Google Sheet) and my favourite thing to do after looking at the single variable analysis of the amount of money is to look at the two variable analysis of how the money compares to the number of theatres each movie was in. And then see if any of the movies might get lost in that data (like the Big Short which hardly played in any theatres but had the most tickets sold per theatre. Or that In the Heart of the Sea is doing better than expected and the Peanuts Movie is doing worse than expected

Another of my favourite things is to look at how movies did compared to what it cost to make them. There is a lot of info on this on The Numbers and one of my favourite examples is that of the Blair Witch Project. A movie that only cost $60,000 to make yet had a world wide total gross of almost $250 million. You can get the daily numbers for any movie like this and in this case see that this started out in one theatre, did well. Then expanded to about 30 theatres and did well and then finally got a much wider distribution and blew up.

That is just a small amount of what you could do with this data. Especially if you use the full set from the Numbers (Fathom, Google Sheets)

Sample Questions

What I usually do with these sites is ask something more general. I introduce them and then just ask "What story does this data tell? Use graphs and calculations to tell your story."
Another thing I ask is to look at the all time list and use a site like http://natoonline.org/data/ticket-price/ to put everything in today's dollars. They can check their answers on the Box Office Mojo summary page where they show that Gone With the Wind, adjusted for inflation, would have grossed over $1.7 billion domestically (there is no worldwide data). Or even look at the story that they tell about adjusted data. The dataset on movie ticket prices alone is pretty good for analysis.
For the younger grades you could make bar graphs or circle graphs about their favourite movie franchise, for example, like Harry Potter (Google Sheets, Google Sheets with Graphs)

Other Movie Resources

The FiveThirtyEight.com site often does a lot of stories on movies and there is a great podcast about the problems with the movie rating sites and how they handle data. Read and listen about it here and here. And of course there is the famous movie quotes as visualizations

Download the Data

Of course go to The Numbers and Box Office Mojo at any time to get the most up to date data on movies. All the files I analyzed here can be found in this folder. Note that all of these files were generated BEFORE Star Wars: The Force Awakens came out so it will be interesting to see how it changes the data.

Let me know if you used these data set or if you have suggestions of what to do with it beyond this. Or if you created a lesson based on this data, share it below.

Friday, December 11, 2015

Reddit Discussions

It's no secret that I am a big fan of fivethirtyeight.com. They do some great statistical analysis of sports, entertainment and politics. They also have some interactive data sections where they take a topic and let you get the data on it. Take for example this one on the information site Reddit.com. This is a pretty thriving community of Internet users who participate on discussion board from a large range of topics (some inappropriate). That being said, they have scraped the site and found the usage of certain key words and matched them up against each other. Take, for example the usage of Batman, Superman or Spiderman over the last 8 years (and 1.7 billion comments) or so.

When you go to http://projects.fivethirtyeight.com/reddit-ngram/ it will immediately randomly choose a few keywords. There are many choices and you can click Shuffle to get a new set (BEWARE that some of the search terms are swears so I wouldn't click that in class) but you can also just type in any keywords that you want to compare. This is similar to Google Trends but just for Reddit

The Analysis

On any graph you can drag the sliders on the zoom bar to zoom into any place on the graph. You can also adjust the smoothing which will change how many days the averaging relate to. The graphs made are essentially broken line graphs and you can get the data for any set by clicking on the Download the Data button. The values in that CSV file represent the percent of the total number of comments that that word or phrase accounts for in the given time period.

Sample Questions

Identify any trends in the data.
Identify why there might be spikes in the data. That is, what was happening in the news at that time that might cause people to use those words

Download the Data

Website: http://projects.fivethirtyeight.com/reddit-ngram/
CSVs can be downloaded from any data set.

Let me know if you used this data set or if you have suggestions of what to do with it beyond this.

Pages

Tuesday, January 26, 2016

The Analysis

Sample Questions

Download the Data

Saturday, January 23, 2016

The Analysis

Sample Questions

Download the Data

Friday, January 15, 2016

The Analysis

Sample Questions

Download the Data

Wednesday, January 6, 2016

The Analysis

Sample Questions

Other Earthquake Data

Download the data

Monday, December 21, 2015

The Analysis

Sample Questions

Other Stories

Download the Data

Thursday, December 17, 2015

Box Office Mojo

The Numbers

The Analysis

Sample Questions

Other Movie Resources

Download the Data

Friday, December 11, 2015

The Analysis

Sample Questions

Download the Data