Friday, June 9, 2017

Five Thirty Eight's Pile of Data

UPDATE: Now even more of their data is available and easier to get at, you guessed it, their data site:

I have always found it tough to find interesting data sets. Especially those that are not contrived. At Five Thirty Eight they are constantly looking at the world through data. Their primary posts tend to be about politics or sports but often they have posts on pop culture and other items. For example, recently they had a post titled "Why Classic Rock Isn't What it used to be". In that post they analyzed over 37000 plays of classic rock songs spanning decades. And not only have they done the work, they've made all of the raw data available. All 37673 pieces in a csv file.

Downloading the Data

So basically they have a Github site where they make much of the raw data available for many of their stories. They have a lot of data related stories and although most of them are not on this site there are almost 100 that are. So for example, you could look at the article about how deadly it is to be an Avenger and see that the article doesn't have any graphs but there is a bunch of data where you could do a histogram or something with the categorical data.

Or if you were a Bob Ross Fan (real or ironic) then you can get the data the analyzed on the paintings he created for his show. Here's the article, but on the GitHub site you get the raw data plus, as an added bonus for you code jockeys, the Python script that they used to create the data set. Most have the link to the original article.
Note that when you see the CSV file listed, you can't just right click and download the file. That will just get you the script used to get the data. To get the actual data, click the CSV link and then copy the data from the table that appears.

Some other interesting sets are on Fandango's movie ratings, or the connections between the actors in the movie Love Actually or their data on the popularity of unisex names.

One small warning. This is raw data and in a few cases really raw. For example the data set about the number times someone cursed or bled out in a Quentin Tarantino movie is very cool but totally inappropriate for a classroom (there are 1895 pieces of data in this set).

Check them all out on the sites:

Saturday, September 17, 2016

Collecting Data from Pokemon Go

It's the beginning of the school year now and the dust is starting to settle from the summer's obsession with Pokemon Go. So why not try to leverage that obsession by having students collect some data. The data comes in the form of how many times each Pokemon was seen and caught by each user. I got the idea for this set of data from this post from @lesliefarooq where she pointed out that with each Pokemon caught, when you look in the Pokedex, there is data about how many times each Pokemon was both seen and caught. At first glance this is a simple data set but it turns out there is a lot you could do with it.

So what I was able to do was start to collect some of that data by using a Google Form to generate two types of graphs. The first was a graph of the most often seen Pokemon (no surprise to players what the top three were). The second graph was the linear relationship between the number of caught and the number seen. What follows are the ways that you can either use my data or collect your own with your students.


So the first thing you need to do is get the data. Once in the game, tap on the Pokeball at the bottom of the screen, then the Pokedex and then tap on any Pokemon that shows up. Once you get to the Pokemon screen you can collect the Pokemon number, the name is optional (to make entry into the form quicker, I only required the number), how many they saw, how many they caught and finally the type of Pokemon. Here you will get the data on each Pokemon. Swiping left or right will cycle between each Pokemon so you can collect the data faster. So if you have students that have been playing the game, they can collect the data there. You might want them to collect it manually or they can use this form to add to my data electronically or you can make a copy of this form to create your own class set.

Once you have the data, the first thing that you can have students do is create a bar graph of their most popular Pokemon like @lesliefarooq did. What I did is took that a step further. Since I collected the data via a Google form, I used a bit of spreadsheet wizardry to tally up the total number of Pokemon of each type seen given all the data. You can see that in my data sheet where I have added some columns to the right where the data is collected. The nice thing about this is that as more people add their data to my form, it will continue to update the totals. So with this data you can do some of the same thing that @lesliefarooq did and ask students about their most popular Pokemon and compare to the graphic that shows how popular or rare each Pokemon is.

But the nice thing about this data is that you can now use the connection between the sightings and catches to connect to linear relationships. It's not a perfectly linear relationship but it will have a very strong correlation.

NOTE: In the actual game, players will collect Pokemon in two ways. The main way is by having them appear and then catching them by throwing Pokeballs at them. Most Pokemon will be caught this way. The second way is to hatch eggs. And the only way to hatch an egg is to physically walk 2km, 5km or 10km (that is one of the physical activities that the game promotes). When you hatch an egg, they are often more rare Pokemon that you will never see "in the wild". So these will always be seen once and caught once. This means that if you do any linear regression, you will have a large number of data that will be (1, 1) and that will skew your regression making it stronger. So I suggest removing any of those data pieces. In the set that I give as a sample, I have already done that (see below).

So this data set will be good for introductory linear relations with interpolation and extrapolation but what I have also done is extract some of the data into smaller sets. Because when we collected the data we also asked about the Pokemon number and Pokemon type. So this means we can start to use that info. For example, we can break up the big set into smaller sets, each corresponding to a different Pokemon. To facilitate that, I have created both a Fathom file and a Desmos Activity with these smaller sets (try it out here). The Desmos file, as it is set up, would be good for beginners when it comes to interpolation and extrapolation but it could be augmented for further exploration of lines of best fit. The Fathom file would be good for comparison of lines of best fit for the data sets. In the original data set you can also do things comparing the types of Pokemon as well.

Sample Questions

  • How does your top 20 most popular Pokemon compare to the top 20 of the larger set?
  • How does the number of each type of Pokemon compare to each other?
  • Which Pokemon has the highest number of average catches?
  • Which Pokemon is easier to catch, based on the data?
  • How does the linearity of the data relate to how easy the Pokemon could be caught?
  • Which type of Pokemon is easier to catch? Which one has the largest correlation?

Download the Data

Let me know if you used this data set or if you have suggestions of what to do with it beyond this.

Wednesday, July 27, 2016

Is Levelling Up in Pokemon Go Exponential?

Unless you have been living under a rock over the last few weeks, you've probably heard of Pokemon Go. If you are not aware, the general premiss is that you wander your neighbourhood (physically) with the App open. The app is linked to GPS and Google maps so as you walk around you see your streets but overlying those streets are various Pokemon characters to capture and along the way you collect points by visiting PokeStops (to also collect items) and PokeGyms (to also have battles). Along the way you "Level Up" by accumulating experience (XP) points. As you increase your level, the number of points needed to go to the next level also increases. But how? Is it linear, quadratic, exponential or something else? Well, get the data and have your students decide.


As far as anyone knows (right now) there are only 40 levels. To move past the 1st level you need to accumulate 1000 pts but by level 40 you need five million. So the question might be "How does the number of points change as you go from level to level?".

As players are in the game, they will level up. What they will see is the number of points needed to get to the next level (not the total number of points accumulated). The first 15 levels can be seen to the right. The middle column shows the total number of points at the beginning of each level (constructed from the points needed to level up for each level). The right most column indicates how many points are needed in each level to get to the next level (this is what players would actually see). It is essentially the 1st difference of the total points. But to clarify, players never see the Total number of XP in the game. It was just constructed here because that is usually what we would be graphing. So to keep your street cred with the kids, you may want to only refer to the XP needed at each level and construct the total (like I did) for mathematical purposes.

Regardless, this is one of the first places you can have students do some analysis. By looking at the points need to level up you can see that as you go from level to level, the number of points needed goes up 1000 pts per level until level 11 where it starts to stabilize for a few levels.

As you look at all the levels there are a couple of ways you can look at it. By plotting all 40 levels you can see that an exponential model is almost a perfect fit with a geometric progression of little more than 25% each time you level up, though not exactly. A different view could be by putting the levels in groups of 5. Doing this shows that as you go up levels you need significantly more XP points to get to the next group of levels.

But a closer look at the data shows that the first 11 levels have a constant 2nd difference and thus are quadratic. And then the next few levels have constant first differences and thus go up linearly. After that the increases are not as consistent. 

So there are many places in the curriculum that this data set can relate to. On the simple end you can look at it as a non linear data set. Or you can just focus on the first few levels and keep it quadratic or contrast that with the linear portion. The fact that we are talking about discrete levels means that you can think about this in terms of sequences and series. So take from it what you need. Below are some possible prompts you can use with students and the entire set can be downloaded from this Google Doc for easy consumption.

Sample Questions

  • If it took you one day to get to level 5, how long would it take you to get to level 10? Level 15? Level 40?
  • What type of relationship exists between the points for each level in the first 10 levels? 15 levels? all levels?
  • Do the levels follow a constant sequence?

Download the Data

Let me know if you used this data set or if you have suggestions of what to do with it beyond this.

Monday, June 6, 2016

Electric Car Rebates

So this article came across my Facebook feed a while back and I though it was a great potential source of data for discussion at many levels
It certainly captured my attention as an Ontario resident but a closer look showed that there was potentially a lot of data to be analyzed. The data is about the Ontario Electric Vehicle Incentive program and the above article was inspired by this news release but in the article they were able to get more specific data about number of vehicles of each style (which is not released).


Students are encouraged to look critically at the original article and perhaps talk about how the title and some of the information given is used to incite a reaction.
For example even though they gave the overall numbers of almost 4800 people getting around $39 million in rebates, they focused on just the rebates of the most expensive cars which total about 2% of the people and rebate value. And although they do mention it, it's not highlighted but about 25% of those rebates went to one vehicle, the Chevrolet Volt.
But looking at the ministry website you can see a nice data set about which cars get which rebates (as well as info about how the program changed once it was pointed out that super expensive luxury cars were getting rebates.
I was able to get this table out and clean it up as well as add the approximate value of each car to the list (it's approximate because I had to go and search each out on the web so I might have been a bit lazy when it came to options) and now it is good for some simple analysis.
On the "low hanging fruit" end you can create the bar graph of the number of models for each company. Personally, I wouldn't have guessed GM to be at the top. But you can also create a histogram of the actual rebate to look at the distribution (or perhaps look at the box plot or dot plot). Lastly you could look at whether there is a connection with the price of the car and how big the rebate is.

Sample Questions

  • Which manufacturer has the most electric models?
  • What is the most common rebate value?
  • Does the rebate get bigger (in general) as the price of the car increases?
  • If you were going to purchase an electric vehicle, which one would benefit the most/least from the rebate program?

Download the Data

Let me know if you used this data set or if you have suggestions of what to do with it beyond this.

Tuesday, May 24, 2016

Gas Prices in Ontario

A friend, Michael Lieff pointed this nice set of data out. It is the price of gas in several Ontario cities going as far back as 1990. This is an interesting data set as the price of gas, in general, increases but you can see that that wasn't always the case (only a few of the cities are shown below).


When you go to this website you have several options for prices and you can download a year of data at a time (with a CSV as an option). The obvious choice is regular gasoline but you might want to consider things like comparing regular gas to alternative fuels like propane. For example in this case, you can see that, in general, propane also has risen in price over time but where gasoline seems to fluctuate similarly regardless of the city, propane seems to be more volatile depending on location.

Because of the shear amount of data points possible (you can get a weekly average for the last 25 years for several cities if you want), you may wish to stick to yearly values. Another option is to use some of he weekly values to talk about the dangers of extrapolation

Download the Data

I have also taken the liberty of downloading all of the data for gasoline (all 25 years of it) in weekly, monthly and yearly form. As well as the yearly propane data. You can get it on this Google sheet (note the tabs) or just the gas prices on Fathom

Let me know if you used this data set or if you have suggestions of what to do with it beyond this.

Friday, May 13, 2016

The Data and Story Library - DASL

DASL (pronounced "dazzle") is the Data and Story Library is an awesome database of sets of data that are specifically to help teach topics of statistics. They are all real sets and are all categorized by topic/subtject (eg automotive, food, health, sports etc) and mathematical method (eg boxplots, mean, outliers, regression, scatterplots etc). So theoretically if you wanted to find a set of data that could be used to help teach a specific topic you could search for, say, "correlation"
These are some great data sets to get through the mechanical nature of statistics. It's not very current data but it's great for practicing statistical methods.
For the longest time this set of data was not available but just recently it was hosted by Data Description Inc. so now we have access to it again.


There are far too many sets to talk about analysis but when the site was down I blogged about one of my favourite sets on Smoking and Cancer. Take a look at that post to get a sense of the data. When you get to any data set, to see the actual data file, click on the Datafile Name

This will show you the text file of the data with the download link at the top of the page.
From that point you can do the analysis. Each data set will have a detailed description of each variable and a short story and sample analysis of each set
There are many data sets on this site for every statistical topic and on a range of subjects. One thing you might have your students do is just explore on this site and find data sets that can be used to exemplify a particular statistical concept.

Download the Data

Let me know if you used this data set or if you have suggestions of what to do with it beyond this.

Saturday, March 5, 2016

Speed Data

A few weeks ago I saw this Tweet
I used to have some data kicking around my computer but I did a quick Google search and found that Car & Driver was a huge source of this type of data. And I love that you can get some of the data with their original hand written data sheets. BTW, here is @MJFenton's finished activity
And the teacher version.

The Analysis

Let's start with the data set from the above post. You can certainly do Desmos Need for Speed activity. The analysis in terms of determining a function is a little intense (IE not a standard function model). You can see some of the more exact analysis via the two links in the tweet below.
But if you didn't want to go too deep you could just use it to talk about non linear relationships or you could use it to talk about rates of change as speed data comes up a lot in calculus.
I have also found more data sets from different cars and you can see how they compare to each other on this Desmos file.

Download the Data

There actually is a lot of data that can be found on the Car & Driver site. Many of the cars in this link have data sheets (you really have to search around on each page to find the data sheet). But I have downloaded a few of them (seen in the Desmos file above) and created a Google Sheet for each so you can copy and paste the data where ever you want.
Porsche Spyder Data Sheet Google Sheet
Dodge Challenger Data Sheet Google Sheet
Chevy Camaro Data Sheet Google Sheet
Cadalac CTS Data Sheet Google Sheet
Chevy Malibu Data Sheet Google Sheet
Honda Fit Data Sheet Google Sheet
All Google Sheets

Let me know if you used this data set or if you have suggestions of what to do with it beyond this.