Friday, January 4, 2019

Highest Grossing Concert Tours

Concerts are a multi billion dollar industry now. So why not use some concert data to do some statistical analysis. This data comes from the wikipedia page on the same subject. On the page the data is broken up into the top 20 all time highest grossing concerts (ordered by unadjusted by inflation numbers). Then it has the top grossing tours for each decade from the 80s until the present. There is data on the decade rank, gross and inflation adjusted gross, the number of shows attendance and other attributes.

Analysis

You can start with some categorical analysis by just looking at the who made the list each year. This data runs for four decades so kids might not be into who was big in the 80s but if you highlight the biggest acts of the last decade you can still see that more than half of them were artists that were around in the 80s (with U2 being #1) and U2, Guns n Roses and The Rolling Stones (twice) were in the top 5 of all time (inflation adjusted).

For more numerical analysis you could pick any of the data sets to do some single variable analysis. Whether it be central tendency, distributions, or histograms. There are many choices.

When you create some box plots you will find that some of the data sets have outliers. In particular, I think it's interesting that the outliers when dealing with the money are different from the outliers when dealing with the number of shows. This might lead you to explore things like the the Average Gross and compare it to the money and number of shows.

This might lead you to do some double variable analysis. Though there aren't any strong relationships, you could use this to maybe talk about relationships with poor correlations. Technically there is one strong relationship. That's the one between the Gross and the Inflation adjusted gross. This would be expected as one relates directly to the other. One thing that I like about this, however, is that it's not a perfect relationship. That is, who ever adjusted for inflation did so using different rates for each year (to make it more realistic, presumably).

Sample Questions


  • Which Artist made the most (over all/ or per concert)?
  • Which decade made the most money (adjusted for inflation)?
  • Which artists are outliers the most often?
  • Calculate the mean and median for each of the numeric attributes. How do these values suggest something about the distributions?

Downloads



Let me know if you used this data set or if you have suggestions of what to do with it beyond this.

Saturday, November 10, 2018

Notre Dame University - "The Shirt"

Guest Post - by Michael Lieff (@virgonomic)


Every year for the last 15 years, my neighbour, who is a die hard Fighting Irish fan, has planned a driving trip to Notre Dame University near South Bend, Indiana. I attended for the first time in 2017 and again in 2018. After a travel day, the first stop on the campus tour is the bookstore. In the lobby, they have a table with one style of short- and long-sleeve t-shirts. In 2017 "the shirt" was navy and it didn't really grab me.


However, in 2018 the shirt was kelly green which drew me in, as green is my favourite colour. I read the price tag and learned that "the shirt" is a student initiative and the proceeds go back into student activities and assistance. At $18 USD it was a no-brainer.

Once I had my shirt, I visited the URL on the price tag. There is a link to a timeline that shows the shirt design from every year, and more importantly, the number of shirts sold, the team's record and the shirt manufacturer. Found data! Even more interesting is that there is no data for number sold for the years 1994-1996.

Analysis

The first question that came to my mind is: how many shirts did they sell from 1994-1996? Due to this gap, the dataset is a really nice example to explore interpolation and extrapolation. I figured the trend would be linear and the line of best fit would give a pretty logical prediction. Upon visualization, it definitely isn't cut-and-dried.

There are some interesting things going on here.The number of shirts sold dropped fairly significantly from 1993 to 1997. It also skyrocketed in 2002 and then plummeted in 2004. Possible reasons for this would make for an interesting discussion.

Drilling a bit deeper, the next question that came to mind is: Are more shirts sold in seasons where the team is winning?

It doesn't appear so, but I will let you 'do the math'.

Sample Questions

In terms of analysis, the following questions could be asked:
  • Is the trend linear or is a curve a better model?
  • Can you interpolate the number of shirts sold in 1994-1996 where there is missing data? Extrapolate the number sold in 2018 or beyond?
  • What are the mean, median and mode number sold?
  • Do the number of shirts sold correlate with the team’s wins that season?

Download the Data

 Let us know if you use this dataset or have any suggestions for things to do with it beyond this.

Monday, November 5, 2018

2018 NFL Salaries

We have a local NFL player that went to high school in one of the schools I support. Luke Willson was recently on the Seattle Seahawks and currently is on our local Detroit Lions. In conversation, a coworker wondered how much his salary was. The Internet provides. Not only his salary, but the salary of every one of the almost 1800 players (who knew there were so many?).

And when you have such a large data set, I think that you should analyze it. It's not a particularly deep topic. But it's a good data set to talk about mean, median, skewing and outliers. Not anything super interesting from a data perspective but the context may be interesting enough to capture the interest of some of your students to do basic single variable analysis. The data includes info about a player's name, salary, position, team, overall rank and I added the team rank. There are 32 teams and a bit over 50 players per team.

Analysis


Certainly some things you can do are to create some graphs. The first types that comes to mind is a dot plot, box plot and histogram. In this case the dot and box plot are provided by CODAP while the histogram comes from Google Sheets. You can see from the dot plot that the mean and median are quite separated (which we would expect from the skewing) and that there are a large number of outliers.

Since we were talking about Luke Willson, we could certainly ask how his salary compares to other NFL players (he's 455th) or other players on his team (he's 18th of 56) or even how he compares to other people the same position (21st of about 126 tight ends and is above the mean tight end salary)

Sample Questions

  • Determine the mean, median and standard deviation for the salaries attribute.
  • Which team has the highest mean salary? median salary?
  • Choose a player of your choice, how do they compare to the league, team and position?
  • Besides the way it looks, what confirms that this data is skewed to the right?
  • Which team has the highest number of outliers?

Download the Data

Let me know if you used this data set or if you have suggestions of what to do with it beyond this.

Friday, October 26, 2018

Walnut Crushing World Record (with 3 Act Task)

Check out this video (thanks to @ddmeyer for pointing this one out).

So the guy crushes walnuts with his head and what we get is a linear relationship. There are a few things here. First off, there is a 3 Act task. I modelled the 3 Act task off of @Gfletchy's similar task for rope jumping. Secondly, I timed how long it took for each walnut to get crushed and collected in file (if you are interested, I slowed the video down by 50% then used an online timer to get the splits). So now you can do some analysis. It's not a particularly interesting data set but it might give a fun context to look at linear relationships.

3 Act Task

Act 1 - Watch the movie
How many walnuts will he be able to crush with his head in 60 seconds? Estimate
Write an estimate you know is too high.  Write an estimate you know is too low.

Act 2a - Before you show this ask students what information they would like to have.

Act 2b - Show this video for information with more accessible math

Act 2c - Show this video for information with even more accessible math

Act 3 - Show this video to reveal the answer.

Analysis

I guess the question that most comes to my mind (after "does he have a headache") is he crushing the walnuts at a constant rate. Careful observation might find a couple of spots where he hesitates a bit and you might want to discuss whether that shows up in the data. But is the data linear? Looking at the graph you can see that for the most part it is, but there is a slightly faster rate at the beginning and a slightly slower at the end but each section seems pretty linear.

Another thing you might want to discuss is whether it should be Time vs Walnuts or Walnuts vs Time. Since rates are usually per unit time then it probably makes sense to do Walnuts vs Time but you could argue that the total time depends on the number of walnuts or that the total number of walnuts you could crush depends on how much time you have. Note that the easiest way to swap the axes in a Google spreadsheet is by changing the position of the columns so to do that I just copied the Time column to both sides of the Walnuts column.

Sample Questions

Besides the above questions you could certainly ask:
  • What's the line of best fit?
  • What's the correlation?
  • How many walnuts do you think he could crush if it were two minutes? 10 minutes?
  • Is there a better fit than linear?
  • How many nuts would he have cracked if he kept at the same pace as the first 10 seconds?
  • If you only saw the first 5 seconds, what would be your prediction of the number crushed in 1 minute?
  • Can you tell, on the graph, when he hesitated?
  • What if he would have had the pace he finished with throughout the whole minute, how many nuts would he have cracked then? I think this was the previous record of 281

Download the Data

Let me know if you used this data set or if you have suggestions of what to do with it beyond this.

Sunday, August 26, 2018

Using the CODAP Online Statistics Software for Simple Analysis

So for years I have been a user of Fathom. Fathom is a dynamic statistical software package that has been available for teachers and students, free, here in Ontario. However, the software itself has not been updated over time and currently won't even run on a relatively recently purchased Mac. Not to fear, some of the creators of Fathom have come together to create the Common Online Data Analysis Platform (CODAP).

And because it was created by the people who gave us Fathom, it has a lot of similarities in style and function. It's not quite exactly the same but the biggest advantage is that it resides online so you can assign data for students to analyze and they can do so on any platform (probably not on a small screen phone very easily but still technically possible).
But for simple analysis, it does almost all the same things that Fathom did. Categorical and numerical analysis, mean & median, dot plots, scatter plots, linear regression, moveable lines, sum of squares, box plot, outliers and more. Some things it doesn't do (yet) are make bar graphs (though it makes the equivalent with dot plots) and histograms (though this may become an added feature). You can watch how easy it is to do some of those things dealing with simple analysis on the video seen below. If you want to play along with the video, here is the file that I used.

Once you know how to use the app, getting the data to your students is the next step. My preference is to have a pre-made CODAP file available for upload to CODAP. You can upload a file directly from any computer or conversely from a Google Drive. My preference is to do so from a Google drive. I have taken the liberty of converting many of the data sets on this blog to CODAP files. I have tagged all of them with the CODAP label here (also seen on the right side of the blog) or I have collected all the CODAP files in this folder. Conversely you can upload your own data in a .csv file. Though it does not seem like you can do this directly from a Google Drive. So I would stick to creating the CODAP files and sharing that with your students (either on Google drive or a local network drive). Either way, if you use any of these files, I would download them from this blog and then upload them to your preferred place.

And being redundant, here is a list of the past posts that I have done the conversion for and future posts will also have CODAP versions included.
Anscombe's Quartet
Smoking and Cancer
Movie Data
How Much Would you Pay for a $50 Gift Card?
Earthquake Data
Trending Data
Magazines
Speed Data
Electric Car Rebates
Is Levelling Up in Pokemon Go Exponential
Collecting Data from Pokemon Go

Don't forget to look at the CODAP site for lots of great resources. From more data sets, tutorials, FAQs and even though we haven't talked about them here, simulations. Or just look at the Educator Resources page.

Download the Data

All the Posts
Folder of CODAP files


Friday, June 9, 2017

Five Thirty Eight's Pile of Data


UPDATE: Now even more of their data is available and easier to get at, you guessed it, their data site: https://data.fivethirtyeight.com/

I have always found it tough to find interesting data sets. Especially those that are not contrived. At Five Thirty Eight they are constantly looking at the world through data. Their primary posts tend to be about politics or sports but often they have posts on pop culture and other items. For example, recently they had a post titled "Why Classic Rock Isn't What it used to be". In that post they analyzed over 37000 plays of classic rock songs spanning decades. And not only have they done the work, they've made all of the raw data available. All 37673 pieces in a csv file.

Downloading the Data

So basically they have a Github site where they make much of the raw data available for many of their stories. They have a lot of data related stories and although most of them are not on this site there are almost 100 that are. So for example, you could look at the article about how deadly it is to be an Avenger and see that the article doesn't have any graphs but there is a bunch of data where you could do a histogram or something with the categorical data.

Or if you were a Bob Ross Fan (real or ironic) then you can get the data the analyzed on the paintings he created for his show. Here's the article, but on the GitHub site you get the raw data plus, as an added bonus for you code jockeys, the Python script that they used to create the data set. Most have the link to the original article.
Note that when you see the CSV file listed, you can't just right click and download the file. That will just get you the script used to get the data. To get the actual data, click the CSV link and then copy the data from the table that appears.

Some other interesting sets are on Fandango's movie ratings, or the connections between the actors in the movie Love Actually or their data on the popularity of unisex names.

One small warning. This is raw data and in a few cases really raw. For example the data set about the number times someone cursed or bled out in a Quentin Tarantino movie is very cool but totally inappropriate for a classroom (there are 1895 pieces of data in this set).

Check them all out on the sites:
https://data.fivethirtyeight.com/
https://github.com/fivethirtyeight/data

Saturday, September 17, 2016

Collecting Data from Pokemon Go

It's the beginning of the school year now and the dust is starting to settle from the summer's obsession with Pokemon Go. So why not try to leverage that obsession by having students collect some data. The data comes in the form of how many times each Pokemon was seen and caught by each user. I got the idea for this set of data from this post from @lesliefarooq where she pointed out that with each Pokemon caught, when you look in the Pokedex, there is data about how many times each Pokemon was both seen and caught. At first glance this is a simple data set but it turns out there is a lot you could do with it.

So what I was able to do was start to collect some of that data by using a Google Form to generate two types of graphs. The first was a graph of the most often seen Pokemon (no surprise to players what the top three were). The second graph was the linear relationship between the number of caught and the number seen. What follows are the ways that you can either use my data or collect your own with your students.

Analysis

So the first thing you need to do is get the data. Once in the game, tap on the Pokeball at the bottom of the screen, then the Pokedex and then tap on any Pokemon that shows up. Once you get to the Pokemon screen you can collect the Pokemon number, the name is optional (to make entry into the form quicker, I only required the number), how many they saw, how many they caught and finally the type of Pokemon. Here you will get the data on each Pokemon. Swiping left or right will cycle between each Pokemon so you can collect the data faster. So if you have students that have been playing the game, they can collect the data there. You might want them to collect it manually or they can use this form to add to my data electronically or you can make a copy of this form to create your own class set.

Once you have the data, the first thing that you can have students do is create a bar graph of their most popular Pokemon like @lesliefarooq did. What I did is took that a step further. Since I collected the data via a Google form, I used a bit of spreadsheet wizardry to tally up the total number of Pokemon of each type seen given all the data. You can see that in my data sheet where I have added some columns to the right where the data is collected. The nice thing about this is that as more people add their data to my form, it will continue to update the totals. So with this data you can do some of the same thing that @lesliefarooq did and ask students about their most popular Pokemon and compare to the graphic that shows how popular or rare each Pokemon is.

But the nice thing about this data is that you can now use the connection between the sightings and catches to connect to linear relationships. It's not a perfectly linear relationship but it will have a very strong correlation.

NOTE: In the actual game, players will collect Pokemon in two ways. The main way is by having them appear and then catching them by throwing Pokeballs at them. Most Pokemon will be caught this way. The second way is to hatch eggs. And the only way to hatch an egg is to physically walk 2km, 5km or 10km (that is one of the physical activities that the game promotes). When you hatch an egg, they are often more rare Pokemon that you will never see "in the wild". So these will always be seen once and caught once. This means that if you do any linear regression, you will have a large number of data that will be (1, 1) and that will skew your regression making it stronger. So I suggest removing any of those data pieces. In the set that I give as a sample, I have already done that (see below).

So this data set will be good for introductory linear relations with interpolation and extrapolation but what I have also done is extract some of the data into smaller sets. Because when we collected the data we also asked about the Pokemon number and Pokemon type. So this means we can start to use that info. For example, we can break up the big set into smaller sets, each corresponding to a different Pokemon. To facilitate that, I have created both a Fathom file and a Desmos Activity with these smaller sets (try it out here). The Desmos file, as it is set up, would be good for beginners when it comes to interpolation and extrapolation but it could be augmented for further exploration of lines of best fit. The Fathom file would be good for comparison of lines of best fit for the data sets. In the original data set you can also do things comparing the types of Pokemon as well.

Sample Questions

  • How does your top 20 most popular Pokemon compare to the top 20 of the larger set?
  • How does the number of each type of Pokemon compare to each other?
  • Which Pokemon has the highest number of average catches?
  • Which Pokemon is easier to catch, based on the data?
  • How does the linearity of the data relate to how easy the Pokemon could be caught?
  • Which type of Pokemon is easier to catch? Which one has the largest correlation?

Download the Data

Let me know if you used this data set or if you have suggestions of what to do with it beyond this.