Showing posts with label standard deviation. Show all posts
Showing posts with label standard deviation. Show all posts

Sunday, May 16, 2021

Star Wars Data via Kaggle

Another repository of freely available data is called Kaggle.  "Inside Kaggle you’ll find all the code & data you need to do your data science work. Use over 50,000 public datasets and 400,000 public notebooks to conquer any analysis in no time." I like this repository because it seems to be easily searchable and there are a lot of data sets so you should be able to find one that is on an interesting topic for your students without too much trouble. 

And to show case a data set, I'm choosing one suggested to me by @virgonomic on data from the Star Wars franchise. And actually it's several data sets. 

Analysis 

There are four CSV files, one on characters, species, planets, starships and vehicles. Now you are not going to be doing any ground breaking statistical work here as the context of these data sets are pretty niche to die hard Star Wars fans. Like, I'm not sure who will care that the Bantha-II cargo skiff has a one day supply of consumables. None the less these are good data sets to be used for basic stats (finding mean, standard deviation, correlation etc). You can definitely find many attributes that are categorical as well. One thing I did noticed is that with most of the sets there was always one or two things that could be used to talk about outliers. Like Jabba the Hutt in the Character's dataset or the rotational period of planets in the planet data set


Sample Questions

  • When you consider the length of a vehicle compared to the number of crew it holds, are there any outliers?
  • What is the standard deviation of the _______ attribute in the _______ data set?
  • Find your favourite character. Pick and attribute and describe how your character compares to the others. 

BONUS data: Though this is not from this data set, it was recently Star Wars day and someone posted this infographic comparing the number of lines each character spoke and what words they spoke the most in the original trilogy. 


Downloads

Let me know if you used this data set or if you have suggestions of what to do with it beyond this.

Friday, March 22, 2019

Hip Hop Vocabulary

This post originally came out in 2014 (before this blog was created) and so I hadn't thought about it for a while. Then I saw a post by Dane Ehlert on his When Math Happens blog and was not only reminded of it but noticed that the original post had been updated in look and with new data. Basically they take a pile of hip hop artists and count how many unique words they use in their first 35000 lyrics.

Analysis

When you go to the site, the visualization (above) is interactive in that you can search for artists and interact with the visualization. This is neat but on this blog we typically want to do some mathematical analysis. They have other representations like this one that looks like a histogram but for our purposes, we would like some numbers.

 
So if you look way down on the post, they do have a Google Sheet with the number of unique words for each of the over 160 artists. It's not a particularly robust data set but we can do some simple
analysis, like histogram, averages, box plots and other single variable analysis. I don't think there is anything particularly mathematically interesting with the data but this is data that might be interesting for students and so it could be used to do practice some standard single variable analysis techniques (central tendance, standard deviation, distributions, dot plots, box plots, histograms etc)

Sample Questions

  • Who are the outliers in this data set?
  • Which decade has the most verbose rappers?
  • How does your favourite rapper compare to the most/least verbose rapper?
  • Take a look at some of the questions Dane was asking in his post for some more open questions.
  • What does the data in the original post say about the amount of words used in different types of music?

Downloads 


Let me know if you used this data set or if you have suggestions of what to do with it beyond this.

Friday, January 4, 2019

Highest Grossing Concert Tours

Concerts are a multi billion dollar industry now. So why not use some concert data to do some statistical analysis. This data comes from the wikipedia page on the same subject. On the page the data is broken up into the top 20 all time highest grossing concerts (ordered by unadjusted by inflation numbers). Then it has the top grossing tours for each decade from the 80s until the present. There is data on the decade rank, gross and inflation adjusted gross, the number of shows attendance and other attributes.

Analysis

You can start with some categorical analysis by just looking at the who made the list each year. This data runs for four decades so kids might not be into who was big in the 80s but if you highlight the biggest acts of the last decade you can still see that more than half of them were artists that were around in the 80s (with U2 being #1) and U2, Guns n Roses and The Rolling Stones (twice) were in the top 5 of all time (inflation adjusted).

For more numerical analysis you could pick any of the data sets to do some single variable analysis. Whether it be central tendency, distributions, or histograms. There are many choices.

When you create some box plots you will find that some of the data sets have outliers. In particular, I think it's interesting that the outliers when dealing with the money are different from the outliers when dealing with the number of shows. This might lead you to explore things like the the Average Gross and compare it to the money and number of shows.

This might lead you to do some double variable analysis. Though there aren't any strong relationships, you could use this to maybe talk about relationships with poor correlations. Technically there is one strong relationship. That's the one between the Gross and the Inflation adjusted gross. This would be expected as one relates directly to the other. One thing that I like about this, however, is that it's not a perfect relationship. That is, who ever adjusted for inflation did so using different rates for each year (to make it more realistic, presumably).

Sample Questions


  • Which Artist made the most (over all/ or per concert)?
  • Which decade made the most money (adjusted for inflation)?
  • Which artists are outliers the most often?
  • Calculate the mean and median for each of the numeric attributes. How do these values suggest something about the distributions?

Downloads



Let me know if you used this data set or if you have suggestions of what to do with it beyond this.

Sunday, August 26, 2018

Using the CODAP Online Statistics Software for Simple Analysis

So for years I have been a user of Fathom. Fathom is a dynamic statistical software package that has been available for teachers and students, free, here in Ontario. However, the software itself has not been updated over time and currently won't even run on a relatively recently purchased Mac. Not to fear, some of the creators of Fathom have come together to create the Common Online Data Analysis Platform (CODAP).

And because it was created by the people who gave us Fathom, it has a lot of similarities in style and function. It's not quite exactly the same but the biggest advantage is that it resides online so you can assign data for students to analyze and they can do so on any platform (probably not on a small screen phone very easily but still technically possible).
But for simple analysis, it does almost all the same things that Fathom did. Categorical and numerical analysis, mean & median, dot plots, scatter plots, linear regression, moveable lines, sum of squares, box plot, outliers and more. Some things it doesn't do (yet) are make bar graphs (though it makes the equivalent with dot plots) and histograms (though this may become an added feature). You can watch how easy it is to do some of those things dealing with simple analysis on the video seen below. If you want to play along with the video, here is the file that I used.

Once you know how to use the app, getting the data to your students is the next step. My preference is to have a pre-made CODAP file available for upload to CODAP. You can upload a file directly from any computer or conversely from a Google Drive. My preference is to do so from a Google drive. I have taken the liberty of converting many of the data sets on this blog to CODAP files. I have tagged all of them with the CODAP label here (also seen on the right side of the blog) or I have collected all the CODAP files in this folder. Conversely you can upload your own data in a .csv file. Though it does not seem like you can do this directly from a Google Drive. So I would stick to creating the CODAP files and sharing that with your students (either on Google drive or a local network drive). Either way, if you use any of these files, I would download them from this blog and then upload them to your preferred place.

And being redundant, here is a list of the past posts that I have done the conversion for and future posts will also have CODAP versions included.
Anscombe's Quartet
Smoking and Cancer
Movie Data
How Much Would you Pay for a $50 Gift Card?
Earthquake Data
Trending Data
Magazines
Speed Data
Electric Car Rebates
Is Levelling Up in Pokemon Go Exponential
Collecting Data from Pokemon Go

Don't forget to look at the CODAP site for lots of great resources. From more data sets, tutorials, FAQs and even though we haven't talked about them here, simulations. Or just look at the Educator Resources page.

Download the Data

All the Posts
Folder of CODAP files


Friday, May 13, 2016

The Data and Story Library - DASL

DASL (pronounced "dazzle") is the Data and Story Library is an awesome database of sets of data that are specifically to help teach topics of statistics. They are all real sets and are all categorized by topic/subtject (eg automotive, food, health, sports etc) and mathematical method (eg boxplots, mean, outliers, regression, scatterplots etc). So theoretically if you wanted to find a set of data that could be used to help teach a specific topic you could search for, say, "correlation"
These are some great data sets to get through the mechanical nature of statistics. It's not very current data but it's great for practicing statistical methods.
For the longest time this set of data was not available but just recently it was hosted by Data Description Inc. so now we have access to it again.

Analysis

There are far too many sets to talk about analysis but when the site was down I blogged about one of my favourite sets on Smoking and Cancer. Take a look at that post to get a sense of the data. When you get to any data set, to see the actual data file, click on the Datafile Name

This will show you the text file of the data with the download link at the top of the page.
From that point you can do the analysis. Each data set will have a detailed description of each variable and a short story and sample analysis of each set
There are many data sets on this site for every statistical topic and on a range of subjects. One thing you might have your students do is just explore on this site and find data sets that can be used to exemplify a particular statistical concept.

Download the Data


Let me know if you used this data set or if you have suggestions of what to do with it beyond this.

Tuesday, January 26, 2016

Magazines

A while back I started doing this activity with my students on the first day. For homework I would tell them to go home and find two magazines, get their prices the number of pages and count the number of pages with ads on them. Once they brought that in then we would combine all the data into one set. I got the idea from browsing through an Oprah magazine and being shocked at how many pages I had to turn in order to get to a page that had actual content on it. Eventually I automated the process by using a Google Form to collect the data. And by adding another criteria (the type of magazine), this actually turns into a pretty rich data set.

The Analysis

Certainly with this data set you can do any number of things pertaining to calculations (average, standard deviation, correlation etc) but I liked to use it to start to have a need to move from single variable analysis to two variable analysis. For example, the magazine in the current set with the highest number of ad pages is In Style with 380 add pages (which is definitely an outlier)
This seems outrageous and the hope is that this will intrigue the students into asking questions. And perhaps they will also realize that it's the magazine with the largest number of total pages. And that then presents a need to do a different type of analysis (two variable scatter plot). And when you do that analysis you will see that although 380 pages is proportionally a little high for a magazine with 620 total pages and is not so outrageous.
This is a good data set to just look at the basic stuff (creating bar graphs, histograms, box plots, scatterplots, measuring central tendency, determining correlations, finding least squared lines etc)
Other things you can do is look at the break up popularity of magazine (in your class or with this data set) by type of magazine. By breaking it up into types of magazine, you can have an opportunity for students to compare graphs . When students compare graphs, an important skill to have them demonstrate is to make sure the size and scales of the graph are similar. This data set can help facilitate that.

Sample Questions

  • Create histograms of each of the numerical attributes and plot the mean and median on each graph. Describe each histogram as skewed right, left or symmetrical and justify your answers
  • Compare the graphs of total pages to ad pages
  • What proportion of magazines would be Sports & Entertainment in the average household?
  • What type of distribution would the number of ad pages be described as? Justify your answer.
  • Are there any outliers in the number of ad pages? Do the outliers change if you consider the type of magazine instead of the whole group?
  • Is the number of total pages (or ad pages) in the magazine correlated with the price of the magazine?
  • If a magazine were to have 120 pages, how many of them would you expect to have ads? Is this number different if you consider the type of magazine instead of all the magazines in the group?

Download the Data


Let me know if you used this data set or if you have suggestions of what to do with it beyond this.

Wednesday, January 6, 2016

Earthquake Database

Last week friends of mine felt a 4.8 magnitude earthquake on Vancouver Island. So it seems like a perfect time to post some resources on data about earthquakes. As it turns out, depending on the magnitude, there are a lot of earthquakes that happen world wide each year. And we can get that data, almost realtime, from any number of earthquake databases. I like the one that the US Geological Service provides. This lets you set a few options and search earthquakes based on those options. The default is then a map that shows the result of your search.

The Analysis

Once you chose which options to use, then you have to get the data. I suggest that you limit your searches originally to those over magnitude 6 if you are looking at an extended time period (in 2015 there were over 140. If you play around with the magnitude (say dropping the threshold to 4.5) then you could get a huge amount (which you may or may not want). For example, if you drop that threshold to 4.5 there are over 6800 earthquakes found from 2015.

Once you get the data, you can just click the Download button on the top left to choose a CSV file that can be imported into any spreadsheet or Fathom. The obvious analysis here is a single variable set of the Magnitude (they call it mag in the data set). So you could do any number of histograms, box plots, dot plots etc as well as measures of central tendency and standard deviation. It's a really good data set for having students go through all the basic calculations needed when doing a single variable analysis.

Depending on when you get your data you will get outliers.

Usually the data will come out skewed to the right as most of the quakes are typically at the low end (this is regardless of what you choose as your threshold.
You can also do a neat "heat map" by choosing Map in CODAP and dragging something like the Magnitude onto the middle of the graph so it appears as a colour spectrum. This can be done in Fathom by plotting the Longitude and Latitude (and thus getting a map) onto the regular graph.


Here's a quick video on getting this data from the database into CODAP to use the Mapping feature:


Sample Questions

  • Determine the measures of central tendency for the magnitude of the earthquakes
  • Determine the five number summary for the magnitude of the earthquakes
  • Which earthquake(s) were the most extreme? Where they outliers?
  • How are the measures of central tendency affected if you remove the outlier(s) when looking at the magnitude of the earthquakes?
  • Determine whether the data for the magnitude of the earthquakes is skewed to the right or left.

Other Earthquake Data

If students are trying to do something more with their earthquake data (like analyze then make sense of it) they might try getting more info at IRIS (Incorporated Research Institutions for Seismology). There they have some of the same data and more plus other info that might be relative. Thanks to @frankmcgowa for that one

Download the data

Let me know if you used this data set or if you have suggestions of what to do with it beyond this.

Monday, December 21, 2015

How much would you pay for a $50 Gift Card?

How much would you pay for a gift card on eBay? Perhaps, let me back up a bit. Maybe for Christmas someone gets me a Tiffany's gift card. I will likely not be going to Tiffany's any time soon (don't tell my wife). So that gift card is not worth much to me. But it may be worth something to someone else. So being an enterprising person, I put it up for auction on eBay. I wouldn't expect to sell it for more than what the gift card is worth (you would think). So the question then is, what percent of the actual value of the card will I be able to sell it for? Well years ago the crew at Freakonomics shared this data set of of 100 gift cards and what they sold for on eBay. The data is almost 10 years old but it still turns out that this is a fairly rich data set.

The Analysis

So the attributes in this set are the card type (Best Buy, iTunes etc), the value of the card, how much it sold for, what were the shipping costs, how many bids did it have, what was the feedback rating of the seller, the percentage of the sale (including the shipping), the average percentage per card and the actual link of the auction. So that means there are a large amount of things you can analyse. For single variable stuff you could find measures of central tendency for the entire set or individually for each type of card. Or just choose your type of single variable graph and create it for the whole group or by card type.
Or you could do some double variable analysis comparing to see the connection between the value of the card and the sale price (for either the whole group or by card type.
And because the data exists, you could even do some comparisons of the average percentage that a card gets.

Sample Questions

  • Identify the outliers for each card type (Value, sold etc) and suggest why they might be outliers
  • Identify the spread for the Value of each card type. Why might some cards have smaller spreads than others?
  • How does the linear regression compare for different types of cards?
  • Are there any cards that were sold for more than they were worth? What might cause someone to pay more for a card than what it is worth?
  • Why might some cards have a higher average sale rate?

Other Stories

This data came out of a story originally about why companies love gift cards (and the page of supporting data for the article) As it turns out they actually tend to be like free money. This is because so often people don't use up all of their gift cards and then forget about them. I think part of that is because we are required to know exactly how much is left on a gift card in order to use it. They actually show the data (on pg 65) for Best Buy on how much extra money they made because of unused gift cards (spoiler alert, it was $43 million)

Download the Data

Let me know if you used this data set or if you have suggestions of what to do with it beyond this.