Showing posts with label repository of data. Show all posts
Showing posts with label repository of data. Show all posts

Sunday, May 16, 2021

Star Wars Data via Kaggle

Another repository of freely available data is called Kaggle.  "Inside Kaggle you’ll find all the code & data you need to do your data science work. Use over 50,000 public datasets and 400,000 public notebooks to conquer any analysis in no time." I like this repository because it seems to be easily searchable and there are a lot of data sets so you should be able to find one that is on an interesting topic for your students without too much trouble. 

And to show case a data set, I'm choosing one suggested to me by @virgonomic on data from the Star Wars franchise. And actually it's several data sets. 

Analysis 

There are four CSV files, one on characters, species, planets, starships and vehicles. Now you are not going to be doing any ground breaking statistical work here as the context of these data sets are pretty niche to die hard Star Wars fans. Like, I'm not sure who will care that the Bantha-II cargo skiff has a one day supply of consumables. None the less these are good data sets to be used for basic stats (finding mean, standard deviation, correlation etc). You can definitely find many attributes that are categorical as well. One thing I did noticed is that with most of the sets there was always one or two things that could be used to talk about outliers. Like Jabba the Hutt in the Character's dataset or the rotational period of planets in the planet data set


Sample Questions

  • When you consider the length of a vehicle compared to the number of crew it holds, are there any outliers?
  • What is the standard deviation of the _______ attribute in the _______ data set?
  • Find your favourite character. Pick and attribute and describe how your character compares to the others. 

BONUS data: Though this is not from this data set, it was recently Star Wars day and someone posted this infographic comparing the number of lines each character spoke and what words they spoke the most in the original trilogy. 


Downloads

Let me know if you used this data set or if you have suggestions of what to do with it beyond this.

Saturday, May 15, 2021

The Big Bang Theory Ratings & Viewership via Data.World

Data.World is a great site for data sets and they all seem to be freely downloadable once you create an account.  The site is a paid site but seems to be paid for people who use data in commerce. Members upload all kinds of data sets and you can search through them. 

To show that I've taken a sample data set about the Big Bang Theory TV show. It was a great show and  it doesn't matter whether you didn't watch it when it first aired because you can probably find an episode of the Big Bang Theory on TV at just about any time of the day. So if you are looking for some data then two data bases (Wikipedia and IMDB) were scraped to get information like ratings, viewership, plot line and more and housed at data.world

Analysis

There are several attributes to this data set (including episode descriptions and titles) but you probably want to stick to the numerical ones. You can do single variable analysis of the number of viewers, the votes and the ratings and some double variable analysis. I like the single variable analysis because you can separate the seasons and do a separate analysis for each season. 

Sample Questions

Which season had the highest average viewership?
Is there a connection between the rating and number of votes?
Which season(s) had the most popular episodes? 

Downloads 


Let me know if you used this data set or if you have suggestions of what to do with it beyond this.



Friday, June 9, 2017

Five Thirty Eight's Pile of Data


UPDATE: Now even more of their data is available and easier to get at, you guessed it, their data site: https://data.fivethirtyeight.com/

I have always found it tough to find interesting data sets. Especially those that are not contrived. At Five Thirty Eight they are constantly looking at the world through data. Their primary posts tend to be about politics or sports but often they have posts on pop culture and other items. For example, recently they had a post titled "Why Classic Rock Isn't What it used to be". In that post they analyzed over 37000 plays of classic rock songs spanning decades. And not only have they done the work, they've made all of the raw data available. All 37673 pieces in a csv file.

Downloading the Data

So basically they have a Github site where they make much of the raw data available for many of their stories. They have a lot of data related stories and although most of them are not on this site there are almost 100 that are. So for example, you could look at the article about how deadly it is to be an Avenger and see that the article doesn't have any graphs but there is a bunch of data where you could do a histogram or something with the categorical data.

Or if you were a Bob Ross Fan (real or ironic) then you can get the data the analyzed on the paintings he created for his show. Here's the article, but on the GitHub site you get the raw data plus, as an added bonus for you code jockeys, the Python script that they used to create the data set. Most have the link to the original article.
Note that when you see the CSV file listed, you can't just right click and download the file. That will just get you the script used to get the data. To get the actual data, click the CSV link and then copy the data from the table that appears.

Some other interesting sets are on Fandango's movie ratings, or the connections between the actors in the movie Love Actually or their data on the popularity of unisex names.

One small warning. This is raw data and in a few cases really raw. For example the data set about the number times someone cursed or bled out in a Quentin Tarantino movie is very cool but totally inappropriate for a classroom (there are 1895 pieces of data in this set).

Check them all out on the sites:
https://data.fivethirtyeight.com/
https://github.com/fivethirtyeight/data

Friday, May 13, 2016

The Data and Story Library - DASL

DASL (pronounced "dazzle") is the Data and Story Library is an awesome database of sets of data that are specifically to help teach topics of statistics. They are all real sets and are all categorized by topic/subtject (eg automotive, food, health, sports etc) and mathematical method (eg boxplots, mean, outliers, regression, scatterplots etc). So theoretically if you wanted to find a set of data that could be used to help teach a specific topic you could search for, say, "correlation"
These are some great data sets to get through the mechanical nature of statistics. It's not very current data but it's great for practicing statistical methods.
For the longest time this set of data was not available but just recently it was hosted by Data Description Inc. so now we have access to it again.

Analysis

There are far too many sets to talk about analysis but when the site was down I blogged about one of my favourite sets on Smoking and Cancer. Take a look at that post to get a sense of the data. When you get to any data set, to see the actual data file, click on the Datafile Name

This will show you the text file of the data with the download link at the top of the page.
From that point you can do the analysis. Each data set will have a detailed description of each variable and a short story and sample analysis of each set
There are many data sets on this site for every statistical topic and on a range of subjects. One thing you might have your students do is just explore on this site and find data sets that can be used to exemplify a particular statistical concept.

Download the Data


Let me know if you used this data set or if you have suggestions of what to do with it beyond this.