Found Data: linear regression

Showing posts with label linear regression. Show all posts

Tuesday, March 26, 2019

Mining the Meta Data in your iTunes Library

If you (or your students) use iTunes to keep track of your music then it turns out they have a rich source of data that might be interesting for your students to analyze. I find that if students use their own data they are more interested in looking at that data for analysis. In this case, every song on iTunes (and really, any platform) has a pile of meta data associated with it. In that meta data are things like song name, artist name, album name but also there are numerical values like song length, file size, number of plays etc. So you could have your students get the data from their own library and do the analysis of it.

Getting the data from iTunes is pretty easy. Once in iTunes, if they want to get the info from all their music then just click on Songs or if they want to get their data from a favourite playlist then they can click on that. Then click on File, then Library, then Export Playlist. It will then send a .TXT file to the folder of your choice. That .txt file will need a bit of cleaning up, but not much. I suggest importing it into Excel or Google Sheets to clean it up. If you are doing the work in that spreadsheet (or uploading to Desmos) then you're all set. If you plan on importing it into CODAP then save the data as a .CSV file (note that I noticed that even though you should be able to import a .TXT file into CODAP, the format of this one doesn't seem to work, so you have to convert it to a . CSV).

Analysis

Though the data itself is not wildly interesting, you can certainly use it to cover topics like mean, median, standard deviation, and other single variable measures. And maybe have students compare values from their playlists to other students. Note, that the time of the songs are in seconds. So if a histogram is created, it is probably appropriate to have bin widths of 30s or 60s (let students figure this out).

One thing that I think is interesting is that you would expect a very strong (if not perfect) relationship between the time of a song and it's file size. But as you can see there seems to be different relationships. This is due to the bit rate of the file compression. So you might be able to have a conversation about what bit rate is and how it relates to the compression of the file. The lower the bit rate the smaller the file size (for songs of the same length). So you could talk about why you would want a lower or higher bit rate (hint: lower bit rate means poorer quality of the sound but smaller file size, so there is a trade off). In CODAP you can create separate graphs of the bit rate data and the scatter plot of the size vs time then high lite parts of the data to show the different relationships. You could actually hide or show data based on the bit rate to do more specific analysis by isolating just the data from one bit rate.

Sample Questions

Choose three numerical attributes from your data and determine the mean, median and SD of each. Graph each attribute using an appropriate representation.
Which genre of music has the highest average song length?
Which song was played the most?
Which decade has the most songs?
Which song was skipped the most?
Determine the relationship between the size of a file and how long the song is for different bit rates.
You have only 50 Mb of space left on your device. How many minutes of music could you store using all of the remaining space (note that answers will vary based on the bit rate.

Downloads

Sample data from my iTunes Library (Google Sheets, CSV, Desmos, CODAP)
Some sample Graphs (Desmos, CODAP)

Let me know if you used this data set or if you have suggestions of what to do with it beyond this.

Sunday, February 24, 2019

Skipping World Record

A few months back I saw a 3Act Task called Rope Jumper that @gfletchy created out of this video:

He shows the first few seconds of the video and you have to guess how many skips are done in 30s. It's a good 3Act task. But that's not what we're doing here. Here I've actually collected the time data from each skip to do a bit of analysis (I had to slow the video down to 50% speed in order to get every skip).

Analysis

As you would guess it's pretty linear but you might notice, as you watch the video, that it seems like she might be slowing down at times. It's not super exciting in terms of the actual data but it could be used to simply help students in determining the least squared line.

Sample Questions

When was she skipping the fastest/slowest and what was the rate?
How many skips do you think she would make in 1 minute?
If she was to keep the pace that she had in the first few seconds, how many skips would she have made in 30s?
If she had skipped at the same rate as she did in her slowest section, would she still have broken the record.

Downloads

Original data (CSV, Google Docs, Desmos, CODAP)
Sample Analysis (Google Docs, Desmos, CODAP)

Let me know if you used this data set or if you have suggestions of what to do with it beyond this.

Friday, October 26, 2018

Walnut Crushing World Record (with 3 Act Task)

Check out this video (thanks to @ddmeyer for pointing this one out).

So the guy crushes walnuts with his head and what we get is a linear relationship. There are a few things here. First off, there is a 3 Act task. I modelled the 3 Act task off of @Gfletchy's similar task for rope jumping. Secondly, I timed how long it took for each walnut to get crushed and collected in file (if you are interested, I slowed the video down by 50% then used an online timer to get the splits). So now you can do some analysis. It's not a particularly interesting data set but it might give a fun context to look at linear relationships.

3 Act Task

Act 1 - Watch the movie

How many walnuts will he be able to crush with his head in 60 seconds? Estimate

Write an estimate you know is too high. Write an estimate you know is too low.

Act 2a - Before you show this ask students what information they would like to have.

Act 2b - Show this video for information with more accessible math

Act 2c - Show this video for information with even more accessible math

Act 3 - Show this video to reveal the answer.

Analysis

I guess the question that most comes to my mind (after "does he have a headache") is he crushing the walnuts at a constant rate. Careful observation might find a couple of spots where he hesitates a bit and you might want to discuss whether that shows up in the data. But is the data linear? Looking at the graph you can see that for the most part it is, but there is a slightly faster rate at the beginning and a slightly slower at the end but each section seems pretty linear.

Another thing you might want to discuss is whether it should be Time vs Walnuts or Walnuts vs Time. Since rates are usually per unit time then it probably makes sense to do Walnuts vs Time but you could argue that the total time depends on the number of walnuts or that the total number of walnuts you could crush depends on how much time you have. Note that the easiest way to swap the axes in a Google spreadsheet is by changing the position of the columns so to do that I just copied the Time column to both sides of the Walnuts column.

Sample Questions

Besides the above questions you could certainly ask:

What's the line of best fit?
What's the correlation?
How many walnuts do you think he could crush if it were two minutes? 10 minutes?
Is there a better fit than linear?
How many nuts would he have cracked if he kept at the same pace as the first 10 seconds?
If you only saw the first 5 seconds, what would be your prediction of the number crushed in 1 minute?
Can you tell, on the graph, when he hesitated?
What if he would have had the pace he finished with throughout the whole minute, how many nuts would he have cracked then? I think this was the previous record of 281.

Download the Data

Original video: https://youtu.be/i1PQX64cTgY
Google Sheets Version
CODAP Version
Comma Separated Version
Desmos Version

Let me know if you used this data set or if you have suggestions of what to do with it beyond this.

Sunday, August 26, 2018

Using the CODAP Online Statistics Software for Simple Analysis

So for years I have been a user of Fathom. Fathom is a dynamic statistical software package that has been available for teachers and students, free, here in Ontario. However, the software itself has not been updated over time and currently won't even run on a relatively recently purchased Mac. Not to fear, some of the creators of Fathom have come together to create the Common Online Data Analysis Platform (CODAP).

And because it was created by the people who gave us Fathom, it has a lot of similarities in style and function. It's not quite exactly the same but the biggest advantage is that it resides online so you can assign data for students to analyze and they can do so on any platform (probably not on a small screen phone very easily but still technically possible).
But for simple analysis, it does almost all the same things that Fathom did. Categorical and numerical analysis, mean & median, dot plots, scatter plots, linear regression, moveable lines, sum of squares, box plot, outliers and more. Some things it doesn't do (yet) are make bar graphs (though it makes the equivalent with dot plots) and histograms (though this may become an added feature). You can watch how easy it is to do some of those things dealing with simple analysis on the video seen below. If you want to play along with the video, here is the file that I used.

Once you know how to use the app, getting the data to your students is the next step. My preference is to have a pre-made CODAP file available for upload to CODAP. You can upload a file directly from any computer or conversely from a Google Drive. My preference is to do so from a Google drive. I have taken the liberty of converting many of the data sets on this blog to CODAP files. I have tagged all of them with the CODAP label here (also seen on the right side of the blog) or I have collected all the CODAP files in this folder. Conversely you can upload your own data in a .csv file. Though it does not seem like you can do this directly from a Google Drive. So I would stick to creating the CODAP files and sharing that with your students (either on Google drive or a local network drive). Either way, if you use any of these files, I would download them from this blog and then upload them to your preferred place.

And being redundant, here is a list of the past posts that I have done the conversion for and future posts will also have CODAP versions included.

Anscombe's Quartet
Smoking and Cancer
Movie Data
How Much Would you Pay for a $50 Gift Card?
Earthquake Data
Trending Data
Magazines
Speed Data
Electric Car Rebates
Is Levelling Up in Pokemon Go Exponential
Collecting Data from Pokemon Go

Don't forget to look at the CODAP site for lots of great resources. From more data sets, tutorials, FAQs and even though we haven't talked about them here, simulations. Or just look at the Educator Resources page.

Download the Data

All the Posts
Folder of CODAP files

Saturday, September 17, 2016

Collecting Data from Pokemon Go

It's the beginning of the school year now and the dust is starting to settle from the summer's obsession with Pokemon Go. So why not try to leverage that obsession by having students collect some data. The data comes in the form of how many times each Pokemon was seen and caught by each user. I got the idea for this set of data from this post from @lesliefarooq where she pointed out that with each Pokemon caught, when you look in the Pokedex, there is data about how many times each Pokemon was both seen and caught. At first glance this is a simple data set but it turns out there is a lot you could do with it.

So what I was able to do was start to collect some of that data by using a Google Form to generate two types of graphs. The first was a graph of the most often seen Pokemon (no surprise to players what the top three were). The second graph was the linear relationship between the number of caught and the number seen. What follows are the ways that you can either use my data or collect your own with your students.

Analysis

So the first thing you need to do is get the data. Once in the game, tap on the Pokeball at the bottom of the screen, then the Pokedex and then tap on any Pokemon that shows up. Once you get to the Pokemon screen you can collect the Pokemon number, the name is optional (to make entry into the form quicker, I only required the number), how many they saw, how many they caught and finally the type of Pokemon. Here you will get the data on each Pokemon. Swiping left or right will cycle between each Pokemon so you can collect the data faster. So if you have students that have been playing the game, they can collect the data there. You might want them to collect it manually or they can use this form to add to my data electronically or you can make a copy of this form to create your own class set.

Once you have the data, the first thing that you can have students do is create a bar graph of their most popular Pokemon like @lesliefarooq did. What I did is took that a step further. Since I collected the data via a Google form, I used a bit of spreadsheet wizardry to tally up the total number of Pokemon of each type seen given all the data. You can see that in my data sheet where I have added some columns to the right where the data is collected. The nice thing about this is that as more people add their data to my form, it will continue to update the totals. So with this data you can do some of the same thing that @lesliefarooq did and ask students about their most popular Pokemon and compare to the graphic that shows how popular or rare each Pokemon is.

But the nice thing about this data is that you can now use the connection between the sightings and catches to connect to linear relationships. It's not a perfectly linear relationship but it will have a very strong correlation.

NOTE: In the actual game, players will collect Pokemon in two ways. The main way is by having them appear and then catching them by throwing Pokeballs at them. Most Pokemon will be caught this way. The second way is to hatch eggs. And the only way to hatch an egg is to physically walk 2km, 5km or 10km (that is one of the physical activities that the game promotes). When you hatch an egg, they are often more rare Pokemon that you will never see "in the wild". So these will always be seen once and caught once. This means that if you do any linear regression, you will have a large number of data that will be (1, 1) and that will skew your regression making it stronger. So I suggest removing any of those data pieces. In the set that I give as a sample, I have already done that (see below).

So this data set will be good for introductory linear relations with interpolation and extrapolation but what I have also done is extract some of the data into smaller sets. Because when we collected the data we also asked about the Pokemon number and Pokemon type. So this means we can start to use that info. For example, we can break up the big set into smaller sets, each corresponding to a different Pokemon. To facilitate that, I have created both a Fathom file and a Desmos Activity with these smaller sets (try it out here). The Desmos file, as it is set up, would be good for beginners when it comes to interpolation and extrapolation but it could be augmented for further exploration of lines of best fit. The Fathom file would be good for comparison of lines of best fit for the data sets. In the original data set you can also do things comparing the types of Pokemon as well.

Sample Questions

How does your top 20 most popular Pokemon compare to the top 20 of the larger set?
How does the number of each type of Pokemon compare to each other?
Which Pokemon has the highest number of average catches?
Which Pokemon is easier to catch, based on the data?
How does the linearity of the data relate to how easy the Pokemon could be caught?
Which type of Pokemon is easier to catch? Which one has the largest correlation?

Download the Data

Form to add to this data set. Google sheet with the large data set, comparison of the Pokemon caught (on the second tab)
Form to create your own data set (make a copy)
Fathom File with large data set and with a few graphs
CODAP large data set
Desmos Activity
Original post
Pokemon popularity data

Let me know if you used this data set or if you have suggestions of what to do with it beyond this.

Monday, June 6, 2016

Electric Car Rebates

So this article came across my Facebook feed a while back and I though it was a great potential source of data for discussion at many levels

It certainly captured my attention as an Ontario resident but a closer look showed that there was potentially a lot of data to be analyzed. The data is about the Ontario Electric Vehicle Incentive program and the above article was inspired by this news release but in the article they were able to get more specific data about number of vehicles of each style (which is not released).

Analysis

Students are encouraged to look critically at the original article and perhaps talk about how the title and some of the information given is used to incite a reaction.

For example even though they gave the overall numbers of almost 4800 people getting around $39 million in rebates, they focused on just the rebates of the most expensive cars which total about 2% of the people and rebate value. And although they do mention it, it's not highlighted but about 25% of those rebates went to one vehicle, the Chevrolet Volt.
But looking at the ministry website you can see a nice data set about which cars get which rebates (as well as info about how the program changed once it was pointed out that super expensive luxury cars were getting rebates.

I was able to get this table out and clean it up as well as add the approximate value of each car to the list (it's approximate because I had to go and search each out on the web so I might have been a bit lazy when it came to options) and now it is good for some simple analysis.

On the "low hanging fruit" end you can create the bar graph of the number of models for each company. Personally, I wouldn't have guessed GM to be at the top. But you can also create a histogram of the actual rebate to look at the distribution (or perhaps look at the box plot or dot plot). Lastly you could look at whether there is a connection with the price of the car and how big the rebate is.

Sample Questions

Which manufacturer has the most electric models?
What is the most common rebate value?
Does the rebate get bigger (in general) as the price of the car increases?
If you were going to purchase an electric vehicle, which one would benefit the most/least from the rebate program?

Download the Data

Ontario Electric Cars (Google Sheet, Fathom, CODAP)
Original Article
Original News brief
Original Table of rebates

Let me know if you used this data set or if you have suggestions of what to do with it beyond this.

Tuesday, May 24, 2016

Gas Prices in Ontario

A friend, Michael Lieff pointed this nice set of data out. It is the price of gas in several Ontario cities going as far back as 1990. This is an interesting data set as the price of gas, in general, increases but you can see that that wasn't always the case (only a few of the cities are shown below).

Analysis

When you go to this website you have several options for prices and you can download a year of data at a time (with a CSV as an option). The obvious choice is regular gasoline but you might want to consider things like comparing regular gas to alternative fuels like propane. For example in this case, you can see that, in general, propane also has risen in price over time but where gasoline seems to fluctuate similarly regardless of the city, propane seems to be more volatile depending on location.

Because of the shear amount of data points possible (you can get a weekly average for the last 25 years for several cities if you want), you may wish to stick to yearly values. Another option is to use some of he weekly values to talk about the dangers of extrapolation

Download the Data

Site http://www.energy.gov.on.ca/en/fuel-prices/
I have also taken the liberty of downloading all of the data for gasoline (all 25 years of it) in weekly, monthly and yearly form. As well as the yearly propane data. You can get it on this Google sheet (note the tabs) or just the gas prices on Fathom

Let me know if you used this data set or if you have suggestions of what to do with it beyond this.

Friday, May 13, 2016

The Data and Story Library - DASL

DASL (pronounced "dazzle") is the Data and Story Library is an awesome database of sets of data that are specifically to help teach topics of statistics. They are all real sets and are all categorized by topic/subtject (eg automotive, food, health, sports etc) and mathematical method (eg boxplots, mean, outliers, regression, scatterplots etc). So theoretically if you wanted to find a set of data that could be used to help teach a specific topic you could search for, say, "correlation"
These are some great data sets to get through the mechanical nature of statistics. It's not very current data but it's great for practicing statistical methods.
For the longest time this set of data was not available but just recently it was hosted by Data Description Inc. so now we have access to it again.

Analysis

There are far too many sets to talk about analysis but when the site was down I blogged about one of my favourite sets on Smoking and Cancer. Take a look at that post to get a sense of the data. When you get to any data set, to see the actual data file, click on the Datafile Name

This will show you the text file of the data with the download link at the top of the page.
From that point you can do the analysis. Each data set will have a detailed description of each variable and a short story and sample analysis of each set
There are many data sets on this site for every statistical topic and on a range of subjects. One thing you might have your students do is just explore on this site and find data sets that can be used to exemplify a particular statistical concept.

Download the Data

Site: http://dasl.datadesk.com/

Let me know if you used this data set or if you have suggestions of what to do with it beyond this.

Thursday, December 17, 2015

Movie Data

Given that as I type this the new Star Wars movie coming out this week it seems like a perfect time to highlight some places to go get data about movies. So there are a pile of places to go. And kids (and most humans) love movies so why not find some data that kids will be more engaged to explore. As it turns out there are a few really great places to get real time data on movies. I'm going to focus on two.

Box Office Mojo

The first one is http://www.boxofficemojo.com/. There is a lot of data that you can choose from and it is almost realtime. For example you can click on Daily and it will give the summary of total domestic (US) ticket sales for each day. Or at the top if you click the daily summary you will get the top movies of the day and how much they made (among other things, right down to the dollar). You can even drill down and click on the movie name to get things like how many theatres it is in. One of the other neat things is they have "Showdowns" of movies and do comparisons like this one from Interstellar, Gravity and The Martian. But by far the coolest thing is the all time chart which gives the records for a huge number of metrics.

The Numbers

The second site I like is http://www.the-numbers.com/ , Here you can get some of the same stats like the box office info from any day of any year, but also stuff on DVD sales as well as how bankable a star is. And it even has a special Report Builder page where you can generate your own report with the info you want. But for me, by far, the best part is their movie budgets page where you can get the all time list of movies by production budget (over 5000 of them) or top 20 movies that were most profitable.

The Analysis

There is so much that you can do with this data that you could probably pick off any topic and find something to report on. But let me highlight a few of my favourite things to do. For example, with the daily movie data from Box Office Mojo (Fathom, Fathom Sol, Google Sheet). At the low end you could create histograms, dot plots and box plots, and compare measures of central tendency. At the higher end you can have them look for outliers or compare what happens day to day.

That daily data was a summary, you can also take the daily data from The Numbers (Fathom, Fathom Sol, Google Sheet) and my favourite thing to do after looking at the single variable analysis of the amount of money is to look at the two variable analysis of how the money compares to the number of theatres each movie was in. And then see if any of the movies might get lost in that data (like the Big Short which hardly played in any theatres but had the most tickets sold per theatre. Or that In the Heart of the Sea is doing better than expected and the Peanuts Movie is doing worse than expected

Another of my favourite things is to look at how movies did compared to what it cost to make them. There is a lot of info on this on The Numbers and one of my favourite examples is that of the Blair Witch Project. A movie that only cost $60,000 to make yet had a world wide total gross of almost $250 million. You can get the daily numbers for any movie like this and in this case see that this started out in one theatre, did well. Then expanded to about 30 theatres and did well and then finally got a much wider distribution and blew up.

That is just a small amount of what you could do with this data. Especially if you use the full set from the Numbers (Fathom, Google Sheets)

Sample Questions

What I usually do with these sites is ask something more general. I introduce them and then just ask "What story does this data tell? Use graphs and calculations to tell your story."
Another thing I ask is to look at the all time list and use a site like http://natoonline.org/data/ticket-price/ to put everything in today's dollars. They can check their answers on the Box Office Mojo summary page where they show that Gone With the Wind, adjusted for inflation, would have grossed over $1.7 billion domestically (there is no worldwide data). Or even look at the story that they tell about adjusted data. The dataset on movie ticket prices alone is pretty good for analysis.
For the younger grades you could make bar graphs or circle graphs about their favourite movie franchise, for example, like Harry Potter (Google Sheets, Google Sheets with Graphs)

Other Movie Resources

The FiveThirtyEight.com site often does a lot of stories on movies and there is a great podcast about the problems with the movie rating sites and how they handle data. Read and listen about it here and here. And of course there is the famous movie quotes as visualizations

Download the Data

Of course go to The Numbers and Box Office Mojo at any time to get the most up to date data on movies. All the files I analyzed here can be found in this folder. Note that all of these files were generated BEFORE Star Wars: The Force Awakens came out so it will be interesting to see how it changes the data.

Let me know if you used these data set or if you have suggestions of what to do with it beyond this. Or if you created a lesson based on this data, share it below.

Friday, December 4, 2015

Smoking and Cancer

For many years I used to use the Data & Story Library (DASL) but for some reason the data on the site is unavailable currently but there are some great data sets there. Since they are unavailable I thought I would share some of my favs.

The Analysis

Probably my most favourite is the Smoking and Cancer story. This is a great data set for talking about correlation. The data is the gives the average number of cigarettes smoked in each US state and then the rates of bladder cancer, lung cancer, kidney cancer and leukaemia for each state. So at the very least you can have students create the graphs of each of the afflictions vs the number of cigarettes smoked. When you do you get the following graphs:

The thing I like the most about this is that when you do that you see that bladder cancer has the strongest correlation which is not intuitive. But in the above graph you will notice that the scales are all different. The graph below shows the same graphs but all with the same scale. Here you see that even though bladder cancer may have a similar correlation as smoking, there really isn't much of a relationship (ie no matter how many cigarettes smoked the rate of bladder cancer barely changes). And since the other two have low or no correlation, you can see that smoking has the largest connection to lung cancer.

So it's a good lesson about correlation and why it is important to scale the axes similarly when comparing data.

Sample Questions

Which pairs of data appear to have a connection to each other?
What do each of the numbers represent in each equation?
Which of the scatter plots indicate that there is a relationship between the data?
Use your least squares equations to predict what the death rate would be for each relationship if the Cig value was 10 or 50. How confident can you be of each prediction?

Download the Data

Fathom (Data) (Solution)
Google Spreadsheet
CODAP file

Let me know if you used this data set or if you have suggestions of what to do with it beyond this.

Saturday, November 21, 2015

Anscombe's Quartet

Anscombe's Quartet is four two variable sets of data that have a particularly interesting property.

Upon examination the first three sets have the same x values but other than that the y values all seem random. But the interesting thing starts when you start to do some numerical analysis on them. Just start with some simple single variable calculations.

Mean of each x set = 9
Mean of each y set = 7.50
Variance of each x set = 11
Variance of each y set = 4.122-4.128

So, almost identical. And then if you take that a step further you can do the two variable analysis on each set and get the following:

Correlation of each set = 0.816
Line of best fit for each set y = 3 + 0.5x

So with all that analysis done, you might get the impression that these are pretty much just different aspects of the same sets of data. But then when you graph them you get something entirely different:

So you really see that they are very different sets of data. The lesson here is that your data cannot be fully described with either numerical or graphical analysis but really both are necessary.

Classroom Connections

So how do you use this in class? This set is really best used for students who have had both single variable and two variable analysis. It really is a great set for tying together many of the concepts of data analysis.

One thing that you can do is use this Desmos Activity Builder that walks students through the analysis. Keep in mind that students should be familiar with calculating mean and variance via a spreadsheet. They should also be familiar with using Desmos in terms of graphing functions and doing linear regression.

To analyze the data you (or your students) can use this Google Spreadsheet or CODAP file

Do you have ideas of any leading questions you would ask students? Do you have ways that you could use this dataset with students? Leave your ideas in the comment section.

Resources

Info: https://en.wikipedia.org/wiki/Anscombe%27s_quartet
The Data: Google Sheet CODAP
The Activity: https://teacher.desmos.com/activitybuilder/custom/56364b6f58d09115172b6a3c

Pages

Tuesday, March 26, 2019

Analysis

Sample Questions

Downloads

Sunday, February 24, 2019

Analysis

Sample Questions

Downloads

Friday, October 26, 2018

3 Act Task

Analysis

Sample Questions

Download the Data

Sunday, August 26, 2018

Download the Data

Saturday, September 17, 2016

Analysis

Sample Questions

Download the Data

Monday, June 6, 2016

Analysis

Sample Questions

Download the Data

Tuesday, May 24, 2016

Analysis

Download the Data

Friday, May 13, 2016

Analysis

Download the Data

Thursday, December 17, 2015

Box Office Mojo

The Numbers

The Analysis

Sample Questions

Other Movie Resources

Download the Data

Friday, December 4, 2015

The Analysis

Sample Questions

Download the Data

Saturday, November 21, 2015

Classroom Connections

Resources