Tuesday, March 26, 2019

Mining the Meta Data in your iTunes Library

If you (or your students) use iTunes to keep track of your music then it turns out they have a rich source of data that might be interesting for your students to analyze. I find that if students use their own data they are more interested in looking at that data for analysis. In this case, every song on iTunes (and really, any platform) has a pile of meta data associated with it. In that meta data are things like song name, artist name, album name but also there are numerical values like song length, file size, number of plays etc. So you could have your students get the data from their own library and do the analysis of it.

Getting the data from iTunes is pretty easy. Once in iTunes, if they want to get the info from all their music then just click on Songs or if they want to get their data from a favourite playlist then they can click on that. Then click on File, then Library, then Export Playlist. It will then send a .TXT file to the folder of your choice. That .txt file will need a bit of cleaning up, but not much. I suggest importing it into Excel or Google Sheets to clean it up. If you are doing the work in that spreadsheet (or uploading to Desmos) then you're all set. If you plan on importing it into CODAP then save the data as a .CSV file (note that I noticed that even though you should be able to import a .TXT file into CODAP, the format of this one doesn't seem to work, so you have to convert it to a . CSV).

Analysis

Though the data itself is not wildly interesting, you can certainly use it to cover topics like mean, median, standard deviation, and other single variable measures. And maybe have students compare values from their playlists to other students. Note, that the time of the songs are in seconds. So if a histogram is created, it is probably appropriate to have bin widths of 30s or 60s (let students figure this out).

One thing that I think is interesting is that you would expect a very strong (if not perfect) relationship between the time of a song and it's file size. But as you can see there seems to be different relationships. This is due to the bit rate of the file compression. So you might be able to have a conversation about what bit rate is and how it relates to the compression of the file. The lower the bit rate the smaller the file size (for songs of the same length). So you could talk about why you would want a lower or higher bit rate (hint: lower bit rate means poorer quality of the sound but smaller file size, so there is a trade off). In CODAP you can create separate graphs of the bit rate data and the scatter plot of the size vs time then high lite parts of the data to show the different relationships. You could actually hide or show data based on the bit rate to do more specific analysis by isolating just the data from one bit rate.

Sample Questions

  • Choose three numerical attributes from your data and determine the mean, median and SD of each. Graph each attribute using an appropriate representation.
  • Which genre of music has the highest average song length?
  • Which song was played the most?
  • Which decade has the most songs?
  • Which song was skipped the most?
  • Determine the relationship between the size of a file and how long the song is for different bit rates. 
  • You have only 50 Mb of space left on your device. How many minutes of music could you store using all of the remaining space (note that answers will vary based on the bit rate.

Downloads

Let me know if you used this data set or if you have suggestions of what to do with it beyond this.

Friday, March 22, 2019

Hip Hop Vocabulary

This post originally came out in 2014 (before this blog was created) and so I hadn't thought about it for a while. Then I saw a post by Dane Ehlert on his When Math Happens blog and was not only reminded of it but noticed that the original post had been updated in look and with new data. Basically they take a pile of hip hop artists and count how many unique words they use in their first 35000 lyrics.

Analysis

When you go to the site, the visualization (above) is interactive in that you can search for artists and interact with the visualization. This is neat but on this blog we typically want to do some mathematical analysis. They have other representations like this one that looks like a histogram but for our purposes, we would like some numbers.

 
So if you look way down on the post, they do have a Google Sheet with the number of unique words for each of the over 160 artists. It's not a particularly robust data set but we can do some simple
analysis, like histogram, averages, box plots and other single variable analysis. I don't think there is anything particularly mathematically interesting with the data but this is data that might be interesting for students and so it could be used to do practice some standard single variable analysis techniques (central tendance, standard deviation, distributions, dot plots, box plots, histograms etc)

Sample Questions

  • Who are the outliers in this data set?
  • Which decade has the most verbose rappers?
  • How does your favourite rapper compare to the most/least verbose rapper?
  • Take a look at some of the questions Dane was asking in his post for some more open questions.
  • What does the data in the original post say about the amount of words used in different types of music?

Downloads 


Let me know if you used this data set or if you have suggestions of what to do with it beyond this.

Sunday, February 24, 2019

Skipping World Record

A few months back I saw a 3Act Task called Rope Jumper that @gfletchy created out of this video:
He shows the first few seconds of the video and you have to guess how many skips are done in 30s. It's a good 3Act task. But that's not what we're doing here. Here I've actually collected the time data from each skip to do a bit of analysis (I had to slow the video down to 50% speed in order to get every skip).

Analysis

As you would guess it's pretty linear but you might notice, as you watch the video, that it seems like she might be slowing down at times. It's not super exciting in terms of the actual data but it could be used to simply help students in determining the least squared line.

Sample Questions

  • When was she skipping the fastest/slowest and what was the rate?
  • How many skips do you think she would make in 1 minute?
  • If she was to keep the pace that she had in the first few seconds, how many skips would she have made in 30s?
  • If she had skipped at the same rate as she did in her slowest section, would she still have broken the record.

Downloads

Let me know if you used this data set or if you have suggestions of what to do with it beyond this.

Thursday, February 7, 2019

New Desmos Statistics Package

So for years you have been able to two variable statistics really well. Finding the correlation and lines and curves of best fit is pretty easy and works really well. But this week Desmos released a long awaited update to include a whole suite of new single variable statistical tools including visualizations like dot plots, box plots and histograms. And of course the great thing about all of this stuff is that all of these visualizations can be made dynamic with a few Desmos slider tricks. For a really nice summary of some of the new features, check out the video from @bobloch below.

But I wanted to point out a couple features that I really like. First of all the new Zoom Fit feature makes it easy to take any set of data and adjust the axes so that all the data can be seen. Basically all you do is create your graph and then click the icon that looks like the little magnifying glass with the plus in it. This icon will show up for any of the visualizations including the distributions. 
Another thing that I like is the control that you get with the various graphs. When you enter any of the functions you will be told what the arguments are for the function (like for histograms you have the data and you have the bin width) or you have arguments outside the function. For example, for box plot you can change the vertical position (Offset) of the box and it's vertical size (Height). But any of those values can be turned into dynamic values by creating sliders or the results of computations. 

Like all Desmos graphs you can save your work and this is probably the best way to get large data sets to students. And if you want to name your sets, you can get a bit more creative by using subscripts. To get to a subscript, start with a variable and then add a "1" and the subscript will appear. Then you can delete the 1 and add what ever you want in its place. Try it out with these data sets from previous posts: NFL Salaries or Concert Tours

That's a quick intro of the new features. Don't forget to check out the Desmos help files on visualizations, distributions and statistics for more info. Going forward, I will be including Desmos versions of the data sets I post so that you'll have your choice of software to use. Have fun.


Friday, January 4, 2019

Highest Grossing Concert Tours

Concerts are a multi billion dollar industry now. So why not use some concert data to do some statistical analysis. This data comes from the wikipedia page on the same subject. On the page the data is broken up into the top 20 all time highest grossing concerts (ordered by unadjusted by inflation numbers). Then it has the top grossing tours for each decade from the 80s until the present. There is data on the decade rank, gross and inflation adjusted gross, the number of shows attendance and other attributes.

Analysis

You can start with some categorical analysis by just looking at the who made the list each year. This data runs for four decades so kids might not be into who was big in the 80s but if you highlight the biggest acts of the last decade you can still see that more than half of them were artists that were around in the 80s (with U2 being #1) and U2, Guns n Roses and The Rolling Stones (twice) were in the top 5 of all time (inflation adjusted).

For more numerical analysis you could pick any of the data sets to do some single variable analysis. Whether it be central tendency, distributions, or histograms. There are many choices.

When you create some box plots you will find that some of the data sets have outliers. In particular, I think it's interesting that the outliers when dealing with the money are different from the outliers when dealing with the number of shows. This might lead you to explore things like the the Average Gross and compare it to the money and number of shows.

This might lead you to do some double variable analysis. Though there aren't any strong relationships, you could use this to maybe talk about relationships with poor correlations. Technically there is one strong relationship. That's the one between the Gross and the Inflation adjusted gross. This would be expected as one relates directly to the other. One thing that I like about this, however, is that it's not a perfect relationship. That is, who ever adjusted for inflation did so using different rates for each year (to make it more realistic, presumably).

Sample Questions


  • Which Artist made the most (over all/ or per concert)?
  • Which decade made the most money (adjusted for inflation)?
  • Which artists are outliers the most often?
  • Calculate the mean and median for each of the numeric attributes. How do these values suggest something about the distributions?

Downloads



Let me know if you used this data set or if you have suggestions of what to do with it beyond this.