The AnalysisProbably my most favourite is the Smoking and Cancer story. This is a great data set for talking about correlation. The data is the gives the average number of cigarettes smoked in each US state and then the rates of bladder cancer, lung cancer, kidney cancer and leukaemia for each state. So at the very least you can have students create the graphs of each of the afflictions vs the number of cigarettes smoked. When you do you get the following graphs:
The thing I like the most about this is that when you do that you see that bladder cancer has the strongest correlation which is not intuitive. But in the above graph you will notice that the scales are all different. The graph below shows the same graphs but all with the same scale. Here you see that even though bladder cancer may have a similar correlation as smoking, there really isn't much of a relationship (ie no matter how many cigarettes smoked the rate of bladder cancer barely changes). And since the other two have low or no correlation, you can see that smoking has the largest connection to lung cancer.
So it's a good lesson about correlation and why it is important to scale the axes similarly when comparing data.
- Which pairs of data appear to have a connection to each other?
- What do each of the numbers represent in each equation?
- Which of the scatter plots indicate that there is a relationship between the data?
- Use your least squares equations to predict what the death rate would be for each relationship if the Cig value was 10 or 50. How confident can you be of each prediction?
Download the DataFathom (Data) (Solution)
Let me know if you used this data set or if you have suggestions of what to do with it beyond this.