Ketchup, Correlation and Outliers

The classic saying "correlation does not imply causation" is still an incredibly important thing to keep in mind when doing data analysis. Spurious regressions will sneak up on you and the next thing you're doing is trying to predict the value of the Mexican Peso based off of the amount of rainfall in London.

Keep the following in mind when doing data analysis and stating that there is a causal relationship: Does this relationship make sense? That simple question is not asked often enough - don't make that mistake.

Another common pitfall is discarding outliers in order to establish a model that fits the data better. Tampering with data is very dangerous and needs to be handled in a completely transparent way when presenting your analysis.

Hot Dog Example:

When I picked up a bottle of ketchup and received an earful from a Chicago local last week, I formed a hypothesis about what the relationship between hot dog sales and ketchup packets might look like.


Question: "Does this relationship make sense?"

Answer: Yes. As restaurants sell more hot dogs, the number of condiments they hand out goes up.

Question: "Should I remove the outlier in order to fit a model with a better R-Squared value?"

Answer: If you are in data exploration mode, it's worth keeping around. The pro-mustard / anti-ketchup City of Chicago certainly appears to be an outlier. If you are building a model and this is the only city out of thousands which stands out, you may choose to build a model without it or simply build a model which isn't sensitive to outliers.

Obviously, this isn't real data. However, it does illustrate the need to think about how one thing must affect another in order to be considered causal in a relationship and whether or not an outlier is worth keeping around.