clock menu more-arrow no yes mobile

Filed under:

Sabermetrics 101: Correlation

This concept is critical for understanding how we determine which statistics are appropriate to use in which situation. This piece is not going to be statistically vigourous, because that's not really the point. Please don't get too agitated, real statisticians!

Prerequisites for understanding: Regression, value.

Prerequisites for derivation: NA; conceptual.

Correlation equals?

Well, it might not equal causation, but correlation is a key concept in sabremetrics. In essence, the correlation between two (or more) variables reflects the relationship between them. Without going too deep into the raw statistics, there are many ways to measure correlation, but we typically use the linear correlation coefficient 'r'. This value ranges between 1 and -1, with a value of one meaning that as one variable rises, the other rises with it perfectly (the magnitudes of the relative rises are less important). -1 means that a rise in one causes a fall in the other. As might be expected, in between is in between, with zero implying no relationship at all. You'll see r² used on occasion, but all you really need to know to get a grasp on baseball analysis is plain old r.

Note that I've been pretty careful with wording for my definition of the correlation coefficient above. You cannot say that a rise in one factor leads to a rise (or a fall) in the other through correlation alone. There is no causation implied in a correlation analysis, which means we must apply logic whenever we can. Does it make sense for stolen bases to correlate well with triples? Yes. Do stolen bases cause triples? Clearly not - they're both driven by speed. So with that in mind, I hope that the reader will forgive when I use the language sloppily and state that r implies how much of a rise in B can be explained by a rise in A.

This is useful to us in two ways.

The first is in examining how valuable a statistic is - how well does it explain winning? Obviously, if home runs had no bearing on winning (which of course, they do), we shouldn't focus our efforts in that direction. A strong relationship between offensive metrics and run scoring in necessary in order to have confidence that we're assessing batters correctly.

The second is to do with predicting performance. Instead of correlating home runs in a season with runs scored in a season, what happens when we run the numbers for home runs by any batter in one season with home runs hit by that same batter in the next? A high number implies that hitting home runs has less to do with luck than pure skill, and a low number means the opposite. This year-to-year correlation has an impact on how much we regress our numbers for predictive purposes, because we then have a good idea of how sustainable any set of outcomes are between seasons.

That's really the crash course on what an analyst means when he's talking about correlation. It's a slightly different language to what the real statisticians speak, but for our purposes it suits very well.

Regression Analysis

It's a little odd that 'regression analysis' fits much better here than in the regression post itself, but that's to do with the name being more than a little misleading. Regression analysis takes a series of inputs (with lots of variables) and fits them into an equation to match a given output. For example, you could feed singles, doubles, triples, home runs, and outs into seasonal runs scored and regression analysis would tell you how well each contributes to our goal of scoring more runs. Nice, right?

No, not really. It's actually a very attractive trap for the unwary. If you ran that study, you would find that doubles are worth more than triples, because low scoring teams tend to be of the speedy, triple hitting variety, and high scoring teams typically hit a lot of doubles and home runs. This result is entirely nonsensical in real life, no matter how well the regression model fits - triples are clearly worth more than doubles, and anything telling you otherwise is wrong. This is another important reminder that you must always have logic firmly in your court before relying on the numbers to tell you the truth.

What Follows

Batting/pitching/fielding statistics; projection.