Sabermetrics 101: Correlation
This concept is critical for understanding how we determine which statistics are appropriate to use in which situation. This piece is not going to be statistically vigourous, because that's not really the point. Please don't get too agitated, real statisticians!
Prerequisites for understanding: Regression, value.
Prerequisites for derivation: NA; conceptual.
Correlation equals?
Well, it might not equal causation, but correlation is a key concept in sabremetrics. In essence, the correlation between two (or more) variables reflects the relationship between them. Without going too deep into the raw statistics, there are many ways to measure correlation, but we typically use the linear correlation coefficient 'r'. This value ranges between 1 and -1, with a value of one meaning that as one variable rises, the other rises with it perfectly (the magnitudes of the relative rises are less important). -1 means that a rise in one causes a fall in the other. As might be expected, in between is in between, with zero implying no relationship at all. You'll see r² used on occasion, but all you really need to know to get a grasp on baseball analysis is plain old r.
Note that I've been pretty careful with wording for my definition of the correlation coefficient above. You cannot say that a rise in one factor leads to a rise (or a fall) in the other through correlation alone. There is no causation implied in a correlation analysis, which means we must apply logic whenever we can. Does it make sense for stolen bases to correlate well with triples? Yes. Do stolen bases cause triples? Clearly not - they're both driven by speed. So with that in mind, I hope that the reader will forgive when I use the language sloppily and state that r implies how much of a rise in B can be explained by a rise in A.
This is useful to us in two ways.
The first is in examining how valuable a statistic is - how well does it explain winning? Obviously, if home runs had no bearing on winning (which of course, they do), we shouldn't focus our efforts in that direction. A strong relationship between offensive metrics and run scoring in necessary in order to have confidence that we're assessing batters correctly.
The second is to do with predicting performance. Instead of correlating home runs in a season with runs scored in a season, what happens when we run the numbers for home runs by any batter in one season with home runs hit by that same batter in the next? A high number implies that hitting home runs has less to do with luck than pure skill, and a low number means the opposite. This year-to-year correlation has an impact on how much we regress our numbers for predictive purposes, because we then have a good idea of how sustainable any set of outcomes are between seasons.
That's really the crash course on what an analyst means when he's talking about correlation. It's a slightly different language to what the real statisticians speak, but for our purposes it suits very well.
Regression Analysis
It's a little odd that 'regression analysis' fits much better here than in the regression post itself, but that's to do with the name being more than a little misleading. Regression analysis takes a series of inputs (with lots of variables) and fits them into an equation to match a given output. For example, you could feed singles, doubles, triples, home runs, and outs into seasonal runs scored and regression analysis would tell you how well each contributes to our goal of scoring more runs. Nice, right?
No, not really. It's actually a very attractive trap for the unwary. If you ran that study, you would find that doubles are worth more than triples, because low scoring teams tend to be of the speedy, triple hitting variety, and high scoring teams typically hit a lot of doubles and home runs. This result is entirely nonsensical in real life, no matter how well the regression model fits - triples are clearly worth more than doubles, and anything telling you otherwise is wrong. This is another important reminder that you must always have logic firmly in your court before relying on the numbers to tell you the truth.
What Follows
Batting/pitching/fielding statistics; projection.
76 comments
|
1 recs |
Do you like this story?
Comments
This might be the most interesting thing I've seen in any of the posts:
This result is entirely nonsensical in real life, no matter how well the regression model fits – triples are clearly worth more than doubles, and anything telling you otherwise is wrong.
It’s interesting to me how without context, many numbers can end up giving completely different findings. It also shows that the strength of sabermetrics isn’t in the numbers themselves, but in the people behind them. One can learn a lot by listening to intelligent people.
I'd sleep at the Internet, but I've found servers don't make for good pillows.
It depends on your sample
Say you looked at week-by-week wOBA. You’d see a pretty low r value – guessing on the range of 0.1 or thereabouts. The sample is small enough to allow huge fluctuations, and if you ran year by year values the numbers would get much higher. Yet wOBA is wOBA and the skill is the same. So you have to take into account what you’re looking at.
by Graham MacAree on Feb 17, 2010 7:20 PM PST up reply actions
I should elaborate
All in all, we can’t be right all the time. That’s impossible. You should just try to be both more right than you were before while making as much or more sense than you did before. I’m not interested in stats that have high correlations and no logic behind them, nor am I interested in the opposite.
by Graham MacAree on Feb 17, 2010 7:28 PM PST up reply actions
I was always taught that r is not very useful for answering that sort of question
and instead you should look at average absolute error (or RMS error) between what your model predicts and your data set. This gives you a number that has some physical meaning.
Another way to do it is to calculate confidence intervals on the coefficients in your equation or look at p-values.
by Edgar for Pres on Feb 17, 2010 9:00 PM PST up reply actions
In the "doubles are worth more than triples" analogy
are you gauging the “worth” based on the Beta of the resulting linear model, or something else?
If anybody hasn't seen this, I think this will teach everyone to be reluctant to use r or r-squared to evaluate whether something is correlated

(Stolen from Wikipedia who took it from somewhere else)
Something can be correlated and give a bad correlation coefficient. Looking at the data vs the model is the only good way to figure out if things are being accounted for well.
by Edgar for Pres on Feb 17, 2010 8:55 PM PST reply actions 1 recs
That's not really useful for our purposes
Non-linear relationships don’t get picked up by R, but we don’t really look at a whole lot of non-linear relationships. r is fine for us.
by Graham MacAree on Feb 17, 2010 9:00 PM PST up reply actions
We might not have too many non-linear relationships but we can have some issues
such as having two different populations that have different characteristics (see bottom right plot).
Maybe you are trying to plot swinging strikes vs strikeout rate and you have guys who throw hard and others that throw to contact. You could try to correlate these two variables however since you have two different populations, your correlation could be low. If you separate the data so that you run a correlation on only hard throwers you probably get a better correlation to the data than the entire population.
Coming up with good correlations can be valuable because it can teach us some things however there is an art and need for intuition to do it correctly.
by Edgar for Pres on Feb 17, 2010 9:09 PM PST up reply actions
Sure, but the idea is to have people understand the terminology
I think that sort of depth might be hurting more than helping.
by Graham MacAree on Feb 17, 2010 9:10 PM PST up reply actions
Yeah I agree
I’ll try to keep focused on the basics so I don’t sidetrack us and get you, me and everyone else confused.
by Edgar for Pres on Feb 17, 2010 9:38 PM PST up reply actions
For the record,
I appreciated it, insightful stuff. The two different populations bit was what I thought of when I saw the plot in the bottom right as well. Thanks
by Terminator X on Feb 18, 2010 12:22 AM PST up reply actions
It's good, I misunderstood and thought there was a relationship in the vertical.
The question that arose from that actually answered something unrelated. Expecting everybody to stay on board with everything might be a bit unrealistic, but sharp comment threads are great resources. People often ask questions or address points I wouldn’t even think to ask or have enough fundamentals to recognize.
Unless I'm mistaken, the first line is the only one that reflects a linear relationship, yes?
And that’s the only line that makes sense from an “r” perspective. So for linear relationships there’s no reason to discount it.
No, the second line is all linear relationships too
by Graham MacAree on Feb 17, 2010 9:11 PM PST up reply actions
Ah, yeah you're right
But, it still makes sense that those r values are achieved.
I was similarly confused the first time I saw it
I incorrectly expected the numbers to be equal for each column. Looking at it a fourth time I noticed the difference, and now it all makes much better sense.
Rooting for lovable losers since 1984.
by seattlecougar on Feb 17, 2010 11:13 PM PST up reply actions
I'm going to take one for the team here, and air out some ignorence.
What am I looking at here? The left to right relationships I get, the vertical… what’s this and how does it relate to correlation? I get the concept, all poodles are dogs but all dogs are not poodles, but this is breaking my brain.
by Kermit. on Feb 17, 2010 9:19 PM PST up reply actions 1 recs
Ignorance, crud.
Just looking for some clarification so I can see this from your perspective, and I honestly don’t understand what I’m looking at.
So the top line is the relevant one for us
Let’s call the horizontal axis x and the vertical y
Notice how the leftmost entry is a perfect linear relationship. Higher x goes with higher y. R is 1. The second from left displays a similar relationship, but less strong. Higher x generally goes with higher y. R is 0.8. Third is a weaker relationship. Higher x still generally goes with higher y, but there’s more scatter and variability. R is 0.4. The central figure is a completely random plot with no relationship between x and y. R is 0.
by Graham MacAree on Feb 17, 2010 9:24 PM PST up reply actions
My bad, this figure might be much more confusing than I orginally thought
Anyway, the point here is that boiling how good a fit is down to one number is difficult. It only works well for linear relationships. Its always good to look at the data to make sure the relationship you are modeling with a linear model is actually linear.
by Edgar for Pres on Feb 17, 2010 9:45 PM PST up reply actions
I'm currently taking elemtary stats at school (only math I have to take - hooray) and it's nice how these posts are complementing what I'm learning in class.
If only it could be a baseball stats class…
Probably getting ahead of myself again here
As well as showing my ignorance, but it would be really beneficial to have a few examples for these posts of what actually means what.
If I understand right…
Swings & misses probably have a correlation to strikeouts of close to 1.
Ground balls probably have a correlation to GIDP of 0 < X < 1
How you prefer your eggs cooked probably has a correlation of roughly 0
Contact pitchers probably have a correlation to strikeouts of -1 < X < 0
Being drunk probably has a correlation to hand eye coordination of -1 (drinking games “talent” excluded from sample set)
Forgive my ignorance… I’m a bit of a numbers guy and a bit of a baseball guy, but an expert in neither… It would just help me considerably to see the two connected in black and white (to the admittedly limited extent you can be black and white with these) examples of how basic baseball behaviors relate to all these funky numbers, letters and graphs.
Rooting for lovable losers since 1984.
Yeah you are correct
The only one that isn’t really right is “How you prefer your eggs cooked” because you didn’t try to correlate that to anything but I’m just kidding.
by Edgar for Pres on Feb 17, 2010 11:14 PM PST up reply actions
Some easy examples that I had to hand:
Pitcher year to year correlations for:
K%: .76
BB%: .67
HBP%: .38
by Graham MacAree on Feb 18, 2010 7:18 AM PST up reply actions
Be very careful
I want to point out a few important issues here as a “real” statistician.
In a correlation calculation, it doesn’t matter which variable is the independent variable and which is the dependent variable, you are trying to measure how well the data conforms to a line
The value of r^2 is hugely important here. Take the correlation coefficient and square it. This result tell you how much of the variance in one variable is explained by the variance in the other. This in particular is important when you see correlation coefficients around 0.5. Many people think that a correlation of 0.5 is really high – not exactly, the line that the data clusters around only explains a quarter of the variance.
What we are measuring with the correlation coefficient is how well knowledge of one variable helps you predict the value of the other. One example is the fact that beer sales at the ball park are highly correlated with temperature. The usual way to use this information is to look at the thermometer and predict how much beer will be sold at Safeco. On the other hand the other way is useful, too. Suppose you lost your temperature data, but knew how much beer was sold at Safeco that day – you could use the regression equation (the equation that describes the line) to estimate what the temperature was that day. Would it be perfect, no, but it would be as good as using that line to predict beer sales given the templ
by New England Fan on Feb 18, 2010 3:03 AM PST reply actions 1 recs
Try to be more Pro-descending
...and now I'm here
I like your explaination of why r^2 is important
That kind of helps relate r to something I can understand.
by Edgar for Pres on Feb 18, 2010 3:59 AM PST up reply actions
I'm being very careful
This isn’t Statistics 101, and I know that I’m cutting a few corners. But what matters here, beyond everything else, is that people can follow the conversation. Not that they can pass a statistics class. And mostly, the people who aren’t fluent in statistics are going to find a detailed, technically perfect conversation about correlation really, really, really boring.
My intent is to avoid boredom while bringing folks up to speed, and I know that will piss off a few statisticians.
by Graham MacAree on Feb 18, 2010 7:06 AM PST up reply actions
I know you are being careful
Maybe I’m being too detailed, but I think the point about r^2 is important, and the fact that either variable can be used to predict the other is a key point that needs to be made. Too many people jump to conclusions about correlation
by New England Fan on Feb 18, 2010 8:54 AM PST up reply actions
This isn't a course in statistics.
The point is to make sabremetrics more accessible to the people who, for a variety of reasons, have either found the barrage of math daunting or confusing. I assure you that Graham understands the points you are making in these theads, but I can also assure you that he’s keeping it simple for a reason. It’s great that you have a nice understanding of statistics but for the sake of the intended audience of these pieces, please respect the aim of this series.
It's a fairly straightforward point, and I think it's a fair one
It’s not like it’s a barrage of math, but it’s also getting at a different point. You’d use it to answer a different question.
Is it better if you call variance, "uncertanty" or "difference from expectations"
Variance isn’t really a complicated concept. It might be a little tricky to calculate but if you don’t want to understand that then you don’t have to. I’m not sure how you can understand correlation but not understand that there will be some deviation between your model and the actual data. If there is a large difference between what your model predicts and your data then you have a large uncertainty or variance in your model.
(Stat guys, variance is more complex than this and this definition would get me shunned in a stats class but i’m trying to boil down the essence.)
by Edgar for Pres on Feb 18, 2010 6:12 PM PST up reply actions
A somewhat suble point:
“Correlation” is often used to refer to two (slightly) different concepts:
1) The numerical correlation (R value) between two variables. This measures the degree to which a straight line describes the relationship between them (with the sign of R indicating the direction). In fact, R=1 if and only if one variable is a linear function of the other.
2) A more general dependence between two variables. For example, in the third row of the set of plots above, we would informally say that the two variables are “correlated”, since the values on the Y axis clearly depend on how far along the X axis you travel. However, since this dependence (“correlation”) is not a linear relationship, the R value is zero. This is why one of the important things taught in an introductory statistics course is that R=0 does not automatically mean that two variables are independent.
Very true and well put
I still don’t think that we run into many nonlinearities in baseball, and so our scope is somewhat cut short.
by Graham MacAree on Feb 18, 2010 7:09 AM PST up reply actions
Let me ask
Is it that
1) We don’t observe many nonlinear relationships in baseball
or
2) We don’t care about nonlinear relationships in baseball (linear trends capture all the information we need)
or
3) We don’t know how to deal with (or detect) nonlinear relationships in baseball, so we ignore them?
Not passing judgment on which of these is the most “acceptable” answer, just interested in knowing a sabermetrician’s perspective.
The sabremetric perspective
It’d be really hard to have a non-linear relationship with the level of data we’re used to. It’ll definitely start creeping in with pitch f/x and hit f/x, but in the typical data playground I’ve never encountered a strong nonlinear correlation, nor have I heard of one occurring.
by Graham MacAree on Feb 18, 2010 1:28 PM PST up reply actions
Well, their are some uses for quadratic and logisitic relationships
Modeling aging and playoff odds, as well as the new SIERA.
by vivaelpujols on Feb 18, 2010 2:51 PM PST up reply actions
Alot of it is that outcomes and talents are so narrowly distributed
that we can do a great job approximating them as linear. This is completely valid and there really is little to gain from using complex and confusing non-linear relationships.
by Edgar for Pres on Feb 18, 2010 6:14 PM PST up reply actions
Association is a better term
To avoid the confusion between the correlation coefficient and non-independence, I prefer to use the term “association” to describe your case #2.
Technically, there is always a correlation (it may be zero), and a non-zero correlation implies non-independence, but variables may be dependent even with a zero correlation.
Those of you who don’t understand what I just said, please ignore . :)
by New England Fan on Feb 18, 2010 8:58 AM PST up reply actions
This whole comment thread makes my brain hurt.
Fear the NPE
by thewyrm on Feb 18, 2010 9:52 AM PST via mobile up reply actions
I understand this is just a basic primer
However, I think it’s important to mention the effects of sample size on a correlation. There are two types of sample sizes involved in a regression: 1) the number of samples in the regression, and 2) the underlying sample of each sample.
The first one is pretty easy to understand. If you are trying to measure the effects of GB rate on HR/FB ratio, and you only have 5 pitchers to work with, even if there is a relatively strong relationship, it won’t necessarily show up in the correlation because there are only 5 samples. And if it did, the standard errors would be huge. This is pretty important because if you ran two separate regressions, and each had the exact same amount of variance, the one with more samples would have a higher correlation.
The second type of sample size is the sample size for each sample if that makes sense. Take that same hypothetical regression with GB rate and HR/FB ratio. If each player in the sample has only 5 balls in play, then even if you have 1000 samples in the regression, it still won’t have a very high correlation because 5 balls in play is simply not enough data to weed out any little skill.
The point of this is that a correlation doesn’t just depend on the relationship between the two variables, it depends heavily on sample size and that’s something you have to be aware of when comparing correlations.
For example, take wOBA and UZR. David found that they have a similar correlation year to year among players with at least 500 plate appearances and 200 chances. For one, because UZR has less chances that means the “real” correlation is probably higher than the observed correlation in relation to wOBA, which has larger samples.
Yeah, I mentioned this in a reply to Fogel above.
It’s certainly a point worth noting, though.
by Graham MacAree on Feb 18, 2010 9:46 AM PST up reply actions
Correlation between two variables doesn't depend on the sample size
It’s an inherent feature of their joint distribution. What does vary with sample size is your ability to estimate that correlation precisely. Which is kind of what you said at first, but then things got pretty confusing…
The UZR example you give is interesting, but it illustrates a point not directly related to this post, namely that the correlation between measures taken in consecutive years increases as the number of samples on which the meaures are based increases. This is hardly surprising; indeed, if you believe that a person’s true talent level stays the same, then their year-to-year correlation should approach 1 as you increase the number of samples in each year. This type of correlation (often called “auto-correlation” to emphasize that it is about repeated measures) could be the subject of a whole other blog post.
The observed correlation doesn't depend on the sample size
But the “true” correlation does. A sample of 5 is much more likely to have a lower correlation than a sample of 50, even if the two relationships are equal. When you see a correlation, that isn’t the actual correlation, it’s the observed correlation of the sample data and is effect heavily by things outside of the actual relationship (sounds kinda like DIPS;)).
by vivaelpujols on Feb 18, 2010 3:19 PM PST up reply actions
Huh?
Say two random variables X and Y are jointly distributed according to F. F determines the covariance between X and Y, and the marginal variances of X and Y, agreed? Hence it determines what their “true” correlation is. This is completely independent of the sample size, and applies whether X and Y are measurements of independent quantities or are measurements of the same quantity taken over time.
Now, if you’re going to estimate the correlation of X and Y, then sample size comes into play. It doesn’t affect the underlying (true) correlation, but you may estimate a different R depending on the sample you get. What is true about a larger sample size is that the same point estimate for R may achieve statistical significance with a larger sample size than with a smaller.
I'm not really understanding exactly what you are saying
That’s my fault, not a problem with your explanation, I just don’t know what this means:
Say two random variables X and Y are jointly distributed according to F. F determines the covariance between X and Y, and the marginal variances of X and Y, agreed?
I’m not very good with notation.
Anyway, the point I’m trying to make is pretty simple. If you looked at two variables that you’d expect to have a relationship (say GB% and HR/BIP), it’s more likely that you would see a higher correlation if you sampled 50 pitchers instead of 5. If you only looked at 5 pitchers, it’s much more likely that you pick the 5 pitchers who display the least relationship of GB% and HR/BIP, than if you looked at 50.
Does that make sense to you?
by vivaelpujols on Feb 18, 2010 8:25 PM PST up reply actions
...
Now, if you’re going to estimate the correlation of X and Y, then sample size comes into play. It doesn’t affect the underlying (true) correlation, but you may estimate a different R depending on the sample you get.
Actually, I think this summarizes my point. And since R can only go up to 1, the range of estimated R’s are not going to be normal, they are going to skewed towards the bottom. Hence, more likely to have a lower correlation.
by vivaelpujols on Feb 18, 2010 8:27 PM PST up reply actions
I just did a quick simulation
- Generate
Set A: 5 pairs random variables with correlation 0.5
Set B: 50 pairs random variables with correlation 0.5
- Compute R for Set A, and Set B
- Repeat 500 times
Results:
Median for R values computed from Set A: 0.58
Median for R values computed from Set B: 0.49
If your assertion would correct, we’d expect to see the computed correlations for Set A smaller than those for Set B, rather than larger.
Good work thanks
Well, I guess I am wrong about that. However, the underlying sample size of each sample is still heavily important.
by vivaelpujols on Feb 20, 2010 12:14 AM PST up reply actions
Hi, my name is Waldo, and I am an idiot.
These primers have been helpful for me to learn broad concepts. I’m often lost when these types of subjects are discussed here. I don’t see myself getting to the point of being able to understand 100% of the statistical analysis on LL, but I will be happy if I can get to ~30% and be able to follow more of the FP posts.
So thank you for these Graham.
XKCD

If you get the joke then you get the correlation vs causation distinction.
by Matthew on Feb 18, 2010 9:27 PM PST reply actions 2 recs
I wanted to let this thread die down a little before I brought up this point because its slightly off-topic and has the possibilty to be confusing as hell.
Statement: Regression can be used to model anything as long as you know all the relevant variables.
Reasoning: Any function can be expressed as an infinite sum polynomials. For example, you can model the function y=exp(x) using a string of polynomials where the accuracy depends on the number of terms you use. This method is often referred to as a Taylor series.
Therefore any data that depends on a variable should be able to be modeled using a large number of polynomials (Taylor series). We can not use an infinite sum of terms because we have a finite number of data points however we should be able to describe the system well without too many terms (as long as we have a big data set). If we know that strikeouts are only a function of swinging strikes (not true, only for example purpose) but we don’t know how they are related, we can use a string of polynomials to model the system and adjust the constants to make it fit the data. This is only when you are trying to find a relationship for something like y = function(x).
Question: If we have an equation where n= f(x,y,z), we should still be ok I think. In this case we need to do a taylor expansion in every variable and then every combination of variables. For example n = f(x,y) = a + bx + cx^2 + dy + ey^2 + fyx where I have expanded it out to the quadratic form. If I extended the series to infinity it should be able to model any function (where a normal talyor series would work) where n = f(x,y).
If we did this sort of expansion for runs (using every single possible relevant parameter) we should be able to fit the data as well as anything out there now if we were able to get enough data free of too much bias. Then we could simply take partial differentials of the equation to figure out the value of a single or double for example. The value this method provides should be close to the best metrics out there now if done correctly.
This method probably isn’t the best way to do it but it seems like it could work just as well as long as you had a big enough data set and knew every single variable that was relevant to the final answer. I’m not suggesting anybody do this because this technique is not very elegant or intuitive. Also, since it is difficult to imagine every single relevant variable that could be used to provide unique information, this approach is fraught with possible errors.
In statistics, we would call this a strategy for "nonparametric regression"
“Nonparametric” because it allows you to fit much more complicated shapes than a straight line to the data, and hence is less sensitive to the “parametric” assumption of a linear relationship.
Two issues in practice:
1) Data support. As you increase the dimension of the basis (i.e. the number of terms in the Taylor expansion), you require more data to estimate the parameters. Many nonparametric regression models quickly run into the curse of dimensionality.
2) Intepretability. The nice thing about models with linear predictors is that you can interpret the coefficients relatively easily. For example, you can say things like “given all other factors being fixed, every additional run leads to an increase of 0.1 wins”. With a more complicated model, the output becomes much harder to interpret.
yeah
I agree with part 1. That makes sense. If you have a data set with 100s or 1000s of points it shouldn’t be an issue though.
I don’t agree with part 2. If you just take a partial derivative of your final expression, you should be able to find the marginal value of any one component.
by Edgar for Pres on Feb 20, 2010 12:00 AM PST up reply actions
Non-linear terms
Ah, but when you take a partial derivative of higher-order (i.e. non-linear) terms, say with respect to x, the answer will still depend on x. So the change in the outcome associated with a one-unit change in x won’t be constant in x, making it harder to summarize. This is what I mean by interpretability.
Well yeah but you can always input the average expectation
so its not that big of a deal. It also makes it more universal. What is the value of a HR if you don’t have any other hits, 1 run. What is it with an average hitter, 1.4 runs. This method doesn’t just give the right numbers but it could give you much more.
Granted, there would be error in determining the coefficients but I don’t think its much harder to understand. If you are interested in figuring out the average then just enter the average environment into the equation. The problem that the value of HR might depend on the number of HRs isn’t a problem, its reality.
by Edgar for Pres on Feb 20, 2010 2:06 PM PST up reply actions













