This is going to be a fun one.
Prerequisites for understanding: None.
Prerequisites for derivation: N/A; conceptual.
The concept of regression towards the (I really should say a) mean is important in fields far beyond baseball analysis, so I suppose we should start with an easy non-baseball example. To Wikipedia!
A class of students takes two editions of the same test on two successive days. It has frequently been observed that the worst performers on the first day will tend to improve their scores on the second day, and the best performers on the first day will tend to do worse on the second day. The phenomenon occurs because student scores are determined in part by underlying ability and in part by chance.
The last sentence is the critical one to understand. Most measurements of human ability are partly achieved by skill and partly achieved by luck. This means that data cannot always be taken at face value. Since we cannot always be completely confident that we've measured what we want to measure, we can apply an expected regression to the mean to get a true idea of talent. We all do this, whether we mean to or not. The rookie that comes up in September and gets a hit in his first at-bat? The numbers say he's on pace for a career batting average of 1.000. Does anyone expect said rookie to never make an out in his life? Of course not.
The interesting question is which mean to apply our expected regression towards. What if our rookie is reckoned by scouts to be an excellent pure hitter? What if he's a guy who swings from the heels and misses half the time he offers at a pitch? Clearly, we expect different batting averages from the two, and one at-bat isn't going to influence our expectations either way. We'd regress the first player towards the 'good hitter' population mean, and the second towards the 'bad-hitter' population mean. Eventually (given enough at-bats), we simply use their career numbers as the population mean for the player. This is a shortcut rather than being analytically rigourous, as some element of randomness always influences career numbers, meaning that barring other information players should always be expected to be slightly more average than they have been historically. It's not a big effect, however; I merely highlight it to demonstrate the difficulty in choosing the mean towards which we expect a player to regress.
Clearly the idea of regression towards a mean is the force behind the need for large samples of data in order to have strong conclusions about talent level. However, while we intuitively know that one at-bat doesn't mean a whole lot, we don't have a good grasp on how strong our conclusions are for any apparently reasonable sample size. This is dangerous, and leads to poor conclusions and arguments, as well as the occasional misinformed request for larger sample sizes. The requisite sample size depends on the proportion of skill and luck inherent in a measurement - higher means a lower sample size is required for a given level of confidence in one's results, and vice versa.
Lies, Damned Lies, and Statistics
We all know that numbers are manipulable, and that it's possible to draw completely ludicrous conclusions from them that simply don't bear up to even basic common sense. With a good grasp on the theory behind regression towards the mean, one can avoid the pitfalls of putting too much faith in a poor sample size. However, we remain unsure of what sample size is actually required for a given metric until regression's close cousin correlation comes into play. Regression also does not protect us from statistical arguments based on irrational theories of value (i.e. over/undervaluing a specific skill or statistic).
- 'Regression' in this case carries no positive or negative connotation. It can push numbers up or drag them down.
- Regression analysis is a statistical technique used for modelling systems, and is not strictly related to regression towards the mean.
- The more statistically inclined may note that estimating talent via a combination of measurements and regression is actually a Bayesian Inference with the population mean as the prior.
- Regression is not the same as a actual change in talent level. An player slowing down due to age is not 'regressing', he is declining (or perhaps a combination of both).
- As with most statistical concepts, there is no guarantee here. Just because a player appears likely to regress does not mean he will do so. This is important to bear in mind.
Correlation; hitting, pitching, and fielding metrics.