Now seemed as good a time as any.
Prerequisites for understanding: Regression, correlation.
Prerequisites for derivation: Data, regression, correlation.
We're familiar with regression and correlation, so let's get a little more in depth with the nature of sample sizes. If we have a certain set of data, how can we assess its reliability? What is it actually telling us? We unpacked a little of this in a previous post, but didn't touch on how it applies to the data we wish to analyse. There are many many resources that describe this problem and the solutions in minute, tortuous (to some) detail. We don't need to rehash them here - these posts are intended to be more overview than encyclopedic. So - an overview:
The idea is that different skills and stats have different thresholds for sample size tolerance. We know that we must regress our measurements towards a mean, and we've thought a little bit about which means we should be using. What we haven't really discussed is how far we should be regressing our given values. This is governed by our sample size and the stability of the statistic - higher samples and higher stability means less regression. The important thing to point out is that the amount of regression we apply should be a continuum, rather than a step - meaning that for every sample size there is a certain amount of associated information. The smallest sample (i.e. zero) tells you nothing, and we slowly work our way up the ladder until we reach the largest samples, which still don't tell you everything.
Some Rules of Thumb
The way we determine the information associated with a given sample size for a given statistic is to look at the stability across the MLB population while taking into the relative persistence of the statistic year by year in individuals. Needless to say, this can be a daunting task. In lieu of pursuing some intensive mathematics, here are some rules of thumb:
- Using sample size on a per plate appearance basis can be misleading. Remember that if a statistic is based around pitches seen, the actual sample is going to be three to four times as large as the number of plate appearances. This makes statistics like swing% appear far more stable than they actually are (although we can treat them as highly stable since we have a large sample anyway), and statistics like home run per fly ball become artificially destabilised, since our sample isn't really plate appearances but plate appearances that end in a fly ball.
- It should be noted that even if you consider total fly balls as your sample, pitcher home run per fly is still highly unstable.
- In general, the more players involved in our measurement of a statistic, the less stable. Batting average for both pitchers and batters is determined by the pitcher, the batter, and the defence. Strikeouts cut out the defence, and they're far more stable on both ends than average is.
- By and large, batting statistics are more reliable than pitching statistics, which are more reliable than defensive statistics. Batting statistics, namely on-base percentage and slugging percentage, tend to stabilise with two-thirds so so of a season (~400 PA). Pitchers only see strikeouts, line drives, and groundball rates stabilise early, with walks coming late to the party. Nowhere to be seen is ERA. Fielding measurements such as UZR require something like three years of data to be comfortable with. Putting up great fielding numbers for a season is like hitting well for April and half of May.
Projection systems, understanding splits.