Sabermetrics 101: Sample Sizes

Now seemed as good a time as any.

Prerequisites for understanding: Regression, correlation.

Prerequisites for derivation: Data, regression, correlation.

Sample Sizes

We're familiar with regression and correlation, so let's get a little more in depth with the nature of sample sizes. If we have a certain set of data, how can we assess its reliability? What is it actually telling us? We unpacked a little of this in a previous post, but didn't touch on how it applies to the data we wish to analyse. There are many many resources that describe this problem and the solutions in minute, tortuous (to some) detail. We don't need to rehash them here - these posts are intended to be more overview than encyclopedic. So - an overview:

The idea is that different skills and stats have different thresholds for sample size tolerance. We know that we must regress our measurements towards a mean, and we've thought a little bit about which means we should be using. What we haven't really discussed is how far we should be regressing our given values. This is governed by our sample size and the stability of the statistic - higher samples and higher stability means less regression. The important thing to point out is that the amount of regression we apply should be a continuum, rather than a step - meaning that for every sample size there is a certain amount of associated information. The smallest sample (i.e. zero) tells you nothing, and we slowly work our way up the ladder until we reach the largest samples, which still don't tell you everything.

Some Rules of Thumb

The way we determine the information associated with a given sample size for a given statistic is to look at the stability across the MLB population while taking into the relative persistence of the statistic year by year in individuals. Needless to say, this can be a daunting task. In lieu of pursuing some intensive mathematics, here are some rules of thumb:

  • Using sample size on a per plate appearance basis can be misleading. Remember that if a statistic is based around pitches seen, the actual sample is going to be three to four times as large as the number of plate appearances. This makes statistics like swing% appear far more stable than they actually are (although we can treat them as highly stable since we have a large sample anyway), and statistics like home run per fly ball become artificially destabilised, since our sample isn't really plate appearances but plate appearances that end in a fly ball.
  • It should be noted that even if you consider total fly balls as your sample, pitcher home run per fly is still highly unstable.
  • In general, the more players involved in our measurement of a statistic, the less stable. Batting average for both pitchers and batters is determined by the pitcher, the batter, and the defence. Strikeouts cut out the defence, and they're far more stable on both ends than average is.
  • By and large, batting statistics are more reliable than pitching statistics, which are more reliable than defensive statistics. Batting statistics, namely on-base percentage and slugging percentage, tend to stabilise with two-thirds so so of a season (~400 PA). Pitchers only see strikeouts, line drives, and groundball rates stabilise early, with walks coming late to the party. Nowhere to be seen is ERA. Fielding measurements such as UZR require something like three years of data to be comfortable with. Putting up great fielding numbers for a season is like hitting well for April and half of May.

What Follows

Projection systems, understanding splits.

X
Log In Sign Up

forgot?
Log In Sign Up

Forgot password?

We'll email you a reset link.

If you signed up using a 3rd party account like Facebook or Twitter, please login with it instead.

Forgot password?

Try another email?

Almost done,

By becoming a registered user, you are also agreeing to our Terms and confirming that you have read our Privacy Policy.

Join Lookout Landing

You must be a member of Lookout Landing to participate.

We have our own Community Guidelines at Lookout Landing. You should read them.

Join Lookout Landing

You must be a member of Lookout Landing to participate.

We have our own Community Guidelines at Lookout Landing. You should read them.

Spinner.vc97ec6e

Authenticating

Great!

Choose an available username to complete sign up.

In order to provide our users with a better overall experience, we ask for more information from Facebook when using it to login so that we can learn more about our audience and provide you with the best possible experience. We do not store specific user data and the sharing of it is not required to login with Facebook.

tracking_pixel_9351_tracker