clock menu more-arrow no yes mobile

Filed under:

Opening day statistical dos and don'ts

It's opening day! Here are some suggestions for rationally interpreting early statistics.

wharglblargloooooo
wharglblargloooooo
USA TODAY Sports

With Saturday's conclusion of the Mariners' final spring "series" against the Rockies, the SCREW INTRODUCTORY PARAGRAPHS SPRING TRAINING IS FINALLY OVER WOOOOOOOOOOOOOOO

Baseball is back! And if you're excited, I understand completely, because I am too. I can't wait for the first Safeco Felix Day, the first Brad Miller home run, the first game I get to watch on my new 46" flatscreen TV that I found in perfect working condition at the local dump... it's a pretty awesome time of year.

Nevertheless, I am one of Lookout Landing's designated statheads, and in my official capacity as stick in the mud I feel compelled to deliver this reminder. A lot of things have changed now that the team's out of Arizona, but one big thing hasn't: the stats still don't really matter. Forget first-week batting averages and ERAs - April is the time of year when we renounce small sample sizes, eschew BABIP, and learn the statistical meaning of the word "stabilization".

These, then, are the dos and don'ts of early season statistics.

DO: Adjust projections slightly according to spring training performance

Turns out spring training actually does matter, a little. Read this article from Nate Silver's fabulous 538 blog, in which Neil Paine finds that projections adjusted for extreme spring performance are slightly better than unadjusted ones. The effect is tiny - one point of wOBA adjustment for every seventeen points of difference between spring wOBA and Marcel-projected wOBA - but it's enough to suggest that we should project Dustin Ackley as a league average hitter and Brad Miller as a Seageresque stud. On the other side of the coin, Paine's article suggests that expectations for Corey Hart and Abe Almonte should be tempered.

DON'T: Use this as an excuse to care about spring training stats next March

Paine's methodology relies on having a full spring's worth of stats. The only thing worse than analysis based entirely on spring training stats is analysis based entirely on a small sample of spring training stats. In general, continue to rely on the eye test for spring training analysis, and take all "swing change" and "best shape of his life" articles with several pounds of salt. Just... know, now, that statistics compiled during the spring aren't entirely worthless information.

DO: Re-familiarize yourself with Russell Carleton's work on sample sizes

The work that Carleton (AKA Pizza Cutter) did for Fangraphs and Baseball Prospectus is some of the most important baseball writing you're likely to read in your time as a fan. It's worth checking it out again. In short: the first month of the season isn't going to produce much in the way of usable data. By the end of the month of April, we'll have swing rates, contact rates, GB%, FB%, K% and maybe BB% for hitters. By the end of May, HR% and ISO will have stabilized as well. Pitching stats take longer to stabilize, but K and BB rates should be stable by the end of the month. That said,

DON'T: Misunderstand the meaning of the word "stabilize" in those articles

"Stabilize" doesn't mean that a statistic represents true talent level - it just means that the difference between the small new sample of a statistic and the larger (career) sample is more than 50% likely to be due to skill and not luck. In stat-speak, each of Carleton's PA thresholds is the line at which the R^2 value between two split samples of a player's performance in a given statistic is .50. It's not that, above the threshold, the new stats are set in stone. It's that, below the threshold, the new stats are completely untrustworthy. Don't confuse those two.

DO: Check out PitchF/X

SBN's Cardinals blog Viva El Birdos once published the immortal sentence "In small sample sizes, a good scout is ALWAYS better than stats". Well, PitchF/X is the closest thing that we baseball fans have to an online scout. It can give you release point readings, velocities, and pitch movement data, all of which can very rapidly clue you in to a pitcher's plateau leap (or plateau drop). And unlike stats that rely on the outcomes of multiple pitches, things like release point stabilize pretty darn quickly. (If they don't, there's a problem.) PitchF/X data is an incredibly valuable tool that could be a great source for data used in early-season analysis.

DON'T: Check out batted ball data

My first FanPost at Lookout Landing was about the Mariners' line drive rates in the month of April 2012. Don't be like me. Line drive rate is a notoriously slow-to-stabilize statistic, and using it any pretty much any analysis of a sample smaller than a full season is inadvisable. Similarly, batted ball distances can fluctuate pretty wildly over a year, leading well-intentioned sportswriters to say things like "Raul Ibanez's power is gone" in May of 2013. Again, don't be like me. Stay away from batted ball data.

DO: Remember to have fun!

Sometimes I feel like statistical pragmatism can suck the joy out of baseball fandom. It's not always fun to have tempered expectations for Dustin Ackley, or to believe that James Paxton is less Clayton Kershaw and more Danny Duffy. And you know what? Now is the time to dream. April is the month where the stats mean nothing, the future is boundless, and we fans can have as much hope as we want. Don't get too bogged down in the numbers - take advantage of this chance to have fun watching the sport we love.

Baseball's back, everybody.

Let's go M's!