Lately, I've been guilty of leaping into the deep end of the sabermetric pool without dipping my toe into the shallow end first, and I worry that in the process my articles have suffered. Too often, I throw statistical jargon at the screen. Not often enough, I make clear my basic thought process. This article, in which I very deliberately walk through a fairly simple analysis, is an attempt to rectify that error. If you so desire, feel free to follow along, doing each step as I discuss it, to get a better feel for how I write and think.
Spring Training Statistics: Do They Matter?
Every analysis begins with a question. Since this is more or less an example article, I've chosen a really simple one: "do spring training statistics matter?" For the sake of reaching a precise conclusion, we should refine this broad question down to one that's more easily answerable with math. In this case, let's tweak "do spring training statistics matter" to "which predicts a player's season stats better: his spring training, or his previous season?"
Now that we've got a precise question, we need to devise a methodology for answering it. Again, this one's pretty simple: we should download an array of 2013 spring training statistics, an array of 2013 season statistics, and an array of 2012 season statistics. Then we should use Microsoft Excel or a similar program to look for correlation between the data sets.
Next, we must collect data. My go-to site for stat downloads is Fangraphs, but unfortunately they don't seem to carry records of past spring training stats. A quick Google Search brings up MLB.com's stats page, which does carry spring training stats but doesn't let you export .csv files. If you like, you can manually copy and paste this data into Excel, but it'll require a little futzing about to get the data to play nicely with Fangraphs' numbers. A better way to go is to download stats from Baseball Reference.
The full-season data is much easier to harvest, since Fangraphs carries it. However, because we want to compare how players performed between time periods, we want the same players in each data set. Also, we have to control for sample size somehow, or our correlation will be wrecked by a bunch of guys who batted 1.000/1.000/4.000 in one spring training plate appearance. To do so, we open up the spring training data set in Microsoft Excel and sort all rows by the PA column. Let's use MLB.com's threshold for "spring training qualifiers" and cut out everyone with less than 50 PA. These are the players that we need 2013 and 2012 data for.
Now that we have our player list, we can head over to the Fangraphs Leaderboards. If we add each player from the spring training list (there should be between 100 and 150) to a custom leaderboard, pick which stats we want to look at, and set a minimum number of PAs for sample size reasons, Fangraphs will let us export the data we want into a .csv file. (For those of you following along: don't actually manually do it. That takes forever. The links for data that line up with the MLB.com ST lists are here and here.)
Next, we can use Excel to sort each of the three .csv files by player name and then paste all three into one file. Here's another gotcha: each set has a different number of rows. This is because not everyone who played in 2013 Spring Training made it to MLB in 2012 and 2013. We should delete the data for players who don't appear in all three data sets. Lastly, we must make sure that all three data sets use the same stats - we may need to (for example) divide K by PA to get K%.
Now that we have the final data set, all we have to do is draw our conclusions by making use of Excel's CORREL tool. You can read about how to use CORREL here: essentially, you input the formula =CORREL(START1:FIN1, START2:FIN2) and it spits out a correlation coefficient. If you want to learn a little more about the correlation coefficient that CORREL generates, you can read this Wikipedia article, but for now all we really need to know about the output is that higher is better. According to well-respected math site Dummies.com, a relationship isn't considered "strong" until there's a correlation coefficient above 0.7.
At any rate, after we've used the CORREL tool, we can go ahead and square it to produce R^2. R^2 is much easier to intuitively grasp: it represents the amount of variation in one data set that can be "explained" by variation in another data set. If R^2 is greater than 0.5, more than half of the variation in one data set is explained by variation in the other, and we can say there is a strong correlation between the two data sets. (You may notice that R^2 > 0.5 is basically the same as R > 0.7.)
So what do the results look like? Well, I've uploaded my version of this file to MediaFire. The cells beneath the 2013 statistics are the correlation between 2013 Spring stats and 2013 stats. As you can see, none of the stats I chose produced a strong correlation, with only K% and SB coming particularly close. The picture gets even grimmer if you scroll right and compare to the 2012 correlations: 2012 stats out-predicted 2013 Spring Training stats in every category except for stolen bases (and stolen bases divided by PAs).
The above paragraph can be represented in graph form.
All that's left to do after generating the "punchline graph" is to write up the article.
In conclusion: don't look at spring training statistics. In almost all cases, you're better off looking at the previous season's numbers as a predictor of future success. There is some interesting future work to be done: is the high ST-to-season stolen base correlation real, or is it an artifact of this data set? Perhaps, if you're interested, you can repeat this process with data from 2012, 2012 spring training, and 2011. If you do, please let me know what you find by leaving it in the comments. Good luck!
My goal with this article was to introduce you to some of the basic tools I use in statistics writing, so that you can feel more comfortable doing some of your own work on the side and/or commenting on the work that's posted here on Lookout Landing. My secondary goal was to remind you to never ever ever look at spring training statistics. If I've failed in either of these respects, please let me know, and I'll do what I can to improve this article as a resource for the future.