Baseball-Reference and THT are great ways to kill time at work.

Earlier today I thought I'd do a little strikeout research using B-R's pitch data statistics (StS%, StL%, SOc%, Cntc%, etc; you can read all about them on any pitcher player card by clicking "Show" next to Pitch Data Summary, and then clicking "Glossary"). Since the game starts in a little over an hour I'll spare you all the gory details, but long story short, it should come as no surprise that the stat with the strongest correlation to strikeouts is Cntc% - the rate with which opposing batters make contact when swinging. Charts and correlation data for 2006 and 2007, including all pitchers with at least 100 innings of work, are shown below:

The neat thing about this is that, by using the best-fit equation of the line in the graph, we can estimate a pitcher's K/G by plugging in his contact rate. We can then compare the resulting value to the pitcher's actual K/G to see who's over/undershooting their estimate.

Here are the top ten results for 2007 in each category, listed by rank. In parentheses is the difference between his estimated K/G and his actual K/G: estK/G - actK/G.

Undershooting Estimated K/G:

  1. Shaun Marcum (2.36)
  2. Lenny DiNardo (1.73)
  3. Jason Hirsh (1.64)
  4. Scott Olsen (1.52)
  5. Brian Burres (1.51)
  6. Tim Hudson (1.47)
  7. Sergio Mitre (1.21)
  8. John Smoltz (1.14)
  9. Tim Wakefield (1.09)
  10. Jorge Sosa (1.09)
Overshooting Estimated K/G:
  1. Erik Bedard (-2.53)
  2. Jake Peavy (-2.19)
  3. Ted Lilly (-2.17)
  4. Josh Beckett (-1.74)
  5. Wandy Rodriguez (-1.51)
  6. Randy Wolf (-1.51)
  7. Rich Hill (-1.44)
  8. Matt Cain (-1.29)
  9. Ben Sheets (-1.27)
  10. Oliver Perez (-1.26)
It should be immediately noted that not all of this is just random statistical noise. For some pitchers, over- or undershooting their estimated K/G appears to be a part of their skillset. For example, Curt Schilling's at -1.49 since 2004, which goes neatly in line with 2007's -1.24. Likewise, Tim Hudson's +1.45 figure since 2004 is almost exactly identical to 2007's +1.47. Since contact rate isn't the only factor that goes into the strikeout equation, some pitchers may excel or suck at something else which influences their final appearance.

However, if we identify the pitchers for whom over/undershooting their estimated K/G isn't part of their skillset, then we have a potentially useful indicator of future performance. Contact rate tends to remain fairly stable on a year-to-year basis, so if a guy's undershooting, then we should expect him to get more strikeouts next year, and vice versa. A few candidates, mostly from the top ten lists (I didn't look at everyone, because it's an annoying amount of work):

Aaron Harang (-1.06; -0.15 since 2004)
Wandy Rodriguez (-1.51; -0.49 since 2005 debut)
Ted Lilly (-2.17; -0.62 since 2004)
Erik Bedard (-2.52; -1.06 since 2004)
Jake Peavy (-2.19; -0.99 since 2004)
Josh Beckett (-1.74; -1.09 since 2004)
Scott Olsen (+1.52; +0.79 since 2005 debut)
Jeremy Bonderman (+1.01; +0.56 since 2004)
Roy Oswalt (-0.21; -0.97 since 2004)
Dave Bush (+0.27; -0.29 since 2004)
Ben Sheets (-1.27; -1.73 since 2004)

(Note: a more negative value on the left means they may strike out fewer batters going forward, while a more positive value means they may strike out more.)

This is all preliminary stuff, so don't take it to mean too much right now, but if nothing else it's interesting to look at. For the record, a similar process with the 2006 data identified guys like Chien-Ming Wang, Derek Lowe, Brad Penny, Curt Schilling, and Carlos Zambrano as guys who were due to see their strikeout rates change for better or worse.

What's responsible for pitchers over/undershooting their K/G estimate? That much still needs to be researched, but called strikeouts are clearly one of the main factors. The league average called strikeout rate is 26%; for the top ten overperformers, it's 31%, while for the top ten underperformers, it's 22%. Called strikeouts are inconsistent year to year - witness Erik Bedard's 26% in 2004 to 50% in 2005 to 26% in 2006 to 40% in 2007 - so that's probably a big chunk of it. For some pitchers (Maddux) this is a skill, but it's still prone to wild variation. If you see a guy hit a career high or career low in called strikeout percentage, be wary going forward.

If a pitcher's contact rate changes dramatically, then the predictive value of his previous season's estimated K/G goes out the window. If it doesn't, though, then I think this might have a little potential. Something to keep an eye on, anyway.

Late update: something I forgot to mention is that, judging by early indications, this only works for starters. Relievers appear to have a y = mx + b equation all their own.