For years now, I've had an issue with the most prominently used pitching metrics here. I don't trust home runs allowed as a useful statistic to hold against pitchers. I'm not a professional statistician but in the following, I'll try to use the right methods to explain why I distrust home runs and why I think there are better options, including a new one.

Suppose you took all plate appearances from 2007 until now and randomly split them in half into two buckets. Then you grouped them by pitcher, by pitcher role (i.e. starter or reliever), and by season. So in each bucket for example, you'd have a roughly equal number of PAs where Felix Hernandez, as a starter, from 2011, pitched. Each bucket is a representative sample of pitching performance. The samples wouldn't be fully controlled for because these PAs aren't also grouped by the exact batters faced, in the same park, with the same base runners, and so on. Gathering large enough samples that are perfectly matched would be an impossible task.

What is common between the two samples are the things controlled for, who the pitcher is, the role he pitched in and roughly when. With such a breakdown, you can begin to tell whether certain statistics are mostly under the control of those controlled factors or not by examining how closely the statistic in question is the same between the two groups. If something was 100% under the pitcher's control then it should be the same in both buckets. Felix's average release point in sample A is visually identical to his average release point in sample B. That's intuitive, naturally it is the pitcher who dictates such a thing. So we can say that's a stable variable. If we graphed that for each pitcher — with Felix in sample A on one axis and Felix in sample B on the other axis — we'd expect to see all the points on a single straight line.

Now consider the average temperature in Edmonton, Alberta during each PA that Felix faced. That would jump all over the place and we would expect it to because obviously Felix has no control or influence over the weather, especially not in Alberta! It's almost completely random; so we can call that a random variable. If we graphed that for each pitcher, we'd expect to see a random blob of points.

Whether the first pitch of the at bat is a fastball lies between those two. The pitcher (and catcher) do get to make that decision, so the control lies with him (them), but the identity of the hitter and the game situation do influence a pitcher's approach. So we can call that a mixed variable. If we graphed that for each pitcher, we'd expect to see something like a smudged, or thick, line because the samples would be highly, but not exactly, correlated.

Identifying where on this spectrum various statistics lie is important because we want to judge pitcher's based on statistics that are as stable, as under the pitchers control, as possible. Take strikeout rate for instance, and now let's look at all the pitchers in the two buckets, matched up with a minimum of 70 plate appearances in each sample.

As mentioned, something that was 100% under the pitcher's purview would look like a perfectly linear line. This is a bit of smudgy line, but it does have a linear trend to it. If a pitcher had a strikeout rate of 20% in one sample, he was likely to have a strikeout rate close to 20% in the other sample too. And this pattern emerged quickly, with just a cutoff of 70 PAs needed to get the R-squared up to 0.5, which is a typical threshold for declaring a statistic to be worthwhile for prediction or measurement purposes.

So strikeout rate looks to be mostly the responsibility of the pitcher and we can roughly figure out a pitcher's talent at it quickly. That's an example — the best as it happens — of a good performance metric. Now look at another stat considered important, home run rate (park-adjusted) per flyball, with a cutoff of 50 flyballs.

Oh no. That's not very linear. That's very blobby. Blobs aren't good here. We don't want blobs because blobs tell us nothing. A pitcher who saw 10% of his flyballs go for home runs in one sample might reasonably have given up anywhere from 0% to 20% in the other sample. There's no telling here. The measurement isn't stable, even after accounting for the park. In fact, it isn't stable no matter how high a cutoff we use within a single season.

To me, that suggests that we should want to at least heavily discount or ideally eliminate the use of home runs when it comes to evaluating a pitcher's ability. However, nearly the opposite happens. Home runs are big deals in baseball; they get lots of attention. Not just from the public and highlight shows but also pitching metrics, even advanced ones.

In the standard FIP formula, home runs are multiplied by 13 while walk and hit batters are multiplied by 3 and strikeouts by 2. In tRA's 2012 values, a home run adds about 1.4 expected runs to the pitcher's performance while a walk or hit batter contributes about 0.4 expected runs. Those are big penalties.

Of course, home runs are the most damaging event (for a pitcher), so it seems reasonable that pitchers be highly punished for giving one up. And they don't involve the defense so it seems reasonable that they are entirely the pitcher's fault, but as pointed out, neither home runs nor home run rate per fly ball stabilize over a season. That makes me wary of having home runs be as indiscriminately weighted as they currently are.

One way of including the home run component without relying on the unstable reality of actual home runs is to ignore actual home runs and instead craft an expected number of home runs allowed from some combination of other, hopefully more stable, results. The stat xFIP does this using the pitcher's number of flyballs and the league average rate on those becoming home runs. That's a reasonable estimator in part because, unlike home run rate, flyball rate does stabilize for pitchers.

And given a large enough sample of the entire league, home run rate per fly ball is fairly stable; you can note that it changes little year to year. This estimation for a pitcher's seasonal home run total ends up working fairly well. Here's a comparison of xFIP-style home run prediction to actual (park-adjusted) home run totals, using a cutoff of at least 50 batted balls.

The graph is a little meteor-y looking, but just using the pitcher's number of flyballs and the league home run per flyball rate does provide a solid estimation. I've been pondering if we can do better though. Back in November I wrote about the direction home runs were hit. I'm going to quote my last image and conclusion here:

In a large swath of territory stretching from the pulled foul pole to about halfway toward center field, about half of fly balls are home runs. And then it drops to almost nothing remarkably quick.

In fact, if you split the field of play in half, with a pull side (to the batter's perspective) and an opposite field side, the league's home run per flyball rate is 25% on the pull side and only 3% on the opposite field side. That's a big difference and led me to wondering if flyballs hit the other way should be contributing equally to a pitcher's expected number of home runs allowed. Perhaps a better estimator of home runs could be built by utilizing the direction of the batted balls.

There's a potential issue with doing that; the rate of pulled flyballs is less stable than the rate of flyballs in general. That's not a surprise, and it's not a non-starter, but it's worth being aware of. However, a composite of all batted balls, split into pulled or not, and multiplied by the league average rate of home runs from each type does appear to be a better predictor. Here's the same data as before, but with xFIP's home run predictor replaced by the more complex version.

This might be tough to see but this more comprehensive model, which I'll call xHR, does get closer to successfully matching the amount of home runs given up, especially with the pitchers that throw the most. And, unlike straight home run rate, xHR does stabilize for pitchers

It is more noisy than just flyball rate, but the overall prediction seems to be closer to reality. So I think xHR is an improvement over both home runs and xFIP's home run estimator. By itself that's fine, but what I ultimately wanted to know was whether using xHR could help make a stat like FIP or tRA better and more stable.

To test this I had to expand and modify the general tRA model. Current tRA separates home runs from the batted ball types to avoid double counting. I split up the batted ball types into pulled or not and re-ran their expected run and out values no longer ignoring the ones that became home runs. That baked xHR into the batted ball types where xHR was originally derived from and eliminated the need for a separate home run term at all.

That creates some meaningful differences. For a simplistic example, let's establish a pitcher who faces two hitters. To the first he surrenders a ground out hit the other way and to the second he gives up a pulled flyball. Now consider two scenarios, in the first, that flyball barely goes for a home run. The pitcher's tRA would be:

The run value of an average groundball (~0.04) + the run value of an average home run (~1.4), divided by the out value of an average groundball (~0.8) + the out value of an average home run (0), multiplied by 27. Or 48.60, really bad.

In the next scenario, imagine that the flyball was instead caught in a leaping grab at the wall. Nothing much changed about the actual events, but tRA is vastly affected. It's now:

The run value of an average groundball (~0.04) +

the run value of an average fly ball (~0.03),

divided by the out value of an average groundball (~0.8) +

the out value of an average fly ball (~0.85),

multiplied by 27. Or 1.15, really good!

It's simplistic and small, I know, but it's illustrative of my overall wariness with allowing home runs to have an impact disproportional to their reliability as pitching talent measures. For what it's worth, the swing in FIP between scenario 1 and 2 would be 42.24 to 0.00, equally as problematic.

In contrast, the new version, which I'll call xRA, would stay constant across both scenarios, calculated as:

The run value of an average other way groundball (~0.05) +

the run value of an average pulled fly ball (~0.4),

divided by the out value of an average other way groundball (~0.75) +

the out value of an average pulled fly ball (~0.6),

multiplied by 27. Or 9.00, back to being bad, which it should be.

In fact, not even xFIP can lay claim to staying constant because its denominator is innings pitched, a number that is ironically not fielding independent. If the flyball went for a home run, leaving only one out recorded, the xFIP comes to about 7.39 compared to 5.27 if the flyball were caught, recording a second out.

Now, this example is cherry-picked. One could turn it around the other way and construct a hypothetical where the flyball in question is nowhere near being a home run but instead is hit to nearly dead center field. Other metrics would treat that flyball the same regardless of whether it it fell two feet to the pull side of the hitter or two feet to the other side, whereas xRA does change based on that. However, that change in xRA is rather small, about a quarter of a run, especially compared to the home run differences, which exceed a full run.

The more consistent behavior in xRA serves to make it less variable over small samples compared to tRA. I'm not going to try and go through everything, but among the observations I made are:

- Looking at the range of single-game xRA values using the median absolute deviation (MAD), I found them smaller, 1.34 to 1.63, than for tRAs. That means that in general, xRA is less prone to super high or super low numbers.
- The same held true for season-long numbers with xRA being smaller, 0.46 to 0.53, than tRA.
- Then I looked at how well each stat was able to predict itself. That is, how far off was each single-game xRA or tRA from that pitcher's eventual season-ending number. Using the root mean square error (RMSE) I found that xRA is again smaller than the the corresponding RMSE for tRA so xRA appears to be a better predictor.
- Finally, looking at the MAD for single-game xRAs and tRAs within each pitcher — that is, each game for each pitcher was compared only to other games thrown by that pitcher that year — showed the MAD for xRAs to be smaller, 1.35, than tRAs, 1.64.

The idea of that last bullet point is that, if your metric was mostly made up of stuff under the control of the pitcher and assuming that pitchers more or less stay similarly skilled throughout the season, then the metric shouldn't vary much from start to start for the pitcher. Some variation is going to occur of course, but the less there is, the more predictable it is and xRA showed less variation than tRA.

Every way I could think of to check showed that xRA produced more consistent results than tRA did, so that in addition to intuitively allaying my concerns over the weightiness of home runs, it appears to offer a genuine improvement in measuring pitchers.