## Can Players Consistently Outperform their Expected Stats? An Analysis

There’s something funny going on with Jose Altuve.

Altuve has been out-performing his expected stats. Significantly. On average per season, he’s out-performed his expected batting average by .032. That means he’s stealing an extra hit per every 30 balls in play he hits.

And this isn’t a small sample - this is across 3,953 balls in play in his Statcast-era career. In fact, expanding the sample to all qualified hitters in the Statcast era, across 756,356 balls in play, hitters are out-performing expected batting average by .0057, expected slugging by .0100, and expected wOBA by .0019.

Now, I’m averaging across player-seasons, meaning that these may be placing higher weights on the seasons of players who make more contact, but still - we’d expect over a sample this big, that we wouldn’t see much of a difference at all between expected stats and actual stats. After all, we’ve been told that these expected stats are based on how, historically, balls hit into play at the launch angles and exit velocities translate into actual hits. And yet, Altuve has been able to steal nearly a hit a week.

So what gives? Well, armed with a rudimentary knowledge of Python, I’m going to attempt to answer this question. I’m going to look at a couple of statistics that I think might affect how well a batter is able to beat their xBA, and we’ll see if any of them seem to correlate at all with their actual xBA to BA differential.

I’m going to start just looking at player-seasons in just 2023, with an eye specifically on players who hit more than 50 balls into play. The weakness of this approach is that I’m able to look less specifically at types of hits and how they may break xBA specifically - in other words, since I’m looking at player performances, I’m not looking at actual batted balls here. Instead I’m looking in aggregate to see which types of players are the ones who are able to consistently beat their xBA over a span of a season.

First, I wanted to look at something obvious - the relationship between the number of balls hit into play and the difference between a player’s xBA and their BA, which you’ll see as BA - xBA from here on. A positive number indicates that a player outperformed their expected batting average, a negative number indicates they underperformed it.

(Yes, here "balls into play" includes home runs, sorry not sorry that that’s confusing with BABIP where they’re not counted)

Looking at this relationship, we’d expect a scatter plot to look like a bell curve essentially, where as the number of balls in play goes up on the y-axis, the dots cluster along the x-axis near the center, where BA - xBA is close to zero. And indeed, that’s what we get:

The data sorts itself into an almost completely tidy little bell curve, with that dot at about 300 balls in play and beyond the 0.06 mark along the x-axis being our friend Jose Altuve. However, this bell is quite wide - really, if xBA were fantastic, we’d expect more of a pinched bell instead of the wide arc we get from this one. In fact, Alex Bregman is the only player with more than 525 balls in play with a BA - xBA less than .001 in either direction, and Luiz Arraez outperformed his by .025 with 544 batted ball events.

Another thing we can gather looking at this chart is that we generally speaking have an even split of overperformers and underperformers - indeed, 117 batters overperformed their xBA, while 131 underperformed it. This tracks with the fact that 2023 was the lowest the gap between the MLB average BA and the MLB average xBA has been in the Statcast era; in fact, Savant displays it as 0.000 on the expected stats leaderboard.

So let’s look at what I’m most suspicious of - ground ball exit velocity. I have a suspicion that hitters who can hit faster ground balls may be more consistently beating what xBA thinks they can do.

Nothing. This is just a lot of noise with no real relationship seeming to emerge. In fact, if anything, it looks like it could be that slower ground ball exit velocities result in xBA outperformance, but it’s hard to really tell from this graph.

Moving on to launch angle, another stat that I think may show to relate to xBA outperformance, we see similar noise.

It certainly doesn’t seem like players who outperform their expected stats are hitting the ball in any particular way to do so. So let’s move on to another potential statistic - sprint speed. Perhaps players who are able to get to first quickly are consistently outrunning ground balls, something that xBA would seem not to account for?

Nuts nuts nuts. More noise. We’re really not making much progress, and I’ve downloaded a lot of CSVs and written a lot of bad Python to get to this point. I won’t lie to you, reader - this is the point where I almost didn’t write this article. Because I couldn’t really think of where to go from here.

Luckily, I had one more idea - and it came to me when I was messing with the custom leaderboards on Savant. It turns out, they have a way to simply view the number of balls in play by type - groundball, fly ball, line drive, pop up - and count by whether they were a hit or an out. Using this, I was able to devise the rate at which players were turning different types of balls in play into hits.

Suddenly, we’re somewhere. While the rate at which hitters were turning fly balls into hits seemed the same level of noisy and uncorrelated as all of the other relationships we’ve looked at, the rate at which they were able to turn line drives and ground balls sure seemed like a linear relationship. Altuve, our test case, is that dot all the way to the right - where he's been on all of these charts, as the player with more than 50 batted balls who outperformed his xBA the most in 2023, at a .066 difference. He converted ground balls into hits at a rate of .329 - significantly higher than the sample's average of .238.

Let’s run a linear regression on this, maybe we actually have some decent P-values.

 coef std err t P>|t| [0.025 0.975] Intercept -0.1224 0.010 -11.972 0.000 -0.143 -0.102 GB Hit % 0.2149 0.019 11.554 0.000 0.178 0.252 LD Hit % 0.0986 0.015 6.538 0.000 0.069 0.128 FB Hit % 0.0207 0.011 1.870 0.063 -0.001 0.043

…alright I’ll admit I don’t know much about what we’re looking at here either, I just know that this is a thing you can do to see how much different variables can have on another variable, in this case the unpictured BA - xBA. I’m providing this table for transparency in case someone who does know what they’re looking at wants to tell me how dumb I am for interpreting it this way, but essentially, it’s saying it would use the following formula to predict a batter’s overperformance of xBA based on how often they convert these individual batted ball types into hits:

(GB Hit %) * 0.2149 + (LD Hit %) * 0.0986 + (FB Hit %) * 0.0207 - 0.1224 = Predicted BA - xBA

So here you can see, the linear regression thinks that turning ground balls into hits at a higher rate has more than double the influence on your ultimate outperformance of xBA as does converting your line drives, while converting your fly balls has a significantly smaller effect. These are numbers with small P-values, even.

Just for fun, I ran it for every player with more than 25 plate appearances this year, regardless of number of batted balls. The regression became even more confident in the influence of ground ball hit rate, and the charts looked like this:

The ground ball rate chart looks like a steep line, the line drive rate looks like a generally going slightly upward cluster, and the fly ball rate looks like just a ball of nothing.

So what does this mean? I’m not sure. This isn’t an effect of sprint speed - we already dismissed that earlier in this article. No, some players are just able to hit ground balls in such a way that they’re able to outperform their expected batting average. Not with speed, not with batted ball velocity - another thing we dismissed - but with some thing we’re not measuring here. Spray? Pulling the ball less? Is this just a fancier way to get to what is ultimately BABIP luck?

This is where I’m going to stop writing this article. I am, after all, a complete amateur in this. I took one statistics class, which I’m pretty sure didn’t even go into linear regressions. Before I spend another night diving further into the possibilities, I want to share what I have so far, to see if I’m making sense to anyone else.

The next step would be to look to see if this relationship seems to carry over across all Statcast seasons by involving them all in the regression. After that, I would look into whether pull rate has an effect - if, as I floated a few paragraphs ago, these grounders are beating the shift or in fact being hit where there’s no shift at all, some diving into Fangraphs could certainly show that in a similar way.

In the meantime, I leave you with a thanks - specifically to Nathan Braun's Learn to Code with Baseball, which was heavily referenced while diving into the Python needed to write this article. If you are interested in learning the way to do your own analysis, it's a solid place to start.