Just be happy they aren't eating your soul - Tom Szczerbowski
You, me, everyone; we live in a horrible reality. Instead of, vastly superior in every way, robots, squishy, irrational humans are thrust into positions of power. Dumb, smelly humans with their blind spots and cognitive biases dare to stand in judgment instead of being kept as diversionary pets as The Great Marauder intended. One such position where humans have held onto power for far too long is standing behind the catcher at baseball games and deciding whether taken pitches were within the strike zone or not. This should not persist, but it does, and I am powerless to affect change.
That particular situation is vexing to me because when it comes to baseball, I quite like the plate discipline suite of baseball stats such as swing and contact rates on pitches in and out of the zone. I think they are the best ways of measuring differences in approach and are usually interesting and sometimes even illuminating. However, they are built on an assumption of an accurate representation of the strike zone. When you build on a crummy foundation, the whole structure becomes unstable and so it can be with those stats should the strike zone model be inadequate.
I'm not fully versed in how other baseball sites compute the boundaries of the strike zone. I believe that FanGraphs goes with Baseball Info Solutions' (BIS) determination, and I know that BIS does a lot of their categorizations by hand, but I don't know if they do that for the strike zone. I believe that they do, but even if, where do they decide the edges are?
I'm not impugning FanGraphs or BIS here or implying that they're using unsophisticated or incorrect strike zone models. For all I know, they've done the same or way better research that I will lay out below. I don't feel comfortable taking things on faith though and I like to dig into the stuff on my own and be able to reproduce it. Doing so ups my level of trust.
A few months back I investigated and adjusted how I, and StatCorner by extension, compute the width of the strike zones for left-handed and right-handed hitters. Before, I merely set the horizontal limits to be the rule book (8.5 inches to either side) plus a little extra for "the black." And the strike zone was static, ignorant of the side of the plate that the batter stood. The top and bottom were determined by the pitch F/X operators for each hitter.
Aside from the all-too-American trait of being a bit too thick in width, the above strike zone represents a model for how the zone should be called, if those responsible were to follow the rule book. Which they should because why else have a rule book? Unfortunately, it is not an accurate model for how the strike is actually called.
For that, I first had to separate batters by handedness. The so-called "lefty strike" is a well-known phenomenon in baseball wherein pitches well off the outside of the plate are called for strikes far more often against lefty hitters than righties. Why? Because shut up. Therefore, when computing the practical limits of the zone, it makes obvious sense to generate two pictures, one for right-handed hitters and one for left-handed hitters.
In that past post I established the new specifications for the zone by starting at the edge of the plate and examining thin slices of space moving steadily outward, looking to see where the percentage of taken pitches crossed the 50% mark from called strike to called ball. I stuck with the pitch F/X-established lines for the top and bottom since I think the differing heights and crouches makes using a static number for the top and bottom of the zone to be inferior to an admittedly subjective human decision. The zones I came up with are shown below, along with the replaced, naive, zone.
For right-handers, it wasn't too dramatic of a change, just a slight widening from 9.5 inches on each side to about 11 inches. But for lefty-hitters such as Michael Saunders, the zone doesn't just get wider, it shifts significantly. The 50% threshold was crossed a whopping five inches off the outside edge of the actual home plate. Meanwhile, the inside edge of the strike zone to lefties actually held true to the plate.
That change held meaningful implications for how to evaluate the approach taken by hitters such as Saunders and Dustin Ackley. For Saunders, the more accurate zone showed that his more aggressive approach this season was limited entirely to pitches within his likely strike zone. His swing rate on pitches outside the zone actually dropped a minimal amount while his rate of swings in the zone went from 58% all the way up to 70%.
In Ackley's case, it better highlighted how much he was getting hamstrung by pitchers attacking that particular zone, and also that he finally be adjusting to it. He'll have to otherwise he'll continue to get eaten alive on those outside pitches and disappoint us Mariner fans, the greatest disappointment any sane person could suffer.
The handed strike zones are a much more accurate representation of the actually called strike zone than the naive, idealistic model. Since hitters and pitchers have to adjust and play to the called strike zone, not the rule book one (looking at you, Dustin), I think that the zone-based stats should reflect that practical zone rather than what the rules counsel.
The strike zones above are improved but are still represented by rectangles. Rectangles are great for coding up a strike zone since it's trivially easy to check if a pitch's location is within its bounds, but alas, they aren't satisfactory at capturing how umpires call the zone. Here's how they are actually called.
I took every pitch F/X recorded pitch from the past six (2007-12) seasons that was either a called ball or strike (intentional balls withheld) and fitted a model* to create a topographical-style chart so that any given (x,y)-coordinate pair (i.e. a pitch location) would yield the likelihood (from 0 to 1) said pitch would be ruled a strike. Then I drew contour lines and splashed on some color, like a boss.
*In case you care, it was a generalized additive model using a binomial distribution (0 for ball, 1 for strike) and a logit link function.
You can click on the images, any of them in this post, to get much bigger versions. Shrunk down as they appear here, the two zones might look pretty similar. They are, but if you want a better comparison between the two to see how the lefty zone shifts over, here's a GIF. You're welcome.
In contrast to the models above, these zones are not very rectangular at all. They look quite circular, actually, though you might be getting fooled by the differing axis-lengths. The horizontal axis covers four feet while the vertical axis covers three feet because I didn't want a bunch of excess space.
Those heat maps may be pretty (well, I think so) in a very 8-bit graphics way, but they don't do much to compare to the previous graphs of the strike zone models. So let me strip away everything except that same 50% barrier that was shown in the handed strike zone models, but this time with the fitted actual zone.
That's clearer. But to get a sense of the difference between modeling the actual strike zone calls, which is the goal, and the current, rectangular, handed model, I'll split the graphs back into two, by batter hand, and show the naive model, the handed model and the actual zone.
The handed rectangles are clearly better than the one-size-fits-all original model, but they still show the typical problems of circle versus square, too much space on the corners and not enough on the edges. I mentioned above that rectangular was nice because that's easy to code. Amorphous is very complex, nigh impossible. Luckily, these zones are approximately not amorphous. An ellipse actually comes quite close to matching them perfectly.
They aren't exact matches, but if you want to crib off this, you can reproduce these strike zones with the following equations:
For LHB: ((x+.18) / 1.01)^2 + ((y-2.52) / 1.02)^2 = 1
For RHB: ((x+.04) / 1.04)^2 + ((y-2.54) / 1.02)^2 = 1
These elliptical equations are what I am settling on to map out the strike zone. StatCorner has already been updated to use this new model. Did you know that among starting pitchers (min: 100 BF) in 2012, Blake Beavan had the fourth-highest rate of throwing pitches within the strike zone? Beavan's rate was 56%. Worst in the Majors was Kyle Drabek, at only 37%.
Among Mariner relievers, the highest zone rate went to Oliver Perez, at 58%. Wait, what? Oliver Perez? Strikes? Somehow that happened. Perez ranked eighth (min: 50 BF) in the Majors for relievers. There's a lot more that I can and will dig into and report on using these improved zone models, but that will come later. Lastly, here's a summary GIF of the change from the rectangular to elliptical models.