clock menu more-arrow no yes mobile

Filed under:

LL Exclusive: Introducing Reformatted Batted Ball Data

Ever wished you could get batted ball data in a format better suited for en-masse analysis than Jeff Zimmerman's leaderboards or Fangraphs' spray charts? Well...

one of the good ones
one of the good ones
Otto Greule Jr

Those of you who read the Off-Topic threads know that I'm a freshman at the Olin College of Engineering. Olin is a tiny school just west of Boston with a very unusual curriculum. There aren't really conventional lectures here; instead, classes focus on project-based learning. In other school's intro CS courses, you might attend a two-hour lecture about text scraping from the internet. In Olin's intro CS course, the professor says "you have one week to write a program that scrapes text from the internet and does something cool with it. If you need help, ask me."

I mention this because ten days ago my intro CS professor told me "you have one week to write a program that scrapes text from the internet and does something cool with it. If you need help, ask me." Having recently spent about ten years harvesting batted ball data from Baseball Heat Maps for this article on OPPO%, I knew exactly what I was going to do.

Three paragraphs is enough time spent burying the lede, don't you think?

Click here to download a .zip archive of .csv spreadsheets containing xy positions, difficulties, and outcomes of every batted ball fielded by every Fangraphs-designated center fielder in the last two years. The data is from Fangraphs' Spray Chart pages, which are wonderful but sadly don't really allow for quantitative analysis. I used a Python script that I wrote myself, which relies on some functions from Python's csv and Pattern libraries.

Now, this data isn't yet perfect. There are still some kinks in the code I need to work out. To see some of what I mean, compare Fangraphs' spray chart for Michael Saunders

Source: FanGraphs

to the one that my program outputs: Capture

Mine is missing the "impossible catches", four of the "difficult" catches are in the wrong place, and one more simply doesn't exist. Whoops. I'm going to get cracking on fixing this up soon, but until I do, it's probably best not to do any analysis that relies very heavily on the exact positions of data points in the "missed catches" columns.

That said, there's a lot of cool work that we can do here. For example, it's possible to use this data to create an extremely simplified version of John Dewan's Plus/Minus system, one of the components of DRS. The idea behind plus/minus is to credit fielders for the catches that they make that other fielders don't while penalizing them for balls they miss that other fielders catch. My system loosely replicates Plus/Minus by multiplying the number of batted balls in each of a player's difficulty categories (as defined by Fangraphs' Inside Edge data) by the average difficulty of a catch in that category. In this manner, a fielder is credited:

+.05 for a catch 90-100% of players make
+.25 for a catch 60-90% of players make
+.5 for a catch 40-60% of players make
+.75 for a catch 10-40% of players make
+.95 for a catch 0-10% of players make
-.95 for a missed catch 90-100% of players make
-.75 for a missed catch 60-90% of players make
-.5 for a missed catch 40-60% of players make
-.25 for a missed catch 10-40% of players make
-.05 for a missed catch 1-10% of players make

Run the numbers, divide by opportunities to convert to a rate stat, and what do you get?

Capture2 R^2 of .25 with UZR/150. OK, so that's not super amazing. But you know what? Considering all of the current issues with this data set, I will take 0.25. The %difficulty ratings, as far as I know, are entirely subjective and come from Inside Edge. The batted ball data is from MLBAM, which means it marks where a ball was fielded and not where it landed. My program doesn't include "impossible to make" catches, which is why all of the Plus/Minus ratings are skewed .045 to the right. Also, it's flat-out missing some data points. And it still gets an R^2 or .25, which, as we covered last week, means there's a moderate correlation. Not bad at all.

As I've said above, this data isn't perfect. But keep an eye on Lookout Landing: over the next few weeks, I'll be rolling out new code that'll improve data quality and expand the scope of this project beyond just center fielders. I'm excited.

In the meantime, download that .zip and play around with the numbers for a while. See if you can come to any interesting conclusions. Maybe certain center fielders play deeper than others? Maybe you can approximate range by finding the radius of a CF's fielded balls in play (or the radius of their closest miss)? There's all sorts of cool work to be done here.

I look forward to helping facilitate it.