Navigation: Jump to content areas:


Pro Quality. Fan Perspective.
Login-facebook
Around SBN: NFL Players Ready To Welcome Gay Teammate

Sabermetrics 101: Splits

I find splits almost fascinating (they're good writing material!), but it's very very easy to misuse them.

Prerequisites for understanding: Sample sizes

Prerequisites for derivation: Data

Star-divide

Pattern (P)recognition

People like patterns. Love them, in fact. We're also very, very good at seeking them out, spotting patterns in data no computer could ever hope to analyse. Even better; we're able to generalise. Look around for a chair (there's likely one directly under you). Note its inherent chair-ness. People are able to recognise chairs with no problem at all. Swivelling chairs, comfortable chairs, even broken chairs: none of these match your platonic ideal of a chair, but you can identify them without a problem anyway. A computer can't process optical information nearly as efficiently as the human brain.

Then again, computers don't have our false-positive problem either. We see rabbits in clouds, faces on toast, and the Batman logo on Rorschach tests (unless that last one is just me). Clearly, this is silly. There are no giant sky rabbits, toast isn't anthropomorphic, and Batman isn't drawing all over my inkblot tests. Our pattern recognition ability is so strong that it leads us to find patterns where there are none. Obviously, the examples I just gave are silly, but sometimes silliness is not so clear cut.

Take, for example, our habit of slicing up baseball statistics. Want to know a player's batting average on Tuesdays? How many stolen bases did Pitcher A give up in May? How about a player's performance in the clutch? We can cut up our data in pretty much any way we like, and non-whole-season numbers are typically referred to as 'splits'. Splits are everywhere, especially ubiquitous in television broadcasts. But what do they mean? Why do we care? How much of our fascination with split statistics is due to false-positive pattern recognition and how much helps us to further our understanding of the game?

Filtering Our Information

Our first filter on interpreting splits must be an old friend - the ever reliable "does this make sense?". Does it make sense for a player to do especially well on Thursdays in June, or are we merely seeing an artifact of randomness? A player will always do especially well on some day of the week in some month, after all. Why should we care about which specific day, and which specific month? We probably shouldn't. How about a player's offensive ability against same-handed pitchers? Clearly, this makes sense to look at. We know that it's harder for left handed bats to hit left handed pitching than right handed. Hitting in the clutch? This also passes the 'makes sense' test. We all know people who perform in high pressure situations, and those who wilt, and so we apply that to baseball.

Our second filter is appropriate regression. Would you accept a batter going 4 for his first 10 as a .400 hitter? No, so there's no reason to accept a batter going 4-10 against a certain pitcher as being indicative of very much at all. Regress everything. Handedness splits, clutch splits, day-night splits, everything. Doing so will help to haul you out of the false-pattern trap.

We can also look at the predictive value of splits (admittedly, this is an offshoot of regression). Can we predict future performance in a given situation from past performance? In most cases, the answer is "not very well," with some statistics faring worse than others. Let's take clutch statistics. Superficially, we can predict clutch performance. Hitters who have performed well in the clutch will likely to continue to perform well. However, what we can't predict is the delta between clutch hitting and regular hitting: analysis shows that this has a habit of collapsing to zero - meaning that the best clutch hitters are the best hitters. Left/right splits do much better, but still require significant regression before we can use them to make statements about ability.

Playing with splits is a dangerous game. Ensure you filter the information appropriately, however, and you can encounter some rewarding results. A pitcher does badly with the bases empty, even with his numbers heavily regressed? Maybe from that we can hypothesise that there's some flaw in his windup, and then we can look for it. Splits (team-based home and away numbers in particular) can also help immensely in determining park factors. There are multiple applications here, and we're finding more and more ways to refine our data. Ultimately, splits are a very useful tool - so long as you don't buy into the false positives that will invariably pop up.

Comment 17 comments  |  0 recs  | 

Do you like this story?

Comments

Display:

Have there been any studies on relevancy of April and September splits?

ie, people that do take a while to get good, or those that tend to get tired near the end of the year?

...and now I'm here

by CapSea on Feb 28, 2010 4:41 PM PST reply actions  

Fatiged pitcher versus fatigued batter

We’ll basically never know. If the pitcher has started 25 games in the last five months and the batter has played almost every day since April 1st, who’s fatigue are we measuring.

Besides, the deviation from true talent over 25 or so games can be large and not improbable. It’s best never to read anything into one month’s data, no matter where the month falls in the season.

by philosofool on Feb 28, 2010 5:05 PM PST up reply actions  

September is tough too because of call-ups

Lots of times you’ll hear about how some pitcher really ramped up his game in September – you know “revving up for the playoffs” and all that. In reality, he’s just seeing more AAA hitters who are getting a look in the majors when the rosters expand.

by shuswapslugger on Mar 1, 2010 11:23 AM PST up reply actions  

Yeah, good point. That too.

I’d be interested to know if there are any sample sizes large enough to show that a specific player takes longer to get going than some other player. I’d guess not though. A 12 year career would only be about 2 seasons of data, and it probably is too large of a talent span to show much meaning. But I dunno.

...and now I'm here

by CapSea on Feb 28, 2010 5:57 PM PST up reply actions  

Doesn't weather play a role in park factor splits?

This seems like the most appropriate way of trying to control for the effects of weather.

by pygmalion on Feb 28, 2010 6:43 PM PST up reply actions  

Well, like Graham said splits help determine park factors

But monthly splits present a problem in that a month of data wouldn’t be enough to determine park factors for that month. Yet we know that weather conditions change over the course of a season and so park factors change. I think it’s one of those things where you know there are other variables in play but it’s hard to measure them exactly, so you’re stuck taking the average over the course of the season and applying the factors on a yearly basis. Graham knows a lot more about this than I do, maybe he can answer better.

by OlSalty on Feb 28, 2010 11:30 PM PST up reply actions  

Wouldn't these be determined over many seasons?

Obviously, the calculations would be complex, but it seems as if, in principle, you would want to determine park factors by time of year by calculating a given park’s effects on the game over a number of seasons. The main question I have is how many seasons you would need; on the one hand, you are restricting yourself to only 1/6 of a season, more or less, but on the other hand, you have access to, say, 13 home games / month X 76 (or something) plate appearances / game, which comes to 988 plate appearances per month. So – if 1800 plate appearances give you a pretty good picture of a hitter, would you only need two seasons worth of data to get a good feel for how a park affects most batting stats? Well, maybe not. I’m not really sure how one determines park effects, and give what Graham has said, I’m not eager to.

But it seems as if there is plenty of data available for a given month in a park over a few seasons: you don’t need 12 seasons to get 2 years worth of data, because you have 18+ batters per game to look at. Only I’m not sure how much data you need to determine park effects, so…

by pygmalion on Mar 1, 2010 7:51 AM PST up reply actions  

Weather probably does have a role

but I suspect the best way to understand the effects of weather would be through physics, not direct analysis of baseball. At least, a lot of questions might best be approached that way. The guy at hittracker online claims to have a pretty good model of the way weather conditions affect the motion of a ball once it’s batted. Projectile motion is obviously only a part of the effect of weather, but it’s also obviously a big part.

by philosofool on Mar 1, 2010 12:24 PM PST up reply actions  

I'd be suprised if there aren't.

There’s often discussions of certain players being slow starters or second-half players. It would make sense to take the extreme ends of these, and I’d imagine they would have value because of all other times in the year it would at least seem that if someone had a repeated pattern in these situations that it might actually be a part of their individual makeup and skillset.

by SethGrandpa on Feb 28, 2010 10:53 PM PST up reply actions  

Except that is erroneous too.

It is entirely possible that people can perform worse under the pressure of the postseason. The issue is that there is absolutely no way that there will be ever a sample size large enough to know for certain, and the degree of difference is probably not too large. Such as someone that is a true talent .350 wOBA player, but under extreme pressure is a true talent .340 wOBA player.

...and now I'm here

by CapSea on Mar 1, 2010 12:26 AM PST up reply actions  

It's not just sample size

Because there’s also the factor of facing more talented opposition.

by Milendriel on Mar 1, 2010 12:59 AM PST up reply actions  

That's fine

but it’s also sample size. There simply isn’t enough data for any given player to have any idea if the pressure makes a difference.

...and now I'm here

by CapSea on Mar 1, 2010 2:14 AM PST up reply actions  

Comments For This Post Are Closed


User Tools

By reading a game thread of your own volition you agree to accept all liability for any and all damage done to your delicate sensibilities.

FanPosts

Community blog posts and discussion.

Recent FanPosts

Small
Starlin Castro's fit with Seattle
Kawasaki80_small
Lists! So many lists!
M_s_hat_copy_small
OT -- May 22nd In Memoriam
Ichiro_small
Why do managers and media members hate walks?
Wbc_029_small
Friday Morning Music Thread
Small
Dustin Ackley BP swing vs game swing
Beastquakerwallpaper_small
More on the Struggles of Smoak
Randy2_for_sbn_small
Albert Pujols 2012: Three Retrospectives
Small
On Batting Orders
Niehaus_small
More on Dustin Ackley and the strikezone

+ New FanPost All FanPosts >

Yahoo_full_count

Sexy People

Wbc_029_small Jeff Sullivan

Small Matthew

Claw_small JY