Coming up with a good model for park factors is a rather difficult proposition. You want to include as much data as possible, but all of it is fraught with bits of bias. It helps to think of it like attempting to quantify pitching. We have a simple end product, runs allowed, but there are so many things that go into producing that result and we have to wade pretty deep in order to extract the parts of it that are due to pitching and strip away the parts contributed by the defense, the park, random luck and the quality of the opposing offense.
Park factors are a lot like that. We have a simple result, runs scored in one park versus runs scored in another, but there's a lot that factors into that result and most of it is not a direct result of the influence of the particular park. Last season, New Yankee Stadium garnered a reputation for being a hitter's haven because of all the high-scoring games played there. However, the 2009 Yankees had one of the best offenses in baseball; so how much of the run scoring was because the park was friendly to hitters and how much of the run scoring was because the Yankee hitters weren't so friendly to opposing pitchers?
There's no easy answer to these problems. We can arrive at what we consider some decent approximations by using multiple years of data and trying to control for the variety of influences provided by the hometown nine. What began as just one number for each park, its overall run factor, has since blossomed into finding park factors for nearly every stat out there down to even the strike zone.
This might have been done already for all I know, but one part I have long been interested in is how parks effect batters based on handedness. We have deep personal knowledge of how Safeco Field plays differently for left-handed and right-handed hitters. It can be such an extreme split that using just one overall home run factor for Safeco significantly shortchanges right-handed hitters and overcompensates lefties. Investigating a way to rectify that has been on my to do list for a while now and I have finally gotten through a first pass at constructing a way to tackle it.
I won't go into the bitty gritty details but the high-level concept goes like this. To figure out the strikeout factor for left-handed hitters (LH K) in Safeco I take four pieces of data:
A: The number of plate appearances made by a hitter from the left side at Safeco (regardless of team)
B: The number of strikeouts recorded during those (A) plate appearances.
C: The number of plate appearances made by a hitter from the left side during Mariner away games (regardless of team)
D: The number of strikeouts recorded during those (C) plate appearances.
B/A gives you the ratio of Mariner-related* at bats in Safeco that yielded a LH K.
D/C gives you the ratio of Mariner-related at bats not in Safeco that yielded a LH K.
*Mariner-related refers to any at bat that the Mariners are a participant in whether as the offensive or defensive team.
Taking the first ratio (B/A) and dividing it by the second ratio (D/C) gives you a ratio of the ratios, which is your de facto Safeco Field park factor for LH Ks.
My reason for defining the samples this way is to cancel out the home team bias as best as possible and to match up the samples. If, for example, you replaced D/C with the ratio of LH Ks on all at bats in the American League then the makeup of the Mariner hitters and pitchers would dramatically affect the resulting park factor. Mariner hitters and pitchers would make up half of the B/A sample but only 1/14th of the D/C sample. By using only games the Mariners play in for calculating D/C, they make up half the sample in both. In theory, if the Mariner hitters/pitchers struck out a lot at home, they would do the same on the road as well, canceling their impact out and leaving only the park's influence behind.
In theory. There are still refinements left to be made which is why this is a first stab. It would help if teams played a balanced schedule, but with our skewed schedules there is still going to be some unduly large influences on D/C by each team's division mates. I welcome any constructive suggestions on how to deal with this and any other statistical issues. Still, I think this is a good start and a better picture than other park factors give us. With no further boring exposition, here are the Safeco Field results covering 2007-present. A factor greater than 100 indicates that Safeco helps to increase that stat:
K, BB, HBP, GB, FB, LD and IF are all factored on a per PA basis since they are all discrete possible results of a PA. 1B, 2B and 3B are factored on a per batted ball basis. HR is factored by balls in the air (i.e. non-ground-ball batted balls). wOBA is based on what the league average line would have looked like given the above factors.