In this post, I created a new performance metric for both hitters and pitchers called xp_wOBA (Expected Pitch wOBA). It’s a new metric to determine a single pitch’s quality. It takes into account release speed, pitch location, hitter’s count, pitch movement, etc. to determine the quality of a pitch.
The primary motivation is to create a metric that attempts to measure a pitch’s quality for a given situation. A pitch that is high and inside is a bad pitch when the hitter has a count of 3 balls and 1 strike because if the hitter does not swing then they get a free base via a walk. That same pitch is a good pitch when the hitter has 0 balls and 2 strikes because if the hitter swings, they are likely to either miss or hit the ball weakly. If they don’t swing then they simply have one ball now. Almost no harm was done.
If we can estimate the quality of a pitch then this opens the door to more accurately judge both batters and pitchers. We can use this metric to determine if a pitcher was making quality pitches throughout an appearance that is not focused on outcome metrics like Earned Run Average (ERA). A pitcher can make a great pitch that is hit for a home run. Is it fair to judge the pitcher poorly because the batter put a great swing on a difficult-to-hit pitch?
The same goes for hitters. With this metric, we can judge who is hitting better than expected given the quality of pitches they are seeing. This is in a very similar vein as Completion Percentage Over Expected (CPOE). A batter might only see high-quality pitches (i.e. difficult to hit pitches) in a given at-bat, but again, should we judge them harshly?
This metric helps us be more process-driven and less outcome-driven.
The outcome the model is going to be trained on is the wOBA value for a given event. For a full primer, read this FanGraphs article on wOBA. Each batting event has a given wOBA value (these are subject to change for a given year):
Event | wOBA value |
---|---|
walk | 0.70 |
hit by pitch | 0.70 |
field error | 0.90 |
single | 0.90 |
double | 1.25 |
triple | 1.60 |
home run | 2.00 |
all other | 0.00 |
As the article above states, not all hits are created equal. A walk doesn’t have the same value as a single and a single doesn’t have the same value as a home run. Yet, metrics like batting average and on-base percentage treat them equally. While slugging percentage does weight hits, it does so by total bases which exaggerate the value of doubles, triples, and home runs.
By using this metric that captures the value of a batting event accurately then we can more accurately judge the value of a given pitch (e.g. a pitch in this location, at this speed, in this count, has an expected value of x). A pitch is of high quality if the expected value is low.
The model is going to be trained using “event” only data. This means the data will only be the last pitch of each at-bat. This will give us the ability to estimate for each pitch as if it were going to be an “event” (walk, hit, strikeout, groundout, flyout, etc.). If a pitcher throws a fastball down the middle of the plate, we can estimate the potential value if the batter were to put it in play, walk, or strikeout.
I’m going to be using 2019 data scraped from https://baseballsavant.mlb.com/statcast_search using the code below.
First, the load needed libraries:
Now scrape the data and save the results to a csv:
dates <- seq(from = as.Date("2019-09-28"), to = as.Date("2019-09-29"), by = 1)
batter_2019 <- data.frame()
count <- 0
tic()
for (i in 1:length(dates)) {
print(dates[i])
begin_date <- as.character(dates[i])
end_date <- as.character(dates[i])
url <- paste0("https://baseballsavant.mlb.com/statcast_search/csv?all=true&hfPT=&hfAB=&hfGT=R%7C&hfPR=&hfZ=&stadium=&hfBBL=&hfNewZones=&hfPull=&hfC=&hfSea=&hfSit=&player_type=batter&hfOuts=&opponent=&pitcher_throws=&batter_stands=&hfSA=&game_date_gt=",begin_date,"&game_date_lt=",end_date,"&hfInfield=&team=&position=&hfOutfield=&hfRO=&home_road=&hfFlag=&hfBBT=&metric_1=&hfInn=&min_pitches=0&min_results=0&group_by=name&sort_col=pitches&player_event_sort=api_p_release_speed&sort_order=desc&min_pas=0&type=details")
df <- readr::read_csv(url)
if (nrow(df) > 0){
batter_2021 <- bind_rows(df, batter_2021)
}
print(paste0("dates left ", as.character(length(dates) - i)))
count <- count + 1
if (count >= 65) {
Sys.sleep(60*5)
count <- 0
}
print(count)
}
toc()
readr::write_excel_csv(batter_2019, file = "C:\\Users\\louis\\Downloads\\batter_2019.csv")
batter_2019 <- readr::read_delim(file = "C:\\Users\\louis\\Downloads\\batter_2019.csv", delim = ",")
desired_events <- c("field_out",
"strikeout",
"single",
"walk",
"double",
"home_run",
"force_out",
"grounded_into_double_play",
"hit_by_pitch",
"field_error",
"triple",
"fielders_choice",
"double_play",
"fielders_choice_out",
"strikeout_double_play")
batter_2019_events <- batter_2019 %>%
filter(events %in% desired_events) %>%
mutate(runner_1b = if_else(!is.na(on_1b), 1, 0),
runner_2b = if_else(!is.na(on_2b), 1, 0),
runner_3b = if_else(!is.na(on_3b), 1, 0)) %>%
drop_na(release_speed) %>%
drop_na(zone) %>%
drop_na(pitch_type)
xgboost_fit_woba <- readRDS(file = "C:\\Users\\louis\\Documents\\GitHub\\xp_woba\\xgboost_fit_woba.rds")
set.seed(123)
be_split <- initial_split(batter_2019_events, prop = 3/4)
be_train <- training(be_split)
be_test <- testing(be_split)
woba_formula <- formula(woba_value ~ release_speed + pitch_type + zone + stand + p_throws + balls + strikes + outs_when_up + pfx_x + pfx_z + runner_1b + runner_2b + runner_3b + plate_x + plate_z)
preprocessing_recipe_woba <-
recipes::recipe(woba_formula, data = be_train) %>%
recipes::step_integer(all_nominal()) %>%
prep()
Normally, I would put an EDA section before modeling but I did the EDA separately and it deserves its own post.
Here are the key features I am going to be using for modeling. For a glossary of the terms visit baseball savant CSV doc.
Here is the code I used to train the model. I utilized grid search to find the optimal hyperparameter values. Here is a good blog post by Julia Silge that demonstrates the technique I used: Tune XGBoost with tidymodels
set.seed(123)
be_split <- initial_split(batter_2019_events, prop = 3/4)
be_train <- training(be_split)
be_test <- testing(be_split)
woba_formula <- formula(woba_value ~ release_speed + pitch_type + zone + stand + p_throws + balls + strikes + outs_when_up + pfx_x + pfx_z + runner_1b + runner_2b + runner_3b + plate_x + plate_z)
preprocessing_recipe_woba <-
recipes::recipe(woba_formula, data = be_train) %>%
recipes::step_integer(all_nominal()) %>%
prep()
xgboost_model_woba <- boost_tree(
trees = 2000,
stop_iter = 250,
tree_depth = 13,
min_n = 18,
loss_reduction = 4.355132,
sample_size = 0.8210649,
mtry = 54,
learn_rate = 0.005436754,
) %>%
set_engine("xgboost") %>%
set_mode("regression")
train_processed <- bake(preprocessing_recipe_woba, new_data = be_train)
xgboost_fit_woba <- xgboost_model_woba %>%
# fit the model on all the training data
fit(
formula = woba_formula,
data = train_processed
)
saveRDS(xgboost_fit_woba, file = "C:\\Users\\louis\\Documents\\GitHub\\xp_woba\\xgboost_fit_woba.rds")
Check how the model performs on the test data:
test_processed <- bake(preprocessing_recipe_woba, new_data = be_test)
test_prediction_woba <- xgboost_fit_woba %>%
# fit the model on all the training data
# predict the sale prices for the training data
predict(new_data = test_processed) %>%
bind_cols(be_test)
test_prediction_woba %>%
yardstick::metrics(truth = woba_value, estimate = .pred) %>%
gt()
.metric | .estimator | .estimate |
---|---|---|
rmse | standard | 0.4985808 |
rsq | standard | 0.1156307 |
mae | standard | 0.3909066 |
Here are the top features:
The ball and strike count are very important with location coming in second (plate_x, plate_z, zone). Horizontal and vertical movement is next followed up by the speed of the pitch.
Now, the fun part. Let’s utilize this model to find the best pitches of the 2021 season. Using the same scraping technique, I scraped the baseball savant data for the 2021 season.
batter_2021 <- readr::read_delim(file = "C:\\Users\\louis\\Downloads\\batter_2021.csv", delim = ",") %>%
mutate(runner_1b = if_else(!is.na(on_1b), 1, 0),
runner_2b = if_else(!is.na(on_2b), 1, 0),
runner_3b = if_else(!is.na(on_3b), 1, 0))
batter_2021_events <- batter_2021 %>%
filter(events %in% desired_events) %>%
drop_na(release_speed) %>%
drop_na(zone) %>%
drop_na(pitch_type)
Process the 2021 data and make predictions:
batter_2021_processed <- bake(preprocessing_recipe_woba, new_data = batter_2021)
batter_2021_prediction_woba <- xgboost_fit_woba %>%
# fit the model on all the training data
# predict the sale prices for the training data
predict(new_data = batter_2021_processed) %>%
bind_cols(batter_2021) %>%
select(player_name, game_date, des, events, woba_value, .pred, release_speed, pitch_type, zone, stand, p_throws, balls, strikes, outs_when_up, pfx_x, pfx_z, runner_1b, runner_2b, runner_3b, plate_x, plate_z)
player_name | game_date | .pred | release_speed | pitch_type | zone | balls | strikes |
---|---|---|---|---|---|---|---|
Bichette, Bo | 2021-07-02 | 0.8719966 | 47.5 | EP | 11 | 3 | 1 |
Soler, Jorge | 2021-06-04 | 0.8524219 | 44.6 | EP | 12 | 3 | 1 |
Bichette, Bo | 2021-07-02 | 0.8424013 | 46.8 | EP | 11 | 3 | 0 |
Riley, Austin | 2021-06-30 | 0.8070010 | 64.1 | FA | 11 | 1 | 0 |
Guerrero Jr., Vladimir | 2021-07-02 | 0.8058015 | 49.1 | EP | 14 | 3 | 1 |
Correa, Carlos | 2021-04-24 | 0.8051472 | 60.8 | FA | 8 | 2 | 1 |
Throwing a pitch out of the strike zone with 3 balls is not good. The top three pitches are Eephus (EP) pitches that are thrown in the 40’s. I think we can all agree that is a very low-quality pitch. The awesome thing about baseballsavant is that you can search for specific pitches using their search and find the video. Here is the video of the lowest quality pitch of the 2021 season:
The first non-3-ball pitch or potential hit-by-pitch is the four-seam fastball thrown to Carlos Correa. The release speed is extremely slow at 60.8 MPH on a 2 ball and 1 strike count in the heart of the plate. This is how it looked in real life:
That is a meatball if I have ever seen one.
Now let’s take a look at the best pitches:
player_name | game_date | .pred | release_speed | pitch_type | zone | balls | strikes |
---|---|---|---|---|---|---|---|
Naquin, Tyler | 2021-04-17 | -0.05260795 | 96.4 | FF | 11 | 0 | 2 |
Galvis, Freddy | 2021-05-09 | -0.05136005 | 97.0 | FF | 11 | 0 | 2 |
Myers, Wil | 2021-05-01 | -0.04837190 | 93.4 | FF | 11 | 0 | 2 |
Laureano, Ramón | 2021-06-22 | -0.04584930 | 94.3 | FF | 11 | 0 | 2 |
Bregman, Alex | 2021-06-02 | -0.03250195 | 95.3 | FF | 12 | 0 | 2 |
Zunino, Mike | 2021-05-12 | -0.03191937 | 96.9 | FF | 12 | 0 | 2 |
From this, we can see that throwing high and out of the strike zone on a 0 ball and 2 strike count is a very good idea. This isn’t very interesting to look at so let’s find the top non-0-2 count pitch:
player_name | game_date | .pred | release_speed | pitch_type | zone | balls | strikes |
---|---|---|---|---|---|---|---|
Maldonado, Martín | 2021-05-20 | 0.1083588 | 93.6 | FF | 2 | 0 | 1 |
Lowe, Brandon | 2021-06-05 | 0.1120782 | 92.3 | FF | 8 | 0 | 1 |
Lowe, Brandon | 2021-04-11 | 0.1146214 | 98.2 | FF | 2 | 0 | 1 |
Laureano, Ramón | 2021-06-22 | 0.1163044 | 93.5 | FF | 2 | 0 | 1 |
Naquin, Tyler | 2021-04-17 | 0.1180818 | 95.4 | FF | 2 | 0 | 0 |
Polanco, Gregory | 2021-04-18 | 0.1310713 | 94.5 | FF | 3 | 1 | 0 |
The top pitch now becomes a 0 ball and 1 strike count 93.6 MPH four-seam fastball in the top edge of the strike zone. Funnily enough, though clearly in the strike zone, it is called a ball. Here is the video:
I think this demonstrates the importance of balls and strike count when determining a pitch’s quality. Let’s see what the best pitch is for each count:
batter_2021_prediction_woba %>%
group_by(balls, strikes) %>%
filter(.pred == min(.pred)) %>%
ungroup() %>%
select(-c(des, events, woba_value, stand, p_throws, outs_when_up, pfx_x, pfx_z, runner_1b, runner_2b, runner_3b, plate_x, plate_z)) %>%
select(player_name, game_date, balls, strikes, everything()) %>%
arrange(balls, strikes) %>%
gt()
player_name | game_date | balls | strikes | .pred | release_speed | pitch_type | zone |
---|---|---|---|---|---|---|---|
Naquin, Tyler | 2021-04-17 | 0 | 0 | 0.118081808 | 95.4 | FF | 2 |
Maldonado, Martín | 2021-05-20 | 0 | 1 | 0.108358778 | 93.6 | FF | 2 |
Naquin, Tyler | 2021-04-17 | 0 | 2 | -0.052607950 | 96.4 | FF | 11 |
Polanco, Gregory | 2021-04-18 | 1 | 0 | 0.131071314 | 94.5 | FF | 3 |
Turner, Trea | 2021-06-08 | 1 | 1 | 0.155475900 | 93.3 | FF | 3 |
Arozarena, Randy | 2021-04-19 | 1 | 2 | -0.024589863 | 95.6 | FF | 11 |
Profar, Jurickson | 2021-04-16 | 2 | 0 | 0.180670932 | 94.4 | FF | 3 |
Rojas, Miguel | 2021-04-27 | 2 | 1 | 0.207329497 | 92.6 | FF | 3 |
Santana, Carlos | 2021-04-11 | 2 | 2 | -0.009279301 | 81.5 | KC | 14 |
Díaz, Yandy | 2021-05-26 | 3 | 0 | 0.399502724 | 95.0 | FF | 1 |
Ohtani, Shohei | 2021-05-10 | 3 | 1 | 0.363996923 | 76.3 | SL | 4 |
Walls, Taylor | 2021-07-10 | 3 | 2 | 0.151510030 | 94.0 | FF | 7 |
Tucker, Kyle | 2021-06-26 | 4 | 2 | 0.404175460 | 92.5 | SI | 13 |
This shows that early in the count, it’s a good idea to throw a strike high in the strike zone. When the count becomes 0 balls and 2 strikes then throw it out of the strike zone to try and get the hitter to chase. If you get behind in the count then, again, throwing strikes in the top of the zone to try to even the count. The first non-fastball is on a 2 ball and 2 strike count. Here’s what it looked like:
Now, that’s a nasty pitch. The pitch starts at the top of the strike zone and breaks several inches out of the zone. The next top pitch for the 3 balls and 1 strike has great movement, too, but since the hitter has 3 balls then the pitch needs to be in the strike zone. Here is the top 3 ball and 1 strike pitch:
The ball starts as a strike on the inside part of the plate and breaks significantly away from the hitter and would have been still a strike even if the batter didn’t swing.
player_name | game_date | woba_value | .pred | release_speed | pitch_type | zone | balls | strikes |
---|---|---|---|---|---|---|---|---|
Hoskins, Rhys | 2021-06-22 | 2 | 0.05392908 | 97.9 | FF | 11 | 1 | 2 |
Altuve, Jose | 2021-06-10 | 2 | 0.05897641 | 85.3 | SL | 13 | 0 | 2 |
Chisholm Jr., Jazz | 2021-04-10 | 2 | 0.07070240 | 100.4 | FF | 12 | 0 | 2 |
Carlson, Dylan | 2021-06-02 | 2 | 0.07594188 | 95.4 | FF | 12 | 2 | 2 |
Astudillo, Willians | 2021-04-23 | 2 | 0.08661124 | 92.8 | FF | 12 | 1 | 2 |
Martinez, J.D. | 2021-07-21 | 2 | 0.09053584 | 98.6 | FF | 12 | 2 | 2 |
The most unlikely home run of 2021 belongs to Rhys Hoskins. Smashes a 98 MPH fastball high and out of the zone.
Look at this golf shot from Altuve that is the second least likely home run.
I could watch these all day. I would highly encourage everyone to use the search feature and watch all of these.
player_name | game_date | woba_value | .pred | release_speed | pitch_type | zone | balls | strikes |
---|---|---|---|---|---|---|---|---|
Adrianza, Ehire | 2021-06-30 | 0 | 0.7647199 | 64.6 | FA | 2 | 2 | 1 |
McCormick, Chas | 2021-04-24 | 0 | 0.7291481 | 60.6 | FA | 6 | 2 | 0 |
Toro, Abraham | 2021-04-24 | 0 | 0.7173564 | 59.6 | FA | 4 | 0 | 0 |
Suzuki, Kurt | 2021-04-16 | 0 | 0.7135440 | 59.0 | FA | 5 | 1 | 0 |
Bregman, Alex | 2021-06-13 | 0 | 0.7102520 | 90.7 | FF | 4 | 2 | 0 |
Bradley, Bobby | 2021-07-25 | 0 | 0.7083586 | 80.2 | SL | 4 | 2 | 0 |
The biggest meatball missed was by Ehire Adrianza, but I like the Chas McCormick one more. 60 MPH fastball lobbed into the heart of the plate.
Like I mentioned earlier, this new metric allows us to measure which batters are performing better than expected given their pitch quality. Let’s take a look at the top total value over expected:
player_name | n | xp_wOBAOE_sum |
---|---|---|
Ohtani, Shohei | 381 | 41.02039 |
Tatis Jr., Fernando | 349 | 34.33326 |
Guerrero Jr., Vladimir | 406 | 33.23778 |
Castellanos, Nick | 364 | 30.05024 |
Bogaerts, Xander | 393 | 29.75805 |
Mullins, Cedric | 418 | 28.39065 |
Devers, Rafael | 406 | 27.89499 |
Martinez, J.D. | 405 | 22.14015 |
Perez, Salvador | 404 | 21.58211 |
Reynolds, Bryan | 407 | 19.90814 |
So this shows the sum of the total value over expected and Shohei Ohtani is running away with first place. Shohei Ohtani has provided the most value to his team this year. What about the most average value with at least 100 at-bats?
player_name | n | xp_wOBAOE_avg |
---|---|---|
Buxton, Byron | 110 | 0.17926097 |
Trout, Mike | 141 | 0.11095308 |
Ohtani, Shohei | 381 | 0.10766506 |
Tatis Jr., Fernando | 349 | 0.09837609 |
Castellanos, Nick | 364 | 0.08255561 |
Wisdom, Patrick | 164 | 0.08236782 |
Guerrero Jr., Vladimir | 406 | 0.08186644 |
Marte, Ketel | 147 | 0.07708410 |
Bogaerts, Xander | 393 | 0.07572022 |
Devers, Rafael | 406 | 0.06870686 |
Byron Buxton, while not playing very much, has been tremendous. The same goes for Mike Trout. The big surprise on the list is Patrick Wisdom.
The model could be improved with further feature engineering. The two top-of-mind:
If you see mistakes or want to suggest changes, please create an issue on the source repository.
For attribution, please cite this work as
Oberdiear (2021, July 28). Louis Oberdiear: Introducing xp_wOBA (Expected Pitch wOBA). Retrieved from https://thelob.blog/posts/2021-07-28-introducing-xpwoba-expected-pitch-woba/
BibTeX citation
@misc{oberdiear2021introducing, author = {Oberdiear, Louis}, title = {Louis Oberdiear: Introducing xp_wOBA (Expected Pitch wOBA)}, url = {https://thelob.blog/posts/2021-07-28-introducing-xpwoba-expected-pitch-woba/}, year = {2021} }