Introducing xp_wOBA (Expected Pitch wOBA)

In this post, I created a new performance metric for both hitters and pitchers called xp_wOBA (Expected Pitch wOBA). It’s a new metric to determine a single pitch’s quality. It takes into account release speed, pitch location, hitter’s count, pitch movement, etc. to determine the quality of a pitch.

Author

Affiliation

Louis Oberdiear

 

Published

July 28, 2021

Citation

Oberdiear, 2021

Motivation

The primary motivation is to create a metric that attempts to measure a pitch’s quality for a given situation. A pitch that is high and inside is a bad pitch when the hitter has a count of 3 balls and 1 strike because if the hitter does not swing then they get a free base via a walk. That same pitch is a good pitch when the hitter has 0 balls and 2 strikes because if the hitter swings, they are likely to either miss or hit the ball weakly. If they don’t swing then they simply have one ball now. Almost no harm was done.

If we can estimate the quality of a pitch then this opens the door to more accurately judge both batters and pitchers. We can use this metric to determine if a pitcher was making quality pitches throughout an appearance that is not focused on outcome metrics like Earned Run Average (ERA). A pitcher can make a great pitch that is hit for a home run. Is it fair to judge the pitcher poorly because the batter put a great swing on a difficult-to-hit pitch?

The same goes for hitters. With this metric, we can judge who is hitting better than expected given the quality of pitches they are seeing. This is in a very similar vein as Completion Percentage Over Expected (CPOE). A batter might only see high-quality pitches (i.e. difficult to hit pitches) in a given at-bat, but again, should we judge them harshly?

This metric helps us be more process-driven and less outcome-driven.

Why wOBA?

The outcome the model is going to be trained on is the wOBA value for a given event. For a full primer, read this FanGraphs article on wOBA. Each batting event has a given wOBA value (these are subject to change for a given year):

Event wOBA value
walk 0.70
hit by pitch 0.70
field error 0.90
single 0.90
double 1.25
triple 1.60
home run 2.00
all other 0.00

As the article above states, not all hits are created equal. A walk doesn’t have the same value as a single and a single doesn’t have the same value as a home run. Yet, metrics like batting average and on-base percentage treat them equally. While slugging percentage does weight hits, it does so by total bases which exaggerate the value of doubles, triples, and home runs.

By using this metric that captures the value of a batting event accurately then we can more accurately judge the value of a given pitch (e.g. a pitch in this location, at this speed, in this count, has an expected value of x). A pitch is of high quality if the expected value is low.

Methodology

The model is going to be trained using “event” only data. This means the data will only be the last pitch of each at-bat. This will give us the ability to estimate for each pitch as if it were going to be an “event” (walk, hit, strikeout, groundout, flyout, etc.). If a pitcher throws a fastball down the middle of the plate, we can estimate the potential value if the batter were to put it in play, walk, or strikeout.

Data

I’m going to be using 2019 data scraped from https://baseballsavant.mlb.com/statcast_search using the code below.

First, the load needed libraries:

Show code

Now scrape the data and save the results to a csv:

Show code
dates <- seq(from = as.Date("2019-09-28"), to = as.Date("2019-09-29"), by = 1)

batter_2019 <- data.frame()
count <- 0
tic()
for (i in 1:length(dates)) {
  print(dates[i])
  begin_date <- as.character(dates[i])
  end_date <- as.character(dates[i])
  
  url <- paste0("https://baseballsavant.mlb.com/statcast_search/csv?all=true&hfPT=&hfAB=&hfGT=R%7C&hfPR=&hfZ=&stadium=&hfBBL=&hfNewZones=&hfPull=&hfC=&hfSea=&hfSit=&player_type=batter&hfOuts=&opponent=&pitcher_throws=&batter_stands=&hfSA=&game_date_gt=",begin_date,"&game_date_lt=",end_date,"&hfInfield=&team=&position=&hfOutfield=&hfRO=&home_road=&hfFlag=&hfBBT=&metric_1=&hfInn=&min_pitches=0&min_results=0&group_by=name&sort_col=pitches&player_event_sort=api_p_release_speed&sort_order=desc&min_pas=0&type=details")

  
  df <- readr::read_csv(url)
  if (nrow(df) > 0){
    batter_2021 <- bind_rows(df, batter_2021)
  }
  print(paste0("dates left ", as.character(length(dates) - i)))
  count <- count + 1
  if (count >= 65) {
    Sys.sleep(60*5)
    count <- 0
  }
  print(count)
  
}
toc()

readr::write_excel_csv(batter_2019, file = "C:\\Users\\louis\\Downloads\\batter_2019.csv")
Show code
batter_2019 <- readr::read_delim(file = "C:\\Users\\louis\\Downloads\\batter_2019.csv", delim = ",")

desired_events <- c("field_out",
                    "strikeout",
                    "single",
                    "walk",
                    "double",
                    "home_run",
                    "force_out",
                    "grounded_into_double_play",
                    "hit_by_pitch",
                    "field_error",
                    "triple",
                    "fielders_choice",
                    "double_play",
                    "fielders_choice_out",
                    "strikeout_double_play")

batter_2019_events <- batter_2019 %>%
  filter(events %in% desired_events) %>%
  mutate(runner_1b = if_else(!is.na(on_1b), 1, 0),
         runner_2b = if_else(!is.na(on_2b), 1, 0),
         runner_3b = if_else(!is.na(on_3b), 1, 0)) %>%
  drop_na(release_speed) %>%
  drop_na(zone) %>%
  drop_na(pitch_type)

xgboost_fit_woba <- readRDS(file = "C:\\Users\\louis\\Documents\\GitHub\\xp_woba\\xgboost_fit_woba.rds")

set.seed(123)
be_split <- initial_split(batter_2019_events, prop = 3/4)
be_train <- training(be_split)
be_test <- testing(be_split)

woba_formula <- formula(woba_value ~ release_speed + pitch_type + zone + stand + p_throws + balls + strikes + outs_when_up + pfx_x + pfx_z + runner_1b + runner_2b + runner_3b + plate_x + plate_z)

preprocessing_recipe_woba <- 
  recipes::recipe(woba_formula, data = be_train) %>%
  recipes::step_integer(all_nominal()) %>%
  prep()

Modeling

Normally, I would put an EDA section before modeling but I did the EDA separately and it deserves its own post.

Here are the key features I am going to be using for modeling. For a glossary of the terms visit baseball savant CSV doc.

Here is the code I used to train the model. I utilized grid search to find the optimal hyperparameter values. Here is a good blog post by Julia Silge that demonstrates the technique I used: Tune XGBoost with tidymodels

Show code
set.seed(123)
be_split <- initial_split(batter_2019_events, prop = 3/4)
be_train <- training(be_split)
be_test <- testing(be_split)

woba_formula <- formula(woba_value ~ release_speed + pitch_type + zone + stand + p_throws + balls + strikes + outs_when_up + pfx_x + pfx_z + runner_1b + runner_2b + runner_3b + plate_x + plate_z)

preprocessing_recipe_woba <- 
  recipes::recipe(woba_formula, data = be_train) %>%
  recipes::step_integer(all_nominal()) %>%
  prep()

xgboost_model_woba <- boost_tree(
  trees          = 2000, 
  stop_iter      = 250,
  tree_depth     = 13, 
  min_n          = 18, 
  loss_reduction = 4.355132,                    
  sample_size    = 0.8210649, 
  mtry           = 54,         
  learn_rate     = 0.005436754,                         
) %>% 
  set_engine("xgboost") %>% 
  set_mode("regression")

train_processed <- bake(preprocessing_recipe_woba,  new_data = be_train)


xgboost_fit_woba <- xgboost_model_woba %>%
  # fit the model on all the training data
  fit(
    formula = woba_formula, 
    data    = train_processed
  )

saveRDS(xgboost_fit_woba, file = "C:\\Users\\louis\\Documents\\GitHub\\xp_woba\\xgboost_fit_woba.rds")

Check how the model performs on the test data:

Show code
test_processed <- bake(preprocessing_recipe_woba,  new_data = be_test)

test_prediction_woba <- xgboost_fit_woba %>%
  # fit the model on all the training data
  # predict the sale prices for the training data
  predict(new_data = test_processed) %>%
  bind_cols(be_test)

test_prediction_woba %>%
  yardstick::metrics(truth = woba_value, estimate = .pred) %>%
  gt()
.metric .estimator .estimate
rmse standard 0.4985808
rsq standard 0.1156307
mae standard 0.3909066

Top Features

Here are the top features:

Show code
library(vip)

xgboost_fit_woba %>%
  vip(geom = "col")

The ball and strike count are very important with location coming in second (plate_x, plate_z, zone). Horizontal and vertical movement is next followed up by the speed of the pitch.

Best Pitches of the 2021 season

Now, the fun part. Let’s utilize this model to find the best pitches of the 2021 season. Using the same scraping technique, I scraped the baseball savant data for the 2021 season.

Show code
batter_2021 <- readr::read_delim(file = "C:\\Users\\louis\\Downloads\\batter_2021.csv", delim = ",") %>%
  mutate(runner_1b = if_else(!is.na(on_1b), 1, 0),
         runner_2b = if_else(!is.na(on_2b), 1, 0),
         runner_3b = if_else(!is.na(on_3b), 1, 0))

batter_2021_events <- batter_2021 %>%
  filter(events %in% desired_events) %>%
  drop_na(release_speed) %>%
  drop_na(zone) %>%
  drop_na(pitch_type)

Process the 2021 data and make predictions:

Show code
batter_2021_processed <- bake(preprocessing_recipe_woba,  new_data = batter_2021)

batter_2021_prediction_woba <- xgboost_fit_woba %>%
  # fit the model on all the training data
  # predict the sale prices for the training data
  predict(new_data = batter_2021_processed) %>%
  bind_cols(batter_2021) %>%
  select(player_name, game_date, des, events, woba_value, .pred, release_speed, pitch_type, zone, stand, p_throws, balls, strikes, outs_when_up, pfx_x, pfx_z, runner_1b, runner_2b, runner_3b, plate_x, plate_z)

Highest Predicted Value

Show code
batter_2021_prediction_woba %>%
  select(-c(des, events, woba_value, stand, p_throws, outs_when_up, pfx_x, pfx_z, runner_1b, runner_2b, runner_3b, plate_x, plate_z)) %>%
  arrange(desc(.pred)) %>%
  head() %>%
  gt()
player_name game_date .pred release_speed pitch_type zone balls strikes
Bichette, Bo 2021-07-02 0.8719966 47.5 EP 11 3 1
Soler, Jorge 2021-06-04 0.8524219 44.6 EP 12 3 1
Bichette, Bo 2021-07-02 0.8424013 46.8 EP 11 3 0
Riley, Austin 2021-06-30 0.8070010 64.1 FA 11 1 0
Guerrero Jr., Vladimir 2021-07-02 0.8058015 49.1 EP 14 3 1
Correa, Carlos 2021-04-24 0.8051472 60.8 FA 8 2 1

Throwing a pitch out of the strike zone with 3 balls is not good. The top three pitches are Eephus (EP) pitches that are thrown in the 40’s. I think we can all agree that is a very low-quality pitch. The awesome thing about baseballsavant is that you can search for specific pitches using their search and find the video. Here is the video of the lowest quality pitch of the 2021 season:

The first non-3-ball pitch or potential hit-by-pitch is the four-seam fastball thrown to Carlos Correa. The release speed is extremely slow at 60.8 MPH on a 2 ball and 1 strike count in the heart of the plate. This is how it looked in real life:

That is a meatball if I have ever seen one.

Lowest Predicted Value

Now let’s take a look at the best pitches:

Show code
batter_2021_prediction_woba %>%
  select(-c(des, events, woba_value, stand, p_throws, outs_when_up, pfx_x, pfx_z, runner_1b, runner_2b, runner_3b, plate_x, plate_z)) %>%
  arrange(.pred) %>%
  head() %>%
  gt()
player_name game_date .pred release_speed pitch_type zone balls strikes
Naquin, Tyler 2021-04-17 -0.05260795 96.4 FF 11 0 2
Galvis, Freddy 2021-05-09 -0.05136005 97.0 FF 11 0 2
Myers, Wil 2021-05-01 -0.04837190 93.4 FF 11 0 2
Laureano, Ramón 2021-06-22 -0.04584930 94.3 FF 11 0 2
Bregman, Alex 2021-06-02 -0.03250195 95.3 FF 12 0 2
Zunino, Mike 2021-05-12 -0.03191937 96.9 FF 12 0 2

From this, we can see that throwing high and out of the strike zone on a 0 ball and 2 strike count is a very good idea. This isn’t very interesting to look at so let’s find the top non-0-2 count pitch:

Show code
batter_2021_prediction_woba %>%
  select(-c(des, events, woba_value, stand, p_throws, outs_when_up, pfx_x, pfx_z, runner_1b, runner_2b, runner_3b, plate_x, plate_z)) %>%
  filter(strikes < 2) %>%
  arrange(.pred) %>%
  head() %>%
  gt()
player_name game_date .pred release_speed pitch_type zone balls strikes
Maldonado, Martín 2021-05-20 0.1083588 93.6 FF 2 0 1
Lowe, Brandon 2021-06-05 0.1120782 92.3 FF 8 0 1
Lowe, Brandon 2021-04-11 0.1146214 98.2 FF 2 0 1
Laureano, Ramón 2021-06-22 0.1163044 93.5 FF 2 0 1
Naquin, Tyler 2021-04-17 0.1180818 95.4 FF 2 0 0
Polanco, Gregory 2021-04-18 0.1310713 94.5 FF 3 1 0

The top pitch now becomes a 0 ball and 1 strike count 93.6 MPH four-seam fastball in the top edge of the strike zone. Funnily enough, though clearly in the strike zone, it is called a ball. Here is the video:

I think this demonstrates the importance of balls and strike count when determining a pitch’s quality. Let’s see what the best pitch is for each count:

Show code
batter_2021_prediction_woba %>%
  group_by(balls, strikes) %>%
  filter(.pred == min(.pred)) %>%
  ungroup() %>%
  select(-c(des, events, woba_value, stand, p_throws, outs_when_up, pfx_x, pfx_z, runner_1b, runner_2b, runner_3b, plate_x, plate_z)) %>%
  select(player_name, game_date, balls, strikes, everything()) %>%
  arrange(balls, strikes) %>%
  gt()
player_name game_date balls strikes .pred release_speed pitch_type zone
Naquin, Tyler 2021-04-17 0 0 0.118081808 95.4 FF 2
Maldonado, Martín 2021-05-20 0 1 0.108358778 93.6 FF 2
Naquin, Tyler 2021-04-17 0 2 -0.052607950 96.4 FF 11
Polanco, Gregory 2021-04-18 1 0 0.131071314 94.5 FF 3
Turner, Trea 2021-06-08 1 1 0.155475900 93.3 FF 3
Arozarena, Randy 2021-04-19 1 2 -0.024589863 95.6 FF 11
Profar, Jurickson 2021-04-16 2 0 0.180670932 94.4 FF 3
Rojas, Miguel 2021-04-27 2 1 0.207329497 92.6 FF 3
Santana, Carlos 2021-04-11 2 2 -0.009279301 81.5 KC 14
Díaz, Yandy 2021-05-26 3 0 0.399502724 95.0 FF 1
Ohtani, Shohei 2021-05-10 3 1 0.363996923 76.3 SL 4
Walls, Taylor 2021-07-10 3 2 0.151510030 94.0 FF 7
Tucker, Kyle 2021-06-26 4 2 0.404175460 92.5 SI 13

This shows that early in the count, it’s a good idea to throw a strike high in the strike zone. When the count becomes 0 balls and 2 strikes then throw it out of the strike zone to try and get the hitter to chase. If you get behind in the count then, again, throwing strikes in the top of the zone to try to even the count. The first non-fastball is on a 2 ball and 2 strike count. Here’s what it looked like:

Now, that’s a nasty pitch. The pitch starts at the top of the strike zone and breaks several inches out of the zone. The next top pitch for the 3 balls and 1 strike has great movement, too, but since the hitter has 3 balls then the pitch needs to be in the strike zone. Here is the top 3 ball and 1 strike pitch:

The ball starts as a strike on the inside part of the plate and breaks significantly away from the hitter and would have been still a strike even if the batter didn’t swing.

Most unlikely home runs

Show code
batter_2021_prediction_woba %>%
  select(-c(des, events, stand, p_throws, outs_when_up, pfx_x, pfx_z, runner_1b, runner_2b, runner_3b, plate_x, plate_z)) %>%
  arrange(desc(woba_value - .pred)) %>%
  head() %>%
  gt()
player_name game_date woba_value .pred release_speed pitch_type zone balls strikes
Hoskins, Rhys 2021-06-22 2 0.05392908 97.9 FF 11 1 2
Altuve, Jose 2021-06-10 2 0.05897641 85.3 SL 13 0 2
Chisholm Jr., Jazz 2021-04-10 2 0.07070240 100.4 FF 12 0 2
Carlson, Dylan 2021-06-02 2 0.07594188 95.4 FF 12 2 2
Astudillo, Willians 2021-04-23 2 0.08661124 92.8 FF 12 1 2
Martinez, J.D. 2021-07-21 2 0.09053584 98.6 FF 12 2 2

The most unlikely home run of 2021 belongs to Rhys Hoskins. Smashes a 98 MPH fastball high and out of the zone.

Look at this golf shot from Altuve that is the second least likely home run.

I could watch these all day. I would highly encourage everyone to use the search feature and watch all of these.

Biggest meatball missed

Show code
batter_2021_prediction_woba %>%
  select(-c(des, events, stand, p_throws, outs_when_up, pfx_x, pfx_z, runner_1b, runner_2b, runner_3b, plate_x, plate_z)) %>%
  arrange((woba_value - .pred)) %>%
  head() %>%
  gt()
player_name game_date woba_value .pred release_speed pitch_type zone balls strikes
Adrianza, Ehire 2021-06-30 0 0.7647199 64.6 FA 2 2 1
McCormick, Chas 2021-04-24 0 0.7291481 60.6 FA 6 2 0
Toro, Abraham 2021-04-24 0 0.7173564 59.6 FA 4 0 0
Suzuki, Kurt 2021-04-16 0 0.7135440 59.0 FA 5 1 0
Bregman, Alex 2021-06-13 0 0.7102520 90.7 FF 4 2 0
Bradley, Bobby 2021-07-25 0 0.7083586 80.2 SL 4 2 0

The biggest meatball missed was by Ehire Adrianza, but I like the Chas McCormick one more. 60 MPH fastball lobbed into the heart of the plate.

xp_wOBAOE (Expected Pitch wOBA Over Expected)

Like I mentioned earlier, this new metric allows us to measure which batters are performing better than expected given their pitch quality. Let’s take a look at the top total value over expected:

Show code
batter_2021_prediction_woba %>%
  filter(!is.na(woba_value)) %>%
  group_by(player_name) %>%
  summarise(n = n(),
            xp_wOBAOE_sum = sum(woba_value - .pred)) %>%
  arrange(desc(xp_wOBAOE_sum)) %>%
  head(10) %>%
  gt()
player_name n xp_wOBAOE_sum
Ohtani, Shohei 381 41.02039
Tatis Jr., Fernando 349 34.33326
Guerrero Jr., Vladimir 406 33.23778
Castellanos, Nick 364 30.05024
Bogaerts, Xander 393 29.75805
Mullins, Cedric 418 28.39065
Devers, Rafael 406 27.89499
Martinez, J.D. 405 22.14015
Perez, Salvador 404 21.58211
Reynolds, Bryan 407 19.90814

So this shows the sum of the total value over expected and Shohei Ohtani is running away with first place. Shohei Ohtani has provided the most value to his team this year. What about the most average value with at least 100 at-bats?

Show code
batter_2021_prediction_woba %>%
  filter(!is.na(woba_value)) %>%
  group_by(player_name) %>%
  summarise(n = n(),
            xp_wOBAOE_avg = mean(woba_value - .pred)) %>%
  arrange(desc(xp_wOBAOE_avg)) %>%
  filter(n >= 100) %>%
  head(10) %>%
  gt()
player_name n xp_wOBAOE_avg
Buxton, Byron 110 0.17926097
Trout, Mike 141 0.11095308
Ohtani, Shohei 381 0.10766506
Tatis Jr., Fernando 349 0.09837609
Castellanos, Nick 364 0.08255561
Wisdom, Patrick 164 0.08236782
Guerrero Jr., Vladimir 406 0.08186644
Marte, Ketel 147 0.07708410
Bogaerts, Xander 393 0.07572022
Devers, Rafael 406 0.06870686

Byron Buxton, while not playing very much, has been tremendous. The same goes for Mike Trout. The big surprise on the list is Patrick Wisdom.

Future Improvements

The model could be improved with further feature engineering. The two top-of-mind:

  1. Factor in the previous pitch. Some pitches are considered set-up pitches. For example, a pitcher could throw a fastball high and tight on the hitter then throw the next a breaking pitch low and outside. The difference in speed, movement, and location could make the pitch even more effective.
  2. All movement is not created equally. A fastball that moves 8 inches might be a large amount of movement for a fastball but when compared to a curveball doesn’t move at all. Scale movement using the release point, speed, and pitch type.

Future Analysis

  1. Analyze the best pitcher outings based on pitch quality.
  2. Top batters in swinging at quality pitches.

Footnotes

    Corrections

    If you see mistakes or want to suggest changes, please create an issue on the source repository.

    Citation

    For attribution, please cite this work as

    Oberdiear (2021, July 28). Louis Oberdiear: Introducing xp_wOBA (Expected Pitch wOBA). Retrieved from https://thelob.blog/posts/2021-07-28-introducing-xpwoba-expected-pitch-woba/

    BibTeX citation

    @misc{oberdiear2021introducing,
      author = {Oberdiear, Louis},
      title = {Louis Oberdiear: Introducing xp_wOBA (Expected Pitch wOBA)},
      url = {https://thelob.blog/posts/2021-07-28-introducing-xpwoba-expected-pitch-woba/},
      year = {2021}
    }