Predicting Customer Tenure…with MLB Hitters

customer tenure customer lifetime value customer series data science rstats

How long will a customer stay with our business or, in this case, how long will an MLB hitter stay in the MLB? In this post, I go through the value of knowing how long a customer will be with an organization, EDA of MLB data, how to clean and setup data for modeling, and how to fit a linear model and a gradient boosted tree model with xgBoost.

Author

Affiliation

Louis Oberdiear

 

Published

June 24, 2021

Citation

Oberdiear, 2021

Intro: The Why

I believe being able to know how long a customer will be with your business is inherently valuable, but why exactly?

First scenario: Customer A

Let’s pose a few scenarios. The first scenario is that we know Customer A will only be a customer with us for two years. In this scenario, customers stay with our business on average for ten years so two years is relatively short. Is it valuable to know that Customer A won’t be with us very long? What can we do about it?

Well, in my mind, we have two different paths. The Abandonment path and the Retainment path. Knowing Customer A will only be with us for two years, we can simply cut our losses and not put any more money into Customer A whether that be through advertisement or discounts. This would be the Abandonment path.

On the other hand, we could try and figure out different ways to extend their stay with us through different interventions. This would be the Retainment path. We could try several different interventions and measure their effectiveness and their return on investment.

Second Scenario: Customer B

What about the other end of the spectrum where we know a customer will be with us for 15 years. This is Customer B. Is it valuable to know we have a potentially very loyal customer?

Again, I see two paths of treatment. The first path is the Stay the Course path where we treat them exactly how we would normally treat them. We know they will be with us for a long time so why change anything?

The second path is the Up Sell Path where we give them more attention and try to get our hooks in deeper by offering more and more products to them. Maybe even give them special discounts and reward them for being a loyal customer and hopefully turning Customer B into a brand ambassador that will recommend our organization to their friends and family.

I think looking at these two extreme scenarios shows how knowing customer tenure is valuable, but I have another reason in mind. We are going to predict customer tenure to calculate Customer Lifetime Value.

Customer Lifetime Value

There are many ways to calculate Customer Lifetime Value and different formulas will calculate at different organization levels. For example, some formulas calculate overall Customer Lifetime Value. This will let you know the value of an average customer. Knowing the value of an average customer makes sense if an organization is trying to calculate the return on investment on different customer acquisition and retainment efforts.

That’s not my goal here, however. I want to know how valuable a customer is to the company on an individual basis. Some customers are more valuable than others and I want to be able to identify them. The main goal here is to be able to identify early on the potential lifetime value of a customer and then use the treatments mentioned above.

I have seen some customer-level CLV calculations simply use average yearly monetary value and multiply it by the average customer tenure. This works well as a rough estimate but I want to be more precise and having a more precise tenure estimate will lead to more precise CLV estimates.

In this example, our customers are going to be MLB hitters. We are going to be predicting how many years a hitter stays in the MLB. Then in the next blog post, we will use the predicted tenure to estimate the lifetime value for hitters.

The Data

Here are the packages used for this post:

Show code

We’ll again be using the R package Lahman. In the Lahman package, there are multiple datasets. Here is a description of the dataset contained in the Lahman package:

Show code
library(Lahman)
Lahman::LahmanData %>% 
  as_tibble() %>%
  gt()
file class nobs nvar title
AllstarFull data.frame 4993 8 AllstarFull table
Appearances data.frame 99466 21 Appearances table
AwardsManagers data.frame 171 6 AwardsManagers table
AwardsPlayers data.frame 6026 6 AwardsPlayers table
AwardsShareManagers data.frame 401 7 AwardsShareManagers table
AwardsSharePlayers data.frame 6705 7 AwardsSharePlayers table
Batting data.frame 99846 22 Batting table
BattingPost data.frame 11294 22 BattingPost table
CollegePlaying data.frame 17350 3 CollegePlaying table
Fielding data.frame 167938 18 Fielding table
FieldingOF data.frame 12028 6 FieldingOF table
FieldingPost data.frame 11924 17 FieldingPost data
HallOfFame data.frame 4088 9 Hall of Fame Voting Data
Managers data.frame 3370 10 Managers table
ManagersHalf data.frame 93 10 ManagersHalf table
Master data.frame 18589 26 Master table
Pitching data.frame 43330 30 Pitching table
PitchingPost data.frame 4945 30 PitchingPost table
Salaries data.frame 24758 5 Salaries table
Schools data.frame 1207 5 Schools table
SeriesPost data.frame 298 9 SeriesPost table
Teams data.frame 2775 48 Teams table
TeamsFranchises data.frame 120 4 TeamFranchises table
TeamsHalf data.frame 52 10 TeamsHalf table

The People dataset looks like a good place to start.

Show code
data(People)
glimpse(People) 
Rows: 20,093
Columns: 26
$ playerID     <chr> "aardsda01", "aaronha01", "aaronto01", "aasedo0~
$ birthYear    <int> 1981, 1934, 1939, 1954, 1972, 1985, 1850, 1877,~
$ birthMonth   <int> 12, 2, 8, 9, 8, 12, 11, 4, 11, 10, 3, 10, 2, 8,~
$ birthDay     <int> 27, 5, 5, 8, 25, 17, 4, 15, 11, 14, 16, 22, 16,~
$ birthCountry <chr> "USA", "USA", "USA", "USA", "USA", "D.R.", "USA~
$ birthState   <chr> "CO", "AL", "AL", "CA", "FL", "La Romana", "PA"~
$ birthCity    <chr> "Denver", "Mobile", "Mobile", "Orange", "Palm B~
$ deathYear    <int> NA, 2021, 1984, NA, NA, NA, 1905, 1957, 1962, 1~
$ deathMonth   <int> NA, 1, 8, NA, NA, NA, 5, 1, 6, 4, 2, 6, NA, NA,~
$ deathDay     <int> NA, 22, 16, NA, NA, NA, 17, 6, 11, 27, 13, 11, ~
$ deathCountry <chr> NA, "USA", "USA", NA, NA, NA, "USA", "USA", "US~
$ deathState   <chr> NA, "GA", "GA", NA, NA, NA, "NJ", "FL", "VT", "~
$ deathCity    <chr> NA, "Atlanta", "Atlanta", NA, NA, NA, "Pemberto~
$ nameFirst    <chr> "David", "Hank", "Tommie", "Don", "Andy", "Fern~
$ nameLast     <chr> "Aardsma", "Aaron", "Aaron", "Aase", "Abad", "A~
$ nameGiven    <chr> "David Allan", "Henry Louis", "Tommie Lee", "Do~
$ weight       <int> 215, 180, 190, 190, 184, 235, 192, 170, 175, 16~
$ height       <int> 75, 72, 75, 75, 73, 74, 72, 71, 71, 68, 71, 70,~
$ bats         <fct> R, R, R, R, L, L, R, R, R, L, R, R, R, R, L, R,~
$ throws       <fct> R, R, R, R, L, L, R, R, R, L, R, R, R, L, L, R,~
$ debut        <chr> "2004-04-06", "1954-04-13", "1962-04-10", "1977~
$ finalGame    <chr> "2015-08-23", "1976-10-03", "1971-09-26", "1990~
$ retroID      <chr> "aardd001", "aaroh101", "aarot101", "aased001",~
$ bbrefID      <chr> "aardsda01", "aaronha01", "aaronto01", "aasedo0~
$ deathDate    <date> NA, 2021-01-22, 1984-08-16, NA, NA, NA, 1905-0~
$ birthDate    <date> 1981-12-27, 1934-02-05, 1939-08-05, 1954-09-08~

We can use their debut, finalGame, and birthDate. This will give us the ability to calculate year number in the league (e.g. 1st year in the league, 2nd, etc.), their age, and their final tenure.

We are going to be predicting hitters so we will use the Batting dataset.

Show code
data(Batting)
glimpse(Batting)
Rows: 108,789
Columns: 22
$ playerID <chr> "abercda01", "addybo01", "allisar01", "allisdo01", ~
$ yearID   <int> 1871, 1871, 1871, 1871, 1871, 1871, 1871, 1871, 187~
$ stint    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
$ teamID   <fct> TRO, RC1, CL1, WS3, RC1, FW1, RC1, BS1, FW1, BS1, C~
$ lgID     <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
$ G        <int> 1, 25, 29, 27, 25, 12, 1, 31, 1, 18, 22, 1, 10, 3, ~
$ AB       <int> 4, 118, 137, 133, 120, 49, 4, 157, 5, 86, 89, 3, 36~
$ R        <int> 0, 30, 28, 28, 29, 9, 0, 66, 1, 13, 18, 0, 6, 7, 24~
$ H        <int> 0, 32, 40, 44, 39, 11, 1, 63, 1, 13, 27, 0, 7, 6, 3~
$ X2B      <int> 0, 6, 4, 10, 11, 2, 0, 10, 1, 2, 1, 0, 0, 0, 9, 3, ~
$ X3B      <int> 0, 0, 5, 2, 3, 1, 0, 9, 0, 1, 10, 0, 0, 0, 1, 3, 0,~
$ HR       <int> 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 1, 0, 0, ~
$ RBI      <int> 0, 13, 19, 27, 16, 5, 2, 34, 1, 11, 18, 0, 1, 5, 21~
$ SB       <int> 0, 8, 3, 1, 6, 0, 0, 11, 0, 1, 0, 0, 2, 2, 4, 4, 0,~
$ CS       <int> 0, 1, 1, 1, 2, 1, 0, 6, 0, 0, 1, 0, 0, 0, 0, 4, 0, ~
$ BB       <int> 0, 4, 2, 0, 2, 0, 1, 13, 0, 0, 3, 1, 2, 0, 2, 9, 0,~
$ SO       <int> 0, 0, 5, 2, 1, 1, 0, 1, 0, 0, 4, 0, 0, 0, 2, 2, 3, ~
$ IBB      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
$ HBP      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
$ SH       <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
$ SF       <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
$ GIDP     <int> 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 1, 2, 0, ~

Now let’s combine the People and the Batting datasets.

Show code
pitchers <- Lahman::Pitching %>%
  select(playerID) %>%
  distinct()

old_players <- People %>%
  filter(ymd(debut) < '1961-01-01')

people_tenure <- People %>%
  select(playerID, nameFirst, nameLast, birthDate, debut, finalGame) %>%
  filter(!is.na(finalGame)) %>%
  filter(ymd(finalGame) < '2018-01-01') %>%
  anti_join(old_players) %>%
  inner_join(Batting) %>%
  group_by(playerID, yearID) %>%
  summarise(nameFirst = first(nameFirst),
            nameLast = first(nameLast),
            birthDate = first(birthDate),
            debut = first(debut),
            finalGame = first(finalGame),
            stint = first(stint),
            teamID = first(teamID),
            lgID = first(lgID),
            G = sum(G),
            AB = sum(AB),
            R = sum(R),
            H = sum(H),
            X2B = sum(X2B),
            X3B = sum(X3B),
            HR = sum(HR),
            RBI = sum(RBI),
            SB = sum(SB),
            CS = sum(CS),
            BB = sum(BB),
            SO = sum(SO),
            IBB = sum(IBB),
            HBP = sum(HBP),
            SH = sum(SH),
            SF = sum(SF),
            GIDP = sum(GIDP),
            .groups = "drop") %>%
  anti_join(pitchers) %>%
  filter(lgID %in% c('NL', 'AL')) %>%
  Lahman::battingStats(data = .) %>%
  group_by(playerID, stint) %>%
  mutate(year_num = row_number()) %>%
  ungroup() %>%
  group_by(playerID) %>%
  mutate(tenure = n()) %>%
  ungroup() %>%
  mutate(age = as.double((ymd(paste0(yearID, '-01-01')) - ymd(birthDate))/365)) %>%
  drop_na() %>%
  mutate(decade = yearID - (yearID %% 10)) %>%
  mutate(TBB = BB + IBB + HBP,         # combine all walk types
         TAB = AB + TBB + SH + SF) %>% # create total at-bats
  select(-c(BB, IBB, HBP, SH, SF))     # remove unneeded columns

Let me walk through the code above and explain what is happening and my thought process.

First, we need to identify the pitchers in our Batting dataset and get rid of them.

pitchers <- Lahman::Pitching %>%
  select(playerID) %>%
  distinct()

Next, I want to remove players from older eras.

old_players <- People %>%
  filter(ymd(debut) < '1961-01-01')

Here is a good quote from The Sport Journal on MLB Eras explaining the ERAs:

A common list presented at Baseball-Reference described the eras as the Dead Ball Era (1901-1919), the Live Ball Era (1920-1941), the Integration Era (1942-1960), the Expansion Era (1961-1976), the Free Agency Era (1977-1993) and the Long Ball/Steroid Era (1994-2005) (17). This study runs through the 2011 season and aseventh era will be added and labeled the Post Steroid Era (2006-2011).

I removed anyone who debuted before the Expansion Era (1961). Average tenure might change depending on the decade so this is something we should investigate in the EDA stage.

I then removed anyone who doesn’t have a final game and also if their final game is after 2018. I want players that are not currently playing and have finished their MLB career. This doesn’t guarantee this but it comes close enough.

filter(!is.na(finalGame)) %>%
  filter(ymd(finalGame) < '2018-01-01')

The next step is to make sure all players only have one entry per year. Some players have two (or more) for a year because they were traded or waived then picked up by another team. I don’t care about this so I am going to aggregate their stats to the year level.

group_by(playerID, yearID) %>%
  summarise(nameFirst = first(nameFirst),
            nameLast = first(nameLast),
            birthDate = first(birthDate),
            debut = first(debut),
            finalGame = first(finalGame),
            stint = first(stint),
            teamID = first(teamID),
            lgID = first(lgID),
            G = sum(G),
            AB = sum(AB),
            R = sum(R),
            H = sum(H),
            X2B = sum(X2B),
            X3B = sum(X3B),
            HR = sum(HR),
            RBI = sum(RBI),
            SB = sum(SB),
            CS = sum(CS),
            BB = sum(BB),
            SO = sum(SO),
            IBB = sum(IBB),
            HBP = sum(HBP),
            SH = sum(SH),
            SF = sum(SF),
            GIDP = sum(GIDP),
            .groups = "drop")

The next steps:
1. Only include players from the NL and AL

filter(lgID %in% c('NL', 'AL'))
  1. Use the function battingStats to calculate rate statistics like Batting Average, On-base Percentage, Slugging, and OPS.
Lahman::battingStats(data = .)
  1. Calculate the year number in the league for each player
group_by(playerID, stint) %>%
  mutate(year_num = row_number())
  1. Calculate total tenure for each player
group_by(playerID) %>%
  mutate(tenure = n())
  1. Calculate player age for each given year
mutate(age = as.double((ymd(paste0(yearID, '-01-01')) - ymd(birthDate))/365))
  1. Calculate decade
mutate(decade = yearID %% 10)
  1. Finally, calculate total walks and total plate appearances
mutate(TBB = BB + IBB + HBP,     # combine all walk types
       TAB = AB + TBB + SH + SF) # create total at-bats

All of this together gives us data for batters only that debuted after 1961. It gives us data for each year a batter was in the league. Let’s take a look at my favorite player of all time and the greatest shortstop to play baseball to illustrate the structure (Ozzie Smith). Having each row be a year a player played in the league allows us to teach a model how a person looks like a rookie that ultimately played in the league for 19 years. Ozzie debuted at age 23, got a ton of at-bats (668), and posted what is likely a below-average OPS.

Show code
people_tenure %>%
  filter(playerID == 'smithoz01') %>%
  select(nameFirst, nameLast, yearID, year_num, age, TAB, OPS, tenure) %>%
  gt()
nameFirst nameLast yearID year_num age TAB OPS tenure
Ozzie Smith 1978 1 23.03288 668 0.623 19
Ozzie Smith 1979 2 24.03288 654 0.522 19
Ozzie Smith 1980 3 25.03288 713 0.589 19
Ozzie Smith 1981 4 26.03562 508 0.550 19
Ozzie Smith 1982 5 27.03562 579 0.653 19
Ozzie Smith 1983 6 28.03562 635 0.656 19
Ozzie Smith 1984 7 29.03562 489 0.684 19
Ozzie Smith 1985 8 30.03836 626 0.716 19
Ozzie Smith 1986 9 31.03836 622 0.709 19
Ozzie Smith 1987 10 32.03836 709 0.775 19
Ozzie Smith 1988 11 33.03836 671 0.686 19
Ozzie Smith 1989 12 34.04110 667 0.696 19
Ozzie Smith 1990 13 35.04110 596 0.635 19
Ozzie Smith 1991 14 36.04110 643 0.747 19
Ozzie Smith 1992 15 37.04110 594 0.709 19
Ozzie Smith 1993 16 38.04384 604 0.693 19
Ozzie Smith 1994 17 39.04384 436 0.675 19
Ozzie Smith 1995 18 40.04384 182 0.526 19
Ozzie Smith 1996 19 41.04384 261 0.728 19

EDA

Always look at the distributions of your variables.

Show code
DataExplorer::plot_histogram(people_tenure)

Nothing jumps out to me. We have some outliers, but we will check to see the effect after modeling.

Next, let’s check out how the variables interact with each other. A good way to investigate this is through a correlation matrix. If modeling with a linear model then you need to be concerned with how correlated the variables are with each other.

The corrplot package has a great way to visualize this and on top of that will cluster your variables together using different cluster techniques. Here I chose a hierarchical clustering method with 5 clusters.

Show code
library(corrplot)

cor_matrix <- people_tenure %>%
  select(-c(playerID, yearID, nameFirst, nameLast, birthDate, 
             debut, finalGame, stint, teamID, lgID)) %>%
  select(c(year_num, tenure, age, G, TAB, TBB, TB, SO, H, X2B, X3B, HR,
           BA, SlugPct, OBP, OPS, BABIP)) %>%
  cor()

corrplot(cor_matrix, order = "hclust", addrect = 5)

What’s cool about this is how we can see we have five different types of variables. In the left corner, we have the ‘Time’ stats: year_num & age. We then have the ‘Rate’ stats: BA, OBP, SlugPct, and OPS. Triples get grouped on their own because they happen so infrequently. Our target variable is by itself also, then we have the ‘Counting’ stats in the bottom right: Singles, Doubles, HRs, SOs, Walks, At Bats, & Games. Pretty neat to see them group like that.

Let’s see how the variables specifically correlate with our target variable Tenure.

Show code
library(corrr)

people_tenure %>%
  select(-c(playerID, yearID, nameFirst, nameLast, birthDate, 
             debut, finalGame, stint, teamID, lgID)) %>%
  select(c(year_num, tenure, age, G, TAB, TBB, TB, SO, H, X2B, X3B, HR,
           SlugPct, OBP, OPS, BABIP)) %>%
  correlate() %>%
  stretch() %>%
  filter(x == "tenure") %>%
  arrange(desc(r)) %>%
  gt()
x y r
tenure year_num 0.6174973
tenure TAB 0.5603746
tenure TB 0.5591402
tenure H 0.5585071
tenure G 0.5406922
tenure TBB 0.5334992
tenure X2B 0.5233772
tenure HR 0.4550314
tenure SO 0.4323623
tenure age 0.3500512
tenure X3B 0.3350727
tenure OPS 0.3186440
tenure SlugPct 0.3057516
tenure OBP 0.2930648
tenure BABIP 0.1349052
tenure tenure NA

Interesting. We can use these correlations in helping us determine which variables to include in the models.

Let’s take a look at how the decade relates to tenure to wrap up the EDA section.

Show code
people_tenure %>%
  group_by(decade) %>%
  summarise(avg_tenure = mean(tenure)) %>%
  ggplot(aes(x = as_factor(decade), y = avg_tenure, fill = as_factor(decade))) +
  geom_col(stat = "identity") +
  theme(legend.position = "none")

There are slight differences in tenure at the tail ends.

Modeling

I have a rough outline when I model:

  1. Build a simple linear baseline model
  2. Build a simple black-box model
  3. Analyze residuals from black-box model
  4. Feature engineer to try and correct large residuals
  5. Tune hyperparameters for black-box model

Build A Simple Linear Baseline model

The reason to build a simple linear baseline model is to have an idea of how much a model has improved with the use of new models, hyper-parameter tuning, and future feature engineering. It also gives you an idea of how the variables interact with the target and the size of the effect.

I recommend making the first model by using the features you think are the most important. Include the bare minimum number of features.

In this case of trying to predict MLB tenure, I think year_num, age, TAB (total at-bats), and OPS (on-base percentage + slugging percentage). I think year_num is going to be important because the longer you have been in the league then the longer your tenure will be. Age will be important because the younger you are then the longer you can be in the league. Total at-bats represent opportunity. If a player is getting opportunity then it either means the manager believes in the player or they are performing well. OPS represents performance. OPS is the sum of on-base percentage and slugging percentage. On-base is the rate of getting on base without consideration of how many bases the hit resulted in. Slugging percentage is the total bases on the hits divided by the at-bats. Slugging percentage is higher the more bases a batter gets. Home run hitters will have a higher slugging than single hitters. OPS is nice because it accounts for how often a batter gets on base and the total bases all in one metric.

The first model will be based on theory but I am also going to build a few other models using the variables that have the highest correlations to tenure.

Here’s the formula for the theory model: tenure ~ year_num + age + TAB + OPS

Since we have isolated the target and the features we are most interested in, we can visualize the relationship between all of the variables using a pair plot. There is a great package GGally that has a function ggpairs that produces pair plots and is a great way to explore relationships and distributions.

Show code
people_tenure %>%
  select(tenure, age, year_num, TAB, OPS) %>%
  ggpairs()

Let’s fit the models and compare their performances.

Show code
lm_theory <- lm(formula = tenure ~ year_num + age + TAB + OPS
          , data = people_tenure)

lm_corr1 <- lm(formula = tenure ~ year_num + TAB + OPS
          , data = people_tenure)

lm_corr2 <- lm(formula = tenure ~ year_num + TAB + OPS + TB + H + G
            , data = people_tenure)

lm_corr3 <- lm(formula = tenure ~ year_num + TAB + OPS + TB + H + G + TBB + X2B + HR + SO
            , data = people_tenure)
lm_corr4 <- lm(formula = tenure ~ year_num + age + TAB + OPS + TB + H + G + TBB + X2B + HR + SO + age + X3B + BABIP
            , data = people_tenure)

performance::compare_performance(lm_theory, lm_corr1, lm_corr2, lm_corr3, lm_corr4) %>%
  as_tibble() %>%
  select(Name, Model, R2, RMSE) %>%
  arrange(desc(R2)) %>%
  gt()
Name Model R2 RMSE
lm_corr4 lm 0.6110783 3.248215
lm_theory lm 0.6017334 3.287007
lm_corr3 lm 0.5387190 3.537501
lm_corr2 lm 0.5296241 3.572204
lm_corr1 lm 0.5251697 3.589078

Nice, this is good to see. The theory model came in second and was only marginally behind the model that used every variable. An R^2 of .60 is also pretty good for only using 4 variables.

Let’s look at the coefficients to see which ones have the greatest impact on tenure.

Show code
tidy(lm_theory) %>%
  mutate(estimate = round(estimate, digits = 3),
         std.error = round(std.error, digits = 3),
         statistic = round(statistic, digits = 3),
         p.value = round(p.value, digits = 3)) %>%
  gt()
term estimate std.error statistic p.value
(Intercept) 19.966 0.250 79.895 0
year_num 1.221 0.011 115.778 0
age -0.706 0.010 -68.910 0
TAB 0.007 0.000 60.932 0
OPS 1.896 0.109 17.441 0

All of the features are significant predictors of tenure with year_num being by far the most important. Every 1 year increase in year_num results in 1.22 more years of tenure. For every increase in age by 1 year, tenure is reduced by .7 years. The theory of how long a player has been in the league, their age, their opportunity, and their performance determining tenure seems to be a good theory.

Let’s check the model for the appropriate assumptions.

Show code
check_model(lm_theory, panel = TRUE)

Build a Simple Blackbox Model

Now is the time to build the simple black-box model. In this case, we are going to use the gradient boosted algorithm, xgBoost. I call it simple because the first go-around we will be using the default hyperparameters. There is a great package called tidymodels that will allow us to create workflows and the ability to quickly swap different machine learning algorithms. To learn more, here is a great resource: Tidy Models bookdown TMRW

First, split the data into train and test datasets.

Show code
set.seed(123)
pt_split <- initial_split(people_tenure, prop = 3/4)
pt_train <- training(pt_split)
pt_test <- testing(pt_split)

We define the two formulas that we are going to be using and fit the models. We then collect the metrics and put them together for easy comparison.

Show code
theory_formula <- formula(tenure ~ year_num + age + TAB + OPS)
all_metrics_formula <- formula(tenure ~ year_num + age + TAB + OPS + TB + H + G + TBB + X2B + HR + SO + age + X3B + BABIP)

lm_theory <- linear_reg() %>%
  fit(theory_formula, data = pt_train)

lm_all <- linear_reg() %>%
  fit(theory_formula, data = pt_train)

xgb_theory <- boost_tree(mode = "regression") %>%
  set_engine("xgboost") %>%
  fit(theory_formula, data = pt_train)

xgb_all <- boost_tree(mode = "regression") %>%
  set_engine("xgboost") %>%
  fit(all_metrics_formula, data = pt_train)

lm_theory_metrics <- lm_theory %>%
  predict(pt_test) %>%
  bind_cols(pt_test) %>%
  metrics(truth = tenure, estimate = .pred) %>%
  mutate(model = "lm_theory") %>%
  select(model, .metric, .estimate) %>%
  pivot_wider(names_from = .metric, values_from = .estimate)

lm_all_metrics <- lm_all %>%
  predict(pt_test) %>%
  bind_cols(pt_test) %>%
  metrics(truth = tenure, estimate = .pred) %>%
  mutate(model = "lm_all") %>%
  select(model, .metric, .estimate) %>%
  pivot_wider(names_from = .metric, values_from = .estimate)

xgb_theory_metrics <- xgb_theory %>%
  predict(pt_test) %>%
  bind_cols(pt_test) %>%
  metrics(truth = tenure, estimate = .pred) %>%
  mutate(model = "xgb_theory") %>%
  select(model, .metric, .estimate) %>%
  pivot_wider(names_from = .metric, values_from = .estimate)

xgb_all_metrics <- xgb_all %>%
  predict(pt_test) %>%
  bind_cols(pt_test) %>%
  metrics(truth = tenure, estimate = .pred) %>%
  mutate(model = "xgb_all") %>%
  select(model, .metric, .estimate) %>%
  pivot_wider(names_from = .metric, values_from = .estimate)

lm_theory_metrics %>%
  bind_rows(lm_all_metrics, xgb_theory_metrics, xgb_all_metrics) %>%
  arrange(rmse) %>%
  gt()
model rmse rsq mae
xgb_all 3.204862 0.6282084 2.434345
xgb_theory 3.218378 0.6249189 2.449568
lm_theory 3.315353 0.6019646 2.581725
lm_all 3.315353 0.6019646 2.581725

The xgBoost model using the theory formula and default hyper-parameters outperformed the linear model by not much. We know the most important variables in the linear model were year_num, age, TAB, and OPS in that order. Let’s see what variables were the most important to the xgBoost theory model.

Show code
library(vip)

xgb_theory %>%
  vip(geom = "col")

Age becomes less important in the xgBoost model and total at-bats becomes more important.

Examine xgBoost Predictions

We know how the model performed on the test data. On average the xgBoost model was off +/- 2.44 years. Let’s see how it predicted Ozzie Smith.

Show code
xgb_theory %>% 
  predict(people_tenure) %>%
  bind_cols(people_tenure) %>%
  filter(playerID == 'smithoz01') %>%
  select(nameFirst, nameLast, yearID, year_num, age, TAB, OPS, tenure, .pred) %>%
  gt()
nameFirst nameLast yearID year_num age TAB OPS tenure .pred
Ozzie Smith 1978 1 23.03288 668 0.623 19 11.06559
Ozzie Smith 1979 2 24.03288 654 0.522 19 10.88405
Ozzie Smith 1980 3 25.03288 713 0.589 19 11.97203
Ozzie Smith 1981 4 26.03562 508 0.550 19 12.02471
Ozzie Smith 1982 5 27.03562 579 0.653 19 11.93142
Ozzie Smith 1983 6 28.03562 635 0.656 19 12.37301
Ozzie Smith 1984 7 29.03562 489 0.684 19 11.75753
Ozzie Smith 1985 8 30.03836 626 0.716 19 13.14181
Ozzie Smith 1986 9 31.03836 622 0.709 19 13.40460
Ozzie Smith 1987 10 32.03836 709 0.775 19 15.87435
Ozzie Smith 1988 11 33.03836 671 0.686 19 15.32062
Ozzie Smith 1989 12 34.04110 667 0.696 19 15.36666
Ozzie Smith 1990 13 35.04110 596 0.635 19 15.70914
Ozzie Smith 1991 14 36.04110 643 0.747 19 18.33076
Ozzie Smith 1992 15 37.04110 594 0.709 19 17.10520
Ozzie Smith 1993 16 38.04384 604 0.693 19 17.98851
Ozzie Smith 1994 17 39.04384 436 0.675 19 18.13917
Ozzie Smith 1995 18 40.04384 182 0.526 19 18.45537
Ozzie Smith 1996 19 41.04384 261 0.728 19 19.86943

You can see the model is off by quite a bit for Ozzie Smith. The model predicts around 11 years until year number 8. From the average tenure by decade plot, we know the average tenure is around 9.5 years so the model was defaulting close to the average. I picked Ozzie not only because he is my favorite player, but because his value was derived from his defense and this model currently does not account for defensive production. Ozzie Smith won 13 gold gloves meaning he was the best at his position for 13 of the 19 years in the MLB. We would expect the model to be off for this type of player. In the future, we will need to account for this.

Let’s take a look at a great offensive player and see if the model does better. Unfortunately, the greatest offensive player, Hank Aaron, isn’t in our dataset because he debuted before 1961, but the next best hitter in our dataset might be Tony Gwynn so let’s take a look at him.

Show code
xgb_theory %>% 
  predict(people_tenure) %>%
  bind_cols(people_tenure) %>%
  filter(playerID == 'gwynnto01') %>%
  select(nameFirst, nameLast, yearID, year_num, age, TAB, OPS, tenure, .pred) %>%
  gt()
nameFirst nameLast yearID year_num age TAB OPS tenure .pred
Tony Gwynn 1982 1 21.66301 209 0.726 20 11.95321
Tony Gwynn 1983 2 22.66301 339 0.727 20 11.77514
Tony Gwynn 1984 3 23.66301 688 0.854 20 15.18588
Tony Gwynn 1985 4 24.66575 675 0.772 20 13.59647
Tony Gwynn 1986 5 25.66575 712 0.848 20 16.38655
Tony Gwynn 1987 6 26.66575 706 0.958 20 16.89046
Tony Gwynn 1988 7 27.66575 591 0.788 20 13.86852
Tony Gwynn 1989 8 28.66849 695 0.813 20 16.08902
Tony Gwynn 1990 9 29.66849 649 0.772 20 16.23166
Tony Gwynn 1991 10 30.66849 577 0.787 20 15.32837
Tony Gwynn 1992 11 31.66849 581 0.786 20 15.56111
Tony Gwynn 1993 12 32.67123 545 0.895 20 17.28474
Tony Gwynn 1994 13 33.67123 491 1.022 20 17.95643
Tony Gwynn 1995 14 34.67123 587 0.888 20 18.29347
Tony Gwynn 1996 15 35.67123 510 0.841 20 18.77420
Tony Gwynn 1997 16 36.67397 663 0.956 20 19.80710
Tony Gwynn 1998 17 37.67397 511 0.865 20 19.59279
Tony Gwynn 1999 18 38.67397 451 0.858 20 19.84833
Tony Gwynn 2000 19 39.67397 142 0.805 20 20.26111
Tony Gwynn 2001 20 40.67671 113 0.845 20 21.19039

It’s good to see the model start higher for Gwynn’s rookie year and even predicts 15 years of tenure for him in year number 3. The model then gets closer and closer to the right answer as the years pass.

Now, let’s do the fun part and see how it predicts some MLB players early in their career. The three I am going to be looking at are young phenoms; Ronald Acuna, Fernando Tatis, & Juan Soto.

Show code
young_phenoms <- People %>%
  select(playerID, nameFirst, nameLast, birthDate, debut, finalGame) %>%
  filter(playerID %in% c('acunaro01', 'tatisfe02', 'sotoju01')) %>%
  inner_join(Batting) %>%
  group_by(playerID, yearID) %>%
  summarise(nameFirst = first(nameFirst),
            nameLast = first(nameLast),
            birthDate = first(birthDate),
            debut = first(debut),
            finalGame = first(finalGame),
            stint = first(stint),
            teamID = first(teamID),
            lgID = first(lgID),
            G = sum(G),
            AB = sum(AB),
            R = sum(R),
            H = sum(H),
            X2B = sum(X2B),
            X3B = sum(X3B),
            HR = sum(HR),
            RBI = sum(RBI),
            SB = sum(SB),
            CS = sum(CS),
            BB = sum(BB),
            SO = sum(SO),
            IBB = sum(IBB),
            HBP = sum(HBP),
            SH = sum(SH),
            SF = sum(SF),
            GIDP = sum(GIDP),
            .groups = "drop") %>%
  Lahman::battingStats(data = .) %>%
  group_by(playerID, stint) %>%
  mutate(year_num = row_number()) %>%
  ungroup() %>%
  filter(yearID == 2019) %>%
  group_by(playerID) %>%
  mutate(tenure = n()) %>%
  ungroup() %>%
  mutate(age = as.double((ymd(paste0(yearID, '-01-01')) - ymd(birthDate))/365)) %>%
  drop_na() %>%
  mutate(decade = yearID %% 10) %>%
  mutate(TBB = BB + IBB + HBP,         # combine all walk types
         TAB = AB + TBB + SH + SF) %>% # create total at-bats
  select(-c(BB, IBB, HBP, SH, SF))     # remove unneeded columns

xgb_theory %>% 
  predict(young_phenoms) %>%
  bind_cols(young_phenoms) %>%
  select(nameFirst, nameLast, yearID, year_num, age, TAB, OPS, tenure, .pred) %>%
  gt()
nameFirst nameLast yearID year_num age TAB OPS tenure .pred
Ronald Acuna 2019 2 21.05205 716 0.883 1 17.25071
Juan Soto 2019 2 20.20000 662 0.949 1 16.44127
Fernando Tatis 2019 1 20.01096 373 0.969 1 15.79035

Based on 2019 data, the model expects Ronald Acuna to be in the league the longest but they all are very close to each other. Again, this is good to see. Most experts would predict these young players to be in the league for a very long time and 16 years is a long time.

Now let’s look at rookies from 2018, 2019, and 2020 and predict their tenure using 2020 data. 2020 was a short season due to the COVID-19 pandemic so we will have to extrapolate total at-bats from the 60 game season to a regular 162 game season.

Show code
rooks <- People %>%
  select(playerID, nameFirst, nameLast, birthDate, debut, finalGame) %>%
  filter( year(ymd(debut)) %in% c(2018, 2019, 2020)) %>%
  inner_join(Batting) %>%
  group_by(playerID, yearID) %>%
  summarise(nameFirst = first(nameFirst),
            nameLast = first(nameLast),
            birthDate = first(birthDate),
            debut = first(debut),
            finalGame = first(finalGame),
            stint = first(stint),
            teamID = first(teamID),
            lgID = first(lgID),
            G = sum(G),
            AB = sum(AB),
            R = sum(R),
            H = sum(H),
            X2B = sum(X2B),
            X3B = sum(X3B),
            HR = sum(HR),
            RBI = sum(RBI),
            SB = sum(SB),
            CS = sum(CS),
            BB = sum(BB),
            SO = sum(SO),
            IBB = sum(IBB),
            HBP = sum(HBP),
            SH = sum(SH),
            SF = sum(SF),
            GIDP = sum(GIDP),
            .groups = "drop") %>%
  Lahman::battingStats(data = .) %>%
  group_by(playerID, stint) %>%
  mutate(year_num = row_number()) %>%
  ungroup() %>%
  group_by(playerID) %>%
  filter(yearID == max(yearID)) %>%
  ungroup() %>%
  mutate(age = as.double((ymd(paste0(yearID, '-01-01')) - ymd(birthDate))/365)) %>%
  drop_na() %>%
  mutate(decade = yearID %% 10) %>%
  mutate(TBB = BB + IBB + HBP,         # combine all walk types
         TAB = AB + TBB + SH + SF) %>% # create total at-bats
  select(-c(BB, IBB, HBP, SH, SF)) %>% # remove unneeded columns
  mutate(TAB = (TAB/60)*162)

xgb_theory %>% 
  predict(rooks) %>%
  bind_cols(rooks) %>%
  select(nameFirst, nameLast, year_num, age, TAB, OPS, .pred) %>%
  arrange(desc(.pred)) %>%
  head(10) %>%
  gt()
nameFirst nameLast year_num age TAB OPS .pred
Juan Soto 3 21.20000 561.6 1.185 17.60920
Fernando Tatis 2 21.01096 696.6 0.937 16.89737
Luis Garcia 1 19.64110 375.3 0.668 16.45621
Ronald Acuna 3 22.05205 550.8 0.987 15.95809
Vladimir Guerrero 2 20.81096 658.8 0.791 15.93926
Kyle Tucker 3 22.96986 621.0 0.837 14.69650
Eloy Jimenez 2 23.10959 610.2 0.891 13.75860
Franmil Reyes 3 24.50411 650.7 0.794 13.58006
Bo Bichette 2 21.84110 348.3 0.840 13.55870
Trent Grisham 2 23.18082 677.7 0.808 13.43157

All of those young players are some of the best in the league and we would expect them to be on the top of the list if we were to predict tenure.

Next Steps

Now that we have a base model, we can start to improve it. The next steps are:

  1. Analyze the times our model was the most wrong and see if there is a pattern
  2. Based on step one, we then introduce new features to correct large errors
  3. Finally, we need to tune the hyperparameters for our model.

All those steps will be covered in the next blog post.

Footnotes

    Corrections

    If you see mistakes or want to suggest changes, please create an issue on the source repository.

    Citation

    For attribution, please cite this work as

    Oberdiear (2021, June 24). Louis Oberdiear: Predicting Customer Tenure...with MLB Hitters. Retrieved from https://thelob.blog/posts/2021-07-08-predicting-customer-tenurewith-mlb-hitters/

    BibTeX citation

    @misc{oberdiear2021predicting,
      author = {Oberdiear, Louis},
      title = {Louis Oberdiear: Predicting Customer Tenure...with MLB Hitters},
      url = {https://thelob.blog/posts/2021-07-08-predicting-customer-tenurewith-mlb-hitters/},
      year = {2021}
    }