Louis Oberdiear: Predicting Customer Tenure...with MLB Hitters

Louis Oberdiear

Predicting Customer Tenure…with MLB Hitters

customer tenure customer lifetime value customer series data science rstats

How long will a customer stay with our business or, in this case, how long will an MLB hitter stay in the MLB? In this post, I go through the value of knowing how long a customer will be with an organization, EDA of MLB data, how to clean and setup data for modeling, and how to fit a linear model and a gradient boosted tree model with xgBoost.

Author

Affiliation

Louis Oberdiear

Published

June 24, 2021

Citation

Oberdiear, 2021

Intro: The Why

I believe being able to know how long a customer will be with your business is inherently valuable, but why exactly?

First scenario: Customer A

Let’s pose a few scenarios. The first scenario is that we know Customer A will only be a customer with us for two years. In this scenario, customers stay with our business on average for ten years so two years is relatively short. Is it valuable to know that Customer A won’t be with us very long? What can we do about it?

Well, in my mind, we have two different paths. The Abandonment path and the Retainment path. Knowing Customer A will only be with us for two years, we can simply cut our losses and not put any more money into Customer A whether that be through advertisement or discounts. This would be the Abandonment path.

On the other hand, we could try and figure out different ways to extend their stay with us through different interventions. This would be the Retainment path. We could try several different interventions and measure their effectiveness and their return on investment.

Second Scenario: Customer B

What about the other end of the spectrum where we know a customer will be with us for 15 years. This is Customer B. Is it valuable to know we have a potentially very loyal customer?

Again, I see two paths of treatment. The first path is the Stay the Course path where we treat them exactly how we would normally treat them. We know they will be with us for a long time so why change anything?

The second path is the Up Sell Path where we give them more attention and try to get our hooks in deeper by offering more and more products to them. Maybe even give them special discounts and reward them for being a loyal customer and hopefully turning Customer B into a brand ambassador that will recommend our organization to their friends and family.

I think looking at these two extreme scenarios shows how knowing customer tenure is valuable, but I have another reason in mind. We are going to predict customer tenure to calculate Customer Lifetime Value.

Customer Lifetime Value

There are many ways to calculate Customer Lifetime Value and different formulas will calculate at different organization levels. For example, some formulas calculate overall Customer Lifetime Value. This will let you know the value of an average customer. Knowing the value of an average customer makes sense if an organization is trying to calculate the return on investment on different customer acquisition and retainment efforts.

That’s not my goal here, however. I want to know how valuable a customer is to the company on an individual basis. Some customers are more valuable than others and I want to be able to identify them. The main goal here is to be able to identify early on the potential lifetime value of a customer and then use the treatments mentioned above.

I have seen some customer-level CLV calculations simply use average yearly monetary value and multiply it by the average customer tenure. This works well as a rough estimate but I want to be more precise and having a more precise tenure estimate will lead to more precise CLV estimates.

In this example, our customers are going to be MLB hitters. We are going to be predicting how many years a hitter stays in the MLB. Then in the next blog post, we will use the predicted tenure to estimate the lifetime value for hitters.

The Data

Here are the packages used for this post:

Show code

library(lubridate)
library(tidylog)
library(broom)
library(DataExplorer)
library(tidymodels)
library(tidyverse)
library(skimr)
library(gt)
library(performance)
library(GGally)

We’ll again be using the R package Lahman. In the Lahman package, there are multiple datasets. Here is a description of the dataset contained in the Lahman package:

Show code

library(Lahman)
Lahman::LahmanData %>% 
  as_tibble() %>%
  gt()

file	class	nobs	nvar	title
AllstarFull	data.frame	4993	8	AllstarFull table
Appearances	data.frame	99466	21	Appearances table
AwardsManagers	data.frame	171	6	AwardsManagers table
AwardsPlayers	data.frame	6026	6	AwardsPlayers table
AwardsShareManagers	data.frame	401	7	AwardsShareManagers table
AwardsSharePlayers	data.frame	6705	7	AwardsSharePlayers table
Batting	data.frame	99846	22	Batting table
BattingPost	data.frame	11294	22	BattingPost table
CollegePlaying	data.frame	17350	3	CollegePlaying table
Fielding	data.frame	167938	18	Fielding table
FieldingOF	data.frame	12028	6	FieldingOF table
FieldingPost	data.frame	11924	17	FieldingPost data
HallOfFame	data.frame	4088	9	Hall of Fame Voting Data
Managers	data.frame	3370	10	Managers table
ManagersHalf	data.frame	93	10	ManagersHalf table
Master	data.frame	18589	26	Master table
Pitching	data.frame	43330	30	Pitching table
PitchingPost	data.frame	4945	30	PitchingPost table
Salaries	data.frame	24758	5	Salaries table
Schools	data.frame	1207	5	Schools table
SeriesPost	data.frame	298	9	SeriesPost table
Teams	data.frame	2775	48	Teams table
TeamsFranchises	data.frame	120	4	TeamFranchises table
TeamsHalf	data.frame	52	10	TeamsHalf table

The People dataset looks like a good place to start.

Show code

data(People)
glimpse(People)

Rows: 20,093
Columns: 26
$ playerID     <chr> "aardsda01", "aaronha01", "aaronto01", "aasedo0~
$ birthYear    <int> 1981, 1934, 1939, 1954, 1972, 1985, 1850, 1877,~
$ birthMonth   <int> 12, 2, 8, 9, 8, 12, 11, 4, 11, 10, 3, 10, 2, 8,~
$ birthDay     <int> 27, 5, 5, 8, 25, 17, 4, 15, 11, 14, 16, 22, 16,~
$ birthCountry <chr> "USA", "USA", "USA", "USA", "USA", "D.R.", "USA~
$ birthState   <chr> "CO", "AL", "AL", "CA", "FL", "La Romana", "PA"~
$ birthCity    <chr> "Denver", "Mobile", "Mobile", "Orange", "Palm B~
$ deathYear    <int> NA, 2021, 1984, NA, NA, NA, 1905, 1957, 1962, 1~
$ deathMonth   <int> NA, 1, 8, NA, NA, NA, 5, 1, 6, 4, 2, 6, NA, NA,~
$ deathDay     <int> NA, 22, 16, NA, NA, NA, 17, 6, 11, 27, 13, 11, ~
$ deathCountry <chr> NA, "USA", "USA", NA, NA, NA, "USA", "USA", "US~
$ deathState   <chr> NA, "GA", "GA", NA, NA, NA, "NJ", "FL", "VT", "~
$ deathCity    <chr> NA, "Atlanta", "Atlanta", NA, NA, NA, "Pemberto~
$ nameFirst    <chr> "David", "Hank", "Tommie", "Don", "Andy", "Fern~
$ nameLast     <chr> "Aardsma", "Aaron", "Aaron", "Aase", "Abad", "A~
$ nameGiven    <chr> "David Allan", "Henry Louis", "Tommie Lee", "Do~
$ weight       <int> 215, 180, 190, 190, 184, 235, 192, 170, 175, 16~
$ height       <int> 75, 72, 75, 75, 73, 74, 72, 71, 71, 68, 71, 70,~
$ bats         <fct> R, R, R, R, L, L, R, R, R, L, R, R, R, R, L, R,~
$ throws       <fct> R, R, R, R, L, L, R, R, R, L, R, R, R, L, L, R,~
$ debut        <chr> "2004-04-06", "1954-04-13", "1962-04-10", "1977~
$ finalGame    <chr> "2015-08-23", "1976-10-03", "1971-09-26", "1990~
$ retroID      <chr> "aardd001", "aaroh101", "aarot101", "aased001",~
$ bbrefID      <chr> "aardsda01", "aaronha01", "aaronto01", "aasedo0~
$ deathDate    <date> NA, 2021-01-22, 1984-08-16, NA, NA, NA, 1905-0~
$ birthDate    <date> 1981-12-27, 1934-02-05, 1939-08-05, 1954-09-08~

We can use their debut, finalGame, and birthDate. This will give us the ability to calculate year number in the league (e.g. 1st year in the league, 2nd, etc.), their age, and their final tenure.

We are going to be predicting hitters so we will use the Batting dataset.

Show code

data(Batting)
glimpse(Batting)

Rows: 108,789
Columns: 22
$ playerID <chr> "abercda01", "addybo01", "allisar01", "allisdo01", ~
$ yearID   <int> 1871, 1871, 1871, 1871, 1871, 1871, 1871, 1871, 187~
$ stint    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
$ teamID   <fct> TRO, RC1, CL1, WS3, RC1, FW1, RC1, BS1, FW1, BS1, C~
$ lgID     <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
$ G        <int> 1, 25, 29, 27, 25, 12, 1, 31, 1, 18, 22, 1, 10, 3, ~
$ AB       <int> 4, 118, 137, 133, 120, 49, 4, 157, 5, 86, 89, 3, 36~
$ R        <int> 0, 30, 28, 28, 29, 9, 0, 66, 1, 13, 18, 0, 6, 7, 24~
$ H        <int> 0, 32, 40, 44, 39, 11, 1, 63, 1, 13, 27, 0, 7, 6, 3~
$ X2B      <int> 0, 6, 4, 10, 11, 2, 0, 10, 1, 2, 1, 0, 0, 0, 9, 3, ~
$ X3B      <int> 0, 0, 5, 2, 3, 1, 0, 9, 0, 1, 10, 0, 0, 0, 1, 3, 0,~
$ HR       <int> 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 1, 0, 0, ~
$ RBI      <int> 0, 13, 19, 27, 16, 5, 2, 34, 1, 11, 18, 0, 1, 5, 21~
$ SB       <int> 0, 8, 3, 1, 6, 0, 0, 11, 0, 1, 0, 0, 2, 2, 4, 4, 0,~
$ CS       <int> 0, 1, 1, 1, 2, 1, 0, 6, 0, 0, 1, 0, 0, 0, 0, 4, 0, ~
$ BB       <int> 0, 4, 2, 0, 2, 0, 1, 13, 0, 0, 3, 1, 2, 0, 2, 9, 0,~
$ SO       <int> 0, 0, 5, 2, 1, 1, 0, 1, 0, 0, 4, 0, 0, 0, 2, 2, 3, ~
$ IBB      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
$ HBP      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
$ SH       <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
$ SF       <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
$ GIDP     <int> 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 1, 2, 0, ~

Now let’s combine the People and the Batting datasets.

Show code

pitchers <- Lahman::Pitching %>%
  select(playerID) %>%
  distinct()

old_players <- People %>%
  filter(ymd(debut) < '1961-01-01')

people_tenure <- People %>%
  select(playerID, nameFirst, nameLast, birthDate, debut, finalGame) %>%
  filter(!is.na(finalGame)) %>%
  filter(ymd(finalGame) < '2018-01-01') %>%
  anti_join(old_players) %>%
  inner_join(Batting) %>%
  group_by(playerID, yearID) %>%
  summarise(nameFirst = first(nameFirst),
            nameLast = first(nameLast),
            birthDate = first(birthDate),
            debut = first(debut),
            finalGame = first(finalGame),
            stint = first(stint),
            teamID = first(teamID),
            lgID = first(lgID),
            G = sum(G),
            AB = sum(AB),
            R = sum(R),
            H = sum(H),
            X2B = sum(X2B),
            X3B = sum(X3B),
            HR = sum(HR),
            RBI = sum(RBI),
            SB = sum(SB),
            CS = sum(CS),
            BB = sum(BB),
            SO = sum(SO),
            IBB = sum(IBB),
            HBP = sum(HBP),
            SH = sum(SH),
            SF = sum(SF),
            GIDP = sum(GIDP),
            .groups = "drop") %>%
  anti_join(pitchers) %>%
  filter(lgID %in% c('NL', 'AL')) %>%
  Lahman::battingStats(data = .) %>%
  group_by(playerID, stint) %>%
  mutate(year_num = row_number()) %>%
  ungroup() %>%
  group_by(playerID) %>%
  mutate(tenure = n()) %>%
  ungroup() %>%
  mutate(age = as.double((ymd(paste0(yearID, '-01-01')) - ymd(birthDate))/365)) %>%
  drop_na() %>%
  mutate(decade = yearID - (yearID %% 10)) %>%
  mutate(TBB = BB + IBB + HBP,         # combine all walk types
         TAB = AB + TBB + SH + SF) %>% # create total at-bats
  select(-c(BB, IBB, HBP, SH, SF))     # remove unneeded columns

Let me walk through the code above and explain what is happening and my thought process.

First, we need to identify the pitchers in our Batting dataset and get rid of them.

pitchers <- Lahman::Pitching %>%
  select(playerID) %>%
  distinct()

Next, I want to remove players from older eras.

old_players <- People %>%
  filter(ymd(debut) < '1961-01-01')

Here is a good quote from The Sport Journal on MLB Eras explaining the ERAs:

A common list presented at Baseball-Reference described the eras as the Dead Ball Era (1901-1919), the Live Ball Era (1920-1941), the Integration Era (1942-1960), the Expansion Era (1961-1976), the Free Agency Era (1977-1993) and the Long Ball/Steroid Era (1994-2005) (17). This study runs through the 2011 season and aseventh era will be added and labeled the Post Steroid Era (2006-2011).

I removed anyone who debuted before the Expansion Era (1961). Average tenure might change depending on the decade so this is something we should investigate in the EDA stage.

I then removed anyone who doesn’t have a final game and also if their final game is after 2018. I want players that are not currently playing and have finished their MLB career. This doesn’t guarantee this but it comes close enough.

filter(!is.na(finalGame)) %>%
  filter(ymd(finalGame) < '2018-01-01')

The next step is to make sure all players only have one entry per year. Some players have two (or more) for a year because they were traded or waived then picked up by another team. I don’t care about this so I am going to aggregate their stats to the year level.

group_by(playerID, yearID) %>%
  summarise(nameFirst = first(nameFirst),
            nameLast = first(nameLast),
            birthDate = first(birthDate),
            debut = first(debut),
            finalGame = first(finalGame),
            stint = first(stint),
            teamID = first(teamID),
            lgID = first(lgID),
            G = sum(G),
            AB = sum(AB),
            R = sum(R),
            H = sum(H),
            X2B = sum(X2B),
            X3B = sum(X3B),
            HR = sum(HR),
            RBI = sum(RBI),
            SB = sum(SB),
            CS = sum(CS),
            BB = sum(BB),
            SO = sum(SO),
            IBB = sum(IBB),
            HBP = sum(HBP),
            SH = sum(SH),
            SF = sum(SF),
            GIDP = sum(GIDP),
            .groups = "drop")

The next steps:
1. Only include players from the NL and AL

filter(lgID %in% c('NL', 'AL'))

Use the function battingStats to calculate rate statistics like Batting Average, On-base Percentage, Slugging, and OPS.

Lahman::battingStats(data = .)

Calculate the year number in the league for each player

group_by(playerID, stint) %>%
  mutate(year_num = row_number())

Calculate total tenure for each player

group_by(playerID) %>%
  mutate(tenure = n())

Calculate player age for each given year

mutate(age = as.double((ymd(paste0(yearID, '-01-01')) - ymd(birthDate))/365))

Calculate decade

mutate(decade = yearID %% 10)

Finally, calculate total walks and total plate appearances

mutate(TBB = BB + IBB + HBP,     # combine all walk types
       TAB = AB + TBB + SH + SF) # create total at-bats

All of this together gives us data for batters only that debuted after 1961. It gives us data for each year a batter was in the league. Let’s take a look at my favorite player of all time and the greatest shortstop to play baseball to illustrate the structure (Ozzie Smith). Having each row be a year a player played in the league allows us to teach a model how a person looks like a rookie that ultimately played in the league for 19 years. Ozzie debuted at age 23, got a ton of at-bats (668), and posted what is likely a below-average OPS.

Show code

people_tenure %>%
  filter(playerID == 'smithoz01') %>%
  select(nameFirst, nameLast, yearID, year_num, age, TAB, OPS, tenure) %>%
  gt()

nameFirst	nameLast	yearID	year_num	age	TAB	OPS	tenure
Ozzie	Smith	1978	1	23.03288	668	0.623	19
Ozzie	Smith	1979	2	24.03288	654	0.522	19
Ozzie	Smith	1980	3	25.03288	713	0.589	19
Ozzie	Smith	1981	4	26.03562	508	0.550	19
Ozzie	Smith	1982	5	27.03562	579	0.653	19
Ozzie	Smith	1983	6	28.03562	635	0.656	19
Ozzie	Smith	1984	7	29.03562	489	0.684	19
Ozzie	Smith	1985	8	30.03836	626	0.716	19
Ozzie	Smith	1986	9	31.03836	622	0.709	19
Ozzie	Smith	1987	10	32.03836	709	0.775	19
Ozzie	Smith	1988	11	33.03836	671	0.686	19
Ozzie	Smith	1989	12	34.04110	667	0.696	19
Ozzie	Smith	1990	13	35.04110	596	0.635	19
Ozzie	Smith	1991	14	36.04110	643	0.747	19
Ozzie	Smith	1992	15	37.04110	594	0.709	19
Ozzie	Smith	1993	16	38.04384	604	0.693	19
Ozzie	Smith	1994	17	39.04384	436	0.675	19
Ozzie	Smith	1995	18	40.04384	182	0.526	19
Ozzie	Smith	1996	19	41.04384	261	0.728	19

EDA

Always look at the distributions of your variables.

Show code

DataExplorer::plot_histogram(people_tenure)

Nothing jumps out to me. We have some outliers, but we will check to see the effect after modeling.

Next, let’s check out how the variables interact with each other. A good way to investigate this is through a correlation matrix. If modeling with a linear model then you need to be concerned with how correlated the variables are with each other.

The corrplot package has a great way to visualize this and on top of that will cluster your variables together using different cluster techniques. Here I chose a hierarchical clustering method with 5 clusters.

Show code

library(corrplot)

cor_matrix <- people_tenure %>%
  select(-c(playerID, yearID, nameFirst, nameLast, birthDate, 
             debut, finalGame, stint, teamID, lgID)) %>%
  select(c(year_num, tenure, age, G, TAB, TBB, TB, SO, H, X2B, X3B, HR,
           BA, SlugPct, OBP, OPS, BABIP)) %>%
  cor()

corrplot(cor_matrix, order = "hclust", addrect = 5)

What’s cool about this is how we can see we have five different types of variables. In the left corner, we have the ‘Time’ stats: year_num & age. We then have the ‘Rate’ stats: BA, OBP, SlugPct, and OPS. Triples get grouped on their own because they happen so infrequently. Our target variable is by itself also, then we have the ‘Counting’ stats in the bottom right: Singles, Doubles, HRs, SOs, Walks, At Bats, & Games. Pretty neat to see them group like that.

Let’s see how the variables specifically correlate with our target variable Tenure.

Show code

library(corrr)

people_tenure %>%
  select(-c(playerID, yearID, nameFirst, nameLast, birthDate, 
             debut, finalGame, stint, teamID, lgID)) %>%
  select(c(year_num, tenure, age, G, TAB, TBB, TB, SO, H, X2B, X3B, HR,
           SlugPct, OBP, OPS, BABIP)) %>%
  correlate() %>%
  stretch() %>%
  filter(x == "tenure") %>%
  arrange(desc(r)) %>%
  gt()

x	y	r
tenure	year_num	0.6174973
tenure	TAB	0.5603746
tenure	TB	0.5591402
tenure	H	0.5585071
tenure	G	0.5406922
tenure	TBB	0.5334992
tenure	X2B	0.5233772
tenure	HR	0.4550314
tenure	SO	0.4323623
tenure	age	0.3500512
tenure	X3B	0.3350727
tenure	OPS	0.3186440
tenure	SlugPct	0.3057516
tenure	OBP	0.2930648
tenure	BABIP	0.1349052
tenure	tenure	NA

Interesting. We can use these correlations in helping us determine which variables to include in the models.

Let’s take a look at how the decade relates to tenure to wrap up the EDA section.

Show code

people_tenure %>%
  group_by(decade) %>%
  summarise(avg_tenure = mean(tenure)) %>%
  ggplot(aes(x = as_factor(decade), y = avg_tenure, fill = as_factor(decade))) +
  geom_col(stat = "identity") +
  theme(legend.position = "none")

There are slight differences in tenure at the tail ends.

Modeling

I have a rough outline when I model:

Build a simple linear baseline model
Build a simple black-box model
Analyze residuals from black-box model
Feature engineer to try and correct large residuals
Tune hyperparameters for black-box model

Build A Simple Linear Baseline model

The reason to build a simple linear baseline model is to have an idea of how much a model has improved with the use of new models, hyper-parameter tuning, and future feature engineering. It also gives you an idea of how the variables interact with the target and the size of the effect.

I recommend making the first model by using the features you think are the most important. Include the bare minimum number of features.

In this case of trying to predict MLB tenure, I think year_num, age, TAB (total at-bats), and OPS (on-base percentage + slugging percentage). I think year_num is going to be important because the longer you have been in the league then the longer your tenure will be. Age will be important because the younger you are then the longer you can be in the league. Total at-bats represent opportunity. If a player is getting opportunity then it either means the manager believes in the player or they are performing well. OPS represents performance. OPS is the sum of on-base percentage and slugging percentage. On-base is the rate of getting on base without consideration of how many bases the hit resulted in. Slugging percentage is the total bases on the hits divided by the at-bats. Slugging percentage is higher the more bases a batter gets. Home run hitters will have a higher slugging than single hitters. OPS is nice because it accounts for how often a batter gets on base and the total bases all in one metric.

The first model will be based on theory but I am also going to build a few other models using the variables that have the highest correlations to tenure.

Here’s the formula for the theory model: tenure ~ year_num + age + TAB + OPS

Since we have isolated the target and the features we are most interested in, we can visualize the relationship between all of the variables using a pair plot. There is a great package GGally that has a function ggpairs that produces pair plots and is a great way to explore relationships and distributions.

Show code

people_tenure %>%
  select(tenure, age, year_num, TAB, OPS) %>%
  ggpairs()

Let’s fit the models and compare their performances.

Show code

lm_theory <- lm(formula = tenure ~ year_num + age + TAB + OPS
          , data = people_tenure)

lm_corr1 <- lm(formula = tenure ~ year_num + TAB + OPS
          , data = people_tenure)

lm_corr2 <- lm(formula = tenure ~ year_num + TAB + OPS + TB + H + G
            , data = people_tenure)

lm_corr3 <- lm(formula = tenure ~ year_num + TAB + OPS + TB + H + G + TBB + X2B + HR + SO
            , data = people_tenure)
lm_corr4 <- lm(formula = tenure ~ year_num + age + TAB + OPS + TB + H + G + TBB + X2B + HR + SO + age + X3B + BABIP
            , data = people_tenure)

performance::compare_performance(lm_theory, lm_corr1, lm_corr2, lm_corr3, lm_corr4) %>%
  as_tibble() %>%
  select(Name, Model, R2, RMSE) %>%
  arrange(desc(R2)) %>%
  gt()

Name	Model	R2	RMSE
lm_corr4	lm	0.6110783	3.248215
lm_theory	lm	0.6017334	3.287007
lm_corr3	lm	0.5387190	3.537501
lm_corr2	lm	0.5296241	3.572204
lm_corr1	lm	0.5251697	3.589078

Nice, this is good to see. The theory model came in second and was only marginally behind the model that used every variable. An R^2 of .60 is also pretty good for only using 4 variables.

Let’s look at the coefficients to see which ones have the greatest impact on tenure.

Show code

tidy(lm_theory) %>%
  mutate(estimate = round(estimate, digits = 3),
         std.error = round(std.error, digits = 3),
         statistic = round(statistic, digits = 3),
         p.value = round(p.value, digits = 3)) %>%
  gt()

term	estimate	std.error	statistic
(Intercept)	19.966	0.250	79.895
year_num	1.221	0.011	115.778
age	-0.706	0.010	-68.910
TAB	0.007	0.000	60.932
OPS	1.896	0.109	17.441

All of the features are significant predictors of tenure with year_num being by far the most important. Every 1 year increase in year_num results in 1.22 more years of tenure. For every increase in age by 1 year, tenure is reduced by .7 years. The theory of how long a player has been in the league, their age, their opportunity, and their performance determining tenure seems to be a good theory.

Let’s check the model for the appropriate assumptions.

Show code

check_model(lm_theory, panel = TRUE)

Build a Simple Blackbox Model

Now is the time to build the simple black-box model. In this case, we are going to use the gradient boosted algorithm, xgBoost. I call it simple because the first go-around we will be using the default hyperparameters. There is a great package called tidymodels that will allow us to create workflows and the ability to quickly swap different machine learning algorithms. To learn more, here is a great resource: Tidy Models bookdown TMRW

First, split the data into train and test datasets.

Show code

set.seed(123)
pt_split <- initial_split(people_tenure, prop = 3/4)
pt_train <- training(pt_split)
pt_test <- testing(pt_split)

We define the two formulas that we are going to be using and fit the models. We then collect the metrics and put them together for easy comparison.

Show code

theory_formula <- formula(tenure ~ year_num + age + TAB + OPS)
all_metrics_formula <- formula(tenure ~ year_num + age + TAB + OPS + TB + H + G + TBB + X2B + HR + SO + age + X3B + BABIP)

lm_theory <- linear_reg() %>%
  fit(theory_formula, data = pt_train)

lm_all <- linear_reg() %>%
  fit(theory_formula, data = pt_train)

xgb_theory <- boost_tree(mode = "regression") %>%
  set_engine("xgboost") %>%
  fit(theory_formula, data = pt_train)

xgb_all <- boost_tree(mode = "regression") %>%
  set_engine("xgboost") %>%
  fit(all_metrics_formula, data = pt_train)

lm_theory_metrics <- lm_theory %>%
  predict(pt_test) %>%
  bind_cols(pt_test) %>%
  metrics(truth = tenure, estimate = .pred) %>%
  mutate(model = "lm_theory") %>%
  select(model, .metric, .estimate) %>%
  pivot_wider(names_from = .metric, values_from = .estimate)

lm_all_metrics <- lm_all %>%
  predict(pt_test) %>%
  bind_cols(pt_test) %>%
  metrics(truth = tenure, estimate = .pred) %>%
  mutate(model = "lm_all") %>%
  select(model, .metric, .estimate) %>%
  pivot_wider(names_from = .metric, values_from = .estimate)

xgb_theory_metrics <- xgb_theory %>%
  predict(pt_test) %>%
  bind_cols(pt_test) %>%
  metrics(truth = tenure, estimate = .pred) %>%
  mutate(model = "xgb_theory") %>%
  select(model, .metric, .estimate) %>%
  pivot_wider(names_from = .metric, values_from = .estimate)

xgb_all_metrics <- xgb_all %>%
  predict(pt_test) %>%
  bind_cols(pt_test) %>%
  metrics(truth = tenure, estimate = .pred) %>%
  mutate(model = "xgb_all") %>%
  select(model, .metric, .estimate) %>%
  pivot_wider(names_from = .metric, values_from = .estimate)

lm_theory_metrics %>%
  bind_rows(lm_all_metrics, xgb_theory_metrics, xgb_all_metrics) %>%
  arrange(rmse) %>%
  gt()

model	rmse	rsq	mae
xgb_all	3.204862	0.6282084	2.434345
xgb_theory	3.218378	0.6249189	2.449568
lm_theory	3.315353	0.6019646	2.581725
lm_all	3.315353	0.6019646	2.581725

The xgBoost model using the theory formula and default hyper-parameters outperformed the linear model by not much. We know the most important variables in the linear model were year_num, age, TAB, and OPS in that order. Let’s see what variables were the most important to the xgBoost theory model.

Show code

library(vip)

xgb_theory %>%
  vip(geom = "col")

Age becomes less important in the xgBoost model and total at-bats becomes more important.

Examine xgBoost Predictions

We know how the model performed on the test data. On average the xgBoost model was off +/- 2.44 years. Let’s see how it predicted Ozzie Smith.

Show code

xgb_theory %>% 
  predict(people_tenure) %>%
  bind_cols(people_tenure) %>%
  filter(playerID == 'smithoz01') %>%
  select(nameFirst, nameLast, yearID, year_num, age, TAB, OPS, tenure, .pred) %>%
  gt()

nameFirst	nameLast	yearID	year_num	age	TAB	OPS	tenure	.pred
Ozzie	Smith	1978	1	23.03288	668	0.623	19	11.06559
Ozzie	Smith	1979	2	24.03288	654	0.522	19	10.88405
Ozzie	Smith	1980	3	25.03288	713	0.589	19	11.97203
Ozzie	Smith	1981	4	26.03562	508	0.550	19	12.02471
Ozzie	Smith	1982	5	27.03562	579	0.653	19	11.93142
Ozzie	Smith	1983	6	28.03562	635	0.656	19	12.37301
Ozzie	Smith	1984	7	29.03562	489	0.684	19	11.75753
Ozzie	Smith	1985	8	30.03836	626	0.716	19	13.14181
Ozzie	Smith	1986	9	31.03836	622	0.709	19	13.40460
Ozzie	Smith	1987	10	32.03836	709	0.775	19	15.87435
Ozzie	Smith	1988	11	33.03836	671	0.686	19	15.32062
Ozzie	Smith	1989	12	34.04110	667	0.696	19	15.36666
Ozzie	Smith	1990	13	35.04110	596	0.635	19	15.70914
Ozzie	Smith	1991	14	36.04110	643	0.747	19	18.33076
Ozzie	Smith	1992	15	37.04110	594	0.709	19	17.10520
Ozzie	Smith	1993	16	38.04384	604	0.693	19	17.98851
Ozzie	Smith	1994	17	39.04384	436	0.675	19	18.13917
Ozzie	Smith	1995	18	40.04384	182	0.526	19	18.45537
Ozzie	Smith	1996	19	41.04384	261	0.728	19	19.86943

You can see the model is off by quite a bit for Ozzie Smith. The model predicts around 11 years until year number 8. From the average tenure by decade plot, we know the average tenure is around 9.5 years so the model was defaulting close to the average. I picked Ozzie not only because he is my favorite player, but because his value was derived from his defense and this model currently does not account for defensive production. Ozzie Smith won 13 gold gloves meaning he was the best at his position for 13 of the 19 years in the MLB. We would expect the model to be off for this type of player. In the future, we will need to account for this.

Let’s take a look at a great offensive player and see if the model does better. Unfortunately, the greatest offensive player, Hank Aaron, isn’t in our dataset because he debuted before 1961, but the next best hitter in our dataset might be Tony Gwynn so let’s take a look at him.

Show code

xgb_theory %>% 
  predict(people_tenure) %>%
  bind_cols(people_tenure) %>%
  filter(playerID == 'gwynnto01') %>%
  select(nameFirst, nameLast, yearID, year_num, age, TAB, OPS, tenure, .pred) %>%
  gt()

nameFirst	nameLast	yearID	year_num	age	TAB	OPS	tenure	.pred
Tony	Gwynn	1982	1	21.66301	209	0.726	20	11.95321
Tony	Gwynn	1983	2	22.66301	339	0.727	20	11.77514
Tony	Gwynn	1984	3	23.66301	688	0.854	20	15.18588
Tony	Gwynn	1985	4	24.66575	675	0.772	20	13.59647
Tony	Gwynn	1986	5	25.66575	712	0.848	20	16.38655
Tony	Gwynn	1987	6	26.66575	706	0.958	20	16.89046
Tony	Gwynn	1988	7	27.66575	591	0.788	20	13.86852
Tony	Gwynn	1989	8	28.66849	695	0.813	20	16.08902
Tony	Gwynn	1990	9	29.66849	649	0.772	20	16.23166
Tony	Gwynn	1991	10	30.66849	577	0.787	20	15.32837
Tony	Gwynn	1992	11	31.66849	581	0.786	20	15.56111
Tony	Gwynn	1993	12	32.67123	545	0.895	20	17.28474
Tony	Gwynn	1994	13	33.67123	491	1.022	20	17.95643
Tony	Gwynn	1995	14	34.67123	587	0.888	20	18.29347
Tony	Gwynn	1996	15	35.67123	510	0.841	20	18.77420
Tony	Gwynn	1997	16	36.67397	663	0.956	20	19.80710
Tony	Gwynn	1998	17	37.67397	511	0.865	20	19.59279
Tony	Gwynn	1999	18	38.67397	451	0.858	20	19.84833
Tony	Gwynn	2000	19	39.67397	142	0.805	20	20.26111
Tony	Gwynn	2001	20	40.67671	113	0.845	20	21.19039

It’s good to see the model start higher for Gwynn’s rookie year and even predicts 15 years of tenure for him in year number 3. The model then gets closer and closer to the right answer as the years pass.

Now, let’s do the fun part and see how it predicts some MLB players early in their career. The three I am going to be looking at are young phenoms; Ronald Acuna, Fernando Tatis, & Juan Soto.

Show code

young_phenoms <- People %>%
  select(playerID, nameFirst, nameLast, birthDate, debut, finalGame) %>%
  filter(playerID %in% c('acunaro01', 'tatisfe02', 'sotoju01')) %>%
  inner_join(Batting) %>%
  group_by(playerID, yearID) %>%
  summarise(nameFirst = first(nameFirst),
            nameLast = first(nameLast),
            birthDate = first(birthDate),
            debut = first(debut),
            finalGame = first(finalGame),
            stint = first(stint),
            teamID = first(teamID),
            lgID = first(lgID),
            G = sum(G),
            AB = sum(AB),
            R = sum(R),
            H = sum(H),
            X2B = sum(X2B),
            X3B = sum(X3B),
            HR = sum(HR),
            RBI = sum(RBI),
            SB = sum(SB),
            CS = sum(CS),
            BB = sum(BB),
            SO = sum(SO),
            IBB = sum(IBB),
            HBP = sum(HBP),
            SH = sum(SH),
            SF = sum(SF),
            GIDP = sum(GIDP),
            .groups = "drop") %>%
  Lahman::battingStats(data = .) %>%
  group_by(playerID, stint) %>%
  mutate(year_num = row_number()) %>%
  ungroup() %>%
  filter(yearID == 2019) %>%
  group_by(playerID) %>%
  mutate(tenure = n()) %>%
  ungroup() %>%
  mutate(age = as.double((ymd(paste0(yearID, '-01-01')) - ymd(birthDate))/365)) %>%
  drop_na() %>%
  mutate(decade = yearID %% 10) %>%
  mutate(TBB = BB + IBB + HBP,         # combine all walk types
         TAB = AB + TBB + SH + SF) %>% # create total at-bats
  select(-c(BB, IBB, HBP, SH, SF))     # remove unneeded columns

xgb_theory %>% 
  predict(young_phenoms) %>%
  bind_cols(young_phenoms) %>%
  select(nameFirst, nameLast, yearID, year_num, age, TAB, OPS, tenure, .pred) %>%
  gt()

nameFirst	nameLast	yearID	year_num	age	TAB	OPS	tenure	.pred
Ronald	Acuna	2019	2	21.05205	716	0.883	1	17.25071
Juan	Soto	2019	2	20.20000	662	0.949	1	16.44127
Fernando	Tatis	2019	1	20.01096	373	0.969	1	15.79035

Based on 2019 data, the model expects Ronald Acuna to be in the league the longest but they all are very close to each other. Again, this is good to see. Most experts would predict these young players to be in the league for a very long time and 16 years is a long time.

Now let’s look at rookies from 2018, 2019, and 2020 and predict their tenure using 2020 data. 2020 was a short season due to the COVID-19 pandemic so we will have to extrapolate total at-bats from the 60 game season to a regular 162 game season.

Show code

rooks <- People %>%
  select(playerID, nameFirst, nameLast, birthDate, debut, finalGame) %>%
  filter( year(ymd(debut)) %in% c(2018, 2019, 2020)) %>%
  inner_join(Batting) %>%
  group_by(playerID, yearID) %>%
  summarise(nameFirst = first(nameFirst),
            nameLast = first(nameLast),
            birthDate = first(birthDate),
            debut = first(debut),
            finalGame = first(finalGame),
            stint = first(stint),
            teamID = first(teamID),
            lgID = first(lgID),
            G = sum(G),
            AB = sum(AB),
            R = sum(R),
            H = sum(H),
            X2B = sum(X2B),
            X3B = sum(X3B),
            HR = sum(HR),
            RBI = sum(RBI),
            SB = sum(SB),
            CS = sum(CS),
            BB = sum(BB),
            SO = sum(SO),
            IBB = sum(IBB),
            HBP = sum(HBP),
            SH = sum(SH),
            SF = sum(SF),
            GIDP = sum(GIDP),
            .groups = "drop") %>%
  Lahman::battingStats(data = .) %>%
  group_by(playerID, stint) %>%
  mutate(year_num = row_number()) %>%
  ungroup() %>%
  group_by(playerID) %>%
  filter(yearID == max(yearID)) %>%
  ungroup() %>%
  mutate(age = as.double((ymd(paste0(yearID, '-01-01')) - ymd(birthDate))/365)) %>%
  drop_na() %>%
  mutate(decade = yearID %% 10) %>%
  mutate(TBB = BB + IBB + HBP,         # combine all walk types
         TAB = AB + TBB + SH + SF) %>% # create total at-bats
  select(-c(BB, IBB, HBP, SH, SF)) %>% # remove unneeded columns
  mutate(TAB = (TAB/60)*162)

xgb_theory %>% 
  predict(rooks) %>%
  bind_cols(rooks) %>%
  select(nameFirst, nameLast, year_num, age, TAB, OPS, .pred) %>%
  arrange(desc(.pred)) %>%
  head(10) %>%
  gt()

nameFirst	nameLast	year_num	age	TAB	OPS	.pred
Juan	Soto	3	21.20000	561.6	1.185	17.60920
Fernando	Tatis	2	21.01096	696.6	0.937	16.89737
Luis	Garcia	1	19.64110	375.3	0.668	16.45621
Ronald	Acuna	3	22.05205	550.8	0.987	15.95809
Vladimir	Guerrero	2	20.81096	658.8	0.791	15.93926
Kyle	Tucker	3	22.96986	621.0	0.837	14.69650
Eloy	Jimenez	2	23.10959	610.2	0.891	13.75860
Franmil	Reyes	3	24.50411	650.7	0.794	13.58006
Bo	Bichette	2	21.84110	348.3	0.840	13.55870
Trent	Grisham	2	23.18082	677.7	0.808	13.43157

All of those young players are some of the best in the league and we would expect them to be on the top of the list if we were to predict tenure.

Next Steps

Now that we have a base model, we can start to improve it. The next steps are:

Analyze the times our model was the most wrong and see if there is a pattern
Based on step one, we then introduce new features to correct large errors
Finally, we need to tune the hyperparameters for our model.

All those steps will be covered in the next blog post.

Comment on this article Share:

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Citation

For attribution, please cite this work as

Oberdiear (2021, June 24). Louis Oberdiear: Predicting Customer Tenure...with MLB Hitters. Retrieved from https://thelob.blog/posts/2021-07-08-predicting-customer-tenurewith-mlb-hitters/

BibTeX citation

@misc{oberdiear2021predicting,
  author = {Oberdiear, Louis},
  title = {Louis Oberdiear: Predicting Customer Tenure...with MLB Hitters},
  url = {https://thelob.blog/posts/2021-07-08-predicting-customer-tenurewith-mlb-hitters/},
  year = {2021}
}

Predicting Customer Tenure…with MLB Hitters

Author

Affiliation

Published

Citation

Intro: The Why

First scenario: Customer A

Second Scenario: Customer B

Customer Lifetime Value

The Data

EDA

Modeling

Build A Simple Linear Baseline model

Build a Simple Blackbox Model

Examine xgBoost Predictions

Next Steps

Footnotes

Corrections

Citation