How long will a customer stay with our business or, in this case, how long will an MLB hitter stay in the MLB? In this post, I go through the value of knowing how long a customer will be with an organization, EDA of MLB data, how to clean and setup data for modeling, and how to fit a linear model and a gradient boosted tree model with xgBoost.
I believe being able to know how long a customer will be with your business is inherently valuable, but why exactly?
Let’s pose a few scenarios. The first scenario is that we know Customer A will only be a customer with us for two years. In this scenario, customers stay with our business on average for ten years so two years is relatively short. Is it valuable to know that Customer A won’t be with us very long? What can we do about it?
Well, in my mind, we have two different paths. The Abandonment path and the Retainment path. Knowing Customer A will only be with us for two years, we can simply cut our losses and not put any more money into Customer A whether that be through advertisement or discounts. This would be the Abandonment path.
On the other hand, we could try and figure out different ways to extend their stay with us through different interventions. This would be the Retainment path. We could try several different interventions and measure their effectiveness and their return on investment.
What about the other end of the spectrum where we know a customer will be with us for 15 years. This is Customer B. Is it valuable to know we have a potentially very loyal customer?
Again, I see two paths of treatment. The first path is the Stay the Course path where we treat them exactly how we would normally treat them. We know they will be with us for a long time so why change anything?
The second path is the Up Sell Path where we give them more attention and try to get our hooks in deeper by offering more and more products to them. Maybe even give them special discounts and reward them for being a loyal customer and hopefully turning Customer B into a brand ambassador that will recommend our organization to their friends and family.
I think looking at these two extreme scenarios shows how knowing customer tenure is valuable, but I have another reason in mind. We are going to predict customer tenure to calculate Customer Lifetime Value.
There are many ways to calculate Customer Lifetime Value and different formulas will calculate at different organization levels. For example, some formulas calculate overall Customer Lifetime Value. This will let you know the value of an average customer. Knowing the value of an average customer makes sense if an organization is trying to calculate the return on investment on different customer acquisition and retainment efforts.
That’s not my goal here, however. I want to know how valuable a customer is to the company on an individual basis. Some customers are more valuable than others and I want to be able to identify them. The main goal here is to be able to identify early on the potential lifetime value of a customer and then use the treatments mentioned above.
I have seen some customer-level CLV calculations simply use average yearly monetary value and multiply it by the average customer tenure. This works well as a rough estimate but I want to be more precise and having a more precise tenure estimate will lead to more precise CLV estimates.
In this example, our customers are going to be MLB hitters. We are going to be predicting how many years a hitter stays in the MLB. Then in the next blog post, we will use the predicted tenure to estimate the lifetime value for hitters.
Here are the packages used for this post:
We’ll again be using the R package Lahman
. In the Lahman
package, there are multiple datasets. Here is a description of the dataset contained in the Lahman
package:
library(Lahman)
Lahman::LahmanData %>%
as_tibble() %>%
gt()
file | class | nobs | nvar | title |
---|---|---|---|---|
AllstarFull | data.frame | 4993 | 8 | AllstarFull table |
Appearances | data.frame | 99466 | 21 | Appearances table |
AwardsManagers | data.frame | 171 | 6 | AwardsManagers table |
AwardsPlayers | data.frame | 6026 | 6 | AwardsPlayers table |
AwardsShareManagers | data.frame | 401 | 7 | AwardsShareManagers table |
AwardsSharePlayers | data.frame | 6705 | 7 | AwardsSharePlayers table |
Batting | data.frame | 99846 | 22 | Batting table |
BattingPost | data.frame | 11294 | 22 | BattingPost table |
CollegePlaying | data.frame | 17350 | 3 | CollegePlaying table |
Fielding | data.frame | 167938 | 18 | Fielding table |
FieldingOF | data.frame | 12028 | 6 | FieldingOF table |
FieldingPost | data.frame | 11924 | 17 | FieldingPost data |
HallOfFame | data.frame | 4088 | 9 | Hall of Fame Voting Data |
Managers | data.frame | 3370 | 10 | Managers table |
ManagersHalf | data.frame | 93 | 10 | ManagersHalf table |
Master | data.frame | 18589 | 26 | Master table |
Pitching | data.frame | 43330 | 30 | Pitching table |
PitchingPost | data.frame | 4945 | 30 | PitchingPost table |
Salaries | data.frame | 24758 | 5 | Salaries table |
Schools | data.frame | 1207 | 5 | Schools table |
SeriesPost | data.frame | 298 | 9 | SeriesPost table |
Teams | data.frame | 2775 | 48 | Teams table |
TeamsFranchises | data.frame | 120 | 4 | TeamFranchises table |
TeamsHalf | data.frame | 52 | 10 | TeamsHalf table |
The People dataset looks like a good place to start.
data(People)
glimpse(People)
Rows: 20,093
Columns: 26
$ playerID <chr> "aardsda01", "aaronha01", "aaronto01", "aasedo0~
$ birthYear <int> 1981, 1934, 1939, 1954, 1972, 1985, 1850, 1877,~
$ birthMonth <int> 12, 2, 8, 9, 8, 12, 11, 4, 11, 10, 3, 10, 2, 8,~
$ birthDay <int> 27, 5, 5, 8, 25, 17, 4, 15, 11, 14, 16, 22, 16,~
$ birthCountry <chr> "USA", "USA", "USA", "USA", "USA", "D.R.", "USA~
$ birthState <chr> "CO", "AL", "AL", "CA", "FL", "La Romana", "PA"~
$ birthCity <chr> "Denver", "Mobile", "Mobile", "Orange", "Palm B~
$ deathYear <int> NA, 2021, 1984, NA, NA, NA, 1905, 1957, 1962, 1~
$ deathMonth <int> NA, 1, 8, NA, NA, NA, 5, 1, 6, 4, 2, 6, NA, NA,~
$ deathDay <int> NA, 22, 16, NA, NA, NA, 17, 6, 11, 27, 13, 11, ~
$ deathCountry <chr> NA, "USA", "USA", NA, NA, NA, "USA", "USA", "US~
$ deathState <chr> NA, "GA", "GA", NA, NA, NA, "NJ", "FL", "VT", "~
$ deathCity <chr> NA, "Atlanta", "Atlanta", NA, NA, NA, "Pemberto~
$ nameFirst <chr> "David", "Hank", "Tommie", "Don", "Andy", "Fern~
$ nameLast <chr> "Aardsma", "Aaron", "Aaron", "Aase", "Abad", "A~
$ nameGiven <chr> "David Allan", "Henry Louis", "Tommie Lee", "Do~
$ weight <int> 215, 180, 190, 190, 184, 235, 192, 170, 175, 16~
$ height <int> 75, 72, 75, 75, 73, 74, 72, 71, 71, 68, 71, 70,~
$ bats <fct> R, R, R, R, L, L, R, R, R, L, R, R, R, R, L, R,~
$ throws <fct> R, R, R, R, L, L, R, R, R, L, R, R, R, L, L, R,~
$ debut <chr> "2004-04-06", "1954-04-13", "1962-04-10", "1977~
$ finalGame <chr> "2015-08-23", "1976-10-03", "1971-09-26", "1990~
$ retroID <chr> "aardd001", "aaroh101", "aarot101", "aased001",~
$ bbrefID <chr> "aardsda01", "aaronha01", "aaronto01", "aasedo0~
$ deathDate <date> NA, 2021-01-22, 1984-08-16, NA, NA, NA, 1905-0~
$ birthDate <date> 1981-12-27, 1934-02-05, 1939-08-05, 1954-09-08~
We can use their debut, finalGame, and birthDate. This will give us the ability to calculate year number in the league (e.g. 1st year in the league, 2nd, etc.), their age, and their final tenure.
We are going to be predicting hitters so we will use the Batting dataset.
data(Batting)
glimpse(Batting)
Rows: 108,789
Columns: 22
$ playerID <chr> "abercda01", "addybo01", "allisar01", "allisdo01", ~
$ yearID <int> 1871, 1871, 1871, 1871, 1871, 1871, 1871, 1871, 187~
$ stint <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
$ teamID <fct> TRO, RC1, CL1, WS3, RC1, FW1, RC1, BS1, FW1, BS1, C~
$ lgID <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
$ G <int> 1, 25, 29, 27, 25, 12, 1, 31, 1, 18, 22, 1, 10, 3, ~
$ AB <int> 4, 118, 137, 133, 120, 49, 4, 157, 5, 86, 89, 3, 36~
$ R <int> 0, 30, 28, 28, 29, 9, 0, 66, 1, 13, 18, 0, 6, 7, 24~
$ H <int> 0, 32, 40, 44, 39, 11, 1, 63, 1, 13, 27, 0, 7, 6, 3~
$ X2B <int> 0, 6, 4, 10, 11, 2, 0, 10, 1, 2, 1, 0, 0, 0, 9, 3, ~
$ X3B <int> 0, 0, 5, 2, 3, 1, 0, 9, 0, 1, 10, 0, 0, 0, 1, 3, 0,~
$ HR <int> 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 1, 0, 0, ~
$ RBI <int> 0, 13, 19, 27, 16, 5, 2, 34, 1, 11, 18, 0, 1, 5, 21~
$ SB <int> 0, 8, 3, 1, 6, 0, 0, 11, 0, 1, 0, 0, 2, 2, 4, 4, 0,~
$ CS <int> 0, 1, 1, 1, 2, 1, 0, 6, 0, 0, 1, 0, 0, 0, 0, 4, 0, ~
$ BB <int> 0, 4, 2, 0, 2, 0, 1, 13, 0, 0, 3, 1, 2, 0, 2, 9, 0,~
$ SO <int> 0, 0, 5, 2, 1, 1, 0, 1, 0, 0, 4, 0, 0, 0, 2, 2, 3, ~
$ IBB <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
$ HBP <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
$ SH <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
$ SF <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
$ GIDP <int> 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 1, 2, 0, ~
Now let’s combine the People and the Batting datasets.
pitchers <- Lahman::Pitching %>%
select(playerID) %>%
distinct()
old_players <- People %>%
filter(ymd(debut) < '1961-01-01')
people_tenure <- People %>%
select(playerID, nameFirst, nameLast, birthDate, debut, finalGame) %>%
filter(!is.na(finalGame)) %>%
filter(ymd(finalGame) < '2018-01-01') %>%
anti_join(old_players) %>%
inner_join(Batting) %>%
group_by(playerID, yearID) %>%
summarise(nameFirst = first(nameFirst),
nameLast = first(nameLast),
birthDate = first(birthDate),
debut = first(debut),
finalGame = first(finalGame),
stint = first(stint),
teamID = first(teamID),
lgID = first(lgID),
G = sum(G),
AB = sum(AB),
R = sum(R),
H = sum(H),
X2B = sum(X2B),
X3B = sum(X3B),
HR = sum(HR),
RBI = sum(RBI),
SB = sum(SB),
CS = sum(CS),
BB = sum(BB),
SO = sum(SO),
IBB = sum(IBB),
HBP = sum(HBP),
SH = sum(SH),
SF = sum(SF),
GIDP = sum(GIDP),
.groups = "drop") %>%
anti_join(pitchers) %>%
filter(lgID %in% c('NL', 'AL')) %>%
Lahman::battingStats(data = .) %>%
group_by(playerID, stint) %>%
mutate(year_num = row_number()) %>%
ungroup() %>%
group_by(playerID) %>%
mutate(tenure = n()) %>%
ungroup() %>%
mutate(age = as.double((ymd(paste0(yearID, '-01-01')) - ymd(birthDate))/365)) %>%
drop_na() %>%
mutate(decade = yearID - (yearID %% 10)) %>%
mutate(TBB = BB + IBB + HBP, # combine all walk types
TAB = AB + TBB + SH + SF) %>% # create total at-bats
select(-c(BB, IBB, HBP, SH, SF)) # remove unneeded columns
Let me walk through the code above and explain what is happening and my thought process.
First, we need to identify the pitchers in our Batting dataset and get rid of them.
pitchers <- Lahman::Pitching %>%
select(playerID) %>%
distinct()
Next, I want to remove players from older eras.
old_players <- People %>%
filter(ymd(debut) < '1961-01-01')
Here is a good quote from The Sport Journal on MLB Eras explaining the ERAs:
A common list presented at Baseball-Reference described the eras as the Dead Ball Era (1901-1919), the Live Ball Era (1920-1941), the Integration Era (1942-1960), the Expansion Era (1961-1976), the Free Agency Era (1977-1993) and the Long Ball/Steroid Era (1994-2005) (17). This study runs through the 2011 season and aseventh era will be added and labeled the Post Steroid Era (2006-2011).
I removed anyone who debuted before the Expansion Era (1961). Average tenure might change depending on the decade so this is something we should investigate in the EDA stage.
I then removed anyone who doesn’t have a final game and also if their final game is after 2018. I want players that are not currently playing and have finished their MLB career. This doesn’t guarantee this but it comes close enough.
filter(!is.na(finalGame)) %>%
filter(ymd(finalGame) < '2018-01-01')
The next step is to make sure all players only have one entry per year. Some players have two (or more) for a year because they were traded or waived then picked up by another team. I don’t care about this so I am going to aggregate their stats to the year level.
group_by(playerID, yearID) %>%
summarise(nameFirst = first(nameFirst),
nameLast = first(nameLast),
birthDate = first(birthDate),
debut = first(debut),
finalGame = first(finalGame),
stint = first(stint),
teamID = first(teamID),
lgID = first(lgID),
G = sum(G),
AB = sum(AB),
R = sum(R),
H = sum(H),
X2B = sum(X2B),
X3B = sum(X3B),
HR = sum(HR),
RBI = sum(RBI),
SB = sum(SB),
CS = sum(CS),
BB = sum(BB),
SO = sum(SO),
IBB = sum(IBB),
HBP = sum(HBP),
SH = sum(SH),
SF = sum(SF),
GIDP = sum(GIDP),
.groups = "drop")
The next steps:
1. Only include players from the NL and AL
filter(lgID %in% c('NL', 'AL'))
Lahman::battingStats(data = .)
group_by(playerID, stint) %>%
mutate(year_num = row_number())
group_by(playerID) %>%
mutate(tenure = n())
mutate(age = as.double((ymd(paste0(yearID, '-01-01')) - ymd(birthDate))/365))
mutate(decade = yearID %% 10)
mutate(TBB = BB + IBB + HBP, # combine all walk types
TAB = AB + TBB + SH + SF) # create total at-bats
All of this together gives us data for batters only that debuted after 1961. It gives us data for each year a batter was in the league. Let’s take a look at my favorite player of all time and the greatest shortstop to play baseball to illustrate the structure (Ozzie Smith). Having each row be a year a player played in the league allows us to teach a model how a person looks like a rookie that ultimately played in the league for 19 years. Ozzie debuted at age 23, got a ton of at-bats (668), and posted what is likely a below-average OPS.
nameFirst | nameLast | yearID | year_num | age | TAB | OPS | tenure |
---|---|---|---|---|---|---|---|
Ozzie | Smith | 1978 | 1 | 23.03288 | 668 | 0.623 | 19 |
Ozzie | Smith | 1979 | 2 | 24.03288 | 654 | 0.522 | 19 |
Ozzie | Smith | 1980 | 3 | 25.03288 | 713 | 0.589 | 19 |
Ozzie | Smith | 1981 | 4 | 26.03562 | 508 | 0.550 | 19 |
Ozzie | Smith | 1982 | 5 | 27.03562 | 579 | 0.653 | 19 |
Ozzie | Smith | 1983 | 6 | 28.03562 | 635 | 0.656 | 19 |
Ozzie | Smith | 1984 | 7 | 29.03562 | 489 | 0.684 | 19 |
Ozzie | Smith | 1985 | 8 | 30.03836 | 626 | 0.716 | 19 |
Ozzie | Smith | 1986 | 9 | 31.03836 | 622 | 0.709 | 19 |
Ozzie | Smith | 1987 | 10 | 32.03836 | 709 | 0.775 | 19 |
Ozzie | Smith | 1988 | 11 | 33.03836 | 671 | 0.686 | 19 |
Ozzie | Smith | 1989 | 12 | 34.04110 | 667 | 0.696 | 19 |
Ozzie | Smith | 1990 | 13 | 35.04110 | 596 | 0.635 | 19 |
Ozzie | Smith | 1991 | 14 | 36.04110 | 643 | 0.747 | 19 |
Ozzie | Smith | 1992 | 15 | 37.04110 | 594 | 0.709 | 19 |
Ozzie | Smith | 1993 | 16 | 38.04384 | 604 | 0.693 | 19 |
Ozzie | Smith | 1994 | 17 | 39.04384 | 436 | 0.675 | 19 |
Ozzie | Smith | 1995 | 18 | 40.04384 | 182 | 0.526 | 19 |
Ozzie | Smith | 1996 | 19 | 41.04384 | 261 | 0.728 | 19 |
Always look at the distributions of your variables.
DataExplorer::plot_histogram(people_tenure)
Nothing jumps out to me. We have some outliers, but we will check to see the effect after modeling.
Next, let’s check out how the variables interact with each other. A good way to investigate this is through a correlation matrix. If modeling with a linear model then you need to be concerned with how correlated the variables are with each other.
The corrplot
package has a great way to visualize this and on top of that will cluster your variables together using different cluster techniques. Here I chose a hierarchical clustering method with 5 clusters.
library(corrplot)
cor_matrix <- people_tenure %>%
select(-c(playerID, yearID, nameFirst, nameLast, birthDate,
debut, finalGame, stint, teamID, lgID)) %>%
select(c(year_num, tenure, age, G, TAB, TBB, TB, SO, H, X2B, X3B, HR,
BA, SlugPct, OBP, OPS, BABIP)) %>%
cor()
corrplot(cor_matrix, order = "hclust", addrect = 5)
What’s cool about this is how we can see we have five different types of variables. In the left corner, we have the ‘Time’ stats: year_num & age. We then have the ‘Rate’ stats: BA, OBP, SlugPct, and OPS. Triples get grouped on their own because they happen so infrequently. Our target variable is by itself also, then we have the ‘Counting’ stats in the bottom right: Singles, Doubles, HRs, SOs, Walks, At Bats, & Games. Pretty neat to see them group like that.
Let’s see how the variables specifically correlate with our target variable Tenure.
library(corrr)
people_tenure %>%
select(-c(playerID, yearID, nameFirst, nameLast, birthDate,
debut, finalGame, stint, teamID, lgID)) %>%
select(c(year_num, tenure, age, G, TAB, TBB, TB, SO, H, X2B, X3B, HR,
SlugPct, OBP, OPS, BABIP)) %>%
correlate() %>%
stretch() %>%
filter(x == "tenure") %>%
arrange(desc(r)) %>%
gt()
x | y | r |
---|---|---|
tenure | year_num | 0.6174973 |
tenure | TAB | 0.5603746 |
tenure | TB | 0.5591402 |
tenure | H | 0.5585071 |
tenure | G | 0.5406922 |
tenure | TBB | 0.5334992 |
tenure | X2B | 0.5233772 |
tenure | HR | 0.4550314 |
tenure | SO | 0.4323623 |
tenure | age | 0.3500512 |
tenure | X3B | 0.3350727 |
tenure | OPS | 0.3186440 |
tenure | SlugPct | 0.3057516 |
tenure | OBP | 0.2930648 |
tenure | BABIP | 0.1349052 |
tenure | tenure | NA |
Interesting. We can use these correlations in helping us determine which variables to include in the models.
Let’s take a look at how the decade relates to tenure to wrap up the EDA section.
There are slight differences in tenure at the tail ends.
I have a rough outline when I model:
The reason to build a simple linear baseline model is to have an idea of how much a model has improved with the use of new models, hyper-parameter tuning, and future feature engineering. It also gives you an idea of how the variables interact with the target and the size of the effect.
I recommend making the first model by using the features you think are the most important. Include the bare minimum number of features.
In this case of trying to predict MLB tenure, I think year_num, age, TAB (total at-bats), and OPS (on-base percentage + slugging percentage). I think year_num is going to be important because the longer you have been in the league then the longer your tenure will be. Age will be important because the younger you are then the longer you can be in the league. Total at-bats represent opportunity. If a player is getting opportunity then it either means the manager believes in the player or they are performing well. OPS represents performance. OPS is the sum of on-base percentage and slugging percentage. On-base is the rate of getting on base without consideration of how many bases the hit resulted in. Slugging percentage is the total bases on the hits divided by the at-bats. Slugging percentage is higher the more bases a batter gets. Home run hitters will have a higher slugging than single hitters. OPS is nice because it accounts for how often a batter gets on base and the total bases all in one metric.
The first model will be based on theory but I am also going to build a few other models using the variables that have the highest correlations to tenure.
Here’s the formula for the theory model: tenure ~ year_num + age + TAB + OPS
Since we have isolated the target and the features we are most interested in, we can visualize the relationship between all of the variables using a pair plot. There is a great package GGally
that has a function ggpairs
that produces pair plots and is a great way to explore relationships and distributions.
Let’s fit the models and compare their performances.
lm_theory <- lm(formula = tenure ~ year_num + age + TAB + OPS
, data = people_tenure)
lm_corr1 <- lm(formula = tenure ~ year_num + TAB + OPS
, data = people_tenure)
lm_corr2 <- lm(formula = tenure ~ year_num + TAB + OPS + TB + H + G
, data = people_tenure)
lm_corr3 <- lm(formula = tenure ~ year_num + TAB + OPS + TB + H + G + TBB + X2B + HR + SO
, data = people_tenure)
lm_corr4 <- lm(formula = tenure ~ year_num + age + TAB + OPS + TB + H + G + TBB + X2B + HR + SO + age + X3B + BABIP
, data = people_tenure)
performance::compare_performance(lm_theory, lm_corr1, lm_corr2, lm_corr3, lm_corr4) %>%
as_tibble() %>%
select(Name, Model, R2, RMSE) %>%
arrange(desc(R2)) %>%
gt()
Name | Model | R2 | RMSE |
---|---|---|---|
lm_corr4 | lm | 0.6110783 | 3.248215 |
lm_theory | lm | 0.6017334 | 3.287007 |
lm_corr3 | lm | 0.5387190 | 3.537501 |
lm_corr2 | lm | 0.5296241 | 3.572204 |
lm_corr1 | lm | 0.5251697 | 3.589078 |
Nice, this is good to see. The theory model came in second and was only marginally behind the model that used every variable. An R^2 of .60 is also pretty good for only using 4 variables.
Let’s look at the coefficients to see which ones have the greatest impact on tenure.
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 19.966 | 0.250 | 79.895 | 0 |
year_num | 1.221 | 0.011 | 115.778 | 0 |
age | -0.706 | 0.010 | -68.910 | 0 |
TAB | 0.007 | 0.000 | 60.932 | 0 |
OPS | 1.896 | 0.109 | 17.441 | 0 |
All of the features are significant predictors of tenure with year_num being by far the most important. Every 1 year increase in year_num results in 1.22 more years of tenure. For every increase in age by 1 year, tenure is reduced by .7 years. The theory of how long a player has been in the league, their age, their opportunity, and their performance determining tenure seems to be a good theory.
Let’s check the model for the appropriate assumptions.
check_model(lm_theory, panel = TRUE)
Now is the time to build the simple black-box model. In this case, we are going to use the gradient boosted algorithm, xgBoost. I call it simple because the first go-around we will be using the default hyperparameters. There is a great package called tidymodels
that will allow us to create workflows and the ability to quickly swap different machine learning algorithms. To learn more, here is a great resource: Tidy Models bookdown TMRW
First, split the data into train and test datasets.
set.seed(123)
pt_split <- initial_split(people_tenure, prop = 3/4)
pt_train <- training(pt_split)
pt_test <- testing(pt_split)
We define the two formulas that we are going to be using and fit the models. We then collect the metrics and put them together for easy comparison.
theory_formula <- formula(tenure ~ year_num + age + TAB + OPS)
all_metrics_formula <- formula(tenure ~ year_num + age + TAB + OPS + TB + H + G + TBB + X2B + HR + SO + age + X3B + BABIP)
lm_theory <- linear_reg() %>%
fit(theory_formula, data = pt_train)
lm_all <- linear_reg() %>%
fit(theory_formula, data = pt_train)
xgb_theory <- boost_tree(mode = "regression") %>%
set_engine("xgboost") %>%
fit(theory_formula, data = pt_train)
xgb_all <- boost_tree(mode = "regression") %>%
set_engine("xgboost") %>%
fit(all_metrics_formula, data = pt_train)
lm_theory_metrics <- lm_theory %>%
predict(pt_test) %>%
bind_cols(pt_test) %>%
metrics(truth = tenure, estimate = .pred) %>%
mutate(model = "lm_theory") %>%
select(model, .metric, .estimate) %>%
pivot_wider(names_from = .metric, values_from = .estimate)
lm_all_metrics <- lm_all %>%
predict(pt_test) %>%
bind_cols(pt_test) %>%
metrics(truth = tenure, estimate = .pred) %>%
mutate(model = "lm_all") %>%
select(model, .metric, .estimate) %>%
pivot_wider(names_from = .metric, values_from = .estimate)
xgb_theory_metrics <- xgb_theory %>%
predict(pt_test) %>%
bind_cols(pt_test) %>%
metrics(truth = tenure, estimate = .pred) %>%
mutate(model = "xgb_theory") %>%
select(model, .metric, .estimate) %>%
pivot_wider(names_from = .metric, values_from = .estimate)
xgb_all_metrics <- xgb_all %>%
predict(pt_test) %>%
bind_cols(pt_test) %>%
metrics(truth = tenure, estimate = .pred) %>%
mutate(model = "xgb_all") %>%
select(model, .metric, .estimate) %>%
pivot_wider(names_from = .metric, values_from = .estimate)
lm_theory_metrics %>%
bind_rows(lm_all_metrics, xgb_theory_metrics, xgb_all_metrics) %>%
arrange(rmse) %>%
gt()
model | rmse | rsq | mae |
---|---|---|---|
xgb_all | 3.204862 | 0.6282084 | 2.434345 |
xgb_theory | 3.218378 | 0.6249189 | 2.449568 |
lm_theory | 3.315353 | 0.6019646 | 2.581725 |
lm_all | 3.315353 | 0.6019646 | 2.581725 |
The xgBoost model using the theory formula and default hyper-parameters outperformed the linear model by not much. We know the most important variables in the linear model were year_num, age, TAB, and OPS in that order. Let’s see what variables were the most important to the xgBoost theory model.
Age becomes less important in the xgBoost model and total at-bats becomes more important.
We know how the model performed on the test data. On average the xgBoost model was off +/- 2.44 years. Let’s see how it predicted Ozzie Smith.
nameFirst | nameLast | yearID | year_num | age | TAB | OPS | tenure | .pred |
---|---|---|---|---|---|---|---|---|
Ozzie | Smith | 1978 | 1 | 23.03288 | 668 | 0.623 | 19 | 11.06559 |
Ozzie | Smith | 1979 | 2 | 24.03288 | 654 | 0.522 | 19 | 10.88405 |
Ozzie | Smith | 1980 | 3 | 25.03288 | 713 | 0.589 | 19 | 11.97203 |
Ozzie | Smith | 1981 | 4 | 26.03562 | 508 | 0.550 | 19 | 12.02471 |
Ozzie | Smith | 1982 | 5 | 27.03562 | 579 | 0.653 | 19 | 11.93142 |
Ozzie | Smith | 1983 | 6 | 28.03562 | 635 | 0.656 | 19 | 12.37301 |
Ozzie | Smith | 1984 | 7 | 29.03562 | 489 | 0.684 | 19 | 11.75753 |
Ozzie | Smith | 1985 | 8 | 30.03836 | 626 | 0.716 | 19 | 13.14181 |
Ozzie | Smith | 1986 | 9 | 31.03836 | 622 | 0.709 | 19 | 13.40460 |
Ozzie | Smith | 1987 | 10 | 32.03836 | 709 | 0.775 | 19 | 15.87435 |
Ozzie | Smith | 1988 | 11 | 33.03836 | 671 | 0.686 | 19 | 15.32062 |
Ozzie | Smith | 1989 | 12 | 34.04110 | 667 | 0.696 | 19 | 15.36666 |
Ozzie | Smith | 1990 | 13 | 35.04110 | 596 | 0.635 | 19 | 15.70914 |
Ozzie | Smith | 1991 | 14 | 36.04110 | 643 | 0.747 | 19 | 18.33076 |
Ozzie | Smith | 1992 | 15 | 37.04110 | 594 | 0.709 | 19 | 17.10520 |
Ozzie | Smith | 1993 | 16 | 38.04384 | 604 | 0.693 | 19 | 17.98851 |
Ozzie | Smith | 1994 | 17 | 39.04384 | 436 | 0.675 | 19 | 18.13917 |
Ozzie | Smith | 1995 | 18 | 40.04384 | 182 | 0.526 | 19 | 18.45537 |
Ozzie | Smith | 1996 | 19 | 41.04384 | 261 | 0.728 | 19 | 19.86943 |
You can see the model is off by quite a bit for Ozzie Smith. The model predicts around 11 years until year number 8. From the average tenure by decade plot, we know the average tenure is around 9.5 years so the model was defaulting close to the average. I picked Ozzie not only because he is my favorite player, but because his value was derived from his defense and this model currently does not account for defensive production. Ozzie Smith won 13 gold gloves meaning he was the best at his position for 13 of the 19 years in the MLB. We would expect the model to be off for this type of player. In the future, we will need to account for this.
Let’s take a look at a great offensive player and see if the model does better. Unfortunately, the greatest offensive player, Hank Aaron, isn’t in our dataset because he debuted before 1961, but the next best hitter in our dataset might be Tony Gwynn so let’s take a look at him.
nameFirst | nameLast | yearID | year_num | age | TAB | OPS | tenure | .pred |
---|---|---|---|---|---|---|---|---|
Tony | Gwynn | 1982 | 1 | 21.66301 | 209 | 0.726 | 20 | 11.95321 |
Tony | Gwynn | 1983 | 2 | 22.66301 | 339 | 0.727 | 20 | 11.77514 |
Tony | Gwynn | 1984 | 3 | 23.66301 | 688 | 0.854 | 20 | 15.18588 |
Tony | Gwynn | 1985 | 4 | 24.66575 | 675 | 0.772 | 20 | 13.59647 |
Tony | Gwynn | 1986 | 5 | 25.66575 | 712 | 0.848 | 20 | 16.38655 |
Tony | Gwynn | 1987 | 6 | 26.66575 | 706 | 0.958 | 20 | 16.89046 |
Tony | Gwynn | 1988 | 7 | 27.66575 | 591 | 0.788 | 20 | 13.86852 |
Tony | Gwynn | 1989 | 8 | 28.66849 | 695 | 0.813 | 20 | 16.08902 |
Tony | Gwynn | 1990 | 9 | 29.66849 | 649 | 0.772 | 20 | 16.23166 |
Tony | Gwynn | 1991 | 10 | 30.66849 | 577 | 0.787 | 20 | 15.32837 |
Tony | Gwynn | 1992 | 11 | 31.66849 | 581 | 0.786 | 20 | 15.56111 |
Tony | Gwynn | 1993 | 12 | 32.67123 | 545 | 0.895 | 20 | 17.28474 |
Tony | Gwynn | 1994 | 13 | 33.67123 | 491 | 1.022 | 20 | 17.95643 |
Tony | Gwynn | 1995 | 14 | 34.67123 | 587 | 0.888 | 20 | 18.29347 |
Tony | Gwynn | 1996 | 15 | 35.67123 | 510 | 0.841 | 20 | 18.77420 |
Tony | Gwynn | 1997 | 16 | 36.67397 | 663 | 0.956 | 20 | 19.80710 |
Tony | Gwynn | 1998 | 17 | 37.67397 | 511 | 0.865 | 20 | 19.59279 |
Tony | Gwynn | 1999 | 18 | 38.67397 | 451 | 0.858 | 20 | 19.84833 |
Tony | Gwynn | 2000 | 19 | 39.67397 | 142 | 0.805 | 20 | 20.26111 |
Tony | Gwynn | 2001 | 20 | 40.67671 | 113 | 0.845 | 20 | 21.19039 |
It’s good to see the model start higher for Gwynn’s rookie year and even predicts 15 years of tenure for him in year number 3. The model then gets closer and closer to the right answer as the years pass.
Now, let’s do the fun part and see how it predicts some MLB players early in their career. The three I am going to be looking at are young phenoms; Ronald Acuna, Fernando Tatis, & Juan Soto.
young_phenoms <- People %>%
select(playerID, nameFirst, nameLast, birthDate, debut, finalGame) %>%
filter(playerID %in% c('acunaro01', 'tatisfe02', 'sotoju01')) %>%
inner_join(Batting) %>%
group_by(playerID, yearID) %>%
summarise(nameFirst = first(nameFirst),
nameLast = first(nameLast),
birthDate = first(birthDate),
debut = first(debut),
finalGame = first(finalGame),
stint = first(stint),
teamID = first(teamID),
lgID = first(lgID),
G = sum(G),
AB = sum(AB),
R = sum(R),
H = sum(H),
X2B = sum(X2B),
X3B = sum(X3B),
HR = sum(HR),
RBI = sum(RBI),
SB = sum(SB),
CS = sum(CS),
BB = sum(BB),
SO = sum(SO),
IBB = sum(IBB),
HBP = sum(HBP),
SH = sum(SH),
SF = sum(SF),
GIDP = sum(GIDP),
.groups = "drop") %>%
Lahman::battingStats(data = .) %>%
group_by(playerID, stint) %>%
mutate(year_num = row_number()) %>%
ungroup() %>%
filter(yearID == 2019) %>%
group_by(playerID) %>%
mutate(tenure = n()) %>%
ungroup() %>%
mutate(age = as.double((ymd(paste0(yearID, '-01-01')) - ymd(birthDate))/365)) %>%
drop_na() %>%
mutate(decade = yearID %% 10) %>%
mutate(TBB = BB + IBB + HBP, # combine all walk types
TAB = AB + TBB + SH + SF) %>% # create total at-bats
select(-c(BB, IBB, HBP, SH, SF)) # remove unneeded columns
xgb_theory %>%
predict(young_phenoms) %>%
bind_cols(young_phenoms) %>%
select(nameFirst, nameLast, yearID, year_num, age, TAB, OPS, tenure, .pred) %>%
gt()
nameFirst | nameLast | yearID | year_num | age | TAB | OPS | tenure | .pred |
---|---|---|---|---|---|---|---|---|
Ronald | Acuna | 2019 | 2 | 21.05205 | 716 | 0.883 | 1 | 17.25071 |
Juan | Soto | 2019 | 2 | 20.20000 | 662 | 0.949 | 1 | 16.44127 |
Fernando | Tatis | 2019 | 1 | 20.01096 | 373 | 0.969 | 1 | 15.79035 |
Based on 2019 data, the model expects Ronald Acuna to be in the league the longest but they all are very close to each other. Again, this is good to see. Most experts would predict these young players to be in the league for a very long time and 16 years is a long time.
Now let’s look at rookies from 2018, 2019, and 2020 and predict their tenure using 2020 data. 2020 was a short season due to the COVID-19 pandemic so we will have to extrapolate total at-bats from the 60 game season to a regular 162 game season.
rooks <- People %>%
select(playerID, nameFirst, nameLast, birthDate, debut, finalGame) %>%
filter( year(ymd(debut)) %in% c(2018, 2019, 2020)) %>%
inner_join(Batting) %>%
group_by(playerID, yearID) %>%
summarise(nameFirst = first(nameFirst),
nameLast = first(nameLast),
birthDate = first(birthDate),
debut = first(debut),
finalGame = first(finalGame),
stint = first(stint),
teamID = first(teamID),
lgID = first(lgID),
G = sum(G),
AB = sum(AB),
R = sum(R),
H = sum(H),
X2B = sum(X2B),
X3B = sum(X3B),
HR = sum(HR),
RBI = sum(RBI),
SB = sum(SB),
CS = sum(CS),
BB = sum(BB),
SO = sum(SO),
IBB = sum(IBB),
HBP = sum(HBP),
SH = sum(SH),
SF = sum(SF),
GIDP = sum(GIDP),
.groups = "drop") %>%
Lahman::battingStats(data = .) %>%
group_by(playerID, stint) %>%
mutate(year_num = row_number()) %>%
ungroup() %>%
group_by(playerID) %>%
filter(yearID == max(yearID)) %>%
ungroup() %>%
mutate(age = as.double((ymd(paste0(yearID, '-01-01')) - ymd(birthDate))/365)) %>%
drop_na() %>%
mutate(decade = yearID %% 10) %>%
mutate(TBB = BB + IBB + HBP, # combine all walk types
TAB = AB + TBB + SH + SF) %>% # create total at-bats
select(-c(BB, IBB, HBP, SH, SF)) %>% # remove unneeded columns
mutate(TAB = (TAB/60)*162)
xgb_theory %>%
predict(rooks) %>%
bind_cols(rooks) %>%
select(nameFirst, nameLast, year_num, age, TAB, OPS, .pred) %>%
arrange(desc(.pred)) %>%
head(10) %>%
gt()
nameFirst | nameLast | year_num | age | TAB | OPS | .pred |
---|---|---|---|---|---|---|
Juan | Soto | 3 | 21.20000 | 561.6 | 1.185 | 17.60920 |
Fernando | Tatis | 2 | 21.01096 | 696.6 | 0.937 | 16.89737 |
Luis | Garcia | 1 | 19.64110 | 375.3 | 0.668 | 16.45621 |
Ronald | Acuna | 3 | 22.05205 | 550.8 | 0.987 | 15.95809 |
Vladimir | Guerrero | 2 | 20.81096 | 658.8 | 0.791 | 15.93926 |
Kyle | Tucker | 3 | 22.96986 | 621.0 | 0.837 | 14.69650 |
Eloy | Jimenez | 2 | 23.10959 | 610.2 | 0.891 | 13.75860 |
Franmil | Reyes | 3 | 24.50411 | 650.7 | 0.794 | 13.58006 |
Bo | Bichette | 2 | 21.84110 | 348.3 | 0.840 | 13.55870 |
Trent | Grisham | 2 | 23.18082 | 677.7 | 0.808 | 13.43157 |
All of those young players are some of the best in the league and we would expect them to be on the top of the list if we were to predict tenure.
Now that we have a base model, we can start to improve it. The next steps are:
All those steps will be covered in the next blog post.
If you see mistakes or want to suggest changes, please create an issue on the source repository.
For attribution, please cite this work as
Oberdiear (2021, June 24). Louis Oberdiear: Predicting Customer Tenure...with MLB Hitters. Retrieved from https://thelob.blog/posts/2021-07-08-predicting-customer-tenurewith-mlb-hitters/
BibTeX citation
@misc{oberdiear2021predicting, author = {Oberdiear, Louis}, title = {Louis Oberdiear: Predicting Customer Tenure...with MLB Hitters}, url = {https://thelob.blog/posts/2021-07-08-predicting-customer-tenurewith-mlb-hitters/}, year = {2021} }