Customer Segmentation…with MLB players Pt. 1

customer segmentation customer series cluster analysis data science rstats

Many businesses try to create customer segmentation to gain a greater understanding of their customer base. This post will show how this can be done from start to finish and how to interpret and validate the newly found segments. Throughout this example advice on how to do this with retail data is given.

Author

Affiliation

Louis Oberdiear

 

Published

June 18, 2021

Citation

Oberdiear, 2021

Customer Segmentation of MLB Baseball Players

The goal of customer segmentation is to find hidden groups in data. The customer in this example will be MLB hitters. We are going to be coming at this problem as if we don’t know anything about MLB hitters to understand the different types of hitters utilizing different clustering algorithms.

The Data

For customer segmentation, you need data that describes the customer. In retail, this could be how recently they have made a purchase, how many times they have purchased in the last 12 months, and the total amount of money they have spent in the last 12 months. These describe a customer’s shopping behavior. You could include age and some other area demographics like Zip Code Household Income and Household size. You just need relevant data that describes the customer.

In this example, we need to get data that describes MLB hitters. We are going to use data from 2018 for the following:

The reason for these specific data points is because they are the outcomes of an at-bat. This could be applied to retail settings also. The number of times a person visits your site, time spent browsing the site, products looked at, the number of products added to the cart, and the number of products purchased.

We can find the data in the R package ‘Lahman’ which is from Sean Lahman’s baseball database. More info can be found here: www.seanlahman.com

Install the package:

Show code
install.packages("Lahman")
Show code
library(Lahman)
library(gt)
data(Batting)
head(Batting) %>%
  gt()
playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB SO IBB HBP SH SF GIDP
abercda01 1871 1 TRO NA 1 4 0 0 0 0 0 0 0 0 0 0 NA NA NA NA 0
addybo01 1871 1 RC1 NA 25 118 30 32 6 0 0 13 8 1 4 0 NA NA NA NA 0
allisar01 1871 1 CL1 NA 29 137 28 40 4 5 0 19 3 1 2 5 NA NA NA NA 1
allisdo01 1871 1 WS3 NA 27 133 28 44 10 2 2 27 1 1 0 2 NA NA NA NA 0
ansonca01 1871 1 RC1 NA 25 120 29 39 11 3 0 16 6 2 2 1 NA NA NA NA 0
armstbo01 1871 1 FW1 NA 12 49 9 11 2 1 0 5 0 1 0 1 NA NA NA NA 0

Glimpse the data:

Show code
library(dplyr)
glimpse(Batting)
Rows: 108,789
Columns: 22
$ playerID <chr> "abercda01", "addybo01", "allisar01", "allisdo01", ~
$ yearID   <int> 1871, 1871, 1871, 1871, 1871, 1871, 1871, 1871, 187~
$ stint    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
$ teamID   <fct> TRO, RC1, CL1, WS3, RC1, FW1, RC1, BS1, FW1, BS1, C~
$ lgID     <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
$ G        <int> 1, 25, 29, 27, 25, 12, 1, 31, 1, 18, 22, 1, 10, 3, ~
$ AB       <int> 4, 118, 137, 133, 120, 49, 4, 157, 5, 86, 89, 3, 36~
$ R        <int> 0, 30, 28, 28, 29, 9, 0, 66, 1, 13, 18, 0, 6, 7, 24~
$ H        <int> 0, 32, 40, 44, 39, 11, 1, 63, 1, 13, 27, 0, 7, 6, 3~
$ X2B      <int> 0, 6, 4, 10, 11, 2, 0, 10, 1, 2, 1, 0, 0, 0, 9, 3, ~
$ X3B      <int> 0, 0, 5, 2, 3, 1, 0, 9, 0, 1, 10, 0, 0, 0, 1, 3, 0,~
$ HR       <int> 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 1, 0, 0, ~
$ RBI      <int> 0, 13, 19, 27, 16, 5, 2, 34, 1, 11, 18, 0, 1, 5, 21~
$ SB       <int> 0, 8, 3, 1, 6, 0, 0, 11, 0, 1, 0, 0, 2, 2, 4, 4, 0,~
$ CS       <int> 0, 1, 1, 1, 2, 1, 0, 6, 0, 0, 1, 0, 0, 0, 0, 4, 0, ~
$ BB       <int> 0, 4, 2, 0, 2, 0, 1, 13, 0, 0, 3, 1, 2, 0, 2, 9, 0,~
$ SO       <int> 0, 0, 5, 2, 1, 1, 0, 1, 0, 0, 4, 0, 0, 0, 2, 2, 3, ~
$ IBB      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
$ HBP      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
$ SH       <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
$ SF       <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
$ GIDP     <int> 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 1, 2, 0, ~

We only want data from 2018 so we need to filter down and select the data points we want:

Show code
batting_2018 <- Batting %>%
  filter(yearID == 2018) %>%
  select(c(playerID, teamID, lgID, G, AB, SH, SF, BB, HBP, IBB, SO, H, X2B, X3B, HR))
head(batting_2018) %>% gt()
playerID teamID lgID G AB SH SF BB HBP IBB SO H X2B X3B HR
abreujo02 CHA AL 128 499 0 6 37 11 7 109 132 36 1 22
acunaro01 ATL NL 111 433 0 3 45 6 2 123 127 26 4 26
adamewi01 TBA AL 85 288 1 2 31 1 3 95 80 7 0 10
adamja01 KCA AL 31 0 0 0 0 0 0 0 0 0 0 0
adamsau02 WAS NL 2 0 0 0 0 0 0 0 0 0 0 0
adamsch01 NYA AL 3 0 0 0 0 0 0 0 0 0 0 0

For the non-baseball people:

The reason we are pulling extra data points like SH, SF, HBP, & IBB is that in the end, we want Total At Bats. Baseball data is a little tricky in that only some plate appearances are considered an ‘At Bat’. Walks and sacrifices aren’t considered an ‘At Bat’ but for this analysis, we’ll want the total number of times they come to the plate in a season.

Show code
batting_2018 <- batting_2018 %>%
  mutate(TBB = BB + IBB + HBP,         # combine all walk types
         TAB = AB + TBB + SH + SF) %>% # create total at-bats
  select(-c(BB, IBB, HBP, SH, SF)) %>% # remove unneeded columns
  select(c(playerID, teamID, lgID, G, TAB, TBB, SO, H, X2B, X3B, HR))

head(batting_2018) %>% gt()
playerID teamID lgID G TAB TBB SO H X2B X3B HR
abreujo02 CHA AL 128 560 55 109 132 36 1 22
acunaro01 ATL NL 111 489 53 123 127 26 4 26
adamewi01 TBA AL 85 326 35 95 80 7 0 10
adamja01 KCA AL 31 0 0 0 0 0 0 0
adamsau02 WAS NL 2 0 0 0 0 0 0 0
adamsch01 NYA AL 3 0 0 0 0 0 0 0

EDA

Let’s do some EDA on our data:

Show code
library(DataExplorer)
DataExplorer::profile_missing(batting_2018) %>%
  arrange(desc(num_missing)) %>%
  gt()
feature num_missing pct_missing
playerID 0 0
teamID 0 0
lgID 0 0
G 0 0
TAB 0 0
TBB 0 0
SO 0 0
H 0 0
X2B 0 0
X3B 0 0
HR 0 0

No missing data. Let’s look at the distributions:

Show code
DataExplorer::plot_bar(batting_2018, ncol = 4, nrow = 4)
Show code
DataExplorer::plot_histogram(batting_2018, ncol = 4, nrow = 4)

A LOT of zero values. We need to figure out a good minimum for the number of games played a hitter needs to be included in our analysis. For this, instead of using analysis, I’m going to use business logic and require a player to have at least appeared in 100 games in the 2018 season.

Show code
batting_2018 <- batting_2018 %>%
  filter(G >= 100)

Look at the distributions again:

Show code
DataExplorer::plot_histogram(batting_2018, ncol = 4, nrow = 4)

Much better. Most distributions (other than triples) are looking a lot more normal. Triples happen so infrequently that I’m going to combine them with doubles and call them extra-base hits.

In cluster analysis, it’s important to get all data points on the same scale. If you run the unscaled data through the kmeans algorithm it will potentially over-emphasize a variable that is on a larger scale. In retail this is common. If you have the number of times visited the store and dollars per visit then the kmeans algorithm will over-emphasize dollars per visit because it could potentially be in the hundreds or thousands of dollars while visits are only in the single and double-digit range.

In this example, we have to take an extra step before we scale. We want to make everything a per-at-bat level. This will let us know the percentage of times a player walks, gets a hit, etc then we will scale each variable. This could also be done for retail data and transforming the data at a per-visit level.

Show code
batting_2018 <- batting_2018 %>%
  mutate(XBH = X2B + X3B,
         walks = TBB/TAB,
         strikeouts = SO/TAB,
         singles = H/TAB,
         extras = XBH/TAB,
         triples = X3B/TAB,
         homeruns = HR/TAB) %>%
  mutate(walks_scaled = scale(walks),
         strikeouts_scaled = scale(strikeouts),
         singles_scaled = scale(singles),
         extras_scaled = scale(extras),
         homeruns_scaled = scale(homeruns))

head(batting_2018) %>% gt()
playerID teamID lgID G TAB TBB SO H X2B X3B HR XBH walks strikeouts singles extras triples homeruns walks_scaled strikeouts_scaled singles_scaled extras_scaled homeruns_scaled
abreujo02 CHA AL 128 560 55 109 132 36 1 22 37 0.09821429 0.1946429 0.2357143 0.06607143 0.001785714 0.03928571 -0.1574981 -0.2393866 0.30032062 1.1686432 0.4940999
acunaro01 ATL NL 111 489 53 123 127 26 4 26 30 0.10838446 0.2515337 0.2597137 0.06134969 0.008179959 0.05316973 0.1310304 0.7245586 1.20778753 0.7962888 1.4262794
adriaeh01 MIN AL 114 368 27 82 84 23 1 6 24 0.07336957 0.2228261 0.2282609 0.06521739 0.002717391 0.01630435 -0.8623447 0.2381431 0.01849173 1.1012941 -1.0488797
aguilje01 MIL NL 149 569 67 143 135 25 0 35 25 0.11775044 0.2513181 0.2372583 0.04393673 0.000000000 0.06151142 0.3967440 0.7209049 0.35870476 -0.5768913 1.9863443
ahmedni01 ARI NL 153 566 44 109 121 33 5 16 38 0.07773852 0.1925795 0.2137809 0.06713781 0.008833922 0.02826855 -0.7383973 -0.2743475 -0.52902478 1.2527376 -0.2455976
albieoz01 ATL NL 158 684 41 116 167 40 5 24 45 0.05994152 0.1695906 0.2441520 0.06578947 0.007309942 0.03508772 -1.2432994 -0.6638652 0.61936959 1.1464083 0.2122445

K-Means Cluster Analysis

We have the data prepped and scaled so we are ready to run it through the kmeans algorithm. The tricky part of cluster analysis via kmeans is kmeans forces the user to select the number of clusters (k). Our first task is determining k and there are a few ways to do this. The three methods I will showcase is called the elbow method, silhouette scores, and gap statistic.

Elbow Method

Show code
library(factoextra)
fviz_nbclust(select(batting_2018, contains('scaled')), kmeans, method = "wss")

The elbow method plots the total within sum of square errors. The way kmeans works guarantees that the tot.withinss will always decrease with an increase in clusters. Our goal isn’t to find the lowest tot.withinss but to find the point of diminishing returns or the ‘elbow’ point in the graph. The elbow seems to be at 4 clusters.

Silhouette Method

Show code
fviz_nbclust(select(batting_2018, contains('scaled')), kmeans, method = "silhouette")

This method suggests 2 with 3 being close behind.

Gap Statistic

Show code
fviz_nbclust(select(batting_2018, contains('scaled')), kmeans, method = "gap_stat")

The gap statistic suggests 1 cluster. Unfortunately the methods don’t all agree which is usually the case when doing this analysis out in the wild. Fortunately we can visualize the clusters. We are going to only visualize 3 and 4 clusters because 2 clusters wouldn’t be informative enough to give us any insights.

Show code
set.seed(123)
k3 <- kmeans(select(batting_2018, contains('scaled')), 3, nstart = 25)
k4 <- kmeans(select(batting_2018, contains('scaled')), 4, nstart = 25)
fviz_cluster(k3, data = select(batting_2018, contains('scaled')))
Show code
fviz_cluster(k4, data = select(batting_2018, contains('scaled')))

Not great separation so it’s understandable that none of the methods agreed on a specific k but I like 4 clusters so we are going to go with that. Let’s join the clusters back to the data and see if we can determine the types of hitters in each cluster. Remember since we scaled the data we are looking at standard scores or z-scores. A z-score of 0 means average for that stat, the more positive means the more above average and vice-versa for negative scores.

Show code
batting_2018 %>%
  mutate(cluster = k4$cluster) %>%
  group_by(cluster) %>%
  summarize(n = n(),
         walks = mean(walks_scaled),
         strikeouts = mean(strikeouts_scaled),
         singles = mean(singles_scaled),
         extras = mean(extras_scaled),
         homeruns = mean(homeruns_scaled)) %>%
  gt()
cluster n walks strikeouts singles extras homeruns
1 39 1.4976302 0.4546602 -0.6094075 -0.55614687 0.94189321
2 53 -0.2066408 -0.1284540 0.7968028 1.09735051 0.61576332
3 75 -0.1431369 0.6372036 -0.7537442 -0.49739296 -0.09622473
4 70 -0.5245764 -0.8387708 0.5438166 0.01192319 -0.88789195

Let’s visualize these clusters with some bar charts. Might be easier to digest.

Show code
library(tidyr)
library(ggplot2)
batting_2018 %>%
  mutate(cluster = k4$cluster) %>%
  group_by(cluster) %>%
  summarize(walks = mean(walks_scaled),
         strikeouts = mean(strikeouts_scaled),
         singles = mean(singles_scaled),
         extras = mean(extras_scaled),
         homeruns = mean(homeruns_scaled),
         .groups = 'drop') %>%
  pivot_longer(!cluster, names_to = "hitting_category", values_to = "z_score") %>%
  mutate(pos = z_score >= 0) %>%
  ggplot(aes(x = hitting_category, y = z_score, fill = pos)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme(legend.position = "none") +
  facet_wrap(facets = vars(cluster), nrow = 2, ncol = 2)

Cluster 1 = high walks, high strikeouts, low singles, low extras, high homeruns

Cluster 2 = low walks, low strikeouts, high singles, high extras, high homeruns

Cluster 3 = low walks, highish strikouts, low singles, low extras, low homeruns

Cluster 4 = low walks, low strikeouts, high singles, avg extras, low homeruns

Let’s give them names.

Cluster 1 = The Three True Outcome Hitters

Show code
batting_2018 %>%
  mutate(cluster = k4$cluster) %>%
  group_by(cluster) %>%
  summarize(walks = mean(walks_scaled),
         strikeouts = mean(strikeouts_scaled),
         singles = mean(singles_scaled),
         extras = mean(extras_scaled),
         homeruns = mean(homeruns_scaled),
         .groups = 'drop') %>%
  pivot_longer(!cluster, names_to = "hitting_category", values_to = "z_score") %>%
  mutate(pos = z_score >= 0) %>%
  filter(cluster == 1) %>%
  ggplot(aes(x = hitting_category, y = z_score, fill = pos)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme(legend.position = "none") +
  ggtitle(label = "The Three True Outcome Hitters")

These hitters are already known in the baseball world and have been give a name because every at bat results in one of the ‘Three True Outcomes’ which are Walk, Strikeout, Home Run. Examples of this type of hitter is Aaron Judge and Bryce Harper. Let’s see if they ended up in cluster 1.

Show code
#Aaron Judge, Bryce Harper
batting_2018 %>%
  mutate(cluster = k4$cluster) %>%
  filter(playerID %in% c("judgeaa01","harpebr03")) %>%
  select(playerID, contains('scaled')) %>%
  pivot_longer(!playerID, names_to = "hitting_category", values_to = "z_score") %>%
  mutate(pos = z_score >= 0) %>%
  ggplot(aes(x = hitting_category, y = z_score, fill = pos)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme(legend.position = "none") +
  facet_wrap(facets = vars(playerID), nrow = 2, ncol = 2) +
  ggtitle(label = "Bryce Harper and Aaron Judge")

Let’s check the other hitters in this cluster:

Show code
batting_2018 %>%
  mutate(cluster = k4$cluster) %>%
  filter(cluster == 1) %>%
  arrange(desc(G)) %>%
  select(playerID, teamID, cluster, contains('scaled')) %>%
  head() %>%
  select(playerID, contains('scaled')) %>%
  pivot_longer(!playerID, names_to = "hitting_category", values_to = "z_score") %>%
  mutate(pos = z_score >= 0) %>%
  ggplot(aes(x = hitting_category, y = z_score, fill = pos)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme(legend.position = "none") +
  facet_wrap(facets = vars(playerID), nrow = 2, ncol = 3) +
  ggtitle(label = "The Three True Outcome Hitters")

All of these are pretty good examples of ‘Three True Outcome’ hitters except Carlos Santana (santaca01). His defining attribute is that he draws a ton of walks, but he doesn’t strikeout or hit above average home runs so he is not a great fit for this category. The most likely reason he was put in here is because he fits the walk profile of this group and doesn’t belong in any of the other groups either.

Cluster 2 = All Around Good Hitters

Show code
batting_2018 %>%
  mutate(cluster = k4$cluster) %>%
  group_by(cluster) %>%
  summarize(walks = mean(walks_scaled),
         strikeouts = mean(strikeouts_scaled),
         singles = mean(singles_scaled),
         extras = mean(extras_scaled),
         homeruns = mean(homeruns_scaled),
         .groups = 'drop') %>%
  pivot_longer(!cluster, names_to = "hitting_category", values_to = "z_score") %>%
  mutate(pos = z_score >= 0) %>%
  filter(cluster == 2) %>%
  ggplot(aes(x = hitting_category, y = z_score, fill = pos)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme(legend.position = "none") +
  ggtitle(label = "All Around Good Hitters")

They don’t have a weak area, walk enough, hit a lot of singles and have power by hitting a lot of homeruns. These are the well-rounded great hitters. We would expect the MVP winner Mookie Betts to be in this category. He might well be the most well-rounded hitter in the MLB with almost no weaknesses.

Show code
#Mookie Betts
batting_2018 %>%
  mutate(cluster = k4$cluster) %>%
  filter(playerID %in% c("bettsmo01")) %>%
  select(playerID, teamID, cluster, contains('scaled')) %>%
  select(playerID, contains('scaled')) %>%
  pivot_longer(!playerID, names_to = "hitting_category", values_to = "z_score") %>%
  mutate(pos = z_score >= 0) %>%
  ggplot(aes(x = hitting_category, y = z_score, fill = pos)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme(legend.position = "none") +
  ggtitle(label = "Mookie Betts")

Look at those z-scores for Mookie. Incredible.

Show code
batting_2018 %>%
  mutate(cluster = k4$cluster) %>%
  filter(cluster == 2) %>%
  arrange(desc(G)) %>%
  select(playerID, teamID, cluster, contains('scaled')) %>%
  head() %>%
  select(playerID, contains('scaled')) %>%
  pivot_longer(!playerID, names_to = "hitting_category", values_to = "z_score") %>%
  mutate(pos = z_score >= 0) %>%
  ggplot(aes(x = hitting_category, y = z_score, fill = pos)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme(legend.position = "none") +
  facet_wrap(facets = vars(playerID), nrow = 2, ncol = 3) +
  ggtitle(label = "All Around Good Hitters")

All of the hitters shown above are great hitters. This is good to see.

Cluster 3 = Weak Hitters

Show code
batting_2018 %>%
  mutate(cluster = k4$cluster) %>%
  group_by(cluster) %>%
  summarize(walks = mean(walks_scaled),
         strikeouts = mean(strikeouts_scaled),
         singles = mean(singles_scaled),
         extras = mean(extras_scaled),
         homeruns = mean(homeruns_scaled),
         .groups = 'drop') %>%
  pivot_longer(!cluster, names_to = "hitting_category", values_to = "z_score") %>%
  mutate(pos = z_score >= 0) %>%
  filter(cluster == 3) %>%
  ggplot(aes(x = hitting_category, y = z_score, fill = pos)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme(legend.position = "none") +
  ggtitle(label = "Weak Hitters")

These batters struggle getting on base (Walks) and strikeout a lot. These are the weak hitters that are most likely at the bottom of the batting order. Most of these names would not be familiar to casual MLB fans.

Show code
batting_2018 %>%
  mutate(cluster = k4$cluster) %>%
  filter(cluster == 3) %>%
  arrange(desc(G)) %>%
  select(playerID, teamID, cluster, contains('scaled')) %>%
  head() %>%
  select(playerID, contains('scaled')) %>%
  pivot_longer(!playerID, names_to = "hitting_category", values_to = "z_score") %>%
  mutate(pos = z_score >= 0) %>%
  ggplot(aes(x = hitting_category, y = z_score, fill = pos)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme(legend.position = "none") +
  facet_wrap(facets = vars(playerID), nrow = 2, ncol = 3) +
  ggtitle(label = "Weak Hitters")

Cluster 4 = Balls-in-Play

Show code
batting_2018 %>%
  mutate(cluster = k4$cluster) %>%
  group_by(cluster) %>%
  summarize(walks = mean(walks_scaled),
         strikeouts = mean(strikeouts_scaled),
         singles = mean(singles_scaled),
         extras = mean(extras_scaled),
         homeruns = mean(homeruns_scaled),
         .groups = 'drop') %>%
  pivot_longer(!cluster, names_to = "hitting_category", values_to = "z_score") %>%
  mutate(pos = z_score >= 0) %>%
  filter(cluster == 4) %>%
  ggplot(aes(x = hitting_category, y = z_score, fill = pos)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme(legend.position = "none") +
  ggtitle(label = "Balls-in-Play")

They don’t walk, they don’t strikeout, and they don’t hit home runs. They just put the ball in play in most at bats.

Show code
batting_2018 %>%
  mutate(cluster = k4$cluster) %>%
  filter(cluster == 4) %>%
  arrange(desc(G)) %>%
  select(playerID, teamID, cluster, contains('scaled')) %>%
  head() %>%
  select(playerID, contains('scaled')) %>%
  pivot_longer(!playerID, names_to = "hitting_category", values_to = "z_score") %>%
  mutate(pos = z_score >= 0) %>%
  ggplot(aes(x = hitting_category, y = z_score, fill = pos)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme(legend.position = "none") +
  facet_wrap(facets = vars(playerID), nrow = 2, ncol = 3) +
  ggtitle(label = "Balls-in-Play")

Nick Markakis is a great example of these cluster. He walks slightly above average, almost never strikes out, a ton of singles and extras with well below average home runs.

Summary

We went through the process of getting the customer data (MLB hitting data), exploring the missing values and distributions, scaling the data, determining an appropriate k using three different methods and then interpreting and validating our segments. The next step is to start using our new-found segments to gain insights on things like roster construction and position importance.

Footnotes

    Corrections

    If you see mistakes or want to suggest changes, please create an issue on the source repository.

    Citation

    For attribution, please cite this work as

    Oberdiear (2021, June 18). Louis Oberdiear: Customer Segmentation...with MLB players Pt. 1. Retrieved from https://thelob.blog/posts/2021-06-18-customer-segmentationwith-mlb-players/

    BibTeX citation

    @misc{oberdiear2021customer,
      author = {Oberdiear, Louis},
      title = {Louis Oberdiear: Customer Segmentation...with MLB players Pt. 1},
      url = {https://thelob.blog/posts/2021-06-18-customer-segmentationwith-mlb-players/},
      year = {2021}
    }