Many businesses try to create customer segmentation to gain a greater understanding of their customer base. This post will show how this can be done from start to finish and how to interpret and validate the newly found segments. Throughout this example advice on how to do this with retail data is given.
The goal of customer segmentation is to find hidden groups in data. The customer in this example will be MLB hitters. We are going to be coming at this problem as if we don’t know anything about MLB hitters to understand the different types of hitters utilizing different clustering algorithms.
For customer segmentation, you need data that describes the customer. In retail, this could be how recently they have made a purchase, how many times they have purchased in the last 12 months, and the total amount of money they have spent in the last 12 months. These describe a customer’s shopping behavior. You could include age and some other area demographics like Zip Code Household Income and Household size. You just need relevant data that describes the customer.
In this example, we need to get data that describes MLB hitters. We are going to use data from 2018 for the following:
The reason for these specific data points is because they are the outcomes of an at-bat. This could be applied to retail settings also. The number of times a person visits your site, time spent browsing the site, products looked at, the number of products added to the cart, and the number of products purchased.
We can find the data in the R package ‘Lahman’ which is from Sean Lahman’s baseball database. More info can be found here: www.seanlahman.com
Install the package:
install.packages("Lahman")
playerID | yearID | stint | teamID | lgID | G | AB | R | H | X2B | X3B | HR | RBI | SB | CS | BB | SO | IBB | HBP | SH | SF | GIDP |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
abercda01 | 1871 | 1 | TRO | NA | 1 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NA | NA | NA | NA | 0 |
addybo01 | 1871 | 1 | RC1 | NA | 25 | 118 | 30 | 32 | 6 | 0 | 0 | 13 | 8 | 1 | 4 | 0 | NA | NA | NA | NA | 0 |
allisar01 | 1871 | 1 | CL1 | NA | 29 | 137 | 28 | 40 | 4 | 5 | 0 | 19 | 3 | 1 | 2 | 5 | NA | NA | NA | NA | 1 |
allisdo01 | 1871 | 1 | WS3 | NA | 27 | 133 | 28 | 44 | 10 | 2 | 2 | 27 | 1 | 1 | 0 | 2 | NA | NA | NA | NA | 0 |
ansonca01 | 1871 | 1 | RC1 | NA | 25 | 120 | 29 | 39 | 11 | 3 | 0 | 16 | 6 | 2 | 2 | 1 | NA | NA | NA | NA | 0 |
armstbo01 | 1871 | 1 | FW1 | NA | 12 | 49 | 9 | 11 | 2 | 1 | 0 | 5 | 0 | 1 | 0 | 1 | NA | NA | NA | NA | 0 |
Glimpse the data:
Rows: 108,789
Columns: 22
$ playerID <chr> "abercda01", "addybo01", "allisar01", "allisdo01", ~
$ yearID <int> 1871, 1871, 1871, 1871, 1871, 1871, 1871, 1871, 187~
$ stint <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
$ teamID <fct> TRO, RC1, CL1, WS3, RC1, FW1, RC1, BS1, FW1, BS1, C~
$ lgID <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
$ G <int> 1, 25, 29, 27, 25, 12, 1, 31, 1, 18, 22, 1, 10, 3, ~
$ AB <int> 4, 118, 137, 133, 120, 49, 4, 157, 5, 86, 89, 3, 36~
$ R <int> 0, 30, 28, 28, 29, 9, 0, 66, 1, 13, 18, 0, 6, 7, 24~
$ H <int> 0, 32, 40, 44, 39, 11, 1, 63, 1, 13, 27, 0, 7, 6, 3~
$ X2B <int> 0, 6, 4, 10, 11, 2, 0, 10, 1, 2, 1, 0, 0, 0, 9, 3, ~
$ X3B <int> 0, 0, 5, 2, 3, 1, 0, 9, 0, 1, 10, 0, 0, 0, 1, 3, 0,~
$ HR <int> 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 1, 0, 0, ~
$ RBI <int> 0, 13, 19, 27, 16, 5, 2, 34, 1, 11, 18, 0, 1, 5, 21~
$ SB <int> 0, 8, 3, 1, 6, 0, 0, 11, 0, 1, 0, 0, 2, 2, 4, 4, 0,~
$ CS <int> 0, 1, 1, 1, 2, 1, 0, 6, 0, 0, 1, 0, 0, 0, 0, 4, 0, ~
$ BB <int> 0, 4, 2, 0, 2, 0, 1, 13, 0, 0, 3, 1, 2, 0, 2, 9, 0,~
$ SO <int> 0, 0, 5, 2, 1, 1, 0, 1, 0, 0, 4, 0, 0, 0, 2, 2, 3, ~
$ IBB <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
$ HBP <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
$ SH <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
$ SF <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
$ GIDP <int> 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 1, 2, 0, ~
We only want data from 2018 so we need to filter down and select the data points we want:
playerID | teamID | lgID | G | AB | SH | SF | BB | HBP | IBB | SO | H | X2B | X3B | HR |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
abreujo02 | CHA | AL | 128 | 499 | 0 | 6 | 37 | 11 | 7 | 109 | 132 | 36 | 1 | 22 |
acunaro01 | ATL | NL | 111 | 433 | 0 | 3 | 45 | 6 | 2 | 123 | 127 | 26 | 4 | 26 |
adamewi01 | TBA | AL | 85 | 288 | 1 | 2 | 31 | 1 | 3 | 95 | 80 | 7 | 0 | 10 |
adamja01 | KCA | AL | 31 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
adamsau02 | WAS | NL | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
adamsch01 | NYA | AL | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
For the non-baseball people:
The reason we are pulling extra data points like SH, SF, HBP, & IBB is that in the end, we want Total At Bats. Baseball data is a little tricky in that only some plate appearances are considered an ‘At Bat’. Walks and sacrifices aren’t considered an ‘At Bat’ but for this analysis, we’ll want the total number of times they come to the plate in a season.
playerID | teamID | lgID | G | TAB | TBB | SO | H | X2B | X3B | HR |
---|---|---|---|---|---|---|---|---|---|---|
abreujo02 | CHA | AL | 128 | 560 | 55 | 109 | 132 | 36 | 1 | 22 |
acunaro01 | ATL | NL | 111 | 489 | 53 | 123 | 127 | 26 | 4 | 26 |
adamewi01 | TBA | AL | 85 | 326 | 35 | 95 | 80 | 7 | 0 | 10 |
adamja01 | KCA | AL | 31 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
adamsau02 | WAS | NL | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
adamsch01 | NYA | AL | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Let’s do some EDA on our data:
library(DataExplorer)
DataExplorer::profile_missing(batting_2018) %>%
arrange(desc(num_missing)) %>%
gt()
feature | num_missing | pct_missing |
---|---|---|
playerID | 0 | 0 |
teamID | 0 | 0 |
lgID | 0 | 0 |
G | 0 | 0 |
TAB | 0 | 0 |
TBB | 0 | 0 |
SO | 0 | 0 |
H | 0 | 0 |
X2B | 0 | 0 |
X3B | 0 | 0 |
HR | 0 | 0 |
No missing data. Let’s look at the distributions:
DataExplorer::plot_bar(batting_2018, ncol = 4, nrow = 4)
DataExplorer::plot_histogram(batting_2018, ncol = 4, nrow = 4)
A LOT of zero values. We need to figure out a good minimum for the number of games played a hitter needs to be included in our analysis. For this, instead of using analysis, I’m going to use business logic and require a player to have at least appeared in 100 games in the 2018 season.
batting_2018 <- batting_2018 %>%
filter(G >= 100)
Look at the distributions again:
DataExplorer::plot_histogram(batting_2018, ncol = 4, nrow = 4)
Much better. Most distributions (other than triples) are looking a lot more normal. Triples happen so infrequently that I’m going to combine them with doubles and call them extra-base hits.
In cluster analysis, it’s important to get all data points on the same scale. If you run the unscaled data through the kmeans algorithm it will potentially over-emphasize a variable that is on a larger scale. In retail this is common. If you have the number of times visited the store and dollars per visit then the kmeans algorithm will over-emphasize dollars per visit because it could potentially be in the hundreds or thousands of dollars while visits are only in the single and double-digit range.
In this example, we have to take an extra step before we scale. We want to make everything a per-at-bat level. This will let us know the percentage of times a player walks, gets a hit, etc then we will scale each variable. This could also be done for retail data and transforming the data at a per-visit level.
batting_2018 <- batting_2018 %>%
mutate(XBH = X2B + X3B,
walks = TBB/TAB,
strikeouts = SO/TAB,
singles = H/TAB,
extras = XBH/TAB,
triples = X3B/TAB,
homeruns = HR/TAB) %>%
mutate(walks_scaled = scale(walks),
strikeouts_scaled = scale(strikeouts),
singles_scaled = scale(singles),
extras_scaled = scale(extras),
homeruns_scaled = scale(homeruns))
head(batting_2018) %>% gt()
playerID | teamID | lgID | G | TAB | TBB | SO | H | X2B | X3B | HR | XBH | walks | strikeouts | singles | extras | triples | homeruns | walks_scaled | strikeouts_scaled | singles_scaled | extras_scaled | homeruns_scaled |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
abreujo02 | CHA | AL | 128 | 560 | 55 | 109 | 132 | 36 | 1 | 22 | 37 | 0.09821429 | 0.1946429 | 0.2357143 | 0.06607143 | 0.001785714 | 0.03928571 | -0.1574981 | -0.2393866 | 0.30032062 | 1.1686432 | 0.4940999 |
acunaro01 | ATL | NL | 111 | 489 | 53 | 123 | 127 | 26 | 4 | 26 | 30 | 0.10838446 | 0.2515337 | 0.2597137 | 0.06134969 | 0.008179959 | 0.05316973 | 0.1310304 | 0.7245586 | 1.20778753 | 0.7962888 | 1.4262794 |
adriaeh01 | MIN | AL | 114 | 368 | 27 | 82 | 84 | 23 | 1 | 6 | 24 | 0.07336957 | 0.2228261 | 0.2282609 | 0.06521739 | 0.002717391 | 0.01630435 | -0.8623447 | 0.2381431 | 0.01849173 | 1.1012941 | -1.0488797 |
aguilje01 | MIL | NL | 149 | 569 | 67 | 143 | 135 | 25 | 0 | 35 | 25 | 0.11775044 | 0.2513181 | 0.2372583 | 0.04393673 | 0.000000000 | 0.06151142 | 0.3967440 | 0.7209049 | 0.35870476 | -0.5768913 | 1.9863443 |
ahmedni01 | ARI | NL | 153 | 566 | 44 | 109 | 121 | 33 | 5 | 16 | 38 | 0.07773852 | 0.1925795 | 0.2137809 | 0.06713781 | 0.008833922 | 0.02826855 | -0.7383973 | -0.2743475 | -0.52902478 | 1.2527376 | -0.2455976 |
albieoz01 | ATL | NL | 158 | 684 | 41 | 116 | 167 | 40 | 5 | 24 | 45 | 0.05994152 | 0.1695906 | 0.2441520 | 0.06578947 | 0.007309942 | 0.03508772 | -1.2432994 | -0.6638652 | 0.61936959 | 1.1464083 | 0.2122445 |
We have the data prepped and scaled so we are ready to run it through the kmeans algorithm. The tricky part of cluster analysis via kmeans is kmeans forces the user to select the number of clusters (k). Our first task is determining k and there are a few ways to do this. The three methods I will showcase is called the elbow method, silhouette scores, and gap statistic.
library(factoextra)
fviz_nbclust(select(batting_2018, contains('scaled')), kmeans, method = "wss")
The elbow method plots the total within sum of square errors. The way kmeans works guarantees that the tot.withinss will always decrease with an increase in clusters. Our goal isn’t to find the lowest tot.withinss but to find the point of diminishing returns or the ‘elbow’ point in the graph. The elbow seems to be at 4 clusters.
fviz_nbclust(select(batting_2018, contains('scaled')), kmeans, method = "silhouette")
This method suggests 2 with 3 being close behind.
fviz_nbclust(select(batting_2018, contains('scaled')), kmeans, method = "gap_stat")
The gap statistic suggests 1 cluster. Unfortunately the methods don’t all agree which is usually the case when doing this analysis out in the wild. Fortunately we can visualize the clusters. We are going to only visualize 3 and 4 clusters because 2 clusters wouldn’t be informative enough to give us any insights.
fviz_cluster(k4, data = select(batting_2018, contains('scaled')))
Not great separation so it’s understandable that none of the methods agreed on a specific k but I like 4 clusters so we are going to go with that. Let’s join the clusters back to the data and see if we can determine the types of hitters in each cluster. Remember since we scaled the data we are looking at standard scores or z-scores. A z-score of 0 means average for that stat, the more positive means the more above average and vice-versa for negative scores.
cluster | n | walks | strikeouts | singles | extras | homeruns |
---|---|---|---|---|---|---|
1 | 39 | 1.4976302 | 0.4546602 | -0.6094075 | -0.55614687 | 0.94189321 |
2 | 53 | -0.2066408 | -0.1284540 | 0.7968028 | 1.09735051 | 0.61576332 |
3 | 75 | -0.1431369 | 0.6372036 | -0.7537442 | -0.49739296 | -0.09622473 |
4 | 70 | -0.5245764 | -0.8387708 | 0.5438166 | 0.01192319 | -0.88789195 |
Let’s visualize these clusters with some bar charts. Might be easier to digest.
library(tidyr)
library(ggplot2)
batting_2018 %>%
mutate(cluster = k4$cluster) %>%
group_by(cluster) %>%
summarize(walks = mean(walks_scaled),
strikeouts = mean(strikeouts_scaled),
singles = mean(singles_scaled),
extras = mean(extras_scaled),
homeruns = mean(homeruns_scaled),
.groups = 'drop') %>%
pivot_longer(!cluster, names_to = "hitting_category", values_to = "z_score") %>%
mutate(pos = z_score >= 0) %>%
ggplot(aes(x = hitting_category, y = z_score, fill = pos)) +
geom_bar(stat = "identity") +
coord_flip() +
theme(legend.position = "none") +
facet_wrap(facets = vars(cluster), nrow = 2, ncol = 2)
Cluster 1 = high walks, high strikeouts, low singles, low extras, high homeruns
Cluster 2 = low walks, low strikeouts, high singles, high extras, high homeruns
Cluster 3 = low walks, highish strikouts, low singles, low extras, low homeruns
Cluster 4 = low walks, low strikeouts, high singles, avg extras, low homeruns
Let’s give them names.
batting_2018 %>%
mutate(cluster = k4$cluster) %>%
group_by(cluster) %>%
summarize(walks = mean(walks_scaled),
strikeouts = mean(strikeouts_scaled),
singles = mean(singles_scaled),
extras = mean(extras_scaled),
homeruns = mean(homeruns_scaled),
.groups = 'drop') %>%
pivot_longer(!cluster, names_to = "hitting_category", values_to = "z_score") %>%
mutate(pos = z_score >= 0) %>%
filter(cluster == 1) %>%
ggplot(aes(x = hitting_category, y = z_score, fill = pos)) +
geom_bar(stat = "identity") +
coord_flip() +
theme(legend.position = "none") +
ggtitle(label = "The Three True Outcome Hitters")
These hitters are already known in the baseball world and have been give a name because every at bat results in one of the ‘Three True Outcomes’ which are Walk, Strikeout, Home Run. Examples of this type of hitter is Aaron Judge and Bryce Harper. Let’s see if they ended up in cluster 1.
#Aaron Judge, Bryce Harper
batting_2018 %>%
mutate(cluster = k4$cluster) %>%
filter(playerID %in% c("judgeaa01","harpebr03")) %>%
select(playerID, contains('scaled')) %>%
pivot_longer(!playerID, names_to = "hitting_category", values_to = "z_score") %>%
mutate(pos = z_score >= 0) %>%
ggplot(aes(x = hitting_category, y = z_score, fill = pos)) +
geom_bar(stat = "identity") +
coord_flip() +
theme(legend.position = "none") +
facet_wrap(facets = vars(playerID), nrow = 2, ncol = 2) +
ggtitle(label = "Bryce Harper and Aaron Judge")
Let’s check the other hitters in this cluster:
batting_2018 %>%
mutate(cluster = k4$cluster) %>%
filter(cluster == 1) %>%
arrange(desc(G)) %>%
select(playerID, teamID, cluster, contains('scaled')) %>%
head() %>%
select(playerID, contains('scaled')) %>%
pivot_longer(!playerID, names_to = "hitting_category", values_to = "z_score") %>%
mutate(pos = z_score >= 0) %>%
ggplot(aes(x = hitting_category, y = z_score, fill = pos)) +
geom_bar(stat = "identity") +
coord_flip() +
theme(legend.position = "none") +
facet_wrap(facets = vars(playerID), nrow = 2, ncol = 3) +
ggtitle(label = "The Three True Outcome Hitters")
All of these are pretty good examples of ‘Three True Outcome’ hitters except Carlos Santana (santaca01). His defining attribute is that he draws a ton of walks, but he doesn’t strikeout or hit above average home runs so he is not a great fit for this category. The most likely reason he was put in here is because he fits the walk profile of this group and doesn’t belong in any of the other groups either.
batting_2018 %>%
mutate(cluster = k4$cluster) %>%
group_by(cluster) %>%
summarize(walks = mean(walks_scaled),
strikeouts = mean(strikeouts_scaled),
singles = mean(singles_scaled),
extras = mean(extras_scaled),
homeruns = mean(homeruns_scaled),
.groups = 'drop') %>%
pivot_longer(!cluster, names_to = "hitting_category", values_to = "z_score") %>%
mutate(pos = z_score >= 0) %>%
filter(cluster == 2) %>%
ggplot(aes(x = hitting_category, y = z_score, fill = pos)) +
geom_bar(stat = "identity") +
coord_flip() +
theme(legend.position = "none") +
ggtitle(label = "All Around Good Hitters")
They don’t have a weak area, walk enough, hit a lot of singles and have power by hitting a lot of homeruns. These are the well-rounded great hitters. We would expect the MVP winner Mookie Betts to be in this category. He might well be the most well-rounded hitter in the MLB with almost no weaknesses.
#Mookie Betts
batting_2018 %>%
mutate(cluster = k4$cluster) %>%
filter(playerID %in% c("bettsmo01")) %>%
select(playerID, teamID, cluster, contains('scaled')) %>%
select(playerID, contains('scaled')) %>%
pivot_longer(!playerID, names_to = "hitting_category", values_to = "z_score") %>%
mutate(pos = z_score >= 0) %>%
ggplot(aes(x = hitting_category, y = z_score, fill = pos)) +
geom_bar(stat = "identity") +
coord_flip() +
theme(legend.position = "none") +
ggtitle(label = "Mookie Betts")
Look at those z-scores for Mookie. Incredible.
batting_2018 %>%
mutate(cluster = k4$cluster) %>%
filter(cluster == 2) %>%
arrange(desc(G)) %>%
select(playerID, teamID, cluster, contains('scaled')) %>%
head() %>%
select(playerID, contains('scaled')) %>%
pivot_longer(!playerID, names_to = "hitting_category", values_to = "z_score") %>%
mutate(pos = z_score >= 0) %>%
ggplot(aes(x = hitting_category, y = z_score, fill = pos)) +
geom_bar(stat = "identity") +
coord_flip() +
theme(legend.position = "none") +
facet_wrap(facets = vars(playerID), nrow = 2, ncol = 3) +
ggtitle(label = "All Around Good Hitters")
All of the hitters shown above are great hitters. This is good to see.
batting_2018 %>%
mutate(cluster = k4$cluster) %>%
group_by(cluster) %>%
summarize(walks = mean(walks_scaled),
strikeouts = mean(strikeouts_scaled),
singles = mean(singles_scaled),
extras = mean(extras_scaled),
homeruns = mean(homeruns_scaled),
.groups = 'drop') %>%
pivot_longer(!cluster, names_to = "hitting_category", values_to = "z_score") %>%
mutate(pos = z_score >= 0) %>%
filter(cluster == 3) %>%
ggplot(aes(x = hitting_category, y = z_score, fill = pos)) +
geom_bar(stat = "identity") +
coord_flip() +
theme(legend.position = "none") +
ggtitle(label = "Weak Hitters")
These batters struggle getting on base (Walks) and strikeout a lot. These are the weak hitters that are most likely at the bottom of the batting order. Most of these names would not be familiar to casual MLB fans.
batting_2018 %>%
mutate(cluster = k4$cluster) %>%
filter(cluster == 3) %>%
arrange(desc(G)) %>%
select(playerID, teamID, cluster, contains('scaled')) %>%
head() %>%
select(playerID, contains('scaled')) %>%
pivot_longer(!playerID, names_to = "hitting_category", values_to = "z_score") %>%
mutate(pos = z_score >= 0) %>%
ggplot(aes(x = hitting_category, y = z_score, fill = pos)) +
geom_bar(stat = "identity") +
coord_flip() +
theme(legend.position = "none") +
facet_wrap(facets = vars(playerID), nrow = 2, ncol = 3) +
ggtitle(label = "Weak Hitters")
batting_2018 %>%
mutate(cluster = k4$cluster) %>%
group_by(cluster) %>%
summarize(walks = mean(walks_scaled),
strikeouts = mean(strikeouts_scaled),
singles = mean(singles_scaled),
extras = mean(extras_scaled),
homeruns = mean(homeruns_scaled),
.groups = 'drop') %>%
pivot_longer(!cluster, names_to = "hitting_category", values_to = "z_score") %>%
mutate(pos = z_score >= 0) %>%
filter(cluster == 4) %>%
ggplot(aes(x = hitting_category, y = z_score, fill = pos)) +
geom_bar(stat = "identity") +
coord_flip() +
theme(legend.position = "none") +
ggtitle(label = "Balls-in-Play")
They don’t walk, they don’t strikeout, and they don’t hit home runs. They just put the ball in play in most at bats.
batting_2018 %>%
mutate(cluster = k4$cluster) %>%
filter(cluster == 4) %>%
arrange(desc(G)) %>%
select(playerID, teamID, cluster, contains('scaled')) %>%
head() %>%
select(playerID, contains('scaled')) %>%
pivot_longer(!playerID, names_to = "hitting_category", values_to = "z_score") %>%
mutate(pos = z_score >= 0) %>%
ggplot(aes(x = hitting_category, y = z_score, fill = pos)) +
geom_bar(stat = "identity") +
coord_flip() +
theme(legend.position = "none") +
facet_wrap(facets = vars(playerID), nrow = 2, ncol = 3) +
ggtitle(label = "Balls-in-Play")
Nick Markakis is a great example of these cluster. He walks slightly above average, almost never strikes out, a ton of singles and extras with well below average home runs.
We went through the process of getting the customer data (MLB hitting data), exploring the missing values and distributions, scaling the data, determining an appropriate k using three different methods and then interpreting and validating our segments. The next step is to start using our new-found segments to gain insights on things like roster construction and position importance.
If you see mistakes or want to suggest changes, please create an issue on the source repository.
For attribution, please cite this work as
Oberdiear (2021, June 18). Louis Oberdiear: Customer Segmentation...with MLB players Pt. 1. Retrieved from https://thelob.blog/posts/2021-06-18-customer-segmentationwith-mlb-players/
BibTeX citation
@misc{oberdiear2021customer, author = {Oberdiear, Louis}, title = {Louis Oberdiear: Customer Segmentation...with MLB players Pt. 1}, url = {https://thelob.blog/posts/2021-06-18-customer-segmentationwith-mlb-players/}, year = {2021} }