Louis Oberdiear: Customer Segmentation...with MLB players Pt. 1

Louis Oberdiear

The goal of customer segmentation is to find hidden groups in data. The customer in this example will be MLB hitters. We are going to be coming at this problem as if we don’t know anything about MLB hitters to understand the different types of hitters utilizing different clustering algorithms.

The Data

For customer segmentation, you need data that describes the customer. In retail, this could be how recently they have made a purchase, how many times they have purchased in the last 12 months, and the total amount of money they have spent in the last 12 months. These describe a customer’s shopping behavior. You could include age and some other area demographics like Zip Code Household Income and Household size. You just need relevant data that describes the customer.

In this example, we need to get data that describes MLB hitters. We are going to use data from 2018 for the following:

The reason for these specific data points is because they are the outcomes of an at-bat. This could be applied to retail settings also. The number of times a person visits your site, time spent browsing the site, products looked at, the number of products added to the cart, and the number of products purchased.

We can find the data in the R package ‘Lahman’ which is from Sean Lahman’s baseball database. More info can be found here: www.seanlahman.com

playerID	yearID	stint	teamID	lgID	G	AB	R	H	X2B	X3B	HR	RBI	SB	CS	BB	SO	IBB	HBP	SH	SF	GIDP
abercda01	1871	1	TRO	NA	1	4	0	0	0	0	0	0	0	0	0	0	NA	NA	NA	NA	0
addybo01	1871	1	RC1	NA	25	118	30	32	6	0	0	13	8	1	4	0	NA	NA	NA	NA	0
allisar01	1871	1	CL1	NA	29	137	28	40	4	5	0	19	3	1	2	5	NA	NA	NA	NA	1
allisdo01	1871	1	WS3	NA	27	133	28	44	10	2	2	27	1	1	0	2	NA	NA	NA	NA	0
ansonca01	1871	1	RC1	NA	25	120	29	39	11	3	0	16	6	2	2	1	NA	NA	NA	NA	0
armstbo01	1871	1	FW1	NA	12	49	9	11	2	1	0	5	0	1	0	1	NA	NA	NA	NA	0

We only want data from 2018 so we need to filter down and select the data points we want:

playerID	teamID	lgID	G	AB	SH	SF	BB	HBP	IBB	SO	H	X2B	X3B	HR
abreujo02	CHA	AL	128	499	0	6	37	11	7	109	132	36	1	22
acunaro01	ATL	NL	111	433	0	3	45	6	2	123	127	26	4	26
adamewi01	TBA	AL	85	288	1	2	31	1	3	95	80	7	0	10
adamja01	KCA	AL	31	0	0	0	0	0	0	0	0	0	0	0
adamsau02	WAS	NL	2	0	0	0	0	0	0	0	0	0	0	0
adamsch01	NYA	AL	3	0	0	0	0	0	0	0	0	0	0	0

The reason we are pulling extra data points like SH, SF, HBP, & IBB is that in the end, we want Total At Bats. Baseball data is a little tricky in that only some plate appearances are considered an ‘At Bat’. Walks and sacrifices aren’t considered an ‘At Bat’ but for this analysis, we’ll want the total number of times they come to the plate in a season.

playerID	teamID	lgID	G	TAB	TBB	SO	H	X2B	X3B	HR
abreujo02	CHA	AL	128	560	55	109	132	36	1	22
acunaro01	ATL	NL	111	489	53	123	127	26	4	26
adamewi01	TBA	AL	85	326	35	95	80	7	0	10
adamja01	KCA	AL	31	0	0	0	0	0	0	0
adamsau02	WAS	NL	2	0	0	0	0	0	0	0
adamsch01	NYA	AL	3	0	0	0	0	0	0	0

EDA

feature	num_missing	pct_missing
playerID	0	0
teamID	0	0
lgID	0	0
G	0	0
TAB	0	0
TBB	0	0
SO	0	0
H	0	0
X2B	0	0
X3B	0	0
HR	0	0

A LOT of zero values. We need to figure out a good minimum for the number of games played a hitter needs to be included in our analysis. For this, instead of using analysis, I’m going to use business logic and require a player to have at least appeared in 100 games in the 2018 season.

Much better. Most distributions (other than triples) are looking a lot more normal. Triples happen so infrequently that I’m going to combine them with doubles and call them extra-base hits.

In cluster analysis, it’s important to get all data points on the same scale. If you run the unscaled data through the kmeans algorithm it will potentially over-emphasize a variable that is on a larger scale. In retail this is common. If you have the number of times visited the store and dollars per visit then the kmeans algorithm will over-emphasize dollars per visit because it could potentially be in the hundreds or thousands of dollars while visits are only in the single and double-digit range.

In this example, we have to take an extra step before we scale. We want to make everything a per-at-bat level. This will let us know the percentage of times a player walks, gets a hit, etc then we will scale each variable. This could also be done for retail data and transforming the data at a per-visit level.

playerID	teamID	lgID	G	TAB	TBB	SO	H	X2B	X3B	HR	XBH	walks	strikeouts	singles	extras	triples	homeruns	walks_scaled	strikeouts_scaled	singles_scaled	extras_scaled	homeruns_scaled
abreujo02	CHA	AL	128	560	55	109	132	36	1	22	37	0.09821429	0.1946429	0.2357143	0.06607143	0.001785714	0.03928571	-0.1574981	-0.2393866	0.30032062	1.1686432	0.4940999
acunaro01	ATL	NL	111	489	53	123	127	26	4	26	30	0.10838446	0.2515337	0.2597137	0.06134969	0.008179959	0.05316973	0.1310304	0.7245586	1.20778753	0.7962888	1.4262794
adriaeh01	MIN	AL	114	368	27	82	84	23	1	6	24	0.07336957	0.2228261	0.2282609	0.06521739	0.002717391	0.01630435	-0.8623447	0.2381431	0.01849173	1.1012941	-1.0488797
aguilje01	MIL	NL	149	569	67	143	135	25	0	35	25	0.11775044	0.2513181	0.2372583	0.04393673	0.000000000	0.06151142	0.3967440	0.7209049	0.35870476	-0.5768913	1.9863443
ahmedni01	ARI	NL	153	566	44	109	121	33	5	16	38	0.07773852	0.1925795	0.2137809	0.06713781	0.008833922	0.02826855	-0.7383973	-0.2743475	-0.52902478	1.2527376	-0.2455976
albieoz01	ATL	NL	158	684	41	116	167	40	5	24	45	0.05994152	0.1695906	0.2441520	0.06578947	0.007309942	0.03508772	-1.2432994	-0.6638652	0.61936959	1.1464083	0.2122445

K-Means Cluster Analysis

We have the data prepped and scaled so we are ready to run it through the kmeans algorithm. The tricky part of cluster analysis via kmeans is kmeans forces the user to select the number of clusters (k). Our first task is determining k and there are a few ways to do this. The three methods I will showcase is called the elbow method, silhouette scores, and gap statistic.

Elbow Method

The elbow method plots the total within sum of square errors. The way kmeans works guarantees that the tot.withinss will always decrease with an increase in clusters. Our goal isn’t to find the lowest tot.withinss but to find the point of diminishing returns or the ‘elbow’ point in the graph. The elbow seems to be at 4 clusters.

Silhouette Method

Gap Statistic

The gap statistic suggests 1 cluster. Unfortunately the methods don’t all agree which is usually the case when doing this analysis out in the wild. Fortunately we can visualize the clusters. We are going to only visualize 3 and 4 clusters because 2 clusters wouldn’t be informative enough to give us any insights.

Not great separation so it’s understandable that none of the methods agreed on a specific k but I like 4 clusters so we are going to go with that. Let’s join the clusters back to the data and see if we can determine the types of hitters in each cluster. Remember since we scaled the data we are looking at standard scores or z-scores. A z-score of 0 means average for that stat, the more positive means the more above average and vice-versa for negative scores.

cluster	n	walks	strikeouts	singles	extras	homeruns
1	39	1.4976302	0.4546602	-0.6094075	-0.55614687	0.94189321
2	53	-0.2066408	-0.1284540	0.7968028	1.09735051	0.61576332
3	75	-0.1431369	0.6372036	-0.7537442	-0.49739296	-0.09622473
4	70	-0.5245764	-0.8387708	0.5438166	0.01192319	-0.88789195

Let’s visualize these clusters with some bar charts. Might be easier to digest.

Cluster 1 = The Three True Outcome Hitters

These hitters are already known in the baseball world and have been give a name because every at bat results in one of the ‘Three True Outcomes’ which are Walk, Strikeout, Home Run. Examples of this type of hitter is Aaron Judge and Bryce Harper. Let’s see if they ended up in cluster 1.

All of these are pretty good examples of ‘Three True Outcome’ hitters except Carlos Santana (santaca01). His defining attribute is that he draws a ton of walks, but he doesn’t strikeout or hit above average home runs so he is not a great fit for this category. The most likely reason he was put in here is because he fits the walk profile of this group and doesn’t belong in any of the other groups either.

Cluster 2 = All Around Good Hitters

They don’t have a weak area, walk enough, hit a lot of singles and have power by hitting a lot of homeruns. These are the well-rounded great hitters. We would expect the MVP winner Mookie Betts to be in this category. He might well be the most well-rounded hitter in the MLB with almost no weaknesses.

Cluster 3 = Weak Hitters

These batters struggle getting on base (Walks) and strikeout a lot. These are the weak hitters that are most likely at the bottom of the batting order. Most of these names would not be familiar to casual MLB fans.

Cluster 4 = Balls-in-Play

They don’t walk, they don’t strikeout, and they don’t hit home runs. They just put the ball in play in most at bats.

Nick Markakis is a great example of these cluster. He walks slightly above average, almost never strikes out, a ton of singles and extras with well below average home runs.

Summary

We went through the process of getting the customer data (MLB hitting data), exploring the missing values and distributions, scaling the data, determining an appropriate k using three different methods and then interpreting and validating our segments. The next step is to start using our new-found segments to gain insights on things like roster construction and position importance.

Customer Segmentation…with MLB players Pt. 1

Author

Affiliation

Published

Citation

Customer Segmentation of MLB Baseball Players