Clustering NBA Players Using K-Means with a Focus on Market Inefficiency

Jack Weyer
Apr 27, 2021
14 min read

Introduction

In the modern NBA, the notion of ‘positionless basketball’ is making waves across the league. Players like Nikola Jokic are averaging as many assists per game as ‘Point God’ Chris Paul—despite differing by 11 inches and 109 lbs. Basketball Reference defines Jokic as a Center and Paul as a Point Guard, but what does that really tell us? The game of basketball is changing; Chicago Bulls Center Nikola Vucevic shot 28 three point attempts in his first five seasons in the NBA—he now shoots from range at the same per game volume as Philadelphia 76ers sharpshooter Danny Green, the player with the 42nd most made threes in NBA history. The goal of this project is to mathematically classify NBA players into ‘roles,’ going beyond the traditional five positions, in an attempt to more accurately define what each player's skill set actually is.

K-Means clustering will be performed using 56 unique statistics from Basketball Reference. This unsupervised machine learning algorithm will automatically ‘cluster’ players based on similarity to other players defined by the 56 statistics. The number of clusters is completely objective and is driven by the problem that the algorithm looks to solve. These clusters will become synonymous with the ‘roles’ as mentioned above and are useful in many ways. For front offices, these clusters can be used to find replacements for players that bring similar attributes. Lineups can be optimized to find the perfect supporting cast to complement a superstar. Undervalued player types can be found and exploited, making teams more efficient in their spending under the Salary Cap. For broadcasts, announcers can more confidently and accurately make player comparisons. Draft analysts can easily project a prospect's style to players already in the NBA which will better inform their viewers. Using clustering techniques to define roles isn't anything groundbreaking; “NBA Lineup Analysis on Clustered Player Tendencies,” "Using Machine Learning to Find the 8 Types of Players in the NBA," and "Defining NBA players by role with k-means clustering" were all inspirations for this report along with others credited in the Works Cited. This report is unique in that all code used to complete this project is included, allowing readers to follow along, duplicate, or expand on the work provided. Additionally I use more statistics, and also look at the roles under an economic perspective in finding undervalued talent. The system implemented here has more value than finding NBA roles; an extremely similar approach could be used on available data in any sport imaginable to solve problems at hand.

Methods

While websites like nba.com/stats and pbpstats.com have more in-depth statistics, Basketball Reference is used here for its relatively easy access to the data. Shown in the image below, player stats for the 2020-21 NBA season are sorted into Totals, Per Game, Per 36 Minutes, Per 100 Possessions, Advanced, Play-by-Play, Shooting, and Adjusted Shooting. The players chosen for this clustering assignment made up the top 400 in minutes played for the 2020-21 season with stats from the beginning of the regular season up until April 19, 2021. Data from each tab excluding Totals, Per Game, and Per 36 minutes were included, copying data from each page and pasting in Excel. Their exclusion was due to the ‘Per 100 Possessions’ page featuring many of the same stats as the excluded tabs with the advantage of controlling for games played, minutes played, and tempo. We don't want our model negatively impacting players who have missed games due to injury, players that perform efficiently in limited playing time, or players who play in a slow offense.

Per 100 Possessions statistics. The Per 100 Possessions statistics included are Field Goals, Field Goal Attempts, Field Goal %, Three Pointers, Three Point Attempts, Three Point %, Two Pointers, Two Point Attempts, Two Point %, Free Throws, Free Throw Attempts, Free Throw %, Personal Fouls, and Points.

Advanced statistics. The Advanced statistics included are Player Efficiency Rating, True Shooting %, Three Point Attempt Rate, Free Throw Rate, Offensive Rebound %, Defensive Rebound %, Total Rebound %, Assist %, Steal %, Block %, Turnover %, Usage %, Win Shares per 48 minutes, Offensive Box Plus-Minus, Defensive Box-Plus Minus, Box Plus-Minus, and Value Over Replacement Player.

Play-by-Play statistics. The Play-by-Play statistics included are all transformed to a per 36 minute total. They are Bad Passes Turnovers, Lost Ball Turnovers, Shooting Fouls, Offensive Fouls, Shooting Fouls Drawn, Offensive Fouls Drawn, Points Generated by Assists, And Ones, and Field Goal Attempts that are blocked.

Shooting statistics. The Shooting statistics included are Average Field Goal Attempt Distance, Two Point Attempt Rate, % of Field Goal Attempts Ranging from: 0-<3 feet, 3-<10 feet, 10-<16 feet, and 16 feet-<3 pointer, and Field Goal % in these same ranges. Also included are % of Two Point Field Goals that are assisted, % of Three Point Field Goals that are assisted, % of Field Goal Attempts that are Dunks, Corner Three Proportion of Total Three Attempts, and Corner Three %.

Adjusted Shooting statistics. The only Adjusted Shooting statistic included is Effective Field Goal %.

We combine our five datasets into one, linking the data by Player. This gives us a full dataset of 400 observations (each representing a unique player) and 57 variables (56 statistics plus the players' names). It appears there are 46 players with at least one 'N/A' value for a statistic. Diving into the data, these appear to be mainly big men who don't have any assisted threes, or three point attempts in general. We update our data to assign a value of 0 to replace all 'N/A' values.

Above is a plot showing correlations for each of our 56 variable pairings. We have 11 pairs of variables with correlations stronger than 95% including Field Goals and Points, Free Throw Attempts and Shooting Fouls Drawn, and Defensive Rebound % and Total Rebound %. For this reason, using K-means clustering directly on our data will overvalue features like scoring, getting to the Free Throw line, and rebounding. Our ultimate goal is to group players based on their unique skills; their stats are simply a numerical representation of these skills that can carry some bias. In order to squeeze out these unique skills from our data, a technique called Principal Component Analysis is performed.

Principal Component Analysis (PCA) looks to reduce dimensionality of the original data using 'Principal Components' that summarize the original into new variables that successively explain less and less of the variability present. Each Principal Component variable is uncorrelated with the other PCs, which in turn means that we are better able to capture aspects of players' games that are unique using PCA. Using only two PCs (2 transformed variables instead of 56 player statistics) we are able to capture 49% of the total variance explained by the original data. Because using sixteen PCs captures 90% of the total variation, they will be used to cluster the players rather than the correlated 56 variables.

Now that we have reduced our data, the question now becomes, how many roles/positions/archetypes should we use? This is partly domain specific; NBA2K, the top-selling basketball video game franchise, utilizes over 90 different archetypes to classify players. We will use a couple methods, starting with 'Silhouette Score' to semi-objectively select the number of clusters. This method uses differing amounts of clusters and compares the distance between each observation and its cluster's centroid. We want more than five clusters because we want to group players into more than the five traditional positions.

7-11 clusters all perform similarly using this method before a drop off in average silhouette score at 12 clusters.

Using average silhouette width, we compare distances between clusters generated.

The given optimal number of clusters is 2, but classifying players into Type A and Type B tells us next to nothing about their skills. 8 clusters gives us a greater silhouette width than both 6 and 7 clusters before dropping off in average silhouette width. For this reason, we will move forward with classifying players into one of eight archetypes.

Results

We run our K-Means clustering algorithm in search of our 8 player archetypes. The 'number' assigned to each player has little significance, we will assign names to each cluster in the analysis below.

Here are each of the 400 players, colored by their chosen cluster. The clusters appear to have a bit of overlap because the x,y plot is only able to capture two of the 16 dimensions used for K-Means clustering. The eight chosen archetype names are All-Around Forwards, Ball Dominant Superstars, Low Usage Defensive Specialists, Midrange Scorers, Pass-First Defensive Guards, Prehistoric Bigs, Rebounding Bigs, and Three Point Sharpshooters. Outlier players are highlighted in the plot. Of this season's 27 All-Stars, 22 belong in the Ball Dominant Superstar role. Each archetype is broken down below with their unique traits highlighted.

Rebounding Bigs. These players make up the best rebounders in the league. Clint Capela, Jonas Valanciunas, Enes Kanter, Rudy Gobert, and Deandre Ayton lead the league in Total Rebound % and are each classified as a Rebounding Big. Domantas Sabonis and Gobert were represented in the 2021 All-Star Game. Rebounding Bigs shoot the greatest proportion of their attempts from 3-<10 feet (26%) of any cluster, led by Hassan Whiteside and Thaddeus Young. They are the best finishers at the rim, shooting 71% on shots within 3 feet, and last in Steal % at 1.2%.

Midrange Scorers. This group combines to average the second most Field Goal Attempts, Three Point Attempts, and Points of any cluster. They are the best Free Throw shooters and are confident in their midrange jumper and/or hesitant to get to the basket. All-Star Mike Conley is a member of the Midrange Scorers.

Three Point Sharpshooters. Conveniently grouped as Cluster #3, these players shoot the most threes, and at the highest percentage of any cluster. With an average shot distance of 18 feet, these players will spread the floor, shooting over 10 three point attempts per 100 possessions. They rank in the lowest of any cluster in Free Throw Rate, all rebounding statistics, Defensive Box Plus-Minus, shots within 10 feet, and dunks.

Pass-First Defensive Guards. With players like Chris Paul, Rajon Rondo, Draymond Green, and Marcus Smart, this cluster goes by the nickname ‘the pests.’ On offense they look to get their teammates involved (and are good at doing it), in part because they struggle with shooting. They may have a tendency to force things, with the worst Turnover Rate (16.7%), Two Point % (47%), True Shooting % (53%), Effective Field Goal % (50%), and finishing at the rim (61%) of any cluster. Two All-Stars, Chris Paul and Ben Simmons, belong in this group.

Low Usage Defensive Specialists. Typically referred to as ‘Three and D’ players, the Low Usage Defensive Specialists are mainly classified by their lack of an offensive game which in turn means their defense is what keeps them in the league. On offense they score the least (13.6 Points per 100 possessions), shoot the worst (42% on Field Goals), and impact the game the least (12.8% Usage) of any cluster. You can find them camping in the corner more than any other cluster, where they shoot only the 5th best percentage (38%). Overall, they perform the worst in every all-encompassing metric examined: Player Efficiency Rating (9.1), Win Shares/48 minutes (0.06), Box Plus-Minus (-2.7), and Value Over Replacement Player (0).

All-Around Forwards. Perhaps the most 'average' archetype is the All-Around Forwards. These players collectively neither finish first nor last in any of the 56 stats! They skew slightly offensive, finishing 5th in Offensive Box Plus-Minus (-1.2) and 6th in Defensive Box-Plus Minus (-0.14).

Prehistoric Bigs. The cluster name isn't a dig at these players' ages, but rather an acknowledgement of their playstyles, which are seemingly being phased out of the modern NBA. The Prehistoric Bigs shoot the least Field Goal Attempts and Three Point Attempts, and are the worst Free Throw (64%) and Three Point (18%) shooters. They are, however, the most efficient shooters (61% True Shooting & 59% Effective Field Goal), and lead the league in dunks (25% of all attempts). They do not create these efficient opportunities themselves, 70% of their made two pointers are assisted. For that reason, their offensive efficiency is maximized when paired with Ball Dominant Superstars or Pass-First Defensive Guards. On defense, they are rim protectors, blocking 4.3% of shots.

Ball Dominant Superstars. This cluster is tied for the most unique (8.25% of all players), yet made up 81.5% of the 2021 All-Star roster. Ball Dominant Superstars are household names, the stars of the past, present, and future. They take and make the most Field Goals and Free Throws, and foul the least. They lead the clusters in Player Efficiency Rating (23.4), Win Share/48 minutes (0.17), Box Plus-Minus (4.9), and Value Over Replacement Player (2.7). While not necessarily ‘Point Guards’ by the traditional definition, the ball is in their hands often because when they have the ball, good things happen.

Discussion

Now that we have clustered all of our players, we shift our focus to finding the most valuable player types. It is apparent that the Ball Dominant Superstars are the most talented, but with an average 2020-21 salary of over $25 million, is the cost worth the value returned in signing a ‘Cluster 8’ player? To attempt to answer this question, player salary data is copied from Basketball Reference. A bit of survivorship bias to note: 18 of our players from the original 400 are removed from the upcoming analysis because their salary information was not included in the Basketball Reference list. These players mainly have two-way contracts, meaning that they bounce around from the NBA and its Development League. Of the 18, five are Low Usage Defensive Specialists, four are All Around Forwards, and four are Pass-First Defensive Guards.

Ball-Dominant Superstars make up the majority of cap space, followed by Midrange Scorers who make $14 million on average this season. Last is the Low Usage Defensive Specialists who are earning only $4,722,045 on average. Our salary analysis will look at the return in each of the four all-encompassing statistics included (Player Efficiency Rating, Value Over Replacement Player, Box Plus-Minus, and Win Shares per 48 minutes) per $1 million in salary spent.

By this metric, it appears that Prehistoric Bigs are the most undervalued for their Player Efficiency Rating. Every $1 million spent on Prehistoric Bigs brings about 2.9 points of Player Efficiency. Despite being the best players, Ball Dominant Superstars appear here to be the most overvalued, returning over three times less than the Prehistoric Bigs per dollar spent.

Here we see that Ball Dominant Superstars and Rebounding Bigs bring by far the most value in the Value Over Replacement Player metric. This is an example of why it is useful to look at more than one statistic, even if it is ‘all-encompassing.’

Again, Ball Dominant Superstars and Rebounding Bigs bring the best value, this time for Box Plus-Minus. In fact, these are the only two archetypes with a positive BPM. Low Usage Defensive Specialists are extremely overvalued here, despite having the lowest salaries of the clusters.

The Prehistoric Bigs and Rebounding Bigs reign supreme in Win Shares per $1 million spent. The Ball Dominant Superstars finish near the bottom in Win Shares value. Overall, the Rebounding Bigs finished 2nd in each category suggesting a market inefficiency for these valuable players.

Equally weighting each of the four metrics, many of the clusters appear to be valued in accordance to their production. This is not the case for both the Rebounding Bigs and the Prehistoric Bigs who are being significantly underpaid. By our analysis, Rebounding Bigs are much more valuable than their $8,873,321 average salary which ranked 4th out of the eight clusters. Some talented Rebounding Bigs making less than their average salary include Michael Porter Jr., John Collins, Enes Kanter, and James Wiseman. Teams would benefit from targeting this archetype in both free agency and the draft to maximize their talent.

In this project we grouped NBA players into archetypes beyond the five traditional positions in an attempt to gain more understanding of how the individuals actually play. We advise that teams should invest in ‘Big Men,’ both Rebounding Bigs and/or Prehistoric Bigs in order to maximize talent subject to the salary cap. This project could be improved by using more tracking data on variables like isolation possessions per game, strength of matchup on defense, deflections, etc. Additionally, the interaction of player types within lineups is of importance to teams and is worth investigating. This report does not intend to be a final solution to the NBA market but may serve as guidance to NBA Front Offices, players, and agents.

Supplementary Materials (R Code)

# Data Cleaning
library(readxl)
one <- read_excel("Stat437_data.xlsx", sheet = "Sheet1")
two <- read_excel("Stat437_data.xlsx", sheet = "Sheet2")
thr <- read_excel("Stat437_data.xlsx", sheet = "Sheet3")
fou <- read_excel("Stat437_data.xlsx", sheet = "Sheet4")
fiv <- read_excel("Stat437_data.xlsx", sheet = "Sheet5")
library(dplyr)

# Per 100 possession statistics
one <- one[2:401,]
one <- one %>% select(Player, FG, FGA, 'FG%','3P','3PA','3P%','2P','2PA','2P%',FT,FTA,'FT%',PF,PTS)

# Advanced statistics
two <- two[2:401,]
two <- two %>% select(Player,PER,'TS%','3PAr',FTr,'ORB%','DRB%','TRB%','AST%', 'STL%', 'BLK%', 'TOV%', 'USG%','WS/48',OBPM,DBPM,BPM,VORP)

# Play-by-Play statistics
thr <- thr[2:401,]
thr <- thr %>% select(MP,Player,BadPass,LostBall,Shoot...17,Off....18,Shoot...19,Off....20,PGA,And1,Blkd)
Mp <- as.numeric(thr$MP)
thr[,3:11] <- thr[,3:11]*36/Mp
thr <- thr[,2:11]

# Shooting
fou <- fou[2:401,]
fou <- fou %>% select(Player,Dist.,"2P...11","0-3...12", "44265...13","44485...14","16-3P...15",
    "0-3...19","44265...20", "44485...21","16-3P...22","2P...25","3P...26","%FGA","%3PA","3P%")

# Adjusted Shooting
fiv <- fiv[2:401,]
fiv <- fiv %>% select(Player,eFG)

# Combine datasets
library(purrr)
nba <-list(one,two,thr,fou,fiv) %>%
  reduce(left_join, by='Player')
nas <- nba[rowSums(is.na(nba)) > 0,]
dim(nas)

# Set N/A’s to 0
nba[is.na(nba)] <- 0

# Correlation Plot
library(corrplot)
corrs <- cor(nba[,2:57])
corrplot(corrs, method="color", type='upper',diag = F, tl.srt=45)
cor <- cor(nba[,2:57])

# Principal Component Analysis
library(MASS)
nba <- as.data.frame(nba)
pca <- princomp(nba[,2:57],cor=T)
summary(pca)
plot(pca$sdev^2,type='l',xlab='Number of Principal Components',ylab='Variance Explained')
nonames=nba[,2:57]
ncomps=16
prin.comp = matrix(0,400,56)
prin.comp=as.matrix(nonames)%*%pca$loadings
reduced = prin.comp[,1:ncomps]

# Cluster Selection
library(factoextra)
library(cluster)
silhouette_score <- function(k){
  km <- kmeans(reduced, centers = k, nstart=50)
  ss <- silhouette(km$cluster, dist(reduced))
  mean(ss[, 3])
}
k <- 2:40
avg_sil <- sapply(k, silhouette_score)
plot(k, type='b', avg_sil, xlab='Number of clusters', ylab='Average Silhouette Scores', frame=FALSE)
fviz_nbclust(reduced, kmeans, method='silhouette',k.max=30)

# K-Means Clustering
set.seed(425)
mod=kmeans(reduced,8,nstart=50)
dat=data.frame(clus=mod$cluster, nonames)
big <- cbind(dat,nba$Player)
ans <- cbind(big$clus,nba$Player)
reduced <- as.data.frame(reduced)
big$clus <- as.factor(big$clus)
added <- cbind(reduced,big)

# Plotting Clusters
big$clus <- gsub('1','Rebounding Bigs',big$clus)
big$clus <- gsub('2','Midrange Scorers',big$clus)
big$clus <- gsub('3','Three Point Sharpshooters',big$clus)
big$clus <- gsub('4','Pass-First Defensive Guards',big$clus)
big$clus <- gsub('5','Low Usage Defensive Specialists',big$clus)
big$clus <- gsub('6','All-Around Forwards',big$clus)
big$clus <- gsub('7','Prehistoric Bigs',big$clus)
big$clus <- gsub('8','Ball Dominant Superstars',big$clus)
outliers = c('Stephen Curry',"Luka Dončić",'Joel Embiid','Zion Williamson','Andre Drummond','Clint Capela','Dwight Howard','Ed Davis','P.J. Tucker','Wesley Matthews','Patty Mills',"Devonte' Graham","D'Angelo Russell")
library(ggplot2)
library(ggrepel)
ggplot(data=reduced, aes(x=Comp.1,y=Comp.2,color=big$clus))+
  geom_point()+
  labs(x='Component 1',y='Component 2',title='8 NBA Player Clusters',subtitle = '2 PCs explains 49% of the total variance')+
  scale_color_discrete(name="Archetype")+
  theme_gray()+
  geom_label_repel(aes(label =ifelse(nba$Player %in% outliers, nba$Player,'')),hjust=0,vjust=0,    box.padding   = 0.35, point.padding = 0.5, segment.color = 'grey50')

# Plotting Rebounding Bigs
ones <- added %>%
  filter(clus==1)
ggplot(data=ones, aes(x=Comp.1,y=Comp.2))+
 geom_label_repel(aes(label = `nba$Player`), box.padding   = 0.35, point.padding = 0.5, segment.color = 'grey50') +
  theme_gray()+
  labs(x='Component 1',y='Component 2',title='Rebounding Bigs (35 total)',subtitle = '11.4% Offensive Rebound Rate, 26.0% Defensive Rebound Rate, 18.6% Total Rebound Rate')
statsByPos <- as.data.frame(colMeans(ones[,18:73]))

# Plotting Midrange Scorers
twos <- added %>%
  filter(clus==2)
ggplot(data=twos, aes(x=Comp.1,y=Comp.2))+
 geom_label_repel(aes(label = `nba$Player`), box.padding   = 0.35, point.padding = 0.5, segment.color = 'grey50') +
   theme_gray()+
  labs(x='Component 1',y='Component 2',title='Midrange Scorers (59 total)',subtitle = '83% Free Throw %, 10% of Attempts from 16 feet-<3 pointer')
statsByPos <- cbind(statsByPos,colMeans(twos[,18:73]))

# Plotting Three Point Sharpshooters
threes <- added %>%
  filter(clus==3)
ggplot(data=threes, aes(x=Comp.1,y=Comp.2))+
 geom_label_repel(aes(label = `nba$Player`), box.padding   = 0.35, point.padding = 0.5, segment.color = 'grey50') +
    theme_gray()+
  labs(x='Component 1',y='Component 2',title='Three Point Sharpshooters (66 total)',subtitle = '37% Three Point %, 59% Three Point Attempt Rate')
statsByPos <- cbind(statsByPos,colMeans(threes[,18:73]))

# Plotting Pass-First Defensive Guards
fours <- added %>%
  filter(clus==4)
ggplot(data=fours, aes(x=Comp.1,y=Comp.2))+
 geom_label_repel(aes(label = `nba$Player`), box.padding   = 0.35, point.padding = 0.5, segment.color = 'grey50') +
      theme_gray()+
  labs(x='Component 1',y='Component 2',title='Pass-First Defensive Guards (38 total)',subtitle = '2.0% Steal %, 16.2 Points Generated from Assists/36 min, 0.41 Offensive Fouls Drawn/36 min')
statsByPos <- cbind(statsByPos,colMeans(fours[,18:73]))

# Plotting Low Usage Defensive Specialists
fives <- added %>%
  filter(clus==5)
ggplot(data=fives, aes(x=Comp.1,y=Comp.2))+
 geom_label_repel(aes(label = `nba$Player`), box.padding   = 0.35, point.padding = 0.5, segment.color = 'grey50') +
        theme_gray()+
  labs(x='Component 1',y='Component 2',title='Low Usage Defensive Specialists (69 total)',subtitle = '96% of Made Threes Assisted, 39% of Threes from the Corner')
statsByPos <- cbind(statsByPos,colMeans(fives[,18:73]))

# Plotting All-Around Forwards
sixes <- added %>%
  filter(clus==6)
ggplot(data=sixes, aes(x=Comp.1,y=Comp.2))+
 geom_label_repel(aes(label = `nba$Player`), box.padding   = 0.35, point.padding = 0.5, segment.color = 'grey50') +
        theme_gray()+
  labs(x='Component 1',y='Component 2',title='All-Around Forwards (67 total)',subtitle = 'Super Average')
statsByPos <- cbind(statsByPos,colMeans(sixes[,18:73]))

# Plotting Prehistoric Bigs
sevens <- added %>%
  filter(clus==7)
ggplot(data=sevens, aes(x=Comp.1,y=Comp.2))+
 geom_label_repel(aes(label = `nba$Player`), box.padding   = 0.35, point.padding = 0.5, segment.color = 'grey50') +
   theme_gray()+
  labs(x='Component 1',y='Component 2',title='Prehistoric Bigs (33 total)',subtitle = '60% Two Point %, 60% of attempts within 3 feet, 0.62 Defensive Box Plus-Minus')
statsByPos <- cbind(statsByPos,colMeans(sevens[,18:73]))

# Plotting Ball Dominant Superstars
eights <- added %>%
  filter(clus==8)
ggplot(data=eights, aes(x=Comp.1,y=Comp.2))+
 geom_label_repel(aes(label = `nba$Player`), box.padding   = 0.35, point.padding = 0.5, segment.color = 'grey50') +
     theme_gray()+
  labs(x='Component 1',y='Component 2',title='Ball Dominant Superstars (33 total)',subtitle = '36 Points per 100 possessions, 30% Usage, 30% Assist Rate')
statsByPos <- cbind(statsByPos,colMeans(eights[,18:73]))
statsByPos <- t(statsByPos)

# Discussion Section
nbaSalaries <- read_excel("nbaSalaries.xlsx")
salaries <- nbaSalaries %>% 
  dplyr::select(Player,`2020-21`)
sal_analysis <- big %>%
  dplyr::select(clus,PER, VORP, WS.48,BPM,`nba$Player`)
sal_analysis <- left_join(sal_analysis,salaries, by=c(`nba$Player`='Player'))
sal_analysis$`2020-21` <- as.numeric(sal_analysis$`2020-21`)
sal_analysis <- sal_analysis[rowSums(is.na(sal_analysis)) ==0,]

means <- sal_analysis %>% group_by(clus) %>%
 summarize(mean_salary = mean(`2020-21`),mean_PER = mean(PER),mean_VORP=mean(VORP),
           mean_BPM = mean(BPM),mean_WS.48 = mean(WS.48))%>%
  arrange(desc(mean_salary))

# Plotting Average Salary by Cluster
options(scipen = 10)
ggplot(data=means, aes(x=clus,y=mean_salary,fill=clus))+
  geom_bar(stat='identity')+
  labs(x='',y='Average 2020-21 Salary',fill='Archetype', title='Salary by Archetype')+
  theme(axis.text.x=element_blank(), axis.ticks.x =element_blank())

# Converting the 4 metrics into per $1 million
sals = means[,2]
for(i in 1:8){
  for(j in 3:6){
  means[i,j] = means[i,j]*1000000/means[i,2]
  }
}

# Plotting PER per $1 million
ggplot(data=means,aes(x=clus,y=mean_PER,fill=clus))+
  geom_bar(stat='identity')+
  labs(x='',y='Average PER per $1,000,000 in salary',fill='Archetype', title='Player Efficiency Rating return by Archetype')+
  theme(axis.text.x=element_blank(), axis.ticks.x =element_blank())

# Plotting VORP per $1 million
ggplot(data=means,aes(x=clus,y=mean_VORP,fill=clus))+
  geom_bar(stat='identity')+
  labs(x='',y='Average VORP per $1,000,000 in salary',fill='Archetype', title='Value Over Replacement Player return by Archetype')+
  theme(axis.text.x=element_blank(), axis.ticks.x =element_blank())

# Plotting BPM per $1 million
ggplot(data=means,aes(x=clus,y=mean_BPM,fill=clus))+
  geom_bar(stat='identity')+
  labs(x='',y='Average BPM per $1,000,000 in salary',fill='Archetype', title='Box Plus-Minus return by Archetype')+
  theme(axis.text.x=element_blank(), axis.ticks.x =element_blank())

# Plotting WS/48 per $1 million
ggplot(data=means,aes(x=clus,y=mean_WS.48,fill=clus))+
  geom_bar(stat='identity')+
  labs(x='',y='Average WS/48 per $1,000,000 in salary',fill='Archetype', title='Win Shares per 48 minutes Player return by Archetype')+
  theme(axis.text.x=element_blank(), axis.ticks.x =element_blank())

# Plotting average rank
ranked = c(4.25, 5.25,5,2,5,5,3.25,6.25)
means=cbind(means,ranked)
ggplot(data=means,aes(x=clus,y=ranked,fill=clus))+
  geom_bar(stat='identity')+
  labs(x='',y='Average Rank of 4 Metrics',fill='Archetype', title='Average Rank of 4 Metrics by Archetype')+
  theme(axis.text.x=element_blank(), axis.ticks.x =element_blank())

Works Cited

“2020-21 NBA Player Contracts.” Basketball Reference, www.basketball-reference.com/contracts/players.html.

“2020-21 NBA Player Stats: Per 100 Possessions.” Basketball Reference, www.basketball-reference.com/leagues/NBA_2021_per_poss.html.

Cheng, Alex. “Using Machine Learning to Find the 8 Types of Players in the NBA.” Medium, Fastbreak Data, 9 Mar. 2017, medium.com/fastbreak-data/classifying-the-modern-nba-player-with-machine-learning-539da03bb824.

“Defining NBA Players by Role with k-Means Clustering.” Dribble Analytics, 26 Apr. 2019, dribbleanalytics.blog/2019/04/positional-clustering/.

Hussain, Haider. “Using K-Means Clustering Algorithm to Redefine NBA Positions and Explore Roster Construction.” Medium, Towards Data Science, 8 Nov. 2019, towardsdatascience.com/using-k-means-clustering-algorithm-to-redefine-nba-positions-and-explore-roster-construction-8cd0f9a96dbb.

James, Gareth, et al. An Introduction to Statistical Learning: with Applications in R. Springer, 2021.

Kalman, Samuel, and Jonathan Bosch. NBA Lineup Analysis on Clustered Player Tendencies: A New Approach to the Positions of Basketball Modeling Lineup Efficiency of Soft Lineup Aggregates. global-uploads.webflow.com/5f1af76ed86d6771ad48324b/5f6a65517f9440891b8e35d0_Kalman_NBA_Line_up_Analysis.pdf.

Kassambara, Alboukadel. “Determining The Optimal Number Of Clusters: 3 Must Know Methods.” Datanovia, 21 Oct. 2018, www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods/#:~:text=The%20optimal%20number%20of%20clusters%20can%20be%20defined%20as%20follow,for%20different%20values%20of%20k.&text=For%20each%20k%2C%20calculate%20the,the%20number%20of%20clusters%20k.

Matt.0. “10 Tips for Choosing the Optimal Number of Clusters.” Medium, Towards Data Science, 28 Jan. 2019, towardsdatascience.com/10-tips-for-choosing-the-optimal-number-of-clusters-277e93d72d92.

Schoch, David. “Analyzing NBA Player Data II: Clustering Players.” Analyzing NBA Player Data II: Clustering Players · David Schoch, 4 Mar. 2018, blog.schochastics.net/post/analyzing-nba-player-data-ii-clustering/.

Clustering NBA Players Using K-Means with a Focus on Market Inefficiency

Introduction

Methods

Results

Discussion

Supplementary Materials (R Code)

Recent Posts

Comments

Subscribe Form