Using data from the 2005-06 season to 2020-21, this guided tutorial analyzes conference disparity in the NBA. How strong is the conference disparity? Is the disparity significant? Is it getting worse?
The linked github contains two tables; 'combined_standings.csv' contains team by season information (one row for each team in each season going back to '05-'06) with variables for wins, losses, win percentage, points scored and allowed per game, and a playoff indicator.
'combined_team_vs_team_records.csv' has rows for each team in each season with columns representing matchup records against every team in the NBA in that season. For example, the 2016-17 Golden State Warriors won all four regular season matchups against the Los Angeles Clippers. Therefore the 'LAC' column of the 2016 Warriors row contains the value '4-0'. Let's load in the data into R.
library(tidyverse)
standings <- read_csv("combined_standings.csv")
team_v_team <- read_csv("combined_team_vs_team_records.csv")
Let's get the overall regular season win percentage of Western Conference teams against the East in our 16 year window.
'team_v_team' is useful here but we need to join with conference from the standings table. Let's just combine tables so we can have all of our information available in one spot. We join on team and season.
df = left_join(standings, team_v_team, by=c('bb_ref_team_name' = 'bb_ref_team_name', 'season'='season'))
Let's begin by filtering out Eastern conference rows.
WestData = df %>%
filter(conference=='West')
We get get the West team names from the 'WestData' table. Then remove columns where we have West vs. West matchups to isolate West vs. East.
westTeams = unique(WestData$team_short)
westVeast = WestData[,!(names(WestData) %in% westTeams)]
# We can clean this up by isolating to a matrix solely of west vs east and correct NA's to 0-0
westVeast = westVeast[12:26]
westVeast[is.na(westVeast)] = '0-0'
Initialize wins and losses vectors
wins=c()
losses=c()
We're dealing with stubborn characters as our objects. For each value, we take the first and third index (corresponding to wins & losses respectively). We convert to an integer and sum across the row to get a West teams total Wins & Losses against the Eastern Conference in a given year.
for (i in c(1:240)){
wins[i] = sum(as.numeric(substring(westVeast[i,],1,1)))
losses[i] = sum(as.numeric(substring(westVeast[i,1:15],3,3)))
}
Get the total wins and losses and find the win percentage.
westVeastWinPct = sum(wins)/sum(wins+losses)
round(westVeastWinPct,3)*100
From 2005-06 to 2020-21, Western Conference teams won 55.7% of regular season matchups against Eastern Conference opponents.
Let's get a sense of the consistency of the inequality.
We can shove the wins and losses vectors into the West table.
WestData$wins = wins
WestData$losses = losses
We group by season and get the total wins and losses for the West against the East. If the west outperforms the East, we set 'westbetter' to 1.
yearByyear = WestData %>%
group_by(season) %>%
summarise(WestvEastwins = sum(wins),
WestvEastlosses = sum(losses)) %>%
mutate(westbetter = ifelse(WestvEastwins > WestvEastlosses, 1, 0))
Then we can sum 'westbetter' to find the number out of the past 16 seasons.
sum(yearByyear$westbetter)
Western Conference teams have won more regular season games than the East in 15 out of the last 16 seasons.
Let's see when the disparity was the most extreme. We set disparity to be the absolute difference in inter conference record and sort by disparity.
yearByyear %>%
mutate(disparity = abs(WestvEastwins - WestvEastlosses)) %>%
arrange(desc(disparity))
season WestvEastwins WestvEastlosses westbetter disparity
2013 284 166 1 118
2014 263 187 1 76
2012 262 188 1 74
2010 261 189 1 72
2007 258 192 1 66
2006 257 193 1 64
2005 252 198 1 54
2018 252 198 1 54
2009 246 204 1 42
2011 156 114 1 42
2016 246 204 1 42
2020 242 208 1 34
2019 211 178 1 33
2017 237 213 1 24
2015 232 218 1 14
2008 219 231 0 12
The conference disparity was the most extreme in the 2013-14 season when Western Conference teams won 118 more games than Eastern Conference teams.
Despite the disparity, each conference sends the same number of teams to the playoffs each year. Let's look at the threshold to make the playoffs in each conference. We do this by finding the average worst performing playoff team by conference. We can use our standings table for this. We filter to playoff teams, group by conference and season, then get the minimum win percentages.
lowestWpct = standings %>%
filter(playoffs=='Yes') %>%
group_by(conference, season) %>%
select(season, conference, bb_ref_team_name, win_pct, bb_ref_team_name) %>%
summarise(team = bb_ref_team_name[which.min(win_pct)], winpct = min(win_pct)*100)
Group by conference and get the average. This is the average win percentage threshold to make the playoffs in the conferences
lowestWpct %>%
group_by(conference) %>%
summarise(avgLowestWinPct = round(mean(winpct),1))
From 2005-06 to 2020-21, the last seed to make the playoffs from the Eastern Conference won 48.5% of their regular season games on average. This mark is 55.1% in the West.
Let's visualize Western Conference success against the East, grouped by 'Playoff" and 'Lottery' teams. We already have west data from above. Let's grab East data.
east = df %>%
filter(conference=='East')
Into the West vs East matchups, we add playoff and season information.
westVeast$season = WestData$season
westVeast$playoffs = WestData$playoffs
Trickily, the Bulls and Hornets are in swapped orders in the columns compared to their row values. Switch them.
westVeast = westVeast[,c(1:3,5,4,6:17)]
Make a copy to get messy.
wVEastPlayoffs = westVeast
We iterate through row, nested in their years. We start by focusing on Eastern Conference Playoff teams. We do this by setting head to head record to 0-0 against Eastern Conference lottery teams. That way when we sum across, those games are not counted.
for (yearI in seq(1,240,by=15)){
for (i in c(0:14)){
if(east$playoffs[i+yearI] == 'No'){
wVEastPlayoffs[seq(yearI,yearI+14),i+1] = '0-0'
}
}
}
Next we do a row sum for wins and losses.
for (i in c(1:240)){
wVEastPlayoffs$wins[i] = sum(as.numeric(substring(wVEastPlayoffs[i,1:15],1,1)))
wVEastPlayoffs$losses[i] = sum(as.numeric(substring(wVEastPlayoffs[i,1:15],3,3)))
}
We can now find out how the Western Conference (grouped by playoffs/lottery) did against Eastern Playoff teams.
wVEastPlayoffs = wVEastPlayoffs %>%
group_by(playoffs) %>%
mutate(East='Playoffs', winpct = wins/(wins+losses)) %>%
select(wins, losses, East, winpct)
Copy for non-playoff east teams.
wvEastLotto = westVeast
Do the same as above except set record to 0-0 if against an Eastern playoff team.
for (yearI in seq(1,240,by=15)){
for (i in c(0:14)){
if(east$playoffs[i+yearI] == 'Yes'){
wvEastLotto[seq(yearI,yearI+14),i+1] = '0-0'
}
}
}
Sum across.
for (i in c(1:240)){
wvEastLotto$wins[i] = sum(as.numeric(substring(wvEastLotto[i,1:15],1,1)))
wvEastLotto$losses[i] = sum(as.numeric(substring(wvEastLotto[i,1:15],3,3)))
}
Create table of how Western Conference (grouped by playoffs/lottery) did against Eastern Lottery teams.
wvEastLotto = wvEastLotto %>%
group_by(playoffs) %>%
mutate(East='Lotto', winpct = wins/(wins+losses)) %>%
select(wins, losses, East, winpct)
Let's combine tables for our visualization.
fullwVe = rbind(wvEastLotto, wVEastPlayoffs)
Create a faceted scatterboxplot with all four scenario levels. The red dashed line indicates a 50/50 tossup.
fullwVe %>%
mutate(playoffs = ifelse(playoffs=='Yes','Western Playoff Teams','Western Lottery Teams')) %>%
ggplot(aes(x=East,y=winpct,fill=playoffs))+
labs(title='West teams Win% Against East teams (2005-2020)', x='Eastern Team Outcomes',y='')+
geom_boxplot(fill='#007AC1')+
geom_jitter(width=.2,alpha=.4,color='#EF3B24')+
facet_wrap(~playoffs)+
geom_hline(yintercept=.5, color='red', linetype='dashed')+
theme(legend.position = 'none', plot.title = element_text(hjust=.5))+
scale_y_continuous(labels=scales::percent)
Over the last 16 NBA seasons, inter-conference regular season matchups of two eventual playoffs/lottery teams have swung in favor of the Western Conference. Additionally, the skill gap between Eastern Conference playoff teams and Western Conference lottery teams has been closer than Western Conference playoff teams against Eastern lottery opponents.
Let's analyze strength of schedule by examining average opponents' point differential. Let's use the big dataframe and clean up the NA values
df[is.na(df)] = '0-0'
Make that pesky Bulls and Hornets column swap.
df = df[,c(1:14,16,15,17:41)]
Initialize average opponent point differential, cumulative opponents' point differential, games played, and times played to 0.
df$avgOppMargin = 0
df$cumulativeoppPM = 0
df$cumulativeGP = 0
timesplayed = 0
We have a triple nested loop that goes by season through teams. First, we set 'timesplayed' to the sum of a team's wins and losses against an opponent in a given year. Then, we get that opponent's season point differential. We add to the cumulative games played count for our team. Then add the weighted opponent's point differential to the cumulative opponents' point differentials. Once we traverse through a team in a season, we calculate their strength of schedule by scaling their cumulative opponents' PD to games played in the season.
for (yearI in seq(1,480,30)){
for(littleindex in c(0:29)){ #vertical index
for(smallerindex in c(0:29)){ #horizontal index
timesplayed = as.numeric(substring(df[yearI+littleindex,smallerindex+12],1,1))+as.numeric(substring(df[littleindex+yearI,smallerindex+12],3,3))
PD = df$points_scored_per_game[yearI+smallerindex]-df$points_allowed_per_game[yearI+smallerindex]
df$cumulativeGP[yearI+littleindex] = df$cumulativeGP[yearI+littleindex] + timesplayed
df$cumulativeoppPM[yearI+littleindex] = df$cumulativeoppPM[yearI+littleindex] + PD*timesplayed
}
df$avgOppMargin[yearI+littleindex] = df$cumulativeoppPM[yearI+littleindex]/df$cumulativeGP[yearI+littleindex]
}
}
Let's visualize strength of schedule by conference.
df$teamyear = paste(df$team_short, df$season)
Label some outliers.
labelled_teams = df %>%
filter(teamyear %in% c('DAL 2011','LAL 2019','CHI 2019'))
Color the points grouped by conference, write text for teams 'teamyear' in 'labelled_teams', add regression lines.
library(ggthemes)
df %>%
ggplot(aes(x=points_scored_per_game-points_allowed_per_game, y=avgOppMargin, color=conference))+
geom_point()+
geom_text(aes(label=teamyear), data=labelled_teams, nudge_y = .06,nudge_x = 1.1,color='black',size=3.5)+
scale_color_manual(values = c('#1d428a','#c8102e'))+
geom_smooth(method='lm',se=F)+
labs(x='Team Point Margin', y='Average Opponent Point Margin', title="Team's Point Margin vs. Average Opponent Point Margin",
color='Conference')+
theme(legend.position = 'bottom', plot.title = element_text(hjust=.25))+
theme_hc()
The points in the plot above represent teams from the last 16 NBA seasons, which are colored by their Conference. Teams to the right had higher point differentials and teams up higher had better strength of schedules. We see that Western Conference teams consistently have tougher strength of schedules than Eastern Conference teams. Also, teams with higher point differentials tend to have easier schedules.
The 2011 Mavericks and 2019 Lakers had an abnormally difficult schedule based on their average opponents' point margins. I hypothesize that 2011 and 2019 teams appear at extremes of the y-axis due to the NBA Lockout and COVID-shortened season causing teams to play less than the typical 82 game season. I would guess that team point differential tends to stabilize so having a smaller sample of games invites room for extreme values. 2019 teams may also be influenced by the NBA Bubble excluding the worst teams, making an inconsistency in the strength of schedules relative to a normal season.
Let's make a logistic regression model to predict the chance a team makes the playoffs given their win percentage. First we make playoffs a factor variable.
df$playoffs = as.factor(df$playoffs)
logreg = glm(playoffs ~ win_pct, data=df, family = 'binomial')
With the model let's find the probability that a .500 team will make the playoffs. We set beta0 and beta1 to the model's coefficients.
b0 = logreg$coefficients[1]
b1 = logreg$coefficients[2]
Use the logistic function, plugging in 0.50 for X.
round(exp(b0+b1*.5)/(1+exp(b0+b1*.5))*100,1)
We expect that teams that win as much as they lose will make the playoffs 58% of the time.
Let's add an 'is_west' indicator variable to see how this probability changes, conditioned on a team's conference.
df$is_west = as.factor(ifelse(df$conference=='West',1,0))
Construct a new model trained on win percentage and conference.
logreg2 = glm(playoffs ~ win_pct + is_west, data=df, family = 'binomial')
summary(logreg2)
Get the is_west coefficient
round(logreg2$coefficients[3],3)
Update coefficients.
b0 = logreg2$coefficients[1]
b1 = logreg2$coefficients[2]
b2 = logreg2$coefficients[3]
Probability given 50% win and East.
round(exp(b0+b1*.5)/(1+exp(b0+b1*.5))*100,1)
Probability given 50% win and West.
round(exp(b0+b1*.5+b2)/(1+exp(b0+b1*.5+b2))*100,1)
The expected log odds of making the playoffs decrease by 2.875 with a move to the Western Conference, controlling for win percentage. An Eastern Conference .500 team will make the playoffs 83.4% of the time. Western Conference .500 teams will make the playoffs only 22.1% of the time.
Let's investigate the possibility the expected log odds magnitude is explained by randomness. We use Monte Carlo simulations by creating 10,000 iterations of randomly reordering conference labels such that a random group of 15 teams makes up the Eastern Conference while the other 15 teams make-up the West. For each iteration, we then create a new logistic model finding the conference impact on the log odds of making the playoffs. If reality is highly probably to be induced by randomness, we would see many random iterations where the log odds of making the playoffs conditioned on conference is greater in magnitude than 2.875, the observed value.
W set the seed for reproducibility.
set.seed(425)
Initialize 10,000 alternate universes that will store random East West assignments.
altUniverses = matrix(0,10000,30)
Randomly assign 1:30 indexes to west or east, making sure that only 15 are assigned to each.
for (sim in c(1:10000)){
newWest = sample(1:30,15)
altUniverses[sim,newWest] = 1
}
Create a copy of the dataframe.
simdf = df
Initialize the empty vector of is_west coefficients.
Westcoefs = c()
For 10,000 iterations, set the conference to the randomized sim.
for (sim in c(1:10000)){
simdf$conference = rep(altUniverses[sim,],16)
# Fit logistic regression with is_west and win%
logreg3 = glm(playoffs ~ win_pct + conference, data=simdf, family = 'binomial')
# Fill the is_westcoefficients vector
Westcoefs[sim] = logreg3$coefficients[3]
}
Make our coefficients a dataframe for ggplot usage.
Westcoefs = as.data.frame(Westcoefs)
Let's visualize the coefficients of the 10,000 simulations.
Make the style set to simulation (to contrast the actual).
Westcoefs$Style = 'Simulation'
Add the observed coefficient to the table
Westcoefs = Westcoefs %>%
rbind(c(logreg2$coefficients[3],'Actual'))
These coefficients should be numeric
Westcoefs$Westcoefs = as.numeric(Westcoefs$Westcoefs)
Make the plot. Our obseved coefficient is super extreme indicating we are enormously confident our results are not due to chance. This points to confidence of a higher playoff threshold for Western Conference teams.
Westcoefs %>%
ggplot(aes(x=Westcoefs, fill=Style))+
geom_histogram(bins = 100)+
geom_segment(aes(y=75, xend=-2.75, x=-2.5,yend=20),
arrow = arrow(length=unit(.5, 'cm')),
color='red')+
annotate('text', x=-2,y=90, label='Observed Coefficient Size!', size=3)+
labs(x='Western Conference Log Odds Effect on Playoff Chances (Controlled for Win%)',
y='Count',
title='Extremely Unlikely that Western Conference Difficulty is due to Chance',
subtitle = 'More significant than 10,000 Monte Carlo Simulations')+
theme_hc()+
theme(legend.position = 'none', plot.title = element_text(hjust=.5), plot.subtitle = element_text(hjust=.5))
The hypothesis that a higher regular season winning percentage is needed to reach the playoffs in the Western Conference relative to the Eastern Conference is supported using the last 16 seasons of data. Our observed coefficient is more significant than all 10,000 Monte Carlo Simulations indicating that the probability the observed playoff thresholds are due to random chance is essentially zero.
Comments