The goal of this mini project is to visualize player aging curves in college baseball. It is assumed that players improve from their freshman to sophomore to junior to senior year but does that hold up in the data? How much should we expect them to improve? This can help us answer questions like “Should we draft the .350 hitting senior or freshman with a .325 batting average. Both players are above average and the senior year hitter is “better” all else equal but what would we expect the freshman to turn into by his senior year?
This analysis is conducted with the aid of Nathan Blumenfeld’s NCAA baseball streamlit app which contains statistics on every NCAA Division I baseball player dating back to the 2013 season. We start by downloading the rate statistics of every hitter with at least 100 plate appearances from five seasons: 2016-2019 and the 2021 season. These seasons are chosen as the five most recent complete seasons and gives us a sample of over 14,000 observations. You can download the files from the github. We combine our tables after creating a corresponding column for the year.
library(tidyverse)
rate16 = read_csv('2016_batting_rate_stat_leaders_batting_all_all_minPA_100.csv')
rate17 = read_csv('2017_batting_rate_stat_leaders_batting_all_all_minPA_100.csv')
rate18 = read_csv('2018_batting_rate_stat_leaders_batting_all_all_minPA_100.csv')
rate19 = read_csv('2019_batting_rate_stat_leaders_batting_all_all_minPA_100.csv')
rate21 = read_csv('2021_batting_rate_stat_leaders_batting_all_all_minPA_100.csv')
rate16$Year = 2016
rate17$Year = 2017
rate18$Year = 2018
rate19$Year = 2019
rate21$Year = 2021
bigRate = rbind(rate16,rate17, rate18, rate19, rate21)
Let’s have a look at how our data is correlated.
library(corrr)
library(paletteer)
numerics = bigRate %>% select_if(is.numeric) %>% correlate()
numerics %>%
stretch() %>%
ggplot(aes(x,y,fill=r))+
geom_tile()+
geom_text(aes(label = as.character(fashion(r)))) +
scale_fill_paletteer_c("viridis::magma", limits = c(-1, 1), direction = 1)
The only column with lots of negative correlations is strikeout percentage which makes sense because this is the only metric here where lower means better. ‘Year’ has slight correlations with strikeout percentage and home run percentage, indicating a possible shift in the NCAA game.
Now let’s move on to visualizing our aging curves. We start by scaling the data so we can interpret the metrics on the same playing field.
scaled = bigRate %>%
select_if(is.numeric) %>%
scale()
Next we convert the scaled data back to a dataframe and add in the students’ class standing (freshman, sophomore, junior, or senior).
df = as.data.frame(scaled)
df$Yr = bigRate$Yr
Lastly we filter out null values for class standing and summarize each metric by their mean scaled value grouped by class standing.
averages = df %>%
filter(Yr != 'N/A') %>%
group_by(Yr) %>%
summarise(PA=mean(PA),
wOBA=mean(wOBA),
wRC = mean(wRC),
wRAA = mean(wRAA),
OPS=mean(OPS),
OBP=mean(OBP),
SLG=mean(SLG),
BA = mean(BA),
ISO = mean(ISO),
BABIP = mean(BABIP),
`K%` = mean(`K%`),
`BB%` = mean(`BB%`),
`HR%` = mean(`HR%`))
From there we create a function to visualize the aging curve for a given metric.
get_chart <- function(metric) {
metric_label = deparse(substitute(metric))
averages %>%
ggplot(aes(x=Yr, y=metric, fill=Yr))+
scale_x_discrete(limits=c('Fr', 'So','Jr','Sr'))+
geom_bar(stat='identity')+
theme(legend.position = "none")+
labs(x='', y='', title=metric_label)
}
And finally we use that function to visualize all on the same graphic.
We see that the growth curves are nearly identical for each “positive” metric, indicating that skills are impacted uniformly by age. Freshmen hitters struggle mightily, sophomores are about average, and juniors and seniors are similarly above average. The growth curve is at its greatest from freshman to sophomore year, with another tangible jump to junior year. Interestingly, our juniors in this sample perform very similarly to the seniors and in most cases drop off. I’m guessing this is due to a mixture of the aging curve slowing and selection bias. It’s possible that many successful juniors are forgoing their senior year playing NCAA and are jumping to MLB clubs. They improved in reality but they would not be a part of the sample of seniors present here.
A question one might ask is if the juniors and seniors are performing so much better than the freshmen, why are the freshmen even playing at all? This may be to develop the young talent. The fact that they get to 100 plate appearances in a season despite their struggles speaks to the level of trust that coaches have in these players. They may see them as a “project” that will be improved by playing suboptimal in the short run.
BABIP or ‘batting average of balls in play’ is the only metric here that does not follow the same growth curve. In fact, sophomores have the best BABIP and seniors are actually below average! I’m perplexed at this finding and have a few ‘pie in the sky’ theories, but I will leave that up to the reader to think about :-)
Comments