Combine Analysis

Overview

A good draft is a big part of doing well in Fantasy Football. Every year, Fantasy Football pundits write droves of articles on “sleeper” picks, or players that aren’t on most people’s draft sheets. Many of these players are untested rookies drafted straight from college. So how do you know differentiate which rookies will remain sleepers from those who will propel your fantasy team to victory? One place to start is the NFL combine.

Indeed, I’m not the first person to investigate the relationship between combine stats/physical attributes and player performance. NFL teams do it every season, as well as fellow data nerds (see here). However, prior analyses have focused on performance over a longer period of time, such as the first three years in the league. This makes sense from the perspective of an NFL team, as rookies are typically signed for longer than a one-year contract. In contrast, most fantasy football leagues operate like one-year contracts; you draft a new team at the beginning of each year. Thus, in the case of NFL rookies, I’m interested exclusively in first year NFL performance. Can we learn anything from the combine that isn’t already baked into a rookie’s draft position?

NFL Combine Performance

The NFL combine is like the SAT or ACT except for football, with more lifting and less clothing. It’s a chance for NFL teams to see how fast and strong NFL hopefuls are in a controlled environment. During the combine players typically complete the following events:

  • 40 yard dash
  • Number of reps of 225lbs on the bench press (my number is 0)
  • How high someone can jump (vertical leap) How far someone can jump (broad jump)
  • Basically run back and forth between two lines as fast as possible (shuttle)
  • Again more running between cones (3cone)

In addition to these drills, height and weight are also measured. This data gives NFL team’s a holistic perspective to evaluate players. Despite a number of findings indicating the limited utility of this information translating into actual performance in the NFL, the difference between a 4.4 second and 4.6 second 40 yard dash can mean big differences in draft position.

Draft Position

Most rookies in the NFL enter a team through the draft. A player’s draft position is a reasonable proxy for their perceived value. If a rookie is believed to be a game-changer, then they’ll have a low draft position; if they have some skills but need a little more work, then they’ll have a higher draft position. Indeed, rookie pay is directly correlated with draft position, such that players drafted in earlier rounds are paid more money than those drafted later. Thus draft position should relate to first-year performance, otherwise teams would just choose players at random, hoping to land the next Tom Brady or Julio Jones based on the phases of the moon or their horoscopes.

First Year Performance

I adopted a common scoring system from most standard fantasy football leagues when quantifying first-year performance:

  • Touchdowns: 5 points
  • Every 10 receiving yards: 1 point
  • Every 10 rushing yards: 1 point
  • Every fumble: -2 points

For example, let’s say you draft a running back and he scores five touchdowns, rushes for 500 yards, and fumbles the ball twice during his rookie season. The points for this player would be (5 * 5) + (500 * 0.1) + (2 * -2) = 71 points. Additionally, 50 rushing or receiving yards is considered equivalent to one touchdown.

Collecting Combine and Draft Data

The first thing we’ll do is pull combine results and draft position data for all rookies in the NFL from 2011-2016. The rvest package will do the heavy lifting.

libs = c('rvest', 'dplyr', 'janitor',
         'GGally', 'zeallot', 'mgcv',
          'knitr', 'kableExtra', 'readr')
lapply(libs, require, character.only = TRUE)
years = seq(2011, 2016)
combine_data = data.frame(NULL)
draft_position_data = data.frame(NULL)
for(y in years){
  print(paste0("COLLECTING DATA FROM ", y))
  combine_url = paste0('http://nflcombineresults.com/nflcombinedata.php?year=',
              y,
             '&pos=&college=')

 # collect combine data
  yearly_combine_data = combine_url %>%
        read_html() %>%
        html_nodes(xpath = '//*[@id="datatable"]/table') %>%
        html_table(fill = TRUE, header = TRUE) %>%
        as.data.frame()

  combine_data = bind_rows(combine_data,
                           yearly_combine_data)

  draft_position_url = paste0('http://www.drafthistory.com/index.php/years/',
                              y)
  # collect draft position data
  yearly_draft_data = draft_position_url %>%
       read_html() %>%
       html_nodes(xpath = '//*[@id="main"]/table[1]') %>%
       html_table(fill = TRUE, header = TRUE) %>%
       as.data.frame()

  names(yearly_draft_data) = yearly_draft_data[1,]
  yearly_draft_data[2:nrow(yearly_draft_data),]
  yearly_draft_data$year = y
  draft_position_data = bind_rows(draft_position_data,
                                  yearly_draft_data)

}

Next, we’ll do a bit of data munging and then join the draft and combine data together.

draft_position_data_clean = draft_position_data %>%
                      filter(Player != 'Player') %>%
                      clean_names() %>%
                      select(-pick, -round, -college, -position) %>%
                      dplyr::rename(nfl_team = team,
                                    pick = player)

combine_data_clean = combine_data %>%
               clean_names() %>%
               select(-na, -na_1)

# filter only players that are Running Backs & Wide Receivers
combine_draft_join = left_join(combine_data_clean, 
                               draft_position_data_clean) %>% 
                      filter(pos %in% c('RB', 'WR'))
And let’s have a look at the first few rows of data.
year name college pos height_in weight_lbs wonderlic x40_yard bench_press vert_leap_in broad_jump_in shuttle x3cone pick nfl_team
2011 Darvin Adams Auburn WR 74.13 190 NA 4.56 NA NA NA 9.99 9.99 NA NA
2011 Anthony Allen Georgia Tech RB 72.75 228 NA 4.56 24 41.5 120 4.06 6.79 225 Ravens
2011 Armando Allen Notre Dame RB 68.25 199 NA 4.52 23 NA NA 9.99 9.99 NA NA
2011 Matt Asiata Utah RB 71.00 229 NA 4.77 22 30.0 104 4.37 7.09 NA NA
2011 Jon Baldwin Pittsburgh WR 76.38 228 14 4.49 20 42.0 129 4.34 7.07 26 Chiefs
2011 Damien Berry Miami (FL) RB 70.25 211 NA 4.58 23 33.5 120 4.12 7.00 NA NA
2011 Armon Binns Cincinnati WR 75.00 209 NA 4.50 13 31.5 118 4.31 6.86 NA NA
2011 Allen Bradford Southern California RB 70.88 242 NA 4.53 28 29.0 113 4.39 6.97 187 Buccaneers
2011 DeAndre Brown Southern Mississippi WR 77.63 233 NA 4.59 20 29.0 117 4.33 6.93 NA NA
2011 Vincent Brown San Diego State WR 71.25 187 NA 4.68 12 33.5 121 4.25 6.64 82 Chargers
2011 Stephen Burton West Texas A&M WR 73.38 221 NA 4.50 19 34.5 117 4.31 7.04 236 Vikings
2011 Delone Carter Syracuse RB 68.63 222 NA 4.54 27 37.0 120 4.07 6.92 119 Colts
2011 John Clay Wisconsin RB 72.50 230 NA 4.83 NA 29.0 111 9.99 9.99 NA NA
2011 Randall Cobb Kentucky WR 70.25 191 NA 4.46 16 33.5 115 4.34 7.08 64 Packers
2011 Graig Cooper Miami (FL) RB 70.00 205 NA 4.60 18 NA 114 4.03 6.66 NA NA
2011 Mark Dell Michigan State WR 72.25 193 NA 4.54 14 NA NA 9.99 9.99 NA NA
2011 Noel Devine West Virginia RB 67.50 179 NA 4.34 24 NA NA 9.99 9.99 NA NA
2011 Tandon Doss Indiana WR 74.00 201 NA 4.56 14 NA NA 9.99 9.99 123 Ravens
2011 Shaun Draughn North Carolina RB 70.88 213 NA 4.73 21 34.0 118 4.20 7.15 NA NA
2011 Darren Evans Virginia Tech RB 72.00 227 NA 4.56 26 35.0 111 4.46 6.96 NA NA
2011 Mario Fannin Auburn RB 70.38 231 NA 4.37 21 37.5 115 4.21 6.99 NA NA
2011 Edmond Gates Abilene Christian (TX) WR 71.75 192 NA 4.31 16 40.0 131 9.99 9.99 111 Dolphins
2011 A.J. Green Georgia WR 75.63 211 10 4.48 18 34.5 126 4.21 6.91 4 Bengals
2011 Alex Green Hawaii RB 72.25 225 NA 4.45 20 34.0 114 4.15 6.91 96 Packers
2011 Tori Gurley South Carolina WR 76.13 216 NA 4.53 15 33.5 118 4.25 7.05 NA NA

Overall looks good. One thing you’ll notice is that a lot of the players who participated in the combine don’t have a draft pick, which means that they were never drafted by a team. While there are many examples of undrafted players going on to have stellar rookie careers, these guys don’t show up much during the fantasy draft process. Thus any player without a draft pick is considered in the following analyses.

Collecting Rookie Performance Data

Next, we’ll collect the outcome variable – how many points each player scored during their rookie season - and enlist the service of the nflgame python package. We’ll write out the players and their rookie years to the rookies.csv file, pull that data into python, collect the first year stats, and then pull the output back into R.

input_file_name = "rookie_names_years.csv"
python_script_name = "collect_rookie_stats.py"
output_file_name = "year_1_rookie_stats.csv"
combine_draft_join %>% 
  select(year, name) %>% 
  write.csv(input_file_name, 
            row.names = FALSE)

exe_pyscript_command = paste0("//anaconda/bin/python ",
                              python_script_name,
                              " ",
                              "'", input_file_name, "'",
                              " ",
                              "'", output_file_name, "'"
                              
                              )
print(exe_pyscript_command)

We’ll execute the collect_rookie_stats.py script from within R, then read the rookie_stats.csv file back into R.

system(exe_pyscript_command)
import sys
import pandas as pd
import nflgame
from pandasql import *

def collect_rushing_stats(year, week, players):
    rushing_stats = list()
    for p in players.rushing():
        rushing_stats.append([year,
                              week,
                              " ".join(str(p.player).split(" ")[:2]), 
                              p.rushing_tds, 
                              p.rushing_yds,
                              p.fumbles_lost])
    rushing_df = pd.DataFrame(rushing_stats)
    rushing_df.columns = ['year', 
                            'week',
                            'name',
                            'rushing_tds',
                            'rushing_yds',
                            'rushing_fb']
    return(rushing_df)

def convert_rushing_pts(td_pts, rushing_pts, fb_pts):
    return(pysqldf("""
        SELECT year, 
               name, 
               td_pts + rushing_pts + fb_pts AS total_pts 
        FROM 
            (SELECT year, 
                    name, 
                    SUM(rushing_tds) * {td_pts} AS td_pts, 
                    SUM(rushing_yds) * {rushing_pts} AS rushing_pts,
                    SUM(rushing_fb) * {fb_pts} AS fb_pts
            FROM rb_df_temp
            GROUP BY year, name)
        NATURAL JOIN 
        rookie_df
        """.format(td_pts = td_pts,
                   rushing_pts = rushing_pts,
                   fb_pts = fb_pts
                    )))

def collect_receiving_stats(year, week, players):
    receiving_stats = list()

    for p in players.receiving():
        receiving_stats.append([year,
                                week,
                                " ".join(str(p.player).split(" ")[:2]), 
                                p.receiving_tds, 
                                p.receiving_yds,
                                p.fumbles_lost])
    receiving_df = pd.DataFrame(receiving_stats)
    receiving_df.columns = ['year', 
                        'week',
                        'name',
                        'receiving_tds',
                        'receiving_yds',
                        'receiving_fb']
    return(receiving_df)

def convert_receiving_pts(td_pts, receiving_pts, fb_pts):
    return(pysqldf("""
        SELECT year, 
               name, 
               td_pts + receiving_pts + fb_pts AS total_pts 
        FROM 
            (SELECT year, 
                    name, 
                    SUM(receiving_tds) * {td_pts} AS td_pts, 
                    SUM(receiving_yds) * {receiving_pts} AS receiving_pts,
                    SUM(receiving_fb) * {fb_pts} AS fb_pts
            FROM wr_df_temp
            GROUP BY year, name)
        NATURAL JOIN 
        rookie_df
        """.format(td_pts = td_pts,
                   receiving_pts = receiving_pts,
                   fb_pts = fb_pts
                    )))
                    

def main(input_file_path, output_file_path):
    # define scoring scheme, years, & weeks here 
    td_pts = 5
    receiving_pts = rushing_pts = 0.1
    fb_pts = -2
    game_years = range(2011, 2017)
    game_weeks = range(1, 17) 
    # input file 
    global rookie_df
    rookie_df = pd.read_csv(input_file_path)
    # store stats for each year in player_stats_df
    global player_stats_df    
    player_stats_df = pd.DataFrame()
    for year in game_years:
        print("Processing Game Data From {year}".format(year = year))
        temp_df = rookie_df[rookie_df['year'] == year]
        global rb_df_temp 
        rb_df_temp = pd.DataFrame()
        global wr_df_temp
        wr_df_temp = pd.DataFrame()
        for week in game_weeks:
            games = nflgame.games(year, week)
            players = nflgame.combine_game_stats(games)

            rb_df_temp = rb_df_temp.append(collect_rushing_stats(year, 
                                                                 week, 
                                                                 players))
            wr_df_temp = wr_df_temp.append(collect_receiving_stats(year, 
                                                                   week, 
                                                                   players))
        print 'calculating running back points'
        player_stats_df = player_stats_df.append(convert_rushing_pts(td_pts,
                                                                 rushing_pts,
                                                                 fb_pts,
                                                                 ))
        print 'calculating wide receiver points'
        player_stats_df = player_stats_df.append(convert_receiving_pts(td_pts,
                                                                   receiving_pts,
                                                                   fb_pts
                                                                   ))

    # aggregate rookies that have both receiving and running stats
    stats_final_df = pysqldf("""
                             SELECT year, name, SUM(total_pts) as total_pts
                             FROM player_stats_df
                             GROUP BY year, name
                             """)

    stats_final_df.to_csv(output_file_path, index = False)

if __name__ == "__main__":
    pysqldf = lambda q: sqldf(q, globals())
    main(sys.argv[1], sys.argv[2])

Below we’ll read the output from the collect_rookie_stats.py script back into R and examine the top 10 rows. We’ll also apply a few filters. First, all non-draft picks are eliminated from consideration. We want to control for draft pick, given that our central question is whether we can explain variation in first-year performance from information other than draft pick. Second, all rookies with less than 10 points (an arbitrary cutoff) during their rookie season are eliminated. This is to remove some of the noise that is unrelated to performance (e.g., low points due to injury and not poor performance).

rookie_stats = read_csv(file.path(working_directory, output_file_name)) %>% 
                      inner_join(combine_draft_join) %>% 
                      filter(pos %in% c("WR", "RB") & 
                             is.na(pick) == FALSE & 
                             total_pts >= 10) %>% 
                      mutate(pick = as.numeric(pick),
                             pos = as.factor(pos)) %>% 
                      select(year, name, pos, pick,
                             height_in, weight_lbs, x40_yard,
                             bench_press, vert_leap_in, total_pts
                             ) %>% 
                      data.frame()
year name pos pick height_in weight_lbs x40_yard bench_press vert_leap_in total_pts
2011 A.J. Green WR 4 75.63 211 4.48 18 34.5 143.4
2011 Austin Pettis WR 78 74.63 209 4.56 14 33.5 25.0
2011 Daniel Thomas RB 62 72.25 230 4.63 21 NA 62.3
2011 Delone Carter RB 119 68.63 222 4.54 27 37.0 42.4
2011 Denarius Moore WR 148 71.63 194 4.43 13 36.0 87.5
2011 Evan Royster RB 177 71.63 212 4.65 20 34.0 23.9
2011 Greg Little WR 59 74.50 231 4.51 27 40.5 82.4
2011 Greg Salas WR 112 73.13 210 4.53 15 37.0 25.2
2011 Jacquizz Rodgers RB 145 65.88 196 4.59 NA 33.0 41.4
2011 Jeremy Kerley WR 153 69.50 189 4.56 16 34.5 26.5
2011 Jon Baldwin WR 26 76.38 228 4.49 20 42.0 27.7
2011 Julio Jones WR 6 74.75 220 4.34 17 38.5 121.0
2011 Kendall Hunter RB 115 67.25 199 4.46 24 35.0 68.1
2011 Leonard Hankerson WR 79 73.50 209 4.40 14 36.0 16.3
2011 Mark Ingram RB 28 69.13 215 4.62 21 31.5 75.0
2011 Randall Cobb WR 64 70.25 191 4.46 16 33.5 35.0
2011 Roy Helu RB 105 71.50 219 4.40 11 36.5 98.6
2011 Shane Vereen RB 56 70.25 210 4.49 31 34.0 10.7
2011 Stevan Ridley RB 73 71.25 225 4.65 18 36.0 42.5
2011 Titus Young WR 44 71.38 174 4.43 NA NA 79.8
2011 Torrey Smith WR 58 72.88 204 4.41 19 41.0 119.7
2011 Vincent Brown WR 82 71.25 187 4.68 12 33.5 42.9
2012 Alfred Morris RB 173 69.88 219 4.63 16 35.5 189.1
2012 Alshon Jeffery WR 45 74.88 216 4.48 NA NA 44.1
2012 Bernard Pierce RB 84 72.25 218 4.45 17 36.5 53.6

We now have all of the necessary data. We’ll visualize the bivariate relationships between our variables, segmented by position (Running Back or Wide Receiver) for a quick quality check.

g = ggscatmat(rookie_stats, 
              columns = 4:dim(rookie_stats)[2], 
              color = "pos")
plot(g)

Everything looks good except for the 40 yard dash field. Most players fall in the 4-6 range except a few. Let’s examine observations with a 40 yard dash time greater than six seconds.

year name pos pick height_in weight_lbs x40_yard bench_press vert_leap_in total_pts
2016 Corey Coleman WR 15 70.63 194 9.99 17 40.5 52.4
2016 Devontae Booker RB 136 70.75 219 9.99 22 NA 81.8
2016 Jonathan Williams RB 156 70.00 220 9.99 16 NA 11.3
2016 Jordan Howard RB 150 71.88 230 9.99 16 34.0 180.6

It looks like missing or invalid 40 yard dash times were coded with 9.99. Let’s sub these values with NAs and re-run the plot above.

rookie_stats$x40_yard = ifelse(rookie_stats$x40_yard == 9.99, 
                               NA, 
                               rookie_stats$x40_yard)

g = ggscatmat(rookie_stats, 
              columns = 4:dim(rookie_stats)[2], 
              color = "pos")
plot(g)

Ahh that’s better. Now we’re ready to do some analyses.

Explaining Rookie Performance

Based on the scatterplot matrix, draft pick explains a fair amount of variability in first-year fantasy points. Teams shell out millions of dollars for rookie players, so it’s not surprising that points and pick number are inversely related, such that lower picks (1, 2, 3) score more fantasy points during their first season than higher picks. The other variable that exhibits a relationship with points is weight, which is moderated by position. Running Backs (RBs) appear to benefit more from having a few extra pounds relative to Wide Receivers (WRs). Being able to punch the ball in from the 1-yard line means more touchdowns. Having a more robust build might translate into fewer injuries and missed games. And being heavier would make an RB harder to tackle, which means more running yards. Having extra weight is less beneficial to a WR, as they rely on their speed and agility to create separation from defenders. Indeed, heavier players run slower 40-yard dash times, and I haven’t seen too many successful WRs in the NFL who were slow. This suggests that given the option between a lighter or heavier RB, go with the big guy.

Teams do seem to factor the weight of an RB into their draft strategy, as heavier RBs are drafted earlier on. But the question is whether weight is still significant after controlling for draft pick number and accounting for position (WR vs. RB). I’ll apply my two favorite modeling approaches when interpretation is paramount: Linear Model (LM) and General Additive Model (GAM). Pick number is present in both models. The LM features an interaction term to capture the disparate relationship between weight and position. The GAM model has a smoother for both weight and pick-number, and position is included as a dummy variable. For those unfamiliar with GAM, it’s a fantastic approach for modeling non-linear relationships while maintaining a highly interpretable model. The one drawback is that it is Additive and thus doesn’t account for interactions (of course you can add interactions terms into the model, but then it’s no longer a GAM). We’ll use LOOCV (leave-one-out-cross-validation) to determine which method of describing the data generating process is most generalizable.

# function for cross validation
split_data = function(input_df, pct_train){
  set.seed(1)
  data_list = list()
  random_index = sample(1:nrow(input_df), nrow(input_df))
  train_index = random_index[1:(floor(nrow(input_df) * pct_train))]
  data_list[['train_df']] = input_df[train_index,]
  data_list[['test_df']] = input_df[setdiff(random_index, train_index)  ,]
  return(data_list)
}

input_df = rookie_stats
pct_train = 0.8
fdata = split_data(input_df, pct_train)

row_index = sample(1:nrow(fdata$train_df), nrow(fdata$train_df))

# empty vectors to store prediction errors during LOOCV
lm_pred = c()
gam_pred = c()

for(i in row_index){
  # training data
  temp_train = fdata$train_df[setdiff(row_index, i),]
  
  # validation datum
  temp_validation = fdata$train_df[i,]  
  
  # linear model
  temp_fit_lm = lm(total_pts ~ weight_lbs * pos + pick, 
                   data = temp_train)
  
  # GAM model
  temp_fit_gam = mgcv::gam(total_pts ~ s(weight_lbs) + s(pick) + pos,
                                     data = temp_train)
  # linear model prediction
  lm_pred = c(lm_pred,
               predict(temp_fit_lm, temp_validation))
  # GAM model prediction
  gam_pred = c(gam_pred,
               predict(temp_fit_gam, temp_validation))

}

Let’s consider two different measures of performance.

r_squared = function(actual, predicted){
  return(1 - (sum((actual-predicted )^2)/sum((actual-mean(actual))^2)))
} 

mdn_abs_error = function(actual, predicted){
  return(median(abs(actual-predicted)))
}

val_actual = fdata$train_df$total_pts[row_index]

val_perf = data.frame(model = c("LM", "GAM"),
                      # calculate r-squared
                      r_squared = c(r_squared(val_actual,lm_pred),
                                    r_squared(val_actual,gam_pred)),
                      # calculate MAE
                      median_abs_err = c(mdn_abs_error(val_actual, lm_pred),
                                         mdn_abs_error(val_actual, gam_pred))) %>% 
                      mutate_if(is.numeric, round, 2)
print(val_perf)               
##     model   r_squared    median_abs_err
## 1    LM        0.17         26.15
## 2   GAM        0.17         27.08

The LM performed similiarly to the GAM, so we’ll use the simpler, LM model on the entire training set and examine the coefficients.

train_fit = lm(total_pts ~ weight_lbs * pos + pick, 
                   data = fdata$train_df)
summary(train_fit)
## Call:
## lm(formula = total_pts ~ weight_lbs * pos + pick, data = fdata$train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -70.946 -27.584  -9.839  17.944 157.722 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -176.63257  101.66213  -1.737  0.08490 .  
## weight_lbs          1.31363    0.46120   2.848  0.00518 ** 
## posWR             235.87977  119.20973   1.979  0.05016 .  
## pick               -0.31399    0.07396  -4.245 4.35e-05 ***
## weight_lbs:posWR   -1.21318    0.56198  -2.159  0.03288 *  
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
## 
## Residual standard error: 43.19 on 119 degrees of freedom
## Multiple R-squared:  0.2424, Adjusted R-squared:  0.217 
## F-statistic:  9.52 on 4 and 119 DF,  p-value: 1.033e-06

Interestingly, even after controlling for pick number, the interaction term is still significant. Interpreting interactions terms can be tricky, and I’ve found it helpful to just plug in a range of numbers for the variable you’re interested in and hold everything else constant to see how the predicted value changes. So that’s what we’ll do here. We’ll hold pick number at the mean, enter “RB” for the position dummy variable, and then generate a sequence between the 5th and 95th percentiles of weight. The slope of our predicted value indicates how many additional points we expect to score during a season for each RB pound.

# select 5th and 95th percentiles for weight
# hold pick number and position constant

q5_95 = unname(quantile(fdata$train_df$weight_lbs, c(0.05, 0.95)))

sim_df = data.frame(weight_lbs = seq(q5_95[1], q5_95[2], 1),
                    pick = mean(fdata$train_df$pick),
                    pos = "RB"
                    )
sim_df$predicted_pts = predict(train_fit, sim_df)
print(diff(sim_df$predicted_pts)[1])
## [1] 1.313628

As the weight of an RB increases by one pound, we expect the total number of fantasy points scored in a season to increase by ~1.3. Thus if you had two RBs drafted around the same position, but one weighed 30lbs more than the other, you would expect 39 more points from the heavier RB. Over a 16 game season, that difference translates to 2.4 more points per game, which could be the difference between making or missing the playoffs.

90 percent of weight values fall between 180 and 230, so I’d feel comfortable generalizing these findings to all RBs within this range. The data is pretty sparse outside of this range, so the relationships in the model would likely break down Indeed, I’m fairly certain a 320lb RB would not score more points than a 220lb RB; there is some point where weight likely has a negative effect on performance, but we won’t see that in the data because a 320lb RB won’t ever see the field.

While these findings are interesting, it’s not clear how helpful the model is in terms of informing our draft strategy. To answer that question, we’ll determine how the model performs on the hold-out set.

r_sq_test = round(r_squared(fdata$test_df$total_pts, 
                            predict(train_fit, fdata$test_df)), 
                  1)
median_abs_err = round(mdn_abs_error(fdata$test_df$total_pts, 
                                        predict(train_fit, fdata$test_df)), 
                       2)
print(paste0("TEST R-SQUARED: ", r_sq_test))
print(paste0("TEST MAE: ", median_abs_err))
## [1] "TEST R-SQUARED: 0.3"
## [1] "TEST MAE: 31.7"

We can explain approximately 30 percent of the variance in the test set, and the median absolute error is around 30. The fact that performance on the test set is comparable with the validation set indicates good generalizability of our model. However, there is a lot of variance left to explain! If we wanted to improve these predictions, there are a number of other variables to consider, including:

  • Injury
  • Strength of Offensive Line
  • Presence of skilled existing running back
  • Ratio of pass to run plays in the prior season

These inputs have an impact on how many points a fantasy RB will score. But if we want to keep it simple, the key takeaway is this: When it’s fantasy draft time and you’re debating between a few rookie RBs drafted around the same spot, go with the big guy. It just might be the difference between fantasy glory or facing the receiving end of one of these…

Related

comments powered by Disqus