Combine Analysis

Sep 4, 2017 17 min read Fantasy Football, Python, R

Overview

A good draft is a big part of doing well in Fantasy Football. Every year, Fantasy Football pundits write droves of articles on “sleeper” picks, or players that aren’t on most people’s draft sheets. Many of these players are untested rookies drafted straight from college. So how do you know differentiate which rookies will remain sleepers from those who will propel your fantasy team to victory? One place to start is the NFL combine.

Indeed, I’m not the first person to investigate the relationship between combine stats/physical attributes and player performance. NFL teams do it every season, as well as fellow data nerds (see here). However, prior analyses have focused on performance over a longer period of time, such as the first three years in the league. This makes sense from the perspective of an NFL team, as rookies are typically signed for longer than a one-year contract. In contrast, most fantasy football leagues operate like one-year contracts; you draft a new team at the beginning of each year. Thus, in the case of NFL rookies, I’m interested exclusively in first year NFL performance. Can we learn anything from the combine that isn’t already baked into a rookie’s draft position?

NFL Combine Performance

The NFL combine is like the SAT or ACT except for football, with more lifting and less clothing. It’s a chance for NFL teams to see how fast and strong NFL hopefuls are in a controlled environment. During the combine players typically complete the following events:

40 yard dash
Number of reps of 225lbs on the bench press (my number is 0)
How high someone can jump (vertical leap) How far someone can jump (broad jump)
Basically run back and forth between two lines as fast as possible (shuttle)
Again more running between cones (3cone)

In addition to these drills, height and weight are also measured. This data gives NFL team’s a holistic perspective to evaluate players. Despite a number of findings indicating the limited utility of this information translating into actual performance in the NFL, the difference between a 4.4 second and 4.6 second 40 yard dash can mean big differences in draft position.

Draft Position

Most rookies in the NFL enter a team through the draft. A player’s draft position is a reasonable proxy for their perceived value. If a rookie is believed to be a game-changer, then they’ll have a low draft position; if they have some skills but need a little more work, then they’ll have a higher draft position. Indeed, rookie pay is directly correlated with draft position, such that players drafted in earlier rounds are paid more money than those drafted later. Thus draft position should relate to first-year performance, otherwise teams would just choose players at random, hoping to land the next Tom Brady or Julio Jones based on the phases of the moon or their horoscopes.

First Year Performance

I adopted a common scoring system from most standard fantasy football leagues when quantifying first-year performance:

Touchdowns: 5 points
Every 10 receiving yards: 1 point
Every 10 rushing yards: 1 point
Every fumble: -2 points

For example, let’s say you draft a running back and he scores five touchdowns, rushes for 500 yards, and fumbles the ball twice during his rookie season. The points for this player would be (5 * 5) + (500 * 0.1) + (2 * -2) = 71 points. Additionally, 50 rushing or receiving yards is considered equivalent to one touchdown.

Collecting Combine and Draft Data

The first thing we’ll do is pull combine results and draft position data for all rookies in the NFL from 2011-2016. The rvest package will do the heavy lifting.

libs = c('rvest', 'dplyr', 'janitor',
         'GGally', 'zeallot', 'mgcv',
          'knitr', 'kableExtra', 'readr')
lapply(libs, require, character.only = TRUE)
years = seq(2011, 2016)
combine_data = data.frame(NULL)
draft_position_data = data.frame(NULL)

for(y in years){
  print(paste0("COLLECTING DATA FROM ", y))
  combine_url = paste0('http://nflcombineresults.com/nflcombinedata.php?year=',
              y,
             '&pos=&college=')

 # collect combine data
  yearly_combine_data = combine_url %>%
        read_html() %>%
        html_nodes(xpath = '//*[@id="datatable"]/table') %>%
        html_table(fill = TRUE, header = TRUE) %>%
        as.data.frame()

  combine_data = bind_rows(combine_data,
                           yearly_combine_data)

  draft_position_url = paste0('http://www.drafthistory.com/index.php/years/',
                              y)
  # collect draft position data
  yearly_draft_data = draft_position_url %>%
       read_html() %>%
       html_nodes(xpath = '//*[@id="main"]/table[1]') %>%
       html_table(fill = TRUE, header = TRUE) %>%
       as.data.frame()

  names(yearly_draft_data) = yearly_draft_data[1,]
  yearly_draft_data[2:nrow(yearly_draft_data),]
  yearly_draft_data$year = y
  draft_position_data = bind_rows(draft_position_data,
                                  yearly_draft_data)

}

Next, we’ll do a bit of data munging and then join the draft and combine data together.

draft_position_data_clean = draft_position_data %>%
                      filter(Player != 'Player') %>%
                      clean_names() %>%
                      select(-pick, -round, -college, -position) %>%
                      dplyr::rename(nfl_team = team,
                                    pick = player)

combine_data_clean = combine_data %>%
               clean_names() %>%
               select(-na, -na_1)

# filter only players that are Running Backs & Wide Receivers
combine_draft_join = left_join(combine_data_clean, 
                               draft_position_data_clean) %>% 
                      filter(pos %in% c('RB', 'WR'))

And let’s have a look at the first few rows of data.

year	name	college	pos	height_in	weight_lbs	wonderlic	x40_yard	bench_press	vert_leap_in	broad_jump_in	shuttle	x3cone	pick	nfl_team
2011	Darvin Adams	Auburn	WR	74.13	190	NA	4.56	NA	NA	NA	9.99	9.99	NA	NA
2011	Anthony Allen	Georgia Tech	RB	72.75	228	NA	4.56	24	41.5	120	4.06	6.79	225	Ravens
2011	Armando Allen	Notre Dame	RB	68.25	199	NA	4.52	23	NA	NA	9.99	9.99	NA	NA
2011	Matt Asiata	Utah	RB	71.00	229	NA	4.77	22	30.0	104	4.37	7.09	NA	NA
2011	Jon Baldwin	Pittsburgh	WR	76.38	228	14	4.49	20	42.0	129	4.34	7.07	26	Chiefs
2011	Damien Berry	Miami (FL)	RB	70.25	211	NA	4.58	23	33.5	120	4.12	7.00	NA	NA
2011	Armon Binns	Cincinnati	WR	75.00	209	NA	4.50	13	31.5	118	4.31	6.86	NA	NA
2011	Allen Bradford	Southern California	RB	70.88	242	NA	4.53	28	29.0	113	4.39	6.97	187	Buccaneers
2011	DeAndre Brown	Southern Mississippi	WR	77.63	233	NA	4.59	20	29.0	117	4.33	6.93	NA	NA
2011	Vincent Brown	San Diego State	WR	71.25	187	NA	4.68	12	33.5	121	4.25	6.64	82	Chargers
2011	Stephen Burton	West Texas A&M	WR	73.38	221	NA	4.50	19	34.5	117	4.31	7.04	236	Vikings
2011	Delone Carter	Syracuse	RB	68.63	222	NA	4.54	27	37.0	120	4.07	6.92	119	Colts
2011	John Clay	Wisconsin	RB	72.50	230	NA	4.83	NA	29.0	111	9.99	9.99	NA	NA
2011	Randall Cobb	Kentucky	WR	70.25	191	NA	4.46	16	33.5	115	4.34	7.08	64	Packers
2011	Graig Cooper	Miami (FL)	RB	70.00	205	NA	4.60	18	NA	114	4.03	6.66	NA	NA
2011	Mark Dell	Michigan State	WR	72.25	193	NA	4.54	14	NA	NA	9.99	9.99	NA	NA
2011	Noel Devine	West Virginia	RB	67.50	179	NA	4.34	24	NA	NA	9.99	9.99	NA	NA
2011	Tandon Doss	Indiana	WR	74.00	201	NA	4.56	14	NA	NA	9.99	9.99	123	Ravens
2011	Shaun Draughn	North Carolina	RB	70.88	213	NA	4.73	21	34.0	118	4.20	7.15	NA	NA
2011	Darren Evans	Virginia Tech	RB	72.00	227	NA	4.56	26	35.0	111	4.46	6.96	NA	NA
2011	Mario Fannin	Auburn	RB	70.38	231	NA	4.37	21	37.5	115	4.21	6.99	NA	NA
2011	Edmond Gates	Abilene Christian (TX)	WR	71.75	192	NA	4.31	16	40.0	131	9.99	9.99	111	Dolphins
2011	A.J. Green	Georgia	WR	75.63	211	10	4.48	18	34.5	126	4.21	6.91	4	Bengals
2011	Alex Green	Hawaii	RB	72.25	225	NA	4.45	20	34.0	114	4.15	6.91	96	Packers
2011	Tori Gurley	South Carolina	WR	76.13	216	NA	4.53	15	33.5	118	4.25	7.05	NA	NA

Overall looks good. One thing you’ll notice is that a lot of the players who participated in the combine don’t have a draft pick, which means that they were never drafted by a team. While there are many examples of undrafted players going on to have stellar rookie careers, these guys don’t show up much during the fantasy draft process. Thus any player without a draft pick is considered in the following analyses.

Collecting Rookie Performance Data

Next, we’ll collect the outcome variable – how many points each player scored during their rookie season - and enlist the service of the nflgame python package. We’ll write out the players and their rookie years to the rookies.csv file, pull that data into python, collect the first year stats, and then pull the output back into R.

input_file_name = "rookie_names_years.csv"
python_script_name = "collect_rookie_stats.py"
output_file_name = "year_1_rookie_stats.csv"
combine_draft_join %>% 
  select(year, name) %>% 
  write.csv(input_file_name, 
            row.names = FALSE)

exe_pyscript_command = paste0("//anaconda/bin/python ",
                              python_script_name,
                              " ",
                              "'", input_file_name, "'",
                              " ",
                              "'", output_file_name, "'"
                              
                              )
print(exe_pyscript_command)

We’ll execute the collect_rookie_stats.py script from within R, then read the rookie_stats.csv file back into R.

system(exe_pyscript_command)

import sys
import pandas as pd
import nflgame
from pandasql import *

def collect_rushing_stats(year, week, players):
    rushing_stats = list()
    for p in players.rushing():
        rushing_stats.append([year,
                              week,
                              " ".join(str(p.player).split(" ")[:2]), 
                              p.rushing_tds, 
                              p.rushing_yds,
                              p.fumbles_lost])
    rushing_df = pd.DataFrame(rushing_stats)
    rushing_df.columns = ['year', 
                            'week',
                            'name',
                            'rushing_tds',
                            'rushing_yds',
                            'rushing_fb']
    return(rushing_df)

def convert_rushing_pts(td_pts, rushing_pts, fb_pts):
    return(pysqldf("""
        SELECT year, 
               name, 
               td_pts + rushing_pts + fb_pts AS total_pts 
        FROM 
            (SELECT year, 
                    name, 
                    SUM(rushing_tds) * {td_pts} AS td_pts, 
                    SUM(rushing_yds) * {rushing_pts} AS rushing_pts,
                    SUM(rushing_fb) * {fb_pts} AS fb_pts
            FROM rb_df_temp
            GROUP BY year, name)
        NATURAL JOIN 
        rookie_df
        """.format(td_pts = td_pts,
                   rushing_pts = rushing_pts,
                   fb_pts = fb_pts
                    )))

def collect_receiving_stats(year, week, players):
    receiving_stats = list()

    for p in players.receiving():
        receiving_stats.append([year,
                                week,
                                " ".join(str(p.player).split(" ")[:2]), 
                                p.receiving_tds, 
                                p.receiving_yds,
                                p.fumbles_lost])
    receiving_df = pd.DataFrame(receiving_stats)
    receiving_df.columns = ['year', 
                        'week',
                        'name',
                        'receiving_tds',
                        'receiving_yds',
                        'receiving_fb']
    return(receiving_df)

def convert_receiving_pts(td_pts, receiving_pts, fb_pts):
    return(pysqldf("""
        SELECT year, 
               name, 
               td_pts + receiving_pts + fb_pts AS total_pts 
        FROM 
            (SELECT year, 
                    name, 
                    SUM(receiving_tds) * {td_pts} AS td_pts, 
                    SUM(receiving_yds) * {receiving_pts} AS receiving_pts,
                    SUM(receiving_fb) * {fb_pts} AS fb_pts
            FROM wr_df_temp
            GROUP BY year, name)
        NATURAL JOIN 
        rookie_df
        """.format(td_pts = td_pts,
                   receiving_pts = receiving_pts,
                   fb_pts = fb_pts
                    )))
                    

def main(input_file_path, output_file_path):
    # define scoring scheme, years, & weeks here 
    td_pts = 5
    receiving_pts = rushing_pts = 0.1
    fb_pts = -2
    game_years = range(2011, 2017)
    game_weeks = range(1, 17) 
    # input file 
    global rookie_df
    rookie_df = pd.read_csv(input_file_path)
    # store stats for each year in player_stats_df
    global player_stats_df    
    player_stats_df = pd.DataFrame()
    for year in game_years:
        print("Processing Game Data From {year}".format(year = year))
        temp_df = rookie_df[rookie_df['year'] == year]
        global rb_df_temp 
        rb_df_temp = pd.DataFrame()
        global wr_df_temp
        wr_df_temp = pd.DataFrame()
        for week in game_weeks:
            games = nflgame.games(year, week)
            players = nflgame.combine_game_stats(games)

            rb_df_temp = rb_df_temp.append(collect_rushing_stats(year, 
                                                                 week, 
                                                                 players))
            wr_df_temp = wr_df_temp.append(collect_receiving_stats(year, 
                                                                   week, 
                                                                   players))
        print 'calculating running back points'
        player_stats_df = player_stats_df.append(convert_rushing_pts(td_pts,
                                                                 rushing_pts,
                                                                 fb_pts,
                                                                 ))
        print 'calculating wide receiver points'
        player_stats_df = player_stats_df.append(convert_receiving_pts(td_pts,
                                                                   receiving_pts,
                                                                   fb_pts
                                                                   ))

    # aggregate rookies that have both receiving and running stats
    stats_final_df = pysqldf("""
                             SELECT year, name, SUM(total_pts) as total_pts
                             FROM player_stats_df
                             GROUP BY year, name
                             """)

    stats_final_df.to_csv(output_file_path, index = False)

if __name__ == "__main__":
    pysqldf = lambda q: sqldf(q, globals())
    main(sys.argv[1], sys.argv[2])

Below we’ll read the output from the collect_rookie_stats.py script back into R and examine the top 10 rows. We’ll also apply a few filters. First, all non-draft picks are eliminated from consideration. We want to control for draft pick, given that our central question is whether we can explain variation in first-year performance from information other than draft pick. Second, all rookies with less than 10 points (an arbitrary cutoff) during their rookie season are eliminated. This is to remove some of the noise that is unrelated to performance (e.g., low points due to injury and not poor performance).

rookie_stats = read_csv(file.path(working_directory, output_file_name)) %>% 
                      inner_join(combine_draft_join) %>% 
                      filter(pos %in% c("WR", "RB") & 
                             is.na(pick) == FALSE & 
                             total_pts >= 10) %>% 
                      mutate(pick = as.numeric(pick),
                             pos = as.factor(pos)) %>% 
                      select(year, name, pos, pick,
                             height_in, weight_lbs, x40_yard,
                             bench_press, vert_leap_in, total_pts
                             ) %>% 
                      data.frame()

year	name	pos	pick	height_in	weight_lbs	x40_yard	bench_press	vert_leap_in	total_pts
2011	A.J. Green	WR	4	75.63	211	4.48	18	34.5	143.4
2011	Austin Pettis	WR	78	74.63	209	4.56	14	33.5	25.0
2011	Daniel Thomas	RB	62	72.25	230	4.63	21	NA	62.3
2011	Delone Carter	RB	119	68.63	222	4.54	27	37.0	42.4
2011	Denarius Moore	WR	148	71.63	194	4.43	13	36.0	87.5
2011	Evan Royster	RB	177	71.63	212	4.65	20	34.0	23.9
2011	Greg Little	WR	59	74.50	231	4.51	27	40.5	82.4
2011	Greg Salas	WR	112	73.13	210	4.53	15	37.0	25.2
2011	Jacquizz Rodgers	RB	145	65.88	196	4.59	NA	33.0	41.4
2011	Jeremy Kerley	WR	153	69.50	189	4.56	16	34.5	26.5
2011	Jon Baldwin	WR	26	76.38	228	4.49	20	42.0	27.7
2011	Julio Jones	WR	6	74.75	220	4.34	17	38.5	121.0
2011	Kendall Hunter	RB	115	67.25	199	4.46	24	35.0	68.1
2011	Leonard Hankerson	WR	79	73.50	209	4.40	14	36.0	16.3
2011	Mark Ingram	RB	28	69.13	215	4.62	21	31.5	75.0
2011	Randall Cobb	WR	64	70.25	191	4.46	16	33.5	35.0
2011	Roy Helu	RB	105	71.50	219	4.40	11	36.5	98.6
2011	Shane Vereen	RB	56	70.25	210	4.49	31	34.0	10.7
2011	Stevan Ridley	RB	73	71.25	225	4.65	18	36.0	42.5
2011	Titus Young	WR	44	71.38	174	4.43	NA	NA	79.8
2011	Torrey Smith	WR	58	72.88	204	4.41	19	41.0	119.7
2011	Vincent Brown	WR	82	71.25	187	4.68	12	33.5	42.9
2012	Alfred Morris	RB	173	69.88	219	4.63	16	35.5	189.1
2012	Alshon Jeffery	WR	45	74.88	216	4.48	NA	NA	44.1
2012	Bernard Pierce	RB	84	72.25	218	4.45	17	36.5	53.6

We now have all of the necessary data. We’ll visualize the bivariate relationships between our variables, segmented by position (Running Back or Wide Receiver) for a quick quality check.

g = ggscatmat(rookie_stats, 
              columns = 4:dim(rookie_stats)[2], 
              color = "pos")
plot(g)

Everything looks good except for the 40 yard dash field. Most players fall in the 4-6 range except a few. Let’s examine observations with a 40 yard dash time greater than six seconds.

year	name	pos	pick	height_in	weight_lbs	x40_yard	bench_press	vert_leap_in	total_pts
2016	Corey Coleman	WR	15	70.63	194	9.99	17	40.5	52.4
2016	Devontae Booker	RB	136	70.75	219	9.99	22	NA	81.8
2016	Jonathan Williams	RB	156	70.00	220	9.99	16	NA	11.3
2016	Jordan Howard	RB	150	71.88	230	9.99	16	34.0	180.6

It looks like missing or invalid 40 yard dash times were coded with 9.99. Let’s sub these values with NAs and re-run the plot above.

rookie_stats$x40_yard = ifelse(rookie_stats$x40_yard == 9.99, 
                               NA, 
                               rookie_stats$x40_yard)

g = ggscatmat(rookie_stats, 
              columns = 4:dim(rookie_stats)[2], 
              color = "pos")
plot(g)

Ahh that’s better. Now we’re ready to do some analyses.

Explaining Rookie Performance

Based on the scatterplot matrix, draft pick explains a fair amount of variability in first-year fantasy points. Teams shell out millions of dollars for rookie players, so it’s not surprising that points and pick number are inversely related, such that lower picks (1, 2, 3) score more fantasy points during their first season than higher picks. The other variable that exhibits a relationship with points is weight, which is moderated by position. Running Backs (RBs) appear to benefit more from having a few extra pounds relative to Wide Receivers (WRs). Being able to punch the ball in from the 1-yard line means more touchdowns. Having a more robust build might translate into fewer injuries and missed games. And being heavier would make an RB harder to tackle, which means more running yards. Having extra weight is less beneficial to a WR, as they rely on their speed and agility to create separation from defenders. Indeed, heavier players run slower 40-yard dash times, and I haven’t seen too many successful WRs in the NFL who were slow. This suggests that given the option between a lighter or heavier RB, go with the big guy.

Teams do seem to factor the weight of an RB into their draft strategy, as heavier RBs are drafted earlier on. But the question is whether weight is still significant after controlling for draft pick number and accounting for position (WR vs. RB). I’ll apply my two favorite modeling approaches when interpretation is paramount: Linear Model (LM) and General Additive Model (GAM). Pick number is present in both models. The LM features an interaction term to capture the disparate relationship between weight and position. The GAM model has a smoother for both weight and pick-number, and position is included as a dummy variable. For those unfamiliar with GAM, it’s a fantastic approach for modeling non-linear relationships while maintaining a highly interpretable model. The one drawback is that it is Additive and thus doesn’t account for interactions (of course you can add interactions terms into the model, but then it’s no longer a GAM). We’ll use LOOCV (leave-one-out-cross-validation) to determine which method of describing the data generating process is most generalizable.

# function for cross validation
split_data = function(input_df, pct_train){
  set.seed(1)
  data_list = list()
  random_index = sample(1:nrow(input_df), nrow(input_df))
  train_index = random_index[1:(floor(nrow(input_df) * pct_train))]
  data_list[['train_df']] = input_df[train_index,]
  data_list[['test_df']] = input_df[setdiff(random_index, train_index)  ,]
  return(data_list)
}

input_df = rookie_stats
pct_train = 0.8
fdata = split_data(input_df, pct_train)

row_index = sample(1:nrow(fdata$train_df), nrow(fdata$train_df))

# empty vectors to store prediction errors during LOOCV
lm_pred = c()
gam_pred = c()

for(i in row_index){
  # training data
  temp_train = fdata$train_df[setdiff(row_index, i),]
  
  # validation datum
  temp_validation = fdata$train_df[i,]  
  
  # linear model
  temp_fit_lm = lm(total_pts ~ weight_lbs * pos + pick, 
                   data = temp_train)
  
  # GAM model
  temp_fit_gam = mgcv::gam(total_pts ~ s(weight_lbs) + s(pick) + pos,
                                     data = temp_train)
  # linear model prediction
  lm_pred = c(lm_pred,
               predict(temp_fit_lm, temp_validation))
  # GAM model prediction
  gam_pred = c(gam_pred,
               predict(temp_fit_gam, temp_validation))

}

Let’s consider two different measures of performance.

r_squared = function(actual, predicted){
  return(1 - (sum((actual-predicted )^2)/sum((actual-mean(actual))^2)))
} 

mdn_abs_error = function(actual, predicted){
  return(median(abs(actual-predicted)))
}

val_actual = fdata$train_df$total_pts[row_index]

val_perf = data.frame(model = c("LM", "GAM"),
                      # calculate r-squared
                      r_squared = c(r_squared(val_actual,lm_pred),
                                    r_squared(val_actual,gam_pred)),
                      # calculate MAE
                      median_abs_err = c(mdn_abs_error(val_actual, lm_pred),
                                         mdn_abs_error(val_actual, gam_pred))) %>% 
                      mutate_if(is.numeric, round, 2)
print(val_perf)

##     model   r_squared    median_abs_err
## 1    LM        0.17         26.15
## 2   GAM        0.17         27.08

The LM performed similiarly to the GAM, so we’ll use the simpler, LM model on the entire training set and examine the coefficients.

train_fit = lm(total_pts ~ weight_lbs * pos + pick, 
                   data = fdata$train_df)
summary(train_fit)

## Call:
## lm(formula = total_pts ~ weight_lbs * pos + pick, data = fdata$train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -70.946 -27.584  -9.839  17.944 157.722 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -176.63257  101.66213  -1.737  0.08490 .  
## weight_lbs          1.31363    0.46120   2.848  0.00518 ** 
## posWR             235.87977  119.20973   1.979  0.05016 .  
## pick               -0.31399    0.07396  -4.245 4.35e-05 ***
## weight_lbs:posWR   -1.21318    0.56198  -2.159  0.03288 *  
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
## 
## Residual standard error: 43.19 on 119 degrees of freedom
## Multiple R-squared:  0.2424, Adjusted R-squared:  0.217 
## F-statistic:  9.52 on 4 and 119 DF,  p-value: 1.033e-06

Interestingly, even after controlling for pick number, the interaction term is still significant. Interpreting interactions terms can be tricky, and I’ve found it helpful to just plug in a range of numbers for the variable you’re interested in and hold everything else constant to see how the predicted value changes. So that’s what we’ll do here. We’ll hold pick number at the mean, enter “RB” for the position dummy variable, and then generate a sequence between the 5th and 95th percentiles of weight. The slope of our predicted value indicates how many additional points we expect to score during a season for each RB pound.

# select 5th and 95th percentiles for weight
# hold pick number and position constant

q5_95 = unname(quantile(fdata$train_df$weight_lbs, c(0.05, 0.95)))

sim_df = data.frame(weight_lbs = seq(q5_95[1], q5_95[2], 1),
                    pick = mean(fdata$train_df$pick),
                    pos = "RB"
                    )
sim_df$predicted_pts = predict(train_fit, sim_df)
print(diff(sim_df$predicted_pts)[1])

## [1] 1.313628

As the weight of an RB increases by one pound, we expect the total number of fantasy points scored in a season to increase by ~1.3. Thus if you had two RBs drafted around the same position, but one weighed 30lbs more than the other, you would expect 39 more points from the heavier RB. Over a 16 game season, that difference translates to 2.4 more points per game, which could be the difference between making or missing the playoffs.

90 percent of weight values fall between 180 and 230, so I’d feel comfortable generalizing these findings to all RBs within this range. The data is pretty sparse outside of this range, so the relationships in the model would likely break down Indeed, I’m fairly certain a 320lb RB would not score more points than a 220lb RB; there is some point where weight likely has a negative effect on performance, but we won’t see that in the data because a 320lb RB won’t ever see the field.

While these findings are interesting, it’s not clear how helpful the model is in terms of informing our draft strategy. To answer that question, we’ll determine how the model performs on the hold-out set.

r_sq_test = round(r_squared(fdata$test_df$total_pts, 
                            predict(train_fit, fdata$test_df)), 
                  1)
median_abs_err = round(mdn_abs_error(fdata$test_df$total_pts, 
                                        predict(train_fit, fdata$test_df)), 
                       2)
print(paste0("TEST R-SQUARED: ", r_sq_test))
print(paste0("TEST MAE: ", median_abs_err))

## [1] "TEST R-SQUARED: 0.3"
## [1] "TEST MAE: 31.7"

We can explain approximately 30 percent of the variance in the test set, and the median absolute error is around 30. The fact that performance on the test set is comparable with the validation set indicates good generalizability of our model. However, there is a lot of variance left to explain! If we wanted to improve these predictions, there are a number of other variables to consider, including:

Injury
Strength of Offensive Line
Presence of skilled existing running back
Ratio of pass to run plays in the prior season

These inputs have an impact on how many points a fantasy RB will score. But if we want to keep it simple, the key takeaway is this: When it’s fantasy draft time and you’re debating between a few rookie RBs drafted around the same spot, go with the big guy. It just might be the difference between fantasy glory or facing the receiving end of one of these…

Fantasy Football Python R