College Rankings and Pay

Overview

Rankings are a pervasive part of modern life, especially for big decisions – like shelling out thousands of dollars for an education. Indeed, college rankings are amongst the most widely cited rankings in existence. Students, parents, and college administrators fret about where their school lands along these highly debated spectrums. But does it really matter in terms of future earnings potential? While an education is more than a simple “means to an end”, making a decent living is amongst the reasons why students pull all-nighters, work terrible summer jobs, and do everything they can to position themselves for a lucrative, meaningful career. Accordingly, this post will focus on the following topics:

🤓 The relationship between Mid Career Pay and College Ranking after accounting for external variables (e.g., % of STEM degrees conferred each year)

🤓 Understand which variables exhibit the strongest relationship with pay

🤓 Identify which college’s graduates over/under earn relative to our predictions

Since I’m a big fan of Learning-by-doing, I’ve included all of the code used to answer these questions. If you are simply interested in the answers, click here

Data Collection

We’ll leverage the following data sources for our analysis:

  • 2018 Payscale Data of Median Mid Career Earnings By College (with additional data about each college)

  • 2017 Forbes Undergraduate College Rankings

  • 2018 Cost of Living Index by State

Let’s start by collecting the earnings and rankings data. The reticulate package allows us to call the two python functions (outlined below) to scrape this data and return it as a DataFrame back into R. For the sake of simplicity, all of the scripts referenced herein exist in the same folder as the .R script. The pacman package will do the heavy lifting here, loading all previously installed packages and installing all referenced but uninstalled packages.

library(pacman)
pacman::p_load('readr',
               'reshape',
               'dplyr',
               'janitor',
               'ggplot2',
               'broom',
               'rvest',
               'reticulate',
               'glue',
               'caret',
               'kableExtra',
               'artyfarty',
               'patchwork',
               'sjPlot',
               'purrr',
               'stringr',
               'tibble'
               )
# set plotting theme to black & white
theme_set(theme_bw())

# set plotting colors 
color_pal = pal("five38")

# set custom plotting theme
# source('my_plot_theme.R')

We’ve loaded/installed the relevant packages, set our working directory, and loaded our custom plotting theme (see here for theme. Next we’ll tell R which version of Python to use.

reticulate::use_python('/anaconda/envs/py36/bin/python', required = TRUE)

Let’s source (i.e., bring the Python functions into our R environment) with the following command:

reticulate::source_python('rankings_earnings_post_functions.py')

We now have access to the two functions described below – collect_payscale_info and college_college_ranks.Each will scrape data from the web, convert the results into a Pandas DataFrame, and then return the final result back as an R DataFrame.

import pandas as pd
import urllib3
from bs4 import BeautifulSoup
import re
import ast
HTTP = urllib3.PoolManager()
#
# Function to scrape payscale information
#
def collect_payscale_info():
    base_url = 'https://www.payscale.com/college-salary-report/bachelors?page=101'
    page = HTTP.request('GET', base_url)
    soup = BeautifulSoup(page.data, 'lxml')
    all_colleges = (str(soup)
                    .split("var collegeSalaryReportData = ")[1]
                    .split('},{')
                    )
    college_list = []
    for i in all_colleges:
        try:
            college_list.append(ast.literal_eval("{" + i.replace("[{", "") + "}"))
        except Exception as e:
            print(e)
    college_df = pd.DataFrame(college_list)
    return(college_df)
#
# Function to scrape college rankings information  
#
def collect_college_ranks(college_names_list):
    base_url = 'https://www.forbes.com/colleges/'
    rank_list = []
    for college_name in college_names_list:
        print(str(round(count/float(len(college_names_list)) * 100, 2)) + '% Complete')
        try:
            tmp_url = base_url + college_name.lower().replace(" ", "-") + "/"
            page = HTTP.request('GET', base_url)
            soup = BeautifulSoup(page.data, 'lxml')
            college_rank = (re.search(r'class="rankonlist">#(.*?) <a href="/top-colleges/list/"', 
                                      str(soup))
                              .group(1)
                           )
            rank_list.append([college_name, college_rank])
        except Exception as e:
            rank_list.append([college_name, None])
            print(e)
        count += 1
    rank_df = pd.DataFrame(rank_list)
    rank_df = rank_df.rename(columns={rank_df.columns[0]: "college_name",
                                      rank_df.columns[1]: "rank"
                                  })
    return(rank_df)

Let’s start by collecting data about each college. We’ll gather the following fields:

  • Percent_Female: Percentage of the student body that is female
  • Percent_Stem: Percentage of students majoring in a Science, Technology, Engineering, or Math field
  • Undergraduate_Enrollment: Number of students enrolled as undergraduate students
  • School_Sector: If the College if Public or Private
  • Rank: 2017 Forbe’s undergraduate college rankings
  • COL_Index: Cost of Living index at the state level
  • Mid_Career_Median_Pay: Median Salary for alumni with 10+ years of experience
college_salary_df = collect_payscale_info() %>% 
  clean_names() %>% 
  select(friendly_name,
         state,
         percent_female,
         percent_stem,
         undergraduate_enrollment,
         school_sector,
         mid_career_median_pay
         ) %>% 
  rename(college_name = friendly_name)

Below we’ll have a glimpse at the first few rows of data pertinent to each college.

college_name state percent_female percent_stem undergraduate_enrollment school_sector mid_career_median_pay
Harvey Mudd College California 0.46 0.85 831 Private not-for-profit 155800
Princeton University New Jersey 0.48 0.47 5666 Private not-for-profit 147800
Massachusetts Institute of Technology Massachusetts 0.46 0.69 4680 Private not-for-profit 147000
SUNY Maritime College New York 0.1 0.35 1817 Public 145100
United States Military Academy New York 0.19 0.43 4536 Public 144300
United States Naval Academy Maryland 0.23 0.6 4535 Public 143800
California Institute of Technology California 0.39 0.97 1034 Private not-for-profit 142500
Babson College Massachusetts 0.48 0 2211 Private not-for-profit 141700
Harvard University Massachusetts 0.5 0.19 14128 Private not-for-profit 140700
Stanford University California 0.48 0.49 8302 Private not-for-profit 140400
Dartmouth College New Hampshire 0.49 0.32 4695 Private not-for-profit 140300
Williams College Massachusetts 0.51 0.33 2233 Private not-for-profit 138400
United States Air Force Academy Colorado 0.23 0.49 4112 Public 138300
Webb Institute New York 0.19 1 91 Private not-for-profit 138200
Samuel Merritt University California 0.81 0 851 Private not-for-profit 138100
Colgate University New York 0.56 0.27 2991 Private not-for-profit 137700
Stevens Institute of Technology New Jersey 0.29 0.8 3076 Private not-for-profit 136900
Colorado School of Mines Colorado 0.28 0.94 4821 Public 136100
United States Merchant Marine Academy New York 0.17 0.54 908 Public 135900
University of Pennsylvania Pennsylvania 0.53 0.22 13096 Private not-for-profit 134800

Looks good! Next, we’ll pass the college names as an argument into our collect_college_ranks function. We’ll iterate through each college and extract its ranking, which ranges from 1 to 650.

college_rankings_df = collect_college_ranks(college_salary_df$college_name) %>% 
  clean_names() %>% 
  filter(is.na(rank) == FALSE) %>% 
  arrange(rank)

Our last piece of data is the Cost of Living for each college’s state of residence. College graduates frequently work in the same state they attend college, and Cost of Living is related to how much companies pay their employees. By accounting for such expenses, we’ll have better estimates of the impact of rankings on pay This data is housed on a website that publishes current Cost of Living estimates by state. We’ll switch back to R and use the rvest package for web scraping.

table_path = '/html/body/div[5]/div/table'
url = "https://www.missourieconomy.org/indicators/cost_of_living/"

col_table = url %>% 
  read_html() %>% 
  html_nodes(xpath = table_path) %>% 
  html_table(header = TRUE)

# note: col_index = cost of living index
col_df = data.frame(col_table[[1]])[,c(1, 3)] %>% 
  setNames(., c('state', 'col_index')) %>% 
  slice(2:(n() - 1)) %>% 
  mutate(col_index = as.numeric(col_index)) %>% 
  arrange(desc(col_index))

print(head(col_df, 10))
##                      state col_index
## 1                   Hawaii     188.9
## 2  District of    Columbia     161.0
## 3               California     137.2
## 4                 New York     134.0
## 5              Connecticut     133.0
## 6                   Alaska     130.5
## 7                   Oregon     129.4
## 8                 Maryland     129.3
## 9            Massachusetts     129.2
## 10              New Jersey     122.3

Having lived on the West Coast for the past three years, these numbers check out! The next step is to confirm that we can join our Cost of Living data with college salaries. A simple left join will identify any inconsistencies between our key (state) that are we using to merge our datasets.

college_salary_df %>% 
  select(state) %>% 
  distinct() %>% 
  left_join(col_df, by = 'state') %>% 
  filter(is.na(col_index) == TRUE)
## # A tibble: 4 x 2
##   state                col_index
##   <chr>                    <dbl>
## 1 District of Columbia        NA
## 2 Puerto Rico                 NA
## 3 Guam                        NA
## 4 <NA>                        NA

The only mismatch within the United States is the District of Columbia. We’ll reformat the state names and then join all three data sources – salary, rankings, and Cost of Living – to complete our final data set.

final_df = college_salary_df %>% 
  inner_join(college_rankings_df, by = 'college_name') %>% 
  inner_join(col_df %>% 
             mutate(state = ifelse(state == 'District of    Columbia', 
                                            'District of Columbia',
                                             state
                                   )
                                  ),
                    by = 'state'
                    ) %>% 
  arrange(rank) %>% 
  select(-state) %>% 
  select(rank, everything()) %>% 
  mutate(percent_stem = as.numeric(percent_stem),
         percent_female = as.numeric(percent_female),
         undergraduate_enrollment =  as.numeric(undergraduate_enrollment),
         school_sector = factor(if_else(school_sector == 'Public', 1, 0))
                ) %>% 
  na.omit() %>% 
  data.frame()
rank college_name percent_female percent_stem undergraduate_enrollment school_sector mid_career_median_pay col_index
1 Harvard University 0.50 0.19 14128 0 140700 129.2
2 Stanford University 0.48 0.49 8302 0 140400 137.2
3 Yale University 0.51 0.23 7150 0 132100 133.0
4 Princeton University 0.48 0.47 5666 0 147800 122.3
5 Massachusetts Institute of Technology 0.46 0.69 4680 0 147000 129.2
6 California Institute of Technology 0.39 0.97 1034 0 142500 137.2
7 University of Pennsylvania 0.53 0.22 13096 0 134800 98.4
8 Duke University 0.49 0.26 7253 0 134400 94.5
9 Brown University 0.54 0.39 7214 0 132000 120.8
10 Pomona College 0.50 0.43 1677 0 114100 137.2
11 Claremont McKenna College 0.48 0.14 1359 0 130800 137.2
12 Dartmouth College 0.49 0.32 4695 0 140300 106.6
13 Williams College 0.51 0.33 2233 0 138400 129.2
14 Columbia University in the City of New York 0.47 0.30 8623 0 124700 134.0
15 Cornell University 0.51 0.43 14656 0 123900 134.0
16 University of Chicago 0.48 0.20 6046 0 117000 95.7
17 Amherst College 0.50 0.27 1921 0 122800 129.2
18 Harvey Mudd College 0.46 0.85 831 0 155800 137.2
19 Swarthmore College 0.50 0.34 1606 0 126200 98.4
20 United States Naval Academy 0.23 0.60 4535 1 143800 129.3
21 Georgetown University 0.56 0.12 8719 0 129500 161.0
22 Rice University 0.47 0.50 4051 0 130200 91.4
23 Bowdoin College 0.50 0.35 1952 0 115500 115.9
24 United States Military Academy 0.19 0.43 4536 1 144300 134.0
25 Haverford College 0.52 0.38 1305 0 119500 98.4
26 University of Notre Dame 0.48 0.26 8626 0 127900 90.0
27 Vanderbilt University 0.50 0.22 7046 0 116000 89.4
28 Northwestern University 0.51 0.19 10611 0 113300 95.7
29 University of California-Berkeley 0.52 0.34 29591 1 130100 137.2
30 Johns Hopkins University 0.55 0.33 8152 0 112200 129.3
31 Washington and Lee University 0.49 0.21 1867 0 131700 101.1
32 Tufts University 0.51 0.24 5697 0 120200 129.2
33 Wesleyan University 0.55 0.22 3091 0 117700 133.0
34 Davidson College 0.51 0.24 1928 0 111400 94.5
35 Bates College 0.50 0.26 1903 0 125000 115.9
36 Washington University in St Louis 0.54 0.31 8440 0 113400 88.6
37 Carleton College 0.52 0.40 2093 0 113800 103.2
38 University of Michigan-Ann Arbor 0.49 0.38 29124 1 104600 90.2
39 Middlebury College 0.53 0.18 3849 0 113200 117.2
40 University of Virginia-Main Campus 0.55 0.21 18930 1 118400 101.1
41 United States Air Force Academy 0.23 0.49 4112 1 138300 107.3
42 Colgate University 0.56 0.27 2991 0 137700 134.0
43 Scripps College 1.00 0.24 1023 0 96700 137.2
44 University of Southern California 0.50 0.24 20543 0 119300 137.2
45 Carnegie Mellon University 0.44 0.64 6869 0 126400 98.4
47 Wellesley College 0.95 0.29 2644 0 100900 129.2
48 University of California-Los Angeles 0.56 0.31 31906 1 114800 137.2
49 Boston College 0.53 0.11 10297 0 115300 129.2
50 Vassar College 0.57 0.27 2625 0 98900 134.0
51 United States Merchant Marine Academy 0.17 0.54 908 1 135900 134.0
52 New York University 0.57 0.17 30000 0 117000 134.0
53 Oberlin College 0.56 0.23 3042 0 107600 93.0
54 Barnard College 1.00 0.19 2669 0 103700 134.0
55 Bucknell University 0.52 0.37 3721 0 124700 98.4
56 Lafayette College 0.49 0.41 2541 0 123800 98.4
57 Grinnell College 0.55 0.36 1774 0 96200 91.5
58 College of William and Mary 0.57 0.20 6890 1 111900 101.1
59 Pitzer College 0.57 0.13 1094 0 91100 137.2
60 Wake Forest University 0.52 0.14 4966 0 118400 94.5
61 Colby College 0.53 0.28 2055 0 104200 115.9
62 Cooper Union for the Advancement of Science and Art 0.34 0.54 942 0 124200 134.0
63 Kenyon College 0.55 0.17 1764 0 110400 93.0
64 Santa Clara University 0.50 0.30 5759 0 134500 137.2
65 Hamilton College 0.52 0.28 2036 0 107000 134.0
66 Emory University 0.58 0.21 7764 0 109800 91.6
67 Lehigh University 0.44 0.52 5325 0 133900 98.4
68 University of North Carolina at Chapel Hill 0.58 0.18 19312 1 96000 94.5
69 University of Illinois at Urbana-Champaign 0.44 0.37 34687 1 109600 95.7
70 Reed College 0.55 0.26 1491 0 103900 129.4
71 Smith College 1.00 0.23 2603 0 98400 129.2
72 University of Maryland-College Park 0.46 0.32 31050 1 107100 129.3
73 Villanova University 0.54 0.23 7871 0 114700 98.4
74 Babson College 0.48 0.00 2211 0 141700 129.2
75 Yeshiva University 0.45 0.13 2949 0 122300 134.0
76 Colorado College 0.54 0.28 2322 0 97000 107.3
77 Occidental College 0.57 0.25 2178 0 107700 137.2
78 Whitman College 0.58 0.38 1562 0 107700 109.3
79 University of Washington-Seattle Campus 0.52 0.32 34486 1 108800 109.3
80 University of Florida 0.55 0.27 36703 1 97800 99.1
81 Georgia Institute of Technology-Main Campus 0.35 0.78 16694 1 128700 91.6
82 University of Richmond 0.55 0.14 3910 0 106600 101.1
83 University of California-Santa Barbara 0.52 0.28 21897 1 118800 137.2
84 Trinity College 0.48 0.18 2434 0 108700 133.0
85 University of Rochester 0.51 0.32 6821 0 107400 134.0
86 Boston University 0.60 0.23 22685 0 113500 129.2
87 University of Wisconsin-Madison 0.51 0.32 33400 1 98400 96.8
88 George Washington University 0.56 0.14 12039 0 115900 161.0
89 College of the Holy Cross 0.50 0.22 2759 0 119700 129.2
90 Trinity University 0.52 0.22 2298 0 101600 91.4
91 The University of Texas at Austin 0.52 0.26 42193 1 107400 91.4
92 Franklin and Marshall College 0.53 0.21 2417 0 112400 98.4
93 University of California-San Diego 0.47 0.54 27918 1 124900 137.2
94 Skidmore College 0.61 0.21 2798 0 97300 134.0
95 Bryn Mawr College 1.00 0.19 1430 0 89900 98.4
96 Furman University 0.56 0.16 2887 0 104000 97.8
97 University of Miami 0.52 0.22 12243 0 92400 99.1
98 Southern Methodist University 0.50 0.20 6866 0 108500 91.4
99 Brandeis University 0.57 0.26 3852 0 114200 129.2
100 Macalester College 0.60 0.33 2238 0 97700 103.2
101 St Olaf College 0.57 0.32 3122 0 98700 103.2

We have a total of 578 colleges with complete data. Let’s get moving on our analysis!

Analysis

We’ll first start by examining the relationship between our two main variables of interest: College Rank and Mid Career Pay.

Earnings drop much quicker for highly ranked colleges relative to lower ranked colleges. Let’s break this out by bucketing the ranks and examining the slopes for each bucket – that is, the expected change in salary as our ranking decreases by one.

final_df = final_df %>% 
  mutate(rank_bucket = case_when(rank >= 1 & rank <= 50 ~ '1 - 50',
                                 rank >= 51 & rank <= 100 ~ '51 - 100',
                                 rank > 100 & rank <= 200 ~ '101 - 200',
                                 rank > 200 & rank <= 400 ~ '201 - 400',
                                 rank > 400 ~ '> 400'
                                 )
         )

rank_bucket_est = final_df %>% 
  group_by(rank_bucket) %>% 
  do(tidy(lm(mid_career_median_pay ~ rank, data=.))) %>% 
  select(rank_bucket, term, estimate) %>% 
  reshape::cast(rank_bucket ~ term, value = 'estimate') %>% 
  clean_names() %>% 
  dplyr::rename(rank_coeff = rank) %>% 
  data.frame() %>% 
  arrange(rank_coeff)
rank_bucket intercept rank_coeff
1 - 50 140453.6 -599.28741
51 - 100 127265.4 -219.40936
101 - 200 115381.3 -79.85206
201 - 400 110338.7 -55.96746
> 400 102147.0 -35.61756

While the relationship between earnings and rank is non-linear, this provides a rough estimate of what we originally noticed in the first scatterplot. Specifically, for colleges that fall in the 1- 50 bucket, a one-unit reduction in rank is associated with a decrease in pay of ~$600. In contrast, for those in the > 400 bucket, a one-unit reduction results only in a ~$36 reduction in pay. Let’s bring this concept to life by visualizing the relationship with our grouped ranking variable.

rank_bucket_est %>% 
  inner_join(final_df, by = 'rank_bucket') %>% 
  mutate(predicted_income = rank * rank_coeff + intercept) %>% 
  mutate(rank_bucket = factor(rank_bucket,
                              levels = c('1 - 50',
                                         '51 - 100',
                                         '101 - 200',
                                         '201 - 400',
                                         '> 400'
                                         )
                              )
         ) %>% 
  ggplot(aes(x = rank, y = predicted_income, color = rank_bucket)) + 
    geom_point() + 
    geom_line() + 
    geom_point(data = final_df, aes(x = rank,
                                    y = mid_career_median_pay),
               alpha = 0.25
               ) + 
    my_plot_theme() + 
    ylab("Predicted Mid Career Pay") + 
    scale_y_continuous(labels=scales::dollar) + 
    scale_x_continuous(breaks = seq(0, max(final_df$rank), by = 50)) + 
    labs(color = 'Rank Bucket') + 
    scale_color_manual(values = pal("five38"))

At this point, we’ve established that (1) rank is a good predictor of earnings, and (2) the nature of this relationship varies by the level of the rank, indicating that a modeling approach that handles non-linearity would likely perform better than our current linear model. Accordingly, we’ll try out three separate models – Linear Regression, General Additive Model, and Random Forest – to see which performs best on a holdout set. Once we identify the best approach (from a modeling perspective), we’ll start considering the role of the other variables as well.

row.names(final_df) = final_df$college_name
final_df = final_df %>%
  select(-rank_bucket, -college_name)

y_var = 'mid_career_median_pay'
x_var = setdiff(names(final_df), y_var)
model_form = as.formula(paste0(y_var,
                               " ~ ",
                               paste0(x_var, collapse = " + ")
                               )
                        )
print(model_form)
## mid_career_median_pay ~ rank + percent_female + percent_stem + 
##     undergraduate_enrollment + school_sector + col_index

Now that we’ve established our model formula, let’s do a quick quality check on our data. We’ll first consider the presence of variables with near zero variance. Such variables all have the same value (e.g., 0) or are highly unbalanced (e.g., 10000 zeros and 1 one). This can be problematic for certain modeling approaches, in that the model won’t fit properly. Second, we’ll explore the bi-variate correlations amongst our predictors. Highly correlated variables can interfere with interpretability and lead to incorrect parameter estimates. If you’ve ever done a linear regression and the direction or size of the resulting coefficients don’t make sense, this might be the reason. We’ll check for both of these issues below.

# identifies any variables with near zero variance, which can lead to model instability
near_zero_vars = nearZeroVar(final_df[x_var])
# identifies any vavriables with absolute correlation > .7
correlated_vars = findCorrelation(cor(final_df[x_var[x_var != 'school_sector']]),
                                  cutoff = .7
                                  )

print(c(paste0("N correlated Vars: ", length(correlated_vars)),
        paste0("N low variance Vars: ", length(near_zero_vars))
        )
      )
## [1] "N correlated Vars: 0"   "N low variance Vars: 0"

Looks good. Let’s compare our three models and select the one that provides the best out-of-sample performance using the caret package. We’ll use 10-fold cross validation and repeat the validation process five times.

control = trainControl(method = 'repeatedcv',
                      number = 10,
                      repeats = 5
                      )

set.seed(1)
model_lm = train(model_form, 
                 data=final_df, 
                 method="lm", 
                 trControl=control
                 )

set.seed(1)
model_gam = train(model_form, 
                  data=final_df, 
                  method="gam", 
                  trControl=control
                  )
set.seed(1)
model_rf = train(model_form, 
                 data=final_df, 
                 method="rf", 
                 trControl=control,
                 ntree = 100
                 )

Model performance is based on Root Mean Square Error (RMSE), R-squared, and Mean Absolute Error (MAE). While performance across these metrics will be related, each tells a different story. For example, RMSE squares each of the errors before they are averaged, while MAE simply takes the absolute value. RMSE puts a larger penalty on models with big errors relative to MAE. Thus, if reliability and consistency are particularly important, RMSE might be the better metric. In the current context, we aren’t concerned with this tradeoff – and obviously it’s easiest when all metrics point to the same model!

model_perf = resamples(list(LM=model_lm, 
                            GAM=model_gam, 
                            RF=model_rf)
                       )$values %>% 
  clean_names() %>% 
  select(-resample) %>% 
  reshape::melt() %>% 
  as_tibble()

model_perf$model_metric = model_perf$variable %>%
  map(function(x) str_split(pattern = "_", x))

model_perf$model = model_perf$model_metric %>%
  map(function(x) x[[1]][1]) %>%
  unlist()

model_perf$metric =
  model_perf$model_metric %>%
  map(function(x) x[[1]][2]) %>%
  unlist()

model_perf = model_perf %>%
  select(-variable, -model_metric)

Let’s visualize the performance of each model across 50 trials (10 folds X 5 repeats) with a boxplot.

ggplot(model_perf, aes(x = model, y = value, color = model)) + 
  geom_boxplot() + 
  facet_grid(metric ~ ., scales = 'free') + 
  my_plot_theme() + 
  scale_color_manual(values = color_pal) + 
  ylab("Metric Value") + 
  xlab("Model")

The GAM performed best, following by the Random Forest and Linear Regression Model. Let’s do a quick review of each by considering the in-sample errors – that is, when we make predictions against the same dataset we trained our model on. A simple approach is to plot the fitted (predicted) values against the errors (residuals).

insample_predictions = final_df %>% 
  mutate(lm_pred = as.vector(predict(model_lm, final_df)),
         gam_pred = as.vector(predict(model_gam, final_df)),
         rf_pred = as.vector(predict(model_rf, final_df))
         ) %>% 
  select(y_var, ends_with('_pred')) %>% 
  reshape::melt(y_var) %>% 
  dplyr::rename(predicted_earnings = value,
                actual_earnings = mid_career_median_pay,
                model = variable
                ) %>% 
  mutate(residual = actual_earnings - predicted_earnings) 
ggplot(insample_predictions, 
       aes(x = predicted_earnings,
           y = residual,
           color = model
           )) + 
  geom_point(alpha = 0.5, size = 2) + 
  facet_grid(model ~ .) + 
  geom_abline(intercept = 0, slope = 0) + 
  stat_smooth(size = 2, se = FALSE) + 
  scale_color_manual(values = pal("five38")[c(1, 3, 2)]) + 
  my_plot_theme() + 
  xlab("Predicted Earnings") + 
  ylab("Residual")

A few takeaways from this visual. First, it’s apparent that a non-linear model (e.g., GAM or RF) performs best on this dataset, based on the pattern of residuals associated with the linear model. Second, the GAM performed best in terms of limiting bias across the entire prediction range. This means we are just as likely to over- or under-predict across the range of values, whereas RF and LM are biased at the tails of the range. Third, RF performs much better in-sample than out-of-sample, as is indicated by the close spread of the errors around zero. This indicates that the RF model is likely overfitting, which we could address by (1) acquiring more data, and/or (2) changing some of the model parameters to encourage a simpler, more robust model – a process referred to as regularization. However, interpretability is a key component of our analysis, and GAM models have this quality in spades. Let’s move forward with the GAM and rank the overall importance of each variable.

var_imp = varImp(model_gam)$importance %>% 
  dplyr::rename(importance = Overall) %>% 
  rownames_to_column(var = 'variable')
  ggplot(var_imp, aes(x = reorder(variable, importance), y = importance)) + 
  geom_bar(stat = 'identity', 
           fill =  pal("five38")[1],
           color = 'black') + 
  coord_flip() + 
  my_plot_theme() + 
  xlab("") + 
  ylab("Importance")

This provides a global measure of importance for all of our variables. Let’s take this a step further and examine the marginal effects of the top four most important variables.

x_var_plot = x_var[!x_var %in% c('undergraduate_enrollment', 'school_sector')]
marginal_effects_plots = list()
for(var in x_var_plot){
  tmp_effects_plot = plot_model(model_gam$finalModel, 
                                 type = "pred", 
                                 terms = var
                                )
  tmp_effects_plot = tmp_effects_plot + 
    ggtitle(glue::glue("Predicted Pay by {var}")) + 
    xlab(glue::glue("{var}")) + 
    ylab("Mid Career Median Pay") + 
    theme_bw() + 
    my_plot_theme() + 
    scale_y_continuous(labels=scales::dollar)
  
  marginal_effects_plots[[var]] = tmp_effects_plot
}

Below we’ll take our four separate plots and organize them nicely into a single, unified plot via the patchwork package.

(marginal_effects_plots$rank + 
  marginal_effects_plots$percent_female + 
  marginal_effects_plots$percent_stem +
  marginal_effects_plots$col_index + 
  patchwork::plot_layout(ncol = 2) 
)

Most of these findings aren’t too surprising: Those who attend top-ranked colleges, in a state with a high cost of living, and likely majored in a STEM subject tend to have the highest pay. However, pay as a percentage of female graduates was unexpected. While the earnings discrepency between men and women has been well documented, it was surprising to see that the gender distribution of a college was a stronger predictor of earnings than cost of living or major. While I’d put less faith in the predictions at the extreme ranges (i.e., all-male or all-female colleges), it is unfortunate to see such a discrepancy.

Over and Under Performers

The last question to consider is which college’s graduates over/under perform their expected pay. Looking at where your model goes wrong is a great way to identify missing variables. Let’s have a look at the top and bottom 10 colleges based on the largest positive and negative residual values.

final_df$college_name = row.names(final_df)
final_df$predicted_pay = as.vector(predict(model_gam, final_df))
final_df$residual_pay = final_df$mid_career_median_pay - final_df$predicted_pay  
top10_over_under = bind_rows(
  final_df %>% 
  arrange(desc(residual_pay)) %>% 
  head(10) %>% 
  mutate(earnings_group = 'More Than Expected'),
  final_df %>% 
  arrange(residual_pay) %>% 
  head(10) %>% 
  mutate(earnings_group = 'Less Than Expected')
                            )
ggplot(top10_over_under, 
       aes(x = reorder(college_name, residual_pay),
           y = residual_pay,
           fill = earnings_group
           )
       ) + 
    geom_bar(stat = 'identity', color = 'black') + 
    coord_flip() + 
    my_plot_theme() + 
    ylab("Error") +
    xlab("") + 
    scale_fill_manual(values = color_pal) + 
    theme(legend.title=element_blank(),
          legend.text = element_text(size = 15)
          )

It is interesting to note that Babson and SUNY Maritime are the top two over-performers, meaning that their graduates have higher pay than expected. A quick google search reveals what these two colleges have in common: They both emphasize specialization in lucrative fields. Babson focuses exclusively on entrepreneurship and business, while SUNY Maritime prepares graduates for what I believe to be lucrative careers around boats?

In contrast, the only pattern that emerges amongst the under-performers is that most are small, private, liberal arts colleges. While our model accounted for school sector (private vs. public) and undergraduate enrollment, it is possible that including a factor indicating if a college emphasizes a liberal arts education could be valuable. Indeed, the vocational emphasis of the two top over-performers suggests that our model would likely improve with the addition of a feature that captures if a college has a Liberal Arts vs. Vocational focus.

Conclusion: Does Rank Matter?

It’s important to note that our outcome variable – Median Mid Career Pay – is a summary statistic. Thus, we do not know how much information each college contributed to our estimates, as the raw data (i.e., the individual responses associated with each college) are not available. However, these findings seem directionally correct. Even after considering the aforementioned caveat, it is apparent that where you attend college is strongly associated with how much you earn later in life. This is especially true for top ranked colleges. The difference between attending a college ranked in the top 10 relative to the top 100 has substantial pay implications, while this difference is less important amongst lower ranked colleges.

Hopefully you enjoyed the post. I’d love to hear your feedback, so feel free to comment below!

Appendix

my_plot_theme = function(){
  font_family = "Helvetica"
  font_face = "bold"
  return(theme(
    axis.text.x = element_text(size = 16, face = font_face, family = font_family),
    axis.text.y = element_text(size = 16, face = font_face, family = font_family),
    axis.title.x = element_text(size = 16, face = font_face, family = font_family),
    axis.title.y = element_text(size = 16, face = font_face, family = font_family),
    strip.text.y = element_text(size = 16, face = font_face, family = font_family),
    plot.title = element_text(size = 20, face = font_face, family = font_family),
    plot.caption = element_text(size = 11, face = "italic", hjust = 0),
    
    legend.position = "top",
    legend.title = element_text(size = 16,
                                face = font_face,
                                family = font_family),
    legend.text = element_text(size = 20,
                               face = font_face,
                               family = font_family),
    legend.key = element_rect(size = 3),
    legend.key.size = unit(3, 'lines'),
    legend.title=element_blank()
  ))
}

Related

comments powered by Disqus