College Rankings and Pay

Sep 11, 2018 21 min read College Rankings, Career, R, Python

Overview

Rankings are a pervasive part of modern life, especially for big decisions – like shelling out thousands of dollars for an education. Indeed, college rankings are amongst the most widely cited rankings in existence. Students, parents, and college administrators fret about where their school lands along these highly debated spectrums. But does it really matter in terms of future earnings potential? While an education is more than a simple “means to an end”, making a decent living is amongst the reasons why students pull all-nighters, work terrible summer jobs, and do everything they can to position themselves for a lucrative, meaningful career. Accordingly, this post will focus on the following topics:

🤓 The relationship between Mid Career Pay and College Ranking after accounting for external variables (e.g., % of STEM degrees conferred each year)

🤓 Understand which variables exhibit the strongest relationship with pay

🤓 Identify which college’s graduates over/under earn relative to our predictions

Since I’m a big fan of Learning-by-doing, I’ve included all of the code used to answer these questions. If you are simply interested in the answers, click here

Data Collection

We’ll leverage the following data sources for our analysis:

2018 Payscale Data of Median Mid Career Earnings By College (with additional data about each college)
2017 Forbes Undergraduate College Rankings
2018 Cost of Living Index by State

Let’s start by collecting the earnings and rankings data. The reticulate package allows us to call the two python functions (outlined below) to scrape this data and return it as a DataFrame back into R. For the sake of simplicity, all of the scripts referenced herein exist in the same folder as the .R script. The pacman package will do the heavy lifting here, loading all previously installed packages and installing all referenced but uninstalled packages.

library(pacman)
pacman::p_load('readr',
               'reshape',
               'dplyr',
               'janitor',
               'ggplot2',
               'broom',
               'rvest',
               'reticulate',
               'glue',
               'caret',
               'kableExtra',
               'artyfarty',
               'patchwork',
               'sjPlot',
               'purrr',
               'stringr',
               'tibble'
               )
# set plotting theme to black & white
theme_set(theme_bw())

# set plotting colors 
color_pal = pal("five38")

# set custom plotting theme
# source('my_plot_theme.R')

We’ve loaded/installed the relevant packages, set our working directory, and loaded our custom plotting theme (see here for theme. Next we’ll tell R which version of Python to use.

reticulate::use_python('/anaconda/envs/py36/bin/python', required = TRUE)

Let’s source (i.e., bring the Python functions into our R environment) with the following command:

reticulate::source_python('rankings_earnings_post_functions.py')

We now have access to the two functions described below – collect_payscale_info and college_college_ranks.Each will scrape data from the web, convert the results into a Pandas DataFrame, and then return the final result back as an R DataFrame.

import pandas as pd
import urllib3
from bs4 import BeautifulSoup
import re
import ast
HTTP = urllib3.PoolManager()
#
# Function to scrape payscale information
#
def collect_payscale_info():
    base_url = 'https://www.payscale.com/college-salary-report/bachelors?page=101'
    page = HTTP.request('GET', base_url)
    soup = BeautifulSoup(page.data, 'lxml')
    all_colleges = (str(soup)
                    .split("var collegeSalaryReportData = ")[1]
                    .split('},{')
                    )
    college_list = []
    for i in all_colleges:
        try:
            college_list.append(ast.literal_eval("{" + i.replace("[{", "") + "}"))
        except Exception as e:
            print(e)
    college_df = pd.DataFrame(college_list)
    return(college_df)
#
# Function to scrape college rankings information  
#
def collect_college_ranks(college_names_list):
    base_url = 'https://www.forbes.com/colleges/'
    rank_list = []
    for college_name in college_names_list:
        print(str(round(count/float(len(college_names_list)) * 100, 2)) + '% Complete')
        try:
            tmp_url = base_url + college_name.lower().replace(" ", "-") + "/"
            page = HTTP.request('GET', base_url)
            soup = BeautifulSoup(page.data, 'lxml')
            college_rank = (re.search(r'class="rankonlist">#(.*?) <a href="/top-colleges/list/"', 
                                      str(soup))
                              .group(1)
                           )
            rank_list.append([college_name, college_rank])
        except Exception as e:
            rank_list.append([college_name, None])
            print(e)
        count += 1
    rank_df = pd.DataFrame(rank_list)
    rank_df = rank_df.rename(columns={rank_df.columns[0]: "college_name",
                                      rank_df.columns[1]: "rank"
                                  })
    return(rank_df)

Let’s start by collecting data about each college. We’ll gather the following fields:

Percent_Female: Percentage of the student body that is female
Percent_Stem: Percentage of students majoring in a Science, Technology, Engineering, or Math field
Undergraduate_Enrollment: Number of students enrolled as undergraduate students
School_Sector: If the College if Public or Private
Rank: 2017 Forbe’s undergraduate college rankings
COL_Index: Cost of Living index at the state level
Mid_Career_Median_Pay: Median Salary for alumni with 10+ years of experience

college_salary_df = collect_payscale_info() %>% 
  clean_names() %>% 
  select(friendly_name,
         state,
         percent_female,
         percent_stem,
         undergraduate_enrollment,
         school_sector,
         mid_career_median_pay
         ) %>% 
  rename(college_name = friendly_name)

Below we’ll have a glimpse at the first few rows of data pertinent to each college.

college_name	state	percent_female	percent_stem	undergraduate_enrollment	school_sector	mid_career_median_pay
Harvey Mudd College	California	0.46	0.85	831	Private not-for-profit	155800
Princeton University	New Jersey	0.48	0.47	5666	Private not-for-profit	147800
Massachusetts Institute of Technology	Massachusetts	0.46	0.69	4680	Private not-for-profit	147000
SUNY Maritime College	New York	0.1	0.35	1817	Public	145100
United States Military Academy	New York	0.19	0.43	4536	Public	144300
United States Naval Academy	Maryland	0.23	0.6	4535	Public	143800
California Institute of Technology	California	0.39	0.97	1034	Private not-for-profit	142500
Babson College	Massachusetts	0.48	0	2211	Private not-for-profit	141700
Harvard University	Massachusetts	0.5	0.19	14128	Private not-for-profit	140700
Stanford University	California	0.48	0.49	8302	Private not-for-profit	140400
Dartmouth College	New Hampshire	0.49	0.32	4695	Private not-for-profit	140300
Williams College	Massachusetts	0.51	0.33	2233	Private not-for-profit	138400
United States Air Force Academy	Colorado	0.23	0.49	4112	Public	138300
Webb Institute	New York	0.19	1	91	Private not-for-profit	138200
Samuel Merritt University	California	0.81	0	851	Private not-for-profit	138100
Colgate University	New York	0.56	0.27	2991	Private not-for-profit	137700
Stevens Institute of Technology	New Jersey	0.29	0.8	3076	Private not-for-profit	136900
Colorado School of Mines	Colorado	0.28	0.94	4821	Public	136100
United States Merchant Marine Academy	New York	0.17	0.54	908	Public	135900
University of Pennsylvania	Pennsylvania	0.53	0.22	13096	Private not-for-profit	134800

Looks good! Next, we’ll pass the college names as an argument into our collect_college_ranks function. We’ll iterate through each college and extract its ranking, which ranges from 1 to 650.

college_rankings_df = collect_college_ranks(college_salary_df$college_name) %>% 
  clean_names() %>% 
  filter(is.na(rank) == FALSE) %>% 
  arrange(rank)

Our last piece of data is the Cost of Living for each college’s state of residence. College graduates frequently work in the same state they attend college, and Cost of Living is related to how much companies pay their employees. By accounting for such expenses, we’ll have better estimates of the impact of rankings on pay This data is housed on a website that publishes current Cost of Living estimates by state. We’ll switch back to R and use the rvest package for web scraping.

table_path = '/html/body/div[5]/div/table'
url = "https://www.missourieconomy.org/indicators/cost_of_living/"

col_table = url %>% 
  read_html() %>% 
  html_nodes(xpath = table_path) %>% 
  html_table(header = TRUE)

# note: col_index = cost of living index
col_df = data.frame(col_table[[1]])[,c(1, 3)] %>% 
  setNames(., c('state', 'col_index')) %>% 
  slice(2:(n() - 1)) %>% 
  mutate(col_index = as.numeric(col_index)) %>% 
  arrange(desc(col_index))

print(head(col_df, 10))

##                      state col_index
## 1                   Hawaii     188.9
## 2  District of    Columbia     161.0
## 3               California     137.2
## 4                 New York     134.0
## 5              Connecticut     133.0
## 6                   Alaska     130.5
## 7                   Oregon     129.4
## 8                 Maryland     129.3
## 9            Massachusetts     129.2
## 10              New Jersey     122.3

Having lived on the West Coast for the past three years, these numbers check out! The next step is to confirm that we can join our Cost of Living data with college salaries. A simple left join will identify any inconsistencies between our key (state) that are we using to merge our datasets.

college_salary_df %>% 
  select(state) %>% 
  distinct() %>% 
  left_join(col_df, by = 'state') %>% 
  filter(is.na(col_index) == TRUE)

## # A tibble: 4 x 2
##   state                col_index
##   <chr>                    <dbl>
## 1 District of Columbia        NA
## 2 Puerto Rico                 NA
## 3 Guam                        NA
## 4 <NA>                        NA

The only mismatch within the United States is the District of Columbia. We’ll reformat the state names and then join all three data sources – salary, rankings, and Cost of Living – to complete our final data set.

final_df = college_salary_df %>% 
  inner_join(college_rankings_df, by = 'college_name') %>% 
  inner_join(col_df %>% 
             mutate(state = ifelse(state == 'District of    Columbia', 
                                            'District of Columbia',
                                             state
                                   )
                                  ),
                    by = 'state'
                    ) %>% 
  arrange(rank) %>% 
  select(-state) %>% 
  select(rank, everything()) %>% 
  mutate(percent_stem = as.numeric(percent_stem),
         percent_female = as.numeric(percent_female),
         undergraduate_enrollment =  as.numeric(undergraduate_enrollment),
         school_sector = factor(if_else(school_sector == 'Public', 1, 0))
                ) %>% 
  na.omit() %>% 
  data.frame()

rank	college_name	percent_female	percent_stem	undergraduate_enrollment	school_sector	mid_career_median_pay	col_index
1	Harvard University	0.50	0.19	14128	0	140700	129.2
2	Stanford University	0.48	0.49	8302	0	140400	137.2
3	Yale University	0.51	0.23	7150	0	132100	133.0
4	Princeton University	0.48	0.47	5666	0	147800	122.3
5	Massachusetts Institute of Technology	0.46	0.69	4680	0	147000	129.2
6	California Institute of Technology	0.39	0.97	1034	0	142500	137.2
7	University of Pennsylvania	0.53	0.22	13096	0	134800	98.4
8	Duke University	0.49	0.26	7253	0	134400	94.5
9	Brown University	0.54	0.39	7214	0	132000	120.8
10	Pomona College	0.50	0.43	1677	0	114100	137.2
11	Claremont McKenna College	0.48	0.14	1359	0	130800	137.2
12	Dartmouth College	0.49	0.32	4695	0	140300	106.6
13	Williams College	0.51	0.33	2233	0	138400	129.2
14	Columbia University in the City of New York	0.47	0.30	8623	0	124700	134.0
15	Cornell University	0.51	0.43	14656	0	123900	134.0
16	University of Chicago	0.48	0.20	6046	0	117000	95.7
17	Amherst College	0.50	0.27	1921	0	122800	129.2
18	Harvey Mudd College	0.46	0.85	831	0	155800	137.2
19	Swarthmore College	0.50	0.34	1606	0	126200	98.4
20	United States Naval Academy	0.23	0.60	4535	1	143800	129.3
21	Georgetown University	0.56	0.12	8719	0	129500	161.0
22	Rice University	0.47	0.50	4051	0	130200	91.4
23	Bowdoin College	0.50	0.35	1952	0	115500	115.9
24	United States Military Academy	0.19	0.43	4536	1	144300	134.0
25	Haverford College	0.52	0.38	1305	0	119500	98.4
26	University of Notre Dame	0.48	0.26	8626	0	127900	90.0
27	Vanderbilt University	0.50	0.22	7046	0	116000	89.4
28	Northwestern University	0.51	0.19	10611	0	113300	95.7
29	University of California-Berkeley	0.52	0.34	29591	1	130100	137.2
30	Johns Hopkins University	0.55	0.33	8152	0	112200	129.3
31	Washington and Lee University	0.49	0.21	1867	0	131700	101.1
32	Tufts University	0.51	0.24	5697	0	120200	129.2
33	Wesleyan University	0.55	0.22	3091	0	117700	133.0
34	Davidson College	0.51	0.24	1928	0	111400	94.5
35	Bates College	0.50	0.26	1903	0	125000	115.9
36	Washington University in St Louis	0.54	0.31	8440	0	113400	88.6
37	Carleton College	0.52	0.40	2093	0	113800	103.2
38	University of Michigan-Ann Arbor	0.49	0.38	29124	1	104600	90.2
39	Middlebury College	0.53	0.18	3849	0	113200	117.2
40	University of Virginia-Main Campus	0.55	0.21	18930	1	118400	101.1
41	United States Air Force Academy	0.23	0.49	4112	1	138300	107.3
42	Colgate University	0.56	0.27	2991	0	137700	134.0
43	Scripps College	1.00	0.24	1023	0	96700	137.2
44	University of Southern California	0.50	0.24	20543	0	119300	137.2
45	Carnegie Mellon University	0.44	0.64	6869	0	126400	98.4
47	Wellesley College	0.95	0.29	2644	0	100900	129.2
48	University of California-Los Angeles	0.56	0.31	31906	1	114800	137.2
49	Boston College	0.53	0.11	10297	0	115300	129.2
50	Vassar College	0.57	0.27	2625	0	98900	134.0
51	United States Merchant Marine Academy	0.17	0.54	908	1	135900	134.0
52	New York University	0.57	0.17	30000	0	117000	134.0
53	Oberlin College	0.56	0.23	3042	0	107600	93.0
54	Barnard College	1.00	0.19	2669	0	103700	134.0
55	Bucknell University	0.52	0.37	3721	0	124700	98.4
56	Lafayette College	0.49	0.41	2541	0	123800	98.4
57	Grinnell College	0.55	0.36	1774	0	96200	91.5
58	College of William and Mary	0.57	0.20	6890	1	111900	101.1
59	Pitzer College	0.57	0.13	1094	0	91100	137.2
60	Wake Forest University	0.52	0.14	4966	0	118400	94.5
61	Colby College	0.53	0.28	2055	0	104200	115.9
62	Cooper Union for the Advancement of Science and Art	0.34	0.54	942	0	124200	134.0
63	Kenyon College	0.55	0.17	1764	0	110400	93.0
64	Santa Clara University	0.50	0.30	5759	0	134500	137.2
65	Hamilton College	0.52	0.28	2036	0	107000	134.0
66	Emory University	0.58	0.21	7764	0	109800	91.6
67	Lehigh University	0.44	0.52	5325	0	133900	98.4
68	University of North Carolina at Chapel Hill	0.58	0.18	19312	1	96000	94.5
69	University of Illinois at Urbana-Champaign	0.44	0.37	34687	1	109600	95.7
70	Reed College	0.55	0.26	1491	0	103900	129.4
71	Smith College	1.00	0.23	2603	0	98400	129.2
72	University of Maryland-College Park	0.46	0.32	31050	1	107100	129.3
73	Villanova University	0.54	0.23	7871	0	114700	98.4
74	Babson College	0.48	0.00	2211	0	141700	129.2
75	Yeshiva University	0.45	0.13	2949	0	122300	134.0
76	Colorado College	0.54	0.28	2322	0	97000	107.3
77	Occidental College	0.57	0.25	2178	0	107700	137.2
78	Whitman College	0.58	0.38	1562	0	107700	109.3
79	University of Washington-Seattle Campus	0.52	0.32	34486	1	108800	109.3
80	University of Florida	0.55	0.27	36703	1	97800	99.1
81	Georgia Institute of Technology-Main Campus	0.35	0.78	16694	1	128700	91.6
82	University of Richmond	0.55	0.14	3910	0	106600	101.1
83	University of California-Santa Barbara	0.52	0.28	21897	1	118800	137.2
84	Trinity College	0.48	0.18	2434	0	108700	133.0
85	University of Rochester	0.51	0.32	6821	0	107400	134.0
86	Boston University	0.60	0.23	22685	0	113500	129.2
87	University of Wisconsin-Madison	0.51	0.32	33400	1	98400	96.8
88	George Washington University	0.56	0.14	12039	0	115900	161.0
89	College of the Holy Cross	0.50	0.22	2759	0	119700	129.2
90	Trinity University	0.52	0.22	2298	0	101600	91.4
91	The University of Texas at Austin	0.52	0.26	42193	1	107400	91.4
92	Franklin and Marshall College	0.53	0.21	2417	0	112400	98.4
93	University of California-San Diego	0.47	0.54	27918	1	124900	137.2
94	Skidmore College	0.61	0.21	2798	0	97300	134.0
95	Bryn Mawr College	1.00	0.19	1430	0	89900	98.4
96	Furman University	0.56	0.16	2887	0	104000	97.8
97	University of Miami	0.52	0.22	12243	0	92400	99.1
98	Southern Methodist University	0.50	0.20	6866	0	108500	91.4
99	Brandeis University	0.57	0.26	3852	0	114200	129.2
100	Macalester College	0.60	0.33	2238	0	97700	103.2
101	St Olaf College	0.57	0.32	3122	0	98700	103.2

We have a total of 578 colleges with complete data. Let’s get moving on our analysis!

Analysis

We’ll first start by examining the relationship between our two main variables of interest: College Rank and Mid Career Pay.

Earnings drop much quicker for highly ranked colleges relative to lower ranked colleges. Let’s break this out by bucketing the ranks and examining the slopes for each bucket – that is, the expected change in salary as our ranking decreases by one.

final_df = final_df %>% 
  mutate(rank_bucket = case_when(rank >= 1 & rank <= 50 ~ '1 - 50',
                                 rank >= 51 & rank <= 100 ~ '51 - 100',
                                 rank > 100 & rank <= 200 ~ '101 - 200',
                                 rank > 200 & rank <= 400 ~ '201 - 400',
                                 rank > 400 ~ '> 400'
                                 )
         )

rank_bucket_est = final_df %>% 
  group_by(rank_bucket) %>% 
  do(tidy(lm(mid_career_median_pay ~ rank, data=.))) %>% 
  select(rank_bucket, term, estimate) %>% 
  reshape::cast(rank_bucket ~ term, value = 'estimate') %>% 
  clean_names() %>% 
  dplyr::rename(rank_coeff = rank) %>% 
  data.frame() %>% 
  arrange(rank_coeff)

rank_bucket	intercept	rank_coeff
1 - 50	140453.6	-599.28741
51 - 100	127265.4	-219.40936
101 - 200	115381.3	-79.85206
201 - 400	110338.7	-55.96746
> 400	102147.0	-35.61756

While the relationship between earnings and rank is non-linear, this provides a rough estimate of what we originally noticed in the first scatterplot. Specifically, for colleges that fall in the 1- 50 bucket, a one-unit reduction in rank is associated with a decrease in pay of ~$600. In contrast, for those in the > 400 bucket, a one-unit reduction results only in a ~$36 reduction in pay. Let’s bring this concept to life by visualizing the relationship with our grouped ranking variable.

rank_bucket_est %>% 
  inner_join(final_df, by = 'rank_bucket') %>% 
  mutate(predicted_income = rank * rank_coeff + intercept) %>% 
  mutate(rank_bucket = factor(rank_bucket,
                              levels = c('1 - 50',
                                         '51 - 100',
                                         '101 - 200',
                                         '201 - 400',
                                         '> 400'
                                         )
                              )
         ) %>% 
  ggplot(aes(x = rank, y = predicted_income, color = rank_bucket)) + 
    geom_point() + 
    geom_line() + 
    geom_point(data = final_df, aes(x = rank,
                                    y = mid_career_median_pay),
               alpha = 0.25
               ) + 
    my_plot_theme() + 
    ylab("Predicted Mid Career Pay") + 
    scale_y_continuous(labels=scales::dollar) + 
    scale_x_continuous(breaks = seq(0, max(final_df$rank), by = 50)) + 
    labs(color = 'Rank Bucket') + 
    scale_color_manual(values = pal("five38"))

At this point, we’ve established that (1) rank is a good predictor of earnings, and (2) the nature of this relationship varies by the level of the rank, indicating that a modeling approach that handles non-linearity would likely perform better than our current linear model. Accordingly, we’ll try out three separate models – Linear Regression, General Additive Model, and Random Forest – to see which performs best on a holdout set. Once we identify the best approach (from a modeling perspective), we’ll start considering the role of the other variables as well.

row.names(final_df) = final_df$college_name
final_df = final_df %>%
  select(-rank_bucket, -college_name)

y_var = 'mid_career_median_pay'
x_var = setdiff(names(final_df), y_var)
model_form = as.formula(paste0(y_var,
                               " ~ ",
                               paste0(x_var, collapse = " + ")
                               )
                        )
print(model_form)

## mid_career_median_pay ~ rank + percent_female + percent_stem + 
##     undergraduate_enrollment + school_sector + col_index

Now that we’ve established our model formula, let’s do a quick quality check on our data. We’ll first consider the presence of variables with near zero variance. Such variables all have the same value (e.g., 0) or are highly unbalanced (e.g., 10000 zeros and 1 one). This can be problematic for certain modeling approaches, in that the model won’t fit properly. Second, we’ll explore the bi-variate correlations amongst our predictors. Highly correlated variables can interfere with interpretability and lead to incorrect parameter estimates. If you’ve ever done a linear regression and the direction or size of the resulting coefficients don’t make sense, this might be the reason. We’ll check for both of these issues below.

# identifies any variables with near zero variance, which can lead to model instability
near_zero_vars = nearZeroVar(final_df[x_var])
# identifies any vavriables with absolute correlation > .7
correlated_vars = findCorrelation(cor(final_df[x_var[x_var != 'school_sector']]),
                                  cutoff = .7
                                  )

print(c(paste0("N correlated Vars: ", length(correlated_vars)),
        paste0("N low variance Vars: ", length(near_zero_vars))
        )
      )

## [1] "N correlated Vars: 0"   "N low variance Vars: 0"

Looks good. Let’s compare our three models and select the one that provides the best out-of-sample performance using the caret package. We’ll use 10-fold cross validation and repeat the validation process five times.

control = trainControl(method = 'repeatedcv',
                      number = 10,
                      repeats = 5
                      )

set.seed(1)
model_lm = train(model_form, 
                 data=final_df, 
                 method="lm", 
                 trControl=control
                 )

set.seed(1)
model_gam = train(model_form, 
                  data=final_df, 
                  method="gam", 
                  trControl=control
                  )
set.seed(1)
model_rf = train(model_form, 
                 data=final_df, 
                 method="rf", 
                 trControl=control,
                 ntree = 100
                 )

Model performance is based on Root Mean Square Error (RMSE), R-squared, and Mean Absolute Error (MAE). While performance across these metrics will be related, each tells a different story. For example, RMSE squares each of the errors before they are averaged, while MAE simply takes the absolute value. RMSE puts a larger penalty on models with big errors relative to MAE. Thus, if reliability and consistency are particularly important, RMSE might be the better metric. In the current context, we aren’t concerned with this tradeoff – and obviously it’s easiest when all metrics point to the same model!

model_perf = resamples(list(LM=model_lm, 
                            GAM=model_gam, 
                            RF=model_rf)
                       )$values %>% 
  clean_names() %>% 
  select(-resample) %>% 
  reshape::melt() %>% 
  as_tibble()

model_perf$model_metric = model_perf$variable %>%
  map(function(x) str_split(pattern = "_", x))

model_perf$model = model_perf$model_metric %>%
  map(function(x) x[[1]][1]) %>%
  unlist()

model_perf$metric =
  model_perf$model_metric %>%
  map(function(x) x[[1]][2]) %>%
  unlist()

model_perf = model_perf %>%
  select(-variable, -model_metric)

Let’s visualize the performance of each model across 50 trials (10 folds X 5 repeats) with a boxplot.

ggplot(model_perf, aes(x = model, y = value, color = model)) + 
  geom_boxplot() + 
  facet_grid(metric ~ ., scales = 'free') + 
  my_plot_theme() + 
  scale_color_manual(values = color_pal) + 
  ylab("Metric Value") + 
  xlab("Model")

The GAM performed best, following by the Random Forest and Linear Regression Model. Let’s do a quick review of each by considering the in-sample errors – that is, when we make predictions against the same dataset we trained our model on. A simple approach is to plot the fitted (predicted) values against the errors (residuals).

insample_predictions = final_df %>% 
  mutate(lm_pred = as.vector(predict(model_lm, final_df)),
         gam_pred = as.vector(predict(model_gam, final_df)),
         rf_pred = as.vector(predict(model_rf, final_df))
         ) %>% 
  select(y_var, ends_with('_pred')) %>% 
  reshape::melt(y_var) %>% 
  dplyr::rename(predicted_earnings = value,
                actual_earnings = mid_career_median_pay,
                model = variable
                ) %>% 
  mutate(residual = actual_earnings - predicted_earnings)

ggplot(insample_predictions, 
       aes(x = predicted_earnings,
           y = residual,
           color = model
           )) + 
  geom_point(alpha = 0.5, size = 2) + 
  facet_grid(model ~ .) + 
  geom_abline(intercept = 0, slope = 0) + 
  stat_smooth(size = 2, se = FALSE) + 
  scale_color_manual(values = pal("five38")[c(1, 3, 2)]) + 
  my_plot_theme() + 
  xlab("Predicted Earnings") + 
  ylab("Residual")

A few takeaways from this visual. First, it’s apparent that a non-linear model (e.g., GAM or RF) performs best on this dataset, based on the pattern of residuals associated with the linear model. Second, the GAM performed best in terms of limiting bias across the entire prediction range. This means we are just as likely to over- or under-predict across the range of values, whereas RF and LM are biased at the tails of the range. Third, RF performs much better in-sample than out-of-sample, as is indicated by the close spread of the errors around zero. This indicates that the RF model is likely overfitting, which we could address by (1) acquiring more data, and/or (2) changing some of the model parameters to encourage a simpler, more robust model – a process referred to as regularization. However, interpretability is a key component of our analysis, and GAM models have this quality in spades. Let’s move forward with the GAM and rank the overall importance of each variable.

var_imp = varImp(model_gam)$importance %>% 
  dplyr::rename(importance = Overall) %>% 
  rownames_to_column(var = 'variable')

  ggplot(var_imp, aes(x = reorder(variable, importance), y = importance)) + 
  geom_bar(stat = 'identity', 
           fill =  pal("five38")[1],
           color = 'black') + 
  coord_flip() + 
  my_plot_theme() + 
  xlab("") + 
  ylab("Importance")

This provides a global measure of importance for all of our variables. Let’s take this a step further and examine the marginal effects of the top four most important variables.

x_var_plot = x_var[!x_var %in% c('undergraduate_enrollment', 'school_sector')]
marginal_effects_plots = list()
for(var in x_var_plot){
  tmp_effects_plot = plot_model(model_gam$finalModel, 
                                 type = "pred", 
                                 terms = var
                                )
  tmp_effects_plot = tmp_effects_plot + 
    ggtitle(glue::glue("Predicted Pay by {var}")) + 
    xlab(glue::glue("{var}")) + 
    ylab("Mid Career Median Pay") + 
    theme_bw() + 
    my_plot_theme() + 
    scale_y_continuous(labels=scales::dollar)
  
  marginal_effects_plots[[var]] = tmp_effects_plot
}

Below we’ll take our four separate plots and organize them nicely into a single, unified plot via the patchwork package.

(marginal_effects_plots$rank + 
  marginal_effects_plots$percent_female + 
  marginal_effects_plots$percent_stem +
  marginal_effects_plots$col_index + 
  patchwork::plot_layout(ncol = 2) 
)

Most of these findings aren’t too surprising: Those who attend top-ranked colleges, in a state with a high cost of living, and likely majored in a STEM subject tend to have the highest pay. However, pay as a percentage of female graduates was unexpected. While the earnings discrepency between men and women has been well documented, it was surprising to see that the gender distribution of a college was a stronger predictor of earnings than cost of living or major. While I’d put less faith in the predictions at the extreme ranges (i.e., all-male or all-female colleges), it is unfortunate to see such a discrepancy.

Over and Under Performers

The last question to consider is which college’s graduates over/under perform their expected pay. Looking at where your model goes wrong is a great way to identify missing variables. Let’s have a look at the top and bottom 10 colleges based on the largest positive and negative residual values.

final_df$college_name = row.names(final_df)
final_df$predicted_pay = as.vector(predict(model_gam, final_df))
final_df$residual_pay = final_df$mid_career_median_pay - final_df$predicted_pay  
top10_over_under = bind_rows(
  final_df %>% 
  arrange(desc(residual_pay)) %>% 
  head(10) %>% 
  mutate(earnings_group = 'More Than Expected'),
  final_df %>% 
  arrange(residual_pay) %>% 
  head(10) %>% 
  mutate(earnings_group = 'Less Than Expected')
                            )

ggplot(top10_over_under, 
       aes(x = reorder(college_name, residual_pay),
           y = residual_pay,
           fill = earnings_group
           )
       ) + 
    geom_bar(stat = 'identity', color = 'black') + 
    coord_flip() + 
    my_plot_theme() + 
    ylab("Error") +
    xlab("") + 
    scale_fill_manual(values = color_pal) + 
    theme(legend.title=element_blank(),
          legend.text = element_text(size = 15)
          )

It is interesting to note that Babson and SUNY Maritime are the top two over-performers, meaning that their graduates have higher pay than expected. A quick google search reveals what these two colleges have in common: They both emphasize specialization in lucrative fields. Babson focuses exclusively on entrepreneurship and business, while SUNY Maritime prepares graduates for what I believe to be lucrative careers around boats?

In contrast, the only pattern that emerges amongst the under-performers is that most are small, private, liberal arts colleges. While our model accounted for school sector (private vs. public) and undergraduate enrollment, it is possible that including a factor indicating if a college emphasizes a liberal arts education could be valuable. Indeed, the vocational emphasis of the two top over-performers suggests that our model would likely improve with the addition of a feature that captures if a college has a Liberal Arts vs. Vocational focus.

Conclusion: Does Rank Matter?

It’s important to note that our outcome variable – Median Mid Career Pay – is a summary statistic. Thus, we do not know how much information each college contributed to our estimates, as the raw data (i.e., the individual responses associated with each college) are not available. However, these findings seem directionally correct. Even after considering the aforementioned caveat, it is apparent that where you attend college is strongly associated with how much you earn later in life. This is especially true for top ranked colleges. The difference between attending a college ranked in the top 10 relative to the top 100 has substantial pay implications, while this difference is less important amongst lower ranked colleges.

Hopefully you enjoyed the post. I’d love to hear your feedback, so feel free to comment below!

Appendix

my_plot_theme = function(){
  font_family = "Helvetica"
  font_face = "bold"
  return(theme(
    axis.text.x = element_text(size = 16, face = font_face, family = font_family),
    axis.text.y = element_text(size = 16, face = font_face, family = font_family),
    axis.title.x = element_text(size = 16, face = font_face, family = font_family),
    axis.title.y = element_text(size = 16, face = font_face, family = font_family),
    strip.text.y = element_text(size = 16, face = font_face, family = font_family),
    plot.title = element_text(size = 20, face = font_face, family = font_family),
    plot.caption = element_text(size = 11, face = "italic", hjust = 0),
    
    legend.position = "top",
    legend.title = element_text(size = 16,
                                face = font_face,
                                family = font_family),
    legend.text = element_text(size = 20,
                               face = font_face,
                               family = font_family),
    legend.key = element_rect(size = 3),
    legend.key.size = unit(3, 'lines'),
    legend.title=element_blank()
  ))
}

College Rankings Career R Python