Overview
There are few reasons for reducing the number of potential input features when building a model:
- Faster training times
- Easier interpretation
- Improved accuracy by reducing the chance of overfitting
The focus here will primarly be on point three, as we address two questions:
- Does having noisy, unnecessary variables as inputs reduce classification performance?
- And, if so, which algorithms are the most robust to the ill-effects of these unnecessary variables?
We’ll explore these questions with one my favorite datasets located here. Based on the methodology outlined here, a bunch of Wine experts got together and evaluated both White and Red wines. Each row represents the chemical properties of a wine in addition to a rating (0 = very bad, 10 = excellent). Each wine was rated by at least three people, and the median of their ratings was used as the final score. I have no idea how you get paid to drink wine, but somewhere in my life, I chose the wrong career path. Thus, the goal here is to predict the rating based on the chemical properties of the wine, such as its pH or Alcohol Concentration.
Let’s load up the libraries we’ll need as well as a custom plotting function.
Here we’ll read the wine quality ratings directly for the source. We’ll also convert the rating system from a 1-10 scale to binary for the sake of simplicity. Any wine with a rating greater than seven receives the highly coveted “IdDrinkThat”, while all others receive the dreaded “NoThanks” label.
red_file = 'winequality-red.csv'
white_file = 'winequality-white.csv'
base_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/'
red_wine = fread(paste0(base_url, red_file)) %>%
data.frame()
white_wine = fread(paste0(base_url, white_file)) %>%
data.frame()
wine_df = bind_rows(red_wine,
white_wine
) %>%
clean_names() %>%
mutate(wine_class = ifelse(quality >= 7,
"IdDrinkThat",
"NoThanks")) %>%
mutate(wine_class = as.factor(wine_class)) %>%
select(-quality)
As usual, let’s peak at the data to make sure there aren’t any unwanted surprises.
fixed_acidity | volatile_acidity | citric_acid | residual_sugar | chlorides | free_sulfur_dioxide | total_sulfur_dioxide | density | ph | sulphates | alcohol | wine_class |
---|---|---|---|---|---|---|---|---|---|---|---|
7.4 | 0.700 | 0.00 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | NoThanks |
7.8 | 0.880 | 0.00 | 2.6 | 0.098 | 25 | 67 | 0.9968 | 3.20 | 0.68 | 9.8 | NoThanks |
7.8 | 0.760 | 0.04 | 2.3 | 0.092 | 15 | 54 | 0.9970 | 3.26 | 0.65 | 9.8 | NoThanks |
11.2 | 0.280 | 0.56 | 1.9 | 0.075 | 17 | 60 | 0.9980 | 3.16 | 0.58 | 9.8 | NoThanks |
7.4 | 0.700 | 0.00 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | NoThanks |
7.4 | 0.660 | 0.00 | 1.8 | 0.075 | 13 | 40 | 0.9978 | 3.51 | 0.56 | 9.4 | NoThanks |
7.9 | 0.600 | 0.06 | 1.6 | 0.069 | 15 | 59 | 0.9964 | 3.30 | 0.46 | 9.4 | NoThanks |
7.3 | 0.650 | 0.00 | 1.2 | 0.065 | 15 | 21 | 0.9946 | 3.39 | 0.47 | 10.0 | IdDrinkThat |
7.8 | 0.580 | 0.02 | 2.0 | 0.073 | 9 | 18 | 0.9968 | 3.36 | 0.57 | 9.5 | IdDrinkThat |
7.5 | 0.500 | 0.36 | 6.1 | 0.071 | 17 | 102 | 0.9978 | 3.35 | 0.80 | 10.5 | NoThanks |
6.7 | 0.580 | 0.08 | 1.8 | 0.097 | 15 | 65 | 0.9959 | 3.28 | 0.54 | 9.2 | NoThanks |
7.5 | 0.500 | 0.36 | 6.1 | 0.071 | 17 | 102 | 0.9978 | 3.35 | 0.80 | 10.5 | NoThanks |
5.6 | 0.615 | 0.00 | 1.6 | 0.089 | 16 | 59 | 0.9943 | 3.58 | 0.52 | 9.9 | NoThanks |
7.8 | 0.610 | 0.29 | 1.6 | 0.114 | 9 | 29 | 0.9974 | 3.26 | 1.56 | 9.1 | NoThanks |
8.9 | 0.620 | 0.18 | 3.8 | 0.176 | 52 | 145 | 0.9986 | 3.16 | 0.88 | 9.2 | NoThanks |
8.9 | 0.620 | 0.19 | 3.9 | 0.170 | 51 | 148 | 0.9986 | 3.17 | 0.93 | 9.2 | NoThanks |
8.5 | 0.280 | 0.56 | 1.8 | 0.092 | 35 | 103 | 0.9969 | 3.30 | 0.75 | 10.5 | IdDrinkThat |
8.1 | 0.560 | 0.28 | 1.7 | 0.368 | 16 | 56 | 0.9968 | 3.11 | 1.28 | 9.3 | NoThanks |
7.4 | 0.590 | 0.08 | 4.4 | 0.086 | 6 | 29 | 0.9974 | 3.38 | 0.50 | 9.0 | NoThanks |
7.9 | 0.320 | 0.51 | 1.8 | 0.341 | 17 | 56 | 0.9969 | 3.04 | 1.08 | 9.2 | NoThanks |
8.9 | 0.220 | 0.48 | 1.8 | 0.077 | 29 | 60 | 0.9968 | 3.39 | 0.53 | 9.4 | NoThanks |
7.6 | 0.390 | 0.31 | 2.3 | 0.082 | 23 | 71 | 0.9982 | 3.52 | 0.65 | 9.7 | NoThanks |
7.9 | 0.430 | 0.21 | 1.6 | 0.106 | 10 | 37 | 0.9966 | 3.17 | 0.91 | 9.5 | NoThanks |
8.5 | 0.490 | 0.11 | 2.3 | 0.084 | 9 | 67 | 0.9968 | 3.17 | 0.53 | 9.4 | NoThanks |
6.9 | 0.400 | 0.14 | 2.4 | 0.085 | 21 | 40 | 0.9968 | 3.43 | 0.63 | 9.7 | NoThanks |
This dataset is exceptionally clean. Typically we would do some exploratory analysis to check for outliers, missing values, incorrect codings, or any other number of problems that can sabotage our predictions. In this case, we can go right into the variable selection.
We’ll initially leverage the gradient boosted machine (GBM) algorithm in the h2o
library for variable selection. One of the nice features of GBM is that it automatically tells you which variables are important. Recall that GBM is an iterative algorithm that fits many simple decision trees, where the errors from the prior tree become the dependent variable for each subsequent tree. A feature’s importance is determined by how much its introduction into a given tree reduces the error. The total reduction in error for a given feature is then averaged across all of the trees the feature appeared in. Thus splitting on important variables should lead to larger reductions in error during training relative to less important variables.
With that in mind let’s do an initial pass and see which variables are important. We’ll do a 70⁄30 train/test split, start h2o
, and fit an initial model.
trainTestSplit = function(df, split_proportion){
set.seed(123)
out_list = list()
data_split = sample(1:2,
size = nrow(df),
prob = split_proportion,
replace = TRUE)
out_list$train = df[data_split == 1,]
out_list$test = df[data_split == 2,]
return(out_list)
}
split_proportion = c(0.7, 0.3)
df_split = trainTestSplit(wine_df, split_proportion)
Next we’ll calculate our variable importance metric.
h2o.init(nthreads = -1)
train_h2o = as.h2o(df_split$train)
test_h2o = as.h2o(df_split$test)
y_var = "wine_class"
x_var = setdiff(names(train_h2o), y_var)
gbm_fit = h2o.gbm(x = x_var,
y = y_var,
distribution = "bernoulli",
training_frame = train_h2o,
stopping_rounds = 3,
stopping_metric = "AUC",
verbose = FALSE
)
And let’s plot our results based on the relative importance of each feature.
data.frame(gbm_fit@model$variable_importances) %>%
select(variable, relative_importance) %>%
mutate(relative_importance = round(relative_importance),
variable = factor(variable)) %>%
mutate(variable = fct_reorder(variable,
relative_importance, .desc = TRUE)) %>%
ggplot(aes(x = variable,
y = relative_importance,
fill = variable,
label = as.character(relative_importance))) +
geom_bar(stat = "identity") +
geom_label(label.size = 1, size = 5, color = "white") +
theme_bw() +
my_plot_theme() +
theme(legend.position = "none") +
ylab("Relative Importance") +
xlab("Variable") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 12))
It looks like all of the variables have some degree of influence on classification, so we’ll continue with the full model. Next, we’re going to calculate AUC (Area Under Curve) for the following methods:
- GBM
- Random Forest (RF)
- Neural Network (NN)
- Logistic Regression (LR)
We’ll use this as a point of comparison as additional irrelevant predictor variables are added. Our goal is to see how the addition of such variables affects AUC on the test set. Let’s calculate our baseline numbers.
rf_fit = h2o.randomForest(x = x_var,
y = y_var,
training_frame = train_h2o,
stopping_rounds = 3,
stopping_metric = "AUC"
)
nn_fit = h2o.deeplearning(x = x_var,
y = y_var,
training_frame = train_h2o,
stopping_rounds = 3,
stopping_metric = "AUC")
glm_fit = h2o.glm(x = x_var,
y = y_var,
family = "binomial",
training_frame = train_h2o)
gbm_auc = h2o.auc(h2o.performance(gbm_fit, newdata = test_h2o))
rf_auc = h2o.auc(h2o.performance(rf_fit, newdata = test_h2o))
nn_auc = h2o.auc(h2o.performance(nn_fit, newdata = test_h2o))
glm_auc = h2o.auc(h2o.performance(glm_fit, newdata = test_h2o))
auc_df = data.frame(n_noise_vars = rep(0, 4),
method = c('gbm', 'rf', 'nn', 'glm'),
AUC = c(gbm_auc, rf_auc, nn_auc, glm_auc)) %>%
arrange(desc(AUC))
n_noise_vars | method | AUC |
---|---|---|
0 | rf | 0.9186806 |
0 | gbm | 0.8889589 |
0 | nn | 0.8417844 |
0 | glm | 0.8179276 |
This provides us with a general baseline. In the following section, we’ll see what happens to the Area Under the Curve (AUC) as we add in lots of irrelevant predictors. But first, let’s explore why we are using AUC as our evaluation metric.
AUC as a Measure of Classification Accuracy
Measuring model performance when your dependent variable is binary is more “involved” than measuring performance with a continuous dependent variable. With a continuous DV, large residuals are worse than small residuals, and this can be quantified along a continuum. In contrast, you can have a classifier that’s extremely accurate – it provides the correct classification 99.99% of the time – but actually doesn’t tell you anything useful. This situation typically arises when you have imbalanced classes, such as trying to predict whether or not someone has a disease when it only occurs in 1 of 100,000 people. By chance alone, the classifier would be right 99,999 times out of 100,000 if you just said: “no one has the disease ever”.
Let’s consider the class balance in our wine dataset
print(round(table(wine_df$wine_class)/nrow(wine_df) * 100, 1))
##
## IdDrinkThat NoThanks
## 19.7 80.3
According to our refined, discerning pallette, 80% of wines would fall by chance alone in the “NoThanks”" category. We could achieve 80% accuracy by simply assigning a label of “NoThanks” to every new wine we encountered. This would obviously be a terrible idea, because then we’d never get to drink any wine! Thus, in cases where your classes are unevenly distributed, accuracy might not be the best evaluation metric.
AUC isn’t affected by class imbalances because it considers both True Positives and False Positives. The True Positive Rate (TPR) captures all the instances in which our model said ‘Drink this wine’ and it was, in fact, a wine we would want to drink; False Positive Rate (FPR) captures instances in which our model said ‘Drink this Wine’ but it was actually a wine that we would not want to drink. Obtaining more TPRs and fewer FPRs will move the AUC closer to one, which means our model is improving. Hopefully, this clarifies the rationale for using AUC relative to accuracy. Now let’s get back to the original question.
Adding Irrelevant Predictors
A little disclaimer before moving on. There are a number of ways you can reduce the influence of irrelevant features on predictive accuracy, namely regularization. However, we’re interested in which algorithms are least susceptible to modeling the noise associated with an irrelevant predictor variable without any additional regularization or parameter optimization. Accordingly, we’ll sample the irrelevant predictors from a normal distribution with a mean of zero and a standard deviation of one. Each iteration adds an additional 100 irrelevant predictors across a total of 20 iterations. All of the algorithms use the default parameters that come with the h2o
library.
n_noise_vars = seq(100, 2000, length.out = 20)
for(i in n_noise_vars){
print(i)
temp_noise_df_train = data.frame(placeholder = rep(NA, nrow(df_split$train)))
temp_noise_df_test = data.frame(placeholder = rep(NA, nrow(df_split$test)))
# add in i irrelevant predictors to train and test
for(j in 1:i){
temp_noise_df_train = cbind(temp_noise_df_train,
data.frame(noise.var = rnorm(nrow(df_split$train),
0,
1)))
temp_noise_df_test = cbind(temp_noise_df_test,
data.frame(noise.var = rnorm(nrow(df_split$test),
0,
1)))
}
# format names of irrelevant variables
temp_noise_df_train = temp_noise_df_train[,2:dim(temp_noise_df_train)[2]]
names(temp_noise_df_train) = gsub("\\.", "", names(temp_noise_df_train))
temp_noise_df_train = as.h2o(cbind(temp_noise_df_train,
df_split$train))
temp_noise_df_test = temp_noise_df_test[,2:dim(temp_noise_df_test)[2]]
names(temp_noise_df_test) = gsub("\\.", "", names(temp_noise_df_test))
temp_noise_df_test = cbind(temp_noise_df_test,
df_split$test)
x_var = setdiff(names(temp_noise_df_train),y_var)
gbm_fit = h2o.gbm(x = x_var,
y = y_var,
distribution = "bernoulli",
training_frame = temp_noise_df_train,
stopping_rounds = 3,
stopping_metric = "AUC")
rf_fit = h2o.randomForest(x = x_var,
y = y_var,
training_frame = temp_noise_df_train,
stopping_rounds = 3,
stopping_metric = "AUC")
nn_fit = h2o.deeplearning(x = x_var,
y = y_var,
training_frame = temp_noise_df_train,
stopping_rounds = 3,
stopping_metric = "AUC")
glm_fit = h2o.glm(x = x_var,
y = y_var,
family = "binomial",
training_frame = temp_noise_df_train)
temp_noise_df_test = as.h2o(temp_noise_df_test)
gbm_auc = h2o.auc(h2o.performance(gbm_fit, newdata = temp_noise_df_test))
rf_auc = h2o.auc(h2o.performance(rf_fit, newdata = temp_noise_df_test))
nn_auc = h2o.auc(h2o.performance(nn_fit, newdata = temp_noise_df_test))
glm_auc = h2o.auc(h2o.performance(glm_fit, newdata = temp_noise_df_test))
auc_df = rbind(auc_df,
data.frame(n_noise_vars = i,
method = c('gbm', 'rf', 'nn', 'glm'),
AUC = c(gbm_auc, rf_auc, nn_auc, glm_auc)))
}
And now let’s plot the results.
ggplot(auc_df, aes(x = n_noise_vars,
y = AUC,
color = method)) +
geom_point(size = 2) +
stat_smooth(span = 1.75,
se = FALSE,
size = 2) +
theme_bw() +
my_plot_theme() +
ylab("AUC") +
xlab("N Noise Variables")
A few initial observations:
- AUC for all methods, with the exception of GLM, decreased as the number of irrelevant predictors in the model increased
- GBM is the most robust against the ill-effects of irrelevant predictors, while NN was the most susceptible
Although we can’t generalize these outcomes to every situation, they do align with my experience. In fact, one of the reasons I like GBM is that it tends to ignore irrelevant predictors during modeling. Neural networks are extremely powerful and with the right tuning/regularization will often outperform any of the methods outlined here. However, they are susceptible to overfitting because of their flexibility, which leads to noise being leveraged as signal during the model building process. This goes to show that pre-processing and feature selection is an important part of the model building process if you are dealing with lots of potential features. Beyond reducing the training time, pruning irrelevant features can vastly improve model performance. As illustrated above, including irrelevant features can wreak havoc on your model when using methods with lots of flexible parameters. It would be interesting to see how these results change if we added some regularization measures to protect against overfitting. For example, one way to reduce overfitting with Random Forest is to limit the maximum depth of a tree (i.e., each tree can only have a single split instead of multiple splits). Or for Neural Networks L2 regularization can be used, which discourages weight vectors in the network from becoming too large. If you try any of these approaches, I’d love to hear how it goes!