Kickin' it with elastic net regression

With the kind of data that I usually work with, overfitting regression models can be a huge problem if I'm not careful. Ridge regression is a really effective technique for thwarting overfitting. It does this by penalizing the L2 norm (euclidean distance) of the coefficient vector which results in "shrinking" the beta coefficients. The aggressiveness of the penalty is controlled by a parameter $\lambda$.

Lasso regression is a related regularization method. Instead of using the L2 norm, though, it penalizes the L1 norm (manhattan distance) of the coefficient vector. Because it uses the L1 norm, some of the coefficients will shrink to zero while lambda increases. A similar effect would be achieved in Bayesian linear regression using a Laplacian prior (strongly peaked at zero) on each of the beta coefficients.

Because some of the coefficients shrink to zero, the lasso doubles as a crackerjack feature selection technique in addition to a solid shrinkage method. This property gives it a leg up on ridge regression. On the other hand, the lasso will occasionally achieve poor results when there's a high degree of collinearity in the features and ridge regression will perform better. Further, the L1 norm is underdetermined when the number of predictors exceeds the number of observations while ridge regression can handle this.

Elastic net regression is a hybrid approach that blends both penalization of the L2 and L1 norms. Specifically, elastic net regression minimizes the following...

$\lVert y - X\beta \rVert + \lambda[(1-\alpha)\lvert \beta \rvert_2^2 + \alpha\lvert \beta \rvert_1]$

the $\alpha$ hyper-parameter is between 0 and 1 and controls how much L2 or L1 penalization is used (0 is ridge, 1 is lasso).

The usual approach to optimizing the lambda hyper-parameter is through cross-validation—by minimizing the cross-validated mean squared prediction error—but in elastic net regression, the optimal lambda hyper-parameter also depends upon and is heavily dependent on the alpha hyper-parameter (hyper-hyper-parameter?).

This blog post takes a cross-validated approach that uses grid search to find the optimal alpha hyper-parameter while also optimizing the lambda hyper-parameter for three different data sets. I also compare the performances against the stepwise regression and showcase some of the dangers of using stepwise feature selection.

mtcars

In this example, I try to predict “miles per gallon” from the other available attributes. The design matrix has 32 observations and 10 predictors and there is a high degree of collinearity (as measured by the variance inflation factors).

The left panel above shows the leave-one-out cross validation (LOOCV) mean squared error of the model with the optimal lambda (as determined again by LOOCV) for each alpha parameter from 0 to 1. This panel indicates that if our objective is to purely minimize MSE (with no regard for model complexity) than pure ridge regression outperforms any blended elastic-net model. This is probably because of the substantial collinearity. Interestingly, the lasso outperforms blended elastic net models that weight the lasso heavily.

The right panel puts things in perspective by plotting the LOOCV MSEs along with the MSE of the "kitchen sink" regression (the blue line) that includes all features in the model. As you can see, any degree of regularization offers a substantial improvement in model generalizability.

It is also plotted with two estimates of the MSE for models that blindly use the coefficients from automated bi-directional stepwise regression. The first uses the features selected by performing the stepwise procedure on the whole dataset and then assesses the model performance (the red line). The second estimate uses the step procedure and resulting features on only the training set for each fold of the cross validations. This is the estimate without the subtle but treacherous "knowledge leaking" eloquently described in this plot post. This should be considered the more correct assessment of the model. As you can see, if we weren't careful about interpreting the stepwise regression, we would have gotten an incredibly inflated and inaccurate view of the model performance.

Forest Fires

The second example uses a very-difficult-to-model dataset from University of California, Irvine machine learning repository. The task is to predict the burnt area from a forest fire given 11 predictors. It has 517 observations. Further, there is a relatively low degree of collinearity between predictors.

Again, highest performing model is the pure ridge regression. This time, the performance asymptotes as the alpha hyper-parameter increases. The variability in the MSE estimates is due to the fact that I didn't use LOOCV and used 400-k CV instead because I'm impatient.

As with the last example, the properly measured stepwise regression performance isn't so great, and the kitchen sink model outperforms it. However, in contrast to the previous example, there was a lot less variability in the selected features across folds—this is probably because of the significantly larger number of observations.

"QuickStartExample"

This dataset is a contrived one that is included with the excellent glmnet package (the one I'm using for the elastic net regression). This dataset has a relatively low degree of collinearity, has 20 features and 100 observations. I have no idea how the package authors created this dataset.

Finally, an example where the lasso outperforms ridge regression! I think this is because the dataset was specifically manufactured to have a small number of genuine predictors with large effects (as opposed to many weak predictors).

Interestingly, stepwise progression far outperforms both—probably for the very same reason. From fold to fold, there was virtually no variation in the features that the stepwise method automatically chose.

Conclusion

So, there you have it. Elastic net regression is awesome because it can perform at worst as good as the lasso or ridge and—though it didn’t on these examples—can sometimes substantially outperform both.

Also, be careful with step-wise feature selection!

PS: If, for some reason, you are interested in the R code I used to run these simulations, you can find it on this GitHub Gist.

Interactive visualization of non-linear logistic regression decision boundaries with Shiny

Model building is very often an iterative process that involves multiple steps of choosing an algorithm and hyperparameters, evaluating that model / cross validation, and optimizing the hyperparameters.

I find a great aid in this process, for classification tasks, is not only to keep track of the accuracy across models, but also to have some visual aid to note which data points are systematically misclassified and why. Is there a lot of noise? Does the model require a non-linear classifier?

My desire for visualizing the results are stymied by (a) high-dimensional data (for which we have no choice but to reduce dimensionality) and (b) the cost of task switching between tweaking the hyperparameters and re-running the plot. Unless I'm using two monitors, I can't even see the plots change in real-time.

Well... Enter Shiny.

Shiny is an R package from RStudio and other open source contributors that makes it incredibly easy to create interactive web applications from R analyses. With Shiny, I can add dropdown menus and sliders to choose algorithms or features and control hyperparameters and visualize the changes to the model in real-time right from a web browser (all in pure R and no Javascript or CSS).

Further, I can deploy this web app easily (and for free) so I can share it with my friends and colleagues.

For a first real foray into Shiny, I chose to visualize the decision boundaries of logistic regression classifiers. I chose logistic regression because I'm taking Andrew Ng's excellent Machine Learning course on Coursera, and reimplementing the algorithms in R (from GNU Octave / Matlab) and it was our last homework assignment.

The implementation of logistic regression and the visualization of the decision boundaries proved to be difficult for two reasons:

(a) The residuals of logistic regression aren't normally distributed and there exists no closed form solution that returns the coefficients that maximize the likelihood function. This means that we have to provide R's 'optim' higher-order function with a custom-written function to be minimized or maximized (we will be minimizing the cost function) and a function that returns the gradient (the differentiation of that function at that location). And...

(b) Although a linear combination of the predictor variables (a first degree polynomial hypothesis) has a linear decision boundary, adding ("faking") higher-degree polynomial features results in non-linear decision boundaries; awesome for classification, un-awesome for visualization.

crummy linear fit to circular data

great quadratic non-linear fit to circular data

The two datasets we will be using were generated using make_circles and make_moons from scikit-learn's 'datasets' module. These will both require non-linear hypothesis to achieve any kind of better-than-chance classification.

These are the supporting functions to add polynomial features, compute the hypothesis function, compute the cost function, and return the gradient:

add.poly.features <- function(x.mat, degree=2){
new.mat <- matrix(1, nrow=nrow(x.mat))
for (i in 1:degree){
for (j in 0:i){
new.mat <- cbind(new.mat, (x.mat[,1]^(i-j) * (x.mat[,2]^j)))
}
}
return(new.mat)
}

hypothesis.function <- function(param.vec, x.mat){
zed <- x.mat %*% matrix(param.vec)
return(1 / (1 + exp(-zed)))
}

get.gradient <- function(param.vec, x.mat, y.vec, lambda=0){
m <- nrow(x.mat)
modtheta <- param.vec
modtheta[1] <- 0
the.hyp <- hypothesis.function(param.vec, x.mat)
gradient <- (t(x.mat) %*% (the.hyp - y.vec) + lambda*modtheta) / m
}

cost.function <- function(param.vec, x.mat, y.vec, lambda=0){
m <- nrow(x.mat)
the.hyp <- hypothesis.function(param.vec, x.mat)
cost <- (((t(-y.vec) %*% log(the.hyp)) - (t(1-y.vec) %*% log(1-the.hyp))) / m) +
((lambda / (2*m)) * sum(param.vec[2:length(param.vec)] ^ 2))

return(cost)
}


Finally, this is the code that finds the optimal coefficients and plots the resulting hypothesis (this is wrapped in the reactive "renderPlot" Shiny function so it can be updated every time the Shiny controls are changed)

if(input$pattern=="moon") da.dataset <- moon else da.dataset <- circle da.lambda <- input$lambda
da.degree <- input$degree design.mat <- add.poly.features(da.dataset[,c(1,2)], degree=da.degree) result <- optim(par=rep(0, ncol(design.mat)), cost.function, get.gradient, x.mat=design.mat, y.vec=as.matrix(da.dataset[,3]), lambda=da.lambda, method=input$opt)

predictions <- hypothesis.function(result$par, design.mat) accuracy <- paste0(round(sum(round(predictions) == da.dataset[,3]) / 3, 2), "%") thex1 <- da.dataset[,1] thex2 <- da.dataset[,2] somex <- seq(min(thex1), max(thex1), by=.05) somex2 <- seq(min(thex2), max(thex2), length.out=length(somex)) z <- matrix(0, nrow=length(somex), ncol=length(somex)) for (i in 1:length(somex)){ for (j in 1:length(somex)){ keep <- add.poly.features(t(matrix(c(somex[i], somex2[j]))), da.degree) z[i, j] <- as.matrix(keep) %*% result$par
}
}

plot(da.dataset$X2 ~ da.dataset$X1,  pch=20,
col=c("red","green3")[da.dataset\$Y+1],
xlab="X1", ylab="X2")
title(paste("Degree:", da.degree,
" -  Lambda:", da.lambda,
"     -      Accuracy:", accuracy))

contour(somex, t(somex2), z, nlevels=1, add=TRUE, drawlabels=FALSE)


Notice that the classification dataset, the degree of the hypothesized polynomial, the regularization hyperparameter (lambda), and the optimization method are parameterized. We will control these options from the Shiny app.

Put all together, code looks like it does in this GitHub repo and yields this Shiny app.

Shiny app screeshot

Is it just me, or is what you can do with Shiny amazing?

In future iterations of my Shiny visualization of classification endeavors, I plan to:

• add support for more classification algorithms and their respective relevant hyper parameters
• use file upload to plot custom datasets
• and use dimensionality reduction automatically for datasets with more than two 'true' features

Until then, shine on you crazy diamond.