(skip to the shiny app)

Model building is very often an iterative process that involves multiple steps of choosing an algorithm and hyperparameters, evaluating that model / cross validation, and optimizing the hyperparameters.

I find a great aid in this process, for classification tasks, is not only to keep track of the accuracy across models, but also to have some visual aid to note *which* data points are systematically misclassified and why. Is there a lot of noise? Does the model require a non-linear classifier?

My desire for visualizing the results are stymied by (a) high-dimensional data (for which we have no choice but to reduce dimensionality) and (b) the cost of task switching between tweaking the hyperparameters and re-running the plot. Unless I'm using two monitors, I can't even see the plots change in real-time.

Well... Enter Shiny.

Shiny is an R package from RStudio and other open source contributors that makes it incredibly easy to create interactive web applications from R analyses. With Shiny, I can add dropdown menus and sliders to choose algorithms or features and control hyperparameters and visualize the changes to the model in real-time right from a web browser (all in pure R and no Javascript or CSS).

Further, I can deploy this web app easily (and for free) so I can share it with my friends and colleagues.

For a first real foray into Shiny, I chose to visualize the decision boundaries of logistic regression classifiers. I chose logistic regression because I'm taking Andrew Ng's excellent Machine Learning course on Coursera, and reimplementing the algorithms in R (from GNU Octave / Matlab) and it was our last homework assignment.

The implementation of logistic regression and the visualization of the decision boundaries proved to be difficult for two reasons:

(a) The residuals of logistic regression aren't normally distributed and there exists no closed form solution that returns the coefficients that maximize the likelihood function. This means that we have to provide R's 'optim' higher-order function with a custom-written function to be minimized or maximized (we will be minimizing the cost function) and a function that returns the gradient (the differentiation of that function at that location). And...

(b) Although a linear combination of the predictor variables (a first degree polynomial hypothesis) has a linear decision boundary, adding ("faking") higher-degree polynomial features results in non-linear decision boundaries; awesome for classification, un-awesome for visualization.

The two datasets we will be using were generated using make_circles and make_moons from scikit-learn's 'datasets' module. These will both require non-linear hypothesis to achieve any kind of better-than-chance classification.

These are the supporting functions to add polynomial features, compute the hypothesis function, compute the cost function, and return the gradient:

add.poly.features <- function(x.mat, degree=2){ new.mat <- matrix(1, nrow=nrow(x.mat)) for (i in 1:degree){ for (j in 0:i){ new.mat <- cbind(new.mat, (x.mat[,1]^(i-j) * (x.mat[,2]^j))) } } return(new.mat) } hypothesis.function <- function(param.vec, x.mat){ zed <- x.mat %*% matrix(param.vec) return(1 / (1 + exp(-zed))) } get.gradient <- function(param.vec, x.mat, y.vec, lambda=0){ m <- nrow(x.mat) modtheta <- param.vec modtheta[1] <- 0 the.hyp <- hypothesis.function(param.vec, x.mat) gradient <- (t(x.mat) %*% (the.hyp - y.vec) + lambda*modtheta) / m return(gradient) } cost.function <- function(param.vec, x.mat, y.vec, lambda=0){ m <- nrow(x.mat) the.hyp <- hypothesis.function(param.vec, x.mat) cost <- (((t(-y.vec) %*% log(the.hyp)) - (t(1-y.vec) %*% log(1-the.hyp))) / m) + ((lambda / (2*m)) * sum(param.vec[2:length(param.vec)] ^ 2)) return(cost) }

Finally, this is the code that finds the optimal coefficients and plots the resulting hypothesis (this is wrapped in the reactive "renderPlot" Shiny function so it can be updated every time the Shiny controls are changed)

if(input$pattern=="moon") da.dataset <- moon else da.dataset <- circle da.lambda <- input$lambda da.degree <- input$degree design.mat <- add.poly.features(da.dataset[,c(1,2)], degree=da.degree) result <- optim(par=rep(0, ncol(design.mat)), cost.function, get.gradient, x.mat=design.mat, y.vec=as.matrix(da.dataset[,3]), lambda=da.lambda, method=input$opt) predictions <- hypothesis.function(result$par, design.mat) accuracy <- paste0(round(sum(round(predictions) == da.dataset[,3]) / 3, 2), "%") thex1 <- da.dataset[,1] thex2 <- da.dataset[,2] somex <- seq(min(thex1), max(thex1), by=.05) somex2 <- seq(min(thex2), max(thex2), length.out=length(somex)) z <- matrix(0, nrow=length(somex), ncol=length(somex)) for (i in 1:length(somex)){ for (j in 1:length(somex)){ keep <- add.poly.features(t(matrix(c(somex[i], somex2[j]))), da.degree) z[i, j] <- as.matrix(keep) %*% result$par } } plot(da.dataset$X2 ~ da.dataset$X1, pch=20, col=c("red","green3")[da.dataset$Y+1], xlab="X1", ylab="X2") title(paste("Degree:", da.degree, " - Lambda:", da.lambda, " - Accuracy:", accuracy)) contour(somex, t(somex2), z, nlevels=1, add=TRUE, drawlabels=FALSE)

Notice that the classification dataset, the degree of the hypothesized polynomial, the regularization hyperparameter (lambda), and the optimization method are parameterized. We will control these options from the Shiny app.

Put all together, code looks like it does in this GitHub repo and yields this Shiny app.

Is it just me, or is what you can do with Shiny amazing?

In future iterations of my Shiny visualization of classification endeavors, I plan to:

- add support for more classification algorithms and their respective relevant hyper parameters
- use file upload to plot custom datasets
- and use dimensionality reduction automatically for datasets with more than two 'true' features

Until then, shine on you crazy diamond.

share this:
I felt immediately that this is not a coincidence that your post appears right after week 4 of dr. Ng's course has been completed :)

I am just curious as to why use optimization algorithms, as in R you have built-in functions to estimate logistic regression (e.g. glm()).

Oh, haha, the reason is I wanted to rewrite the Octave homework from the class in R and it would have been cheating to use glm :)

I feel like I understand a lot more about optimization algorithms now, even more than from just implementing the homework in Octave

Also, isn't the class great?!

Pingback :Decision Boundaries – Show Me Shiny