The Bayesian approach to ridge regression

In a previous post, we demonstrated that ridge regression (a form of regularized linear regression that attempts to shrink the beta coefficients toward zero) can be super-effective at combating overfitting and lead to a greatly more generalizable model. This approach to regularization used penalized maximum likelihood estimation (for which we used the amazing glmnet package). There is, however, another approach... an equivalent approach... but one that allows us greater flexibility in model construction and lends itself more easily to an intuitive interpretation of the uncertainty of our beta coefficient estimates. I'm speaking, of course, of the bayesian approach.

As it turns out, careful selection of the type and shape of our prior distributions with respect to the coefficients can mimic different types of frequentist linear model regularization. For ridge regression, we use normal priors of varying width.

Though it can be shown analytically that shifting the width of normal priors on the beta coefficients is equivalent to L2 penalized maximum likelihood estimation, the math is scary and hard to follow. In this post, we are going to be taking a computational approach to demonstrating the equivalence of the bayesian approach and ridge regression.

This post is going to be a part of a multi-post series investigating other bayesian approaches to linear model regularization including lasso regression facsimiles and hybrid approaches.

mtcars

We are going to be using the venerable mtcars dataset for this demonstration because (a) it's multicollinearity and high number of potential predictors relative to its sample size lends itself fairly well to ridge regression, and (b) we used it in the elastic net blog post :)

Before, you lose interest... here! have a figure! An explanation will follow.

mtcars-loocv-mse

After scaling the predictor variables to be 0-centered and have a standard deviation of 1, I described a model predicting mpg using all available predictors and placed normal priors on the beta coefficients with a standard deviation for each value from 0.05 to 5 (by 0.025). To fit the model, instead of MCMC estimation via JAGS or Stan, I used quadratic approximation performed by the awesome rethinking package written by Richard McElreath written for his excellent book, Statistical Rethinking. Quadratic approximation uses an optimization algorithm to find the maximum a priori (MAP) point of the posterior distribution and approximates the rest of the posterior with a normal distribution about the MAP estimate. I use this method chiefly because as long as it took to run these simulations using quadratic approximation, it would have taken many orders of magnitude longer to use MCMC. Various spot checks confirmed that the quadratic approximation was comparable to the posterior as told by Stan.

As you can see from the figure, as the prior on the coefficients gets tighter, the model performance (as measured by the leave-one-out cross-validated mean squared error) improves—at least until the priors become too strong to be influenced sufficiently by the evidence. The ribbon about the MSE is the 95% credible interval (using a normal likelihood). I know, I know... it's pretty damn wide.

The dashed vertical line is at the prior width that minimizes the LOOCV MSE. The minimum MSE is, for all practical purposes, identical to that of the highest performing ridge regression model using glmnet. This is good.

Another really fun thing to do with the results is to visualize the movement of the beta coefficient estimates and different penalties. The figure below depicts this. Again, the dashed vertical line is the highest performing prior width.

mtcars-coef-shrinkage

One last thing: we've heretofore only demonstrated that the bayesian approach can perform as well as the L2 penalized MLE... but it's conceivable that it achieves this by finding a completely different coefficient vector. The figure below shows the same figure as above but I overlaid the coefficient estimates (for each predictor) of the top-performing glmnet model. These are shown as the dashed colored horizontal lines.

mtcars-coef-shrinkage-net-overlay

These results are pretty exciting! (if you're the type to not get invited to parties). Notice that, at the highest performing prior width, the coefficients of the bayesian approach and the glmnet approach are virtually identical.

Sooooo, not only did the bayesian variety produce an equivalently generalizable model (as evinced by equivalent cross-validated MSEs) but also yielded a vector of beta coefficient estimates nearly identical to those estimated by glmnet. This suggests that both the bayesian approach and glmnet's approach, using different methods, regularize the model via the same underlying mechanism.

A drawback of the bayesian approach is that its solution takes many orders of magnitude more time to arrive at. Two advantages of the Bayesian approach are (a) the ability to study the posterior distributions of the coefficient estimates and ease of interpretation that they allows, and (b) the enhanced flexibility in model design and the ease by which you can, for example, swap out likelihood functions or construct more complicated hierarchal models.

If you are even the least bit interested in this, I urge you to look at the code (in this git repository) because (a) I worked really hard on it and, (b) it demonstrates cool use of meta-programming, parallelization, and progress bars... if I do say so myself :)

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Genre-based Music Recommendations Using Open Data (and the problem with recommender systems)

After a long 12 months of pouring my soul into it, my book, Data Analysis with R, was finally published. After the requisite 2-4 day breather, I started thinking about how I was going to get back into the swing of regular blog posts and decided that the easier and softer way is to cannibalize and expand on an example in the book.

In the chapter "Sources of Data" I show how to consume web data of different formats in R. The motivating example is to build a simple recommendation system that uses user-supplied "tags" (genres/labels) submitted to Last.fm and MusicBrainz to quantify musical artist "similarity". The example in the book stops at the construction and sorting of the similarity matrix but, in this post, we're going to make a really fly D3 visualization of the musical similarity network and provide recommendations in the tooltips. The code, including the Javascript and HTML, I used for this post was hastily thrown into a git repo and is available here. If you're uninterested in the detailed methodology, I suggest you skip to the section labeled "Outcome".

Methodology

Although in the book tags from both Last.fm and MusicBrainz are used, we'll just be using Last.fm here. (In additional contrast to the book, the code here is, as you might imagine, substantially faster-paced.)

The first step is to make a character vector of all the artists that you'd like to be included. If you were building a real system, you'd probably want all Last.fm artists. Since we're not, I just used 70 of my most played artists on my Last.fm. Since I got the list straight from the source, I didn't have to worry that any of the API requests would return "No Artist Found".

The following is a function that takes an artist and returns the properly formatted Last.fm API call to get the tags in JSON format.

create_artist_query_url_lfm <- function(artist_name){
  prefix <- "http://ws.audioscrobbler.com/2.0/?method=artist.gettoptags&artist="
  postfix <- "&api_key=c2e57923a25c03f3d8b317b3c8622b43&format=json"
  encoded_artist <- URLencode(artist_name)
  return(paste0(prefix, encoded_artist, postfix))
}

This is an example of the JSON payload from my favorite merengue artist.

We only want the tag names--curiously, attempts to factor in degree of tag fit (the "count" attribute) resulted in (what I interpreted as) substantially poorer recommendations.

The following is a function that will return a vector of all the tags.

library(jsonlite)

get_tag_frame_lfm <- function(an_artist){
  print(paste0("Attempting to fetch: ", an_artist))
  artist_url <- create_artist_query_url_lfm(an_artist)
  json <- fromJSON(artist_url)
  return(as.vector(json$toptags$tag[,"name"]))
}

Since the above function is referentially transparent, and it involves using resources that aren't yours, it's a good idea to memoize the function so that if you (accidentally or otherwise) call the function with the same artist, the function will return the cached result instead of making the web request again. This can be achieved quite easily with the memoise package.

library(memoise)
mem_get_tag_frame_lfm <- memoise(get_tag_frame_lfm)

To get the tags from all the artists in our custom ARTIST_LIST vector...

artists_tags <- sapply(ARTIST_LIST, mem_get_tag_frame_lfm)
names(artists_tags) <- ARTIST_LIST

To get a list of all pairs of artists to compute the similarity for, we can use the combn function to create a 2 by 2,415 character matrix of all possible combinations (choose 2). Let’s get that into a 2,415 by 2 data.frame with the name "artist1" and "artist2"...

cmbs <- combn(ARTIST_LIST, 2)
comparisons <- data.frame(t(cmbs))
names(comparisons) <- c("artist1", "artist2")

The similarity metric we’ll be using is simple as all get-out: the Jaccard index. Assuming we put the tags from both artists into two sets, it is the cardinality of the sets' intersection divided by the sets' union...

jaccard_index <- function(tags1, tags2){
  length(intersect(tags1, tags2))/length(union(tags1, tags2))
}

comparisons$similarity <- apply(comparisons, 1,
  function(arow){
    jaccard_index(artists_tags[[unlist(arow[1])]],
                  artists_tags[[unlist(arow[2])]])
  }) 

Now we've added a new column to our previously 2,415 by 2 data.frame, "similarity" that contains the Jaccard index.

Our D3 visualization expects a JSON with two top level attributes: "nodes" and "links". The "nodes" attribute is an array of x number of 5 key-value pairs (where x is the number of nodes). The 5 keys are "name" (the name of the artist) "group" (a number that affects the coloring of the node in the visualization that we will be setting to "1"), and "first", "second", and "third", which are the top 3 most similar artists and will serve as the recommendations that pop-up in a tool-tip when you mouse over an artist node in the visualization.

This is some code to get the top 3 most similar artists. It takes the 2,415 by 3 comparisons data.frame, the number of "most similar artists" to return, an artist, and an arbitrary threshold for "similar-ness" as arguments. Any similarity below this threshold will not be considered a viable recommendation.

library(dplyr)
get_top_n <- function(comparisons, N, artist, threshold){
  comparisons %<>%
    filter(artist1==artist | artist2==artist) %>%
    arrange(desc(similarity))
  other_artist <- ifelse(comparisons$similarity>threshold,
                         ifelse(comparisons$artist1==artist,
                                comparisons$artist2, comparisons$artist1),
                         "None")
  return(other_artist[1:N])
}

The inner ifelse clause has to handle the fact that the "similar" artist can be in the first column or the second column. The outer ifelse returns "None" for every similarity value that is not above the threshold.

Let's make the data.frame that will serve as the "nodes" attribute in the final JSON...

nodes <- sapply(ARTIST_LIST, function(x) get_top_n(comparisons, 3, x, 0.25))
nodes <- data.frame(t(nodes))
names(nodes) <- c("first", "second", "third")
nodes$name <- row.names(nodes)
row.names(nodes) <- NULL
nodes$group <- 1

For the other top-level JSON attribute, "links", we need an array of y number of 5 key-value pairs where y is the number of sufficiently strong similarities between the artists. The 5 keys are "node1" (the name of the first artist), "source" (the 0-indexed index of the artist with respect to the array in the "nodes" attribute), "node2" (the name of the second artist), "target" (the index of the second artist) and "weight", which is the degree of similarity between the two artists; this will translate into thicker "edges" in the similarity graph.

# find the 0-indexed index
lookup_number <- function(name) which(name==ARTIST_LIST)-1

strong_links <- comparisons %>%
  filter(similarity > 0.25) %>%
  rename(node1 = artist1, node2 = artist2, weight=similarity)
strong_links$source <- sapply(strong_links$node1, lookup_number)
strong_links$target <- sapply(strong_links$node2, lookup_number)

Finally, we can create the properly formatted JSON and send it to the file "artists.json" thusly...

object <- list("nodes"=nodes,
               "links"=strong_links)

sink("artists.json")
toJSON(object, dataframe="rows", pretty=TRUE)
sink()

Outcome

Musical Similarity Network

Using "artists.json" and the "index.html" that can be found here, the similarity graph looks a little like this. (Make sure you scroll to see the whole thing.)

For illustrative purposes, I pre-labeled the artists' "group" with labels that correspond to what I view as the artist's primary genre. This is why the nodes in the linked visualization have different colors. Note that, independently, the genres that I indicated tend to cluster together in the network. For example, Reggae (light green), Hip-Hop (green), and Punk (orange) all form almost completely connected graphs, though unconnected to each other (disjoint subgraphs). Indie rock (blue), post-punk (light blue) and classic rock (light orange) together form a rather tightly-connected subgraph. Curiously, the Sex Pistols (that I labeled "Punk") are not part of the Punk cluster but part of the Indie-rock/post-punk/classic-rock component. There are three orphan nodes (no edges), "Johann Sebastian Bach", "P:ano", and "No Kids". Bach is orphaned because he's the only Baroque artist in my top 70 artists :( --P:ano and No Kids are obscure... you’ve probably never heard of them.

The recommendations, prima facie, appear to be on point. For example, without direct knowledge of association, "KRS-One" recommends "Boogie Down Productions" (the group that KRS-One comes from) most highly. Similarly, "The Smiths" and "Morrissey" recommend each other, and "De La Soul" and "A Tribe Called Quest" (part of a positive, Afrocentric hip-hop collective known as the Native Tongues together with Queen Latifah, et al.) recommend each other.

Appropriately, Joy Division and New Order, whose Jaccard index of band members is 0.6 but whose music style is somewhat distinct, don't recommend each other.

Lastly, subgenred artists appear to recommend other artists in the subgenre. For example, goth band "The Sisters of Mercy" appropriately recommends other goth-esque bands "Bauhaus", "And Also The Trees", and "Joy Division".

Afterword

Using this similarity measure to drive recommendations seems successful. It should be noted, though, that my ability to assess the effectiveness of using the Jaccard index as the sole arbiter of musical similarity is hampered; judging an algorithm on the basis that the system recommends other bands that I necessarily like is prejudicial, to say the least.

This stands even if the system makes good theoretical sense. This still stands even if the system, quite independently, indicates that associated acts—that are objectively and incontrovertibly similar—are good recommendations.

This raises a larger question on how to accurately measure the effectiveness of recommender systems; do you tell people what they want to hear, or do you pledge allegiance to a particular theoretical interpretation of similarity? If it's the latter, how do you iterate and improve the system? If it's the former, is your only criterion for success positive user-provided feedback?

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Interactive visualization of non-linear logistic regression decision boundaries with Shiny

(skip to the shiny app)

Model building is very often an iterative process that involves multiple steps of choosing an algorithm and hyperparameters, evaluating that model / cross validation, and optimizing the hyperparameters.

I find a great aid in this process, for classification tasks, is not only to keep track of the accuracy across models, but also to have some visual aid to note which data points are systematically misclassified and why. Is there a lot of noise? Does the model require a non-linear classifier?

My desire for visualizing the results are stymied by (a) high-dimensional data (for which we have no choice but to reduce dimensionality) and (b) the cost of task switching between tweaking the hyperparameters and re-running the plot. Unless I'm using two monitors, I can't even see the plots change in real-time.

Well... Enter Shiny.

Shiny is an R package from RStudio and other open source contributors that makes it incredibly easy to create interactive web applications from R analyses. With Shiny, I can add dropdown menus and sliders to choose algorithms or features and control hyperparameters and visualize the changes to the model in real-time right from a web browser (all in pure R and no Javascript or CSS).

Further, I can deploy this web app easily (and for free) so I can share it with my friends and colleagues.

For a first real foray into Shiny, I chose to visualize the decision boundaries of logistic regression classifiers. I chose logistic regression because I'm taking Andrew Ng's excellent Machine Learning course on Coursera, and reimplementing the algorithms in R (from GNU Octave / Matlab) and it was our last homework assignment.

The implementation of logistic regression and the visualization of the decision boundaries proved to be difficult for two reasons:

(a) The residuals of logistic regression aren't normally distributed and there exists no closed form solution that returns the coefficients that maximize the likelihood function. This means that we have to provide R's 'optim' higher-order function with a custom-written function to be minimized or maximized (we will be minimizing the cost function) and a function that returns the gradient (the differentiation of that function at that location). And...

(b) Although a linear combination of the predictor variables (a first degree polynomial hypothesis) has a linear decision boundary, adding ("faking") higher-degree polynomial features results in non-linear decision boundaries; awesome for classification, un-awesome for visualization.

crummy linear fit to circular data

crummy linear fit to circular data


great quadratic non-linear fit to circular data

great quadratic non-linear fit to circular data

The two datasets we will be using were generated using make_circles and make_moons from scikit-learn's 'datasets' module. These will both require non-linear hypothesis to achieve any kind of better-than-chance classification.

These are the supporting functions to add polynomial features, compute the hypothesis function, compute the cost function, and return the gradient:

add.poly.features <- function(x.mat, degree=2){
  new.mat <- matrix(1, nrow=nrow(x.mat))
  for (i in 1:degree){
    for (j in 0:i){
      new.mat <- cbind(new.mat, (x.mat[,1]^(i-j) * (x.mat[,2]^j)))
    }
  }
  return(new.mat)
}

hypothesis.function <- function(param.vec, x.mat){
  zed <- x.mat %*% matrix(param.vec)
  return(1 / (1 + exp(-zed)))
}

get.gradient <- function(param.vec, x.mat, y.vec, lambda=0){
  m <- nrow(x.mat)
  modtheta <- param.vec
  modtheta[1] <- 0
  the.hyp <- hypothesis.function(param.vec, x.mat)
  gradient <- (t(x.mat) %*% (the.hyp - y.vec) + lambda*modtheta) / m
  return(gradient)
}

cost.function <- function(param.vec, x.mat, y.vec, lambda=0){
  m <- nrow(x.mat)
  the.hyp <- hypothesis.function(param.vec, x.mat)
  cost <- (((t(-y.vec) %*% log(the.hyp)) - (t(1-y.vec) %*% log(1-the.hyp))) / m) +
    ((lambda / (2*m)) * sum(param.vec[2:length(param.vec)] ^ 2))
  
  return(cost)
}

Finally, this is the code that finds the optimal coefficients and plots the resulting hypothesis (this is wrapped in the reactive "renderPlot" Shiny function so it can be updated every time the Shiny controls are changed)

if(input$pattern=="moon")
  da.dataset <- moon
else
  da.dataset <- circle

da.lambda <- input$lambda
da.degree <- input$degree

design.mat <- add.poly.features(da.dataset[,c(1,2)], degree=da.degree)

result <- optim(par=rep(0, ncol(design.mat)),
                cost.function, 
                get.gradient,
                x.mat=design.mat,
                y.vec=as.matrix(da.dataset[,3]),
                lambda=da.lambda,
                method=input$opt)

predictions <- hypothesis.function(result$par, design.mat)
accuracy <- paste0(round(sum(round(predictions) ==
                                   da.dataset[,3]) / 3, 2), "%")

thex1 <- da.dataset[,1]
thex2 <- da.dataset[,2]
somex <- seq(min(thex1), max(thex1), by=.05)
somex2 <- seq(min(thex2), max(thex2), length.out=length(somex))

z <- matrix(0, nrow=length(somex), ncol=length(somex))

for (i in 1:length(somex)){
  for (j in 1:length(somex)){
    keep <- add.poly.features(t(matrix(c(somex[i], somex2[j]))), da.degree)
    z[i, j] <- as.matrix(keep) %*% result$par
  }
}

plot(da.dataset$X2 ~ da.dataset$X1,  pch=20, 
     col=c("red","green3")[da.dataset$Y+1],
     xlab="X1", ylab="X2")
title(paste("Degree:", da.degree,
            " -  Lambda:", da.lambda,
            "     -      Accuracy:", accuracy))

contour(somex, t(somex2), z, nlevels=1, add=TRUE, drawlabels=FALSE)

Notice that the classification dataset, the degree of the hypothesized polynomial, the regularization hyperparameter (lambda), and the optimization method are parameterized. We will control these options from the Shiny app.

Put all together, code looks like it does in this GitHub repo and yields this Shiny app.

Shiny app screeshot

Shiny app screeshot

Is it just me, or is what you can do with Shiny amazing?

In future iterations of my Shiny visualization of classification endeavors, I plan to:

  • add support for more classification algorithms and their respective relevant hyper parameters
  • use file upload to plot custom datasets
  • and use dimensionality reduction automatically for datasets with more than two 'true' features

Until then, shine on you crazy diamond.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Take a look, it's in a book: distribution of kindle e-book highlights

If you've ever started a book and not finished it, it may comfort you to know that you are not alone. It's hard to get accurate estimates of the percentage books that are discontinued, but the rise of e-reading (and resulting circumvention of privacy) affords us the opportunity to answer related questions.

The kindle e-reading devices allow readers to highlight salient passages of books and optionally share them with Amazon. Amazon then curates these highlights and displays them to readers who opt-in. These are called "popular highlights".

After reading a few books on the Kindle, it's hard not to notice a pattern with popular highlights: they become sparser the further you get into a book. Given my penchant for answering mildly interesting questions with statistics, I couldn't help but analyze and visualize the distribution of these popular highlights.

I organized the location of the 10 most popular highlights of 64 books (21 fiction and 43 non-fiction) along with the location of the end of the book (this doesn't include the index, notes, and references of non-fiction books) and loaded it into R:

library(dplyr)

ebook.frame <- read.csv("./ebooks.csv",
                        stringsAsFactors=FALSE)

ebook.frame <- ebook.frame %.%
  mutate(normalized=location/end.location)

In order to meaningfully compare locations across books, I needed to express each location as a percentage of the total length of the book. Let's use ggplot2 to visualize the distribution of where the popular highlights appear across all books:

library(ggplot2)
ggplot(ebook.frame, aes(x=normalized)) +
  geom_density(adjust=2, fill="#0072B2", alpha=.8) +
  labs(title="Distribution of e-book highlights\n") +
  xlab("location in book (percent)") +
  theme(axis.ticks = element_blank(),
        axis.text.y = element_blank()) +
  guides(fill=guide_legend(title=NULL))
ggsave(file="DistributionOfEBookHighlights.png")

Distribution Of E-Book Highlights

Distribution Of E-Book Highlights

Before we go on, it's important to express a few words of warning...
These books are not a proper sample of all kindle e-books; since these books came from my personal collection, books on science and philosophy are oversampled, books about vampires are woefully underrepresented, and there is far more Janet Evanovich than chance would dictate. Because of this, any insights gleaned from these data (to the extent that these data offer any) are only applicable to the reading habits of a certain type of e-reader, namely, boring ones that don't like to have fun.

The spreadsheet I loaded also contained a logical field representing whether the book was fiction. We can take a look at the differences in the highlight locations between fiction and non-fiction books thusly:

ggplot(ebook.frame, aes(x=normalized)) +
  geom_density(adjust=2, aes(fill=factor(fiction)), 
                             alpha=.5) +
  labs(title="Distribution of e-book highlights\n") +
  scale_fill_discrete(labels=c(“non-fiction",
                               "fiction")) +
  xlab("location in book (percent)") +
  theme(axis.ticks = element_blank(), 
        axis.text.y = element_blank()) +
  guides(fill=guide_legend(title=NULL))
ggsave(file="DistributionOfEBookHighlightsFictionDistinction.png")

Distribution Of E-Book Highlights Fiction Distinction

Distribution Of E-Book Highlights Fiction Distinction

It would appear as if non-fiction books have a less uniform distribution of popular highlights. There are likely many causes for this, but one explanation could be that the reader is less likely to make it to the end of a non-fiction book.

In order to make some quantifiable claims, let's look at the empirical cumulative distribution function:

ggplot(ebook.frame, aes(normalized, colour=factor(fiction))) +
  stat_ecdf() +
  labs(title="Cumulative distribution of e-book highlights\n") +
  scale_colour_discrete(labels=c("non-fiction", "fiction")) +
  xlab("location in book (percent)") +
  ylab("cumulative percentage of highlights") +
  theme(legend.title=element_blank())
ggsave(file="CumulativeDistributionofEbookHighlights.png")

Cumulative Distribution of E-book Highlights

Cumulative Distribution of E-book Highlights

Interestingly, for non-fiction books, a full 75% of the highlights are contained in the first 25% percent of the book; not quite pareto, but close).

Before we come to any conclusions regarding the proportion of readers that make it through a book, let's check our assumptions:

  • e-readers that highlight passages (and choose to share them with Amazon) behave just like e-readers that don't
  • salient passages are uniformly distributed throughout a book and, thus, the distribution of highlights is uniform across the entire length of the read portion of a book
  • the fact that a passage was already highlighted by many e-readers has no bearing on the reader’s decision to highlight the same passage

These assumptions don't hold up to critical scrutiny. Nevertheless, these results serve as strong evidence that at least some e-books go unfinished. As for the percentage of books that go unfinished, perhaps Amazon is in a better position to answer that question.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail