The Bayesian approach to ridge regression

In a previous post, we demonstrated that ridge regression (a form of regularized linear regression that attempts to shrink the beta coefficients toward zero) can be super-effective at combating overfitting and lead to a greatly more generalizable model. This approach to regularization used penalized maximum likelihood estimation (for which we used the amazing glmnet package). There is, however, another approach... an equivalent approach... but one that allows us greater flexibility in model construction and lends itself more easily to an intuitive interpretation of the uncertainty of our beta coefficient estimates. I'm speaking, of course, of the bayesian approach.

As it turns out, careful selection of the type and shape of our prior distributions with respect to the coefficients can mimic different types of frequentist linear model regularization. For ridge regression, we use normal priors of varying width.

Though it can be shown analytically that shifting the width of normal priors on the beta coefficients is equivalent to L2 penalized maximum likelihood estimation, the math is scary and hard to follow. In this post, we are going to be taking a computational approach to demonstrating the equivalence of the bayesian approach and ridge regression.

This post is going to be a part of a multi-post series investigating other bayesian approaches to linear model regularization including lasso regression facsimiles and hybrid approaches.

mtcars

We are going to be using the venerable mtcars dataset for this demonstration because (a) it's multicollinearity and high number of potential predictors relative to its sample size lends itself fairly well to ridge regression, and (b) we used it in the elastic net blog post :)

Before, you lose interest... here! have a figure! An explanation will follow.

mtcars-loocv-mse

After scaling the predictor variables to be 0-centered and have a standard deviation of 1, I described a model predicting mpg using all available predictors and placed normal priors on the beta coefficients with a standard deviation for each value from 0.05 to 5 (by 0.025). To fit the model, instead of MCMC estimation via JAGS or Stan, I used quadratic approximation performed by the awesome rethinking package written by Richard McElreath written for his excellent book, Statistical Rethinking. Quadratic approximation uses an optimization algorithm to find the maximum a priori (MAP) point of the posterior distribution and approximates the rest of the posterior with a normal distribution about the MAP estimate. I use this method chiefly because as long as it took to run these simulations using quadratic approximation, it would have taken many orders of magnitude longer to use MCMC. Various spot checks confirmed that the quadratic approximation was comparable to the posterior as told by Stan.

As you can see from the figure, as the prior on the coefficients gets tighter, the model performance (as measured by the leave-one-out cross-validated mean squared error) improves—at least until the priors become too strong to be influenced sufficiently by the evidence. The ribbon about the MSE is the 95% credible interval (using a normal likelihood). I know, I know... it's pretty damn wide.

The dashed vertical line is at the prior width that minimizes the LOOCV MSE. The minimum MSE is, for all practical purposes, identical to that of the highest performing ridge regression model using glmnet. This is good.

Another really fun thing to do with the results is to visualize the movement of the beta coefficient estimates and different penalties. The figure below depicts this. Again, the dashed vertical line is the highest performing prior width.

mtcars-coef-shrinkage

One last thing: we've heretofore only demonstrated that the bayesian approach can perform as well as the L2 penalized MLE... but it's conceivable that it achieves this by finding a completely different coefficient vector. The figure below shows the same figure as above but I overlaid the coefficient estimates (for each predictor) of the top-performing glmnet model. These are shown as the dashed colored horizontal lines.

mtcars-coef-shrinkage-net-overlay

These results are pretty exciting! (if you're the type to not get invited to parties). Notice that, at the highest performing prior width, the coefficients of the bayesian approach and the glmnet approach are virtually identical.

Sooooo, not only did the bayesian variety produce an equivalently generalizable model (as evinced by equivalent cross-validated MSEs) but also yielded a vector of beta coefficient estimates nearly identical to those estimated by glmnet. This suggests that both the bayesian approach and glmnet's approach, using different methods, regularize the model via the same underlying mechanism.

A drawback of the bayesian approach is that its solution takes many orders of magnitude more time to arrive at. Two advantages of the Bayesian approach are (a) the ability to study the posterior distributions of the coefficient estimates and ease of interpretation that they allows, and (b) the enhanced flexibility in model design and the ease by which you can, for example, swap out likelihood functions or construct more complicated hierarchal models.

If you are even the least bit interested in this, I urge you to look at the code (in this git repository) because (a) I worked really hard on it and, (b) it demonstrates cool use of meta-programming, parallelization, and progress bars... if I do say so myself :)

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Kickin' it with elastic net regression

With the kind of data that I usually work with, overfitting regression models can be a huge problem if I'm not careful. Ridge regression is a really effective technique for thwarting overfitting. It does this by penalizing the L2 norm (euclidean distance) of the coefficient vector which results in "shrinking" the beta coefficients. The aggressiveness of the penalty is controlled by a parameter \lambda.

Lasso regression is a related regularization method. Instead of using the L2 norm, though, it penalizes the L1 norm (manhattan distance) of the coefficient vector. Because it uses the L1 norm, some of the coefficients will shrink to zero while lambda increases. A similar effect would be achieved in Bayesian linear regression using a Laplacian prior (strongly peaked at zero) on each of the beta coefficients.

Because some of the coefficients shrink to zero, the lasso doubles as a crackerjack feature selection technique in addition to a solid shrinkage method. This property gives it a leg up on ridge regression. On the other hand, the lasso will occasionally achieve poor results when there's a high degree of collinearity in the features and ridge regression will perform better. Further, the L1 norm is underdetermined when the number of predictors exceeds the number of observations while ridge regression can handle this.

Elastic net regression is a hybrid approach that blends both penalization of the L2 and L1 norms. Specifically, elastic net regression minimizes the following...

\lVert y - X\beta \rVert + \lambda[(1-\alpha)\lvert \beta \rvert_2^2 + \alpha\lvert \beta \rvert_1]

the \alpha hyper-parameter is between 0 and 1 and controls how much L2 or L1 penalization is used (0 is ridge, 1 is lasso).

The usual approach to optimizing the lambda hyper-parameter is through cross-validation—by minimizing the cross-validated mean squared prediction error—but in elastic net regression, the optimal lambda hyper-parameter also depends upon and is heavily dependent on the alpha hyper-parameter (hyper-hyper-parameter?).

This blog post takes a cross-validated approach that uses grid search to find the optimal alpha hyper-parameter while also optimizing the lambda hyper-parameter for three different data sets. I also compare the performances against the stepwise regression and showcase some of the dangers of using stepwise feature selection.

mtcars

In this example, I try to predict “miles per gallon” from the other available attributes. The design matrix has 32 observations and 10 predictors and there is a high degree of collinearity (as measured by the variance inflation factors).

mtcars and elastic net regression

The left panel above shows the leave-one-out cross validation (LOOCV) mean squared error of the model with the optimal lambda (as determined again by LOOCV) for each alpha parameter from 0 to 1. This panel indicates that if our objective is to purely minimize MSE (with no regard for model complexity) than pure ridge regression outperforms any blended elastic-net model. This is probably because of the substantial collinearity. Interestingly, the lasso outperforms blended elastic net models that weight the lasso heavily.

The right panel puts things in perspective by plotting the LOOCV MSEs along with the MSE of the "kitchen sink" regression (the blue line) that includes all features in the model. As you can see, any degree of regularization offers a substantial improvement in model generalizability.

It is also plotted with two estimates of the MSE for models that blindly use the coefficients from automated bi-directional stepwise regression. The first uses the features selected by performing the stepwise procedure on the whole dataset and then assesses the model performance (the red line). The second estimate uses the step procedure and resulting features on only the training set for each fold of the cross validations. This is the estimate without the subtle but treacherous "knowledge leaking" eloquently described in this plot post. This should be considered the more correct assessment of the model. As you can see, if we weren't careful about interpreting the stepwise regression, we would have gotten an incredibly inflated and inaccurate view of the model performance.

Forest Fires

The second example uses a very-difficult-to-model dataset from University of California, Irvine machine learning repository. The task is to predict the burnt area from a forest fire given 11 predictors. It has 517 observations. Further, there is a relatively low degree of collinearity between predictors.

fireplot

Again, highest performing model is the pure ridge regression. This time, the performance asymptotes as the alpha hyper-parameter increases. The variability in the MSE estimates is due to the fact that I didn't use LOOCV and used 400-k CV instead because I'm impatient.

As with the last example, the properly measured stepwise regression performance isn't so great, and the kitchen sink model outperforms it. However, in contrast to the previous example, there was a lot less variability in the selected features across folds—this is probably because of the significantly larger number of observations.

"QuickStartExample"

This dataset is a contrived one that is included with the excellent glmnet package (the one I'm using for the elastic net regression). This dataset has a relatively low degree of collinearity, has 20 features and 100 observations. I have no idea how the package authors created this dataset.

quickstartplot

Finally, an example where the lasso outperforms ridge regression! I think this is because the dataset was specifically manufactured to have a small number of genuine predictors with large effects (as opposed to many weak predictors).

Interestingly, stepwise progression far outperforms both—probably for the very same reason. From fold to fold, there was virtually no variation in the features that the stepwise method automatically chose.

Conclusion

So, there you have it. Elastic net regression is awesome because it can perform at worst as good as the lasso or ridge and—though it didn’t on these examples—can sometimes substantially outperform both.

Also, be careful with step-wise feature selection!

PS: If, for some reason, you are interested in the R code I used to run these simulations, you can find it on this GitHub Gist.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Using Last.fm to data mine my music listening history

Indie Rock
I've (passively) been keeping meticulous records of almost every song I've listened to since January of 2008. Since I opened my last.fm account 6 years ago, they've accumulated a massive detailed dataset of the 107,222 songs I've listened to since then. The best thing is that they're willing to share this data with me!

I used the last.fm developer REST API to (over a very long period of time) retrieve my entire listening history, the date(s) that I've listened to each song, and the top three user-submitted "tags" for each song.

I want to glean every bit of insight that I can out of this data. For this post, I focused on:

  • total listening history over time
  • music "diversity" levels
  • trends in my musical genre listening habits

In future posts, I hope to explore other things like using PCA to determine "orthogonal" music genres, construct similarity matrices, predict trends, and perform acoustic analysis.

This has been one of my favorite pet-projects because it combines three things that I love:

  • data mining
  • music
  • navel-gazing

I used both R and Python in this analysis. Let’s get into it!

Obtaining data
Getting the data using the last.fm REST API was very straightforward; the only hiccups I encountered were the fault of Python2's unicode snafus. For the web requests I used the urllib2 module and to handle the XML responses I used the amazing lxml module. The code to get my whole listening history looked a little like this:

#!/usr/bin/env python -tt

import urllib2
import time
from lxml import etree
from StringIO import StringIO

baseurl = ''.join(["http://ws.audioscrobbler.com/2.0/",
                   "?method=user.getrecenttracks",
                   "&user=statethatiamin&api_key=XXXXXXX&limit=200"])

def clean_xml(the_xml):
    return "\n".join(the_xml.split("\n")[3:-2])

# let's get the first page so we know how many pages there are
response = urllib2.urlopen(baseurl+"&page=1", timeout=200)
html = response.read()

# parse the XML tree
doc = etree.parse(StringIO(html))

# use Xpath to query the number of pages
num_pages = int(doc.xpath("/lfm/recenttracks")[0].get("totalPages"))

# file to dump results
fh = open("all_the_tracks.xml", "a")

for page in xrange(0, num_pages+1):
    # I'm nice so I don't want to hit last.fm
    # with a bunch of requests a second.
    # Let's wait ten seconds between each request
    time.sleep(10)
    progress = "On page {} of {}...........  {}%"
    print progress.format(str(page),
                          str(num_pages),
                          str(round(float(page)/num_pages*100, 1)))
    response = urllib2.urlopen(baseurl+"&page="+str(page))
    html = response.read()
    the_xml = clean_xml(html)
    fh.write(the_xml)
fh.close()

I decided to make the requests for the user-submitted tags in another python script. The script is a little too long to post here, but it basically iterated over all "track" nodes in the output of the last script, and parsed the results from a REST query of tags. Since I'm considerate, I put a long wait between each request for the over 100,000 songs. Even though I handled repeated tracks gracefully, it took days to finish. I used the pickle module to serialize the sum of data I got at regular intervals so a failure during the night of day 2 wouldn't have been catastrophic.

XML transformations and XPath
There is still a little bit of cleanup to do... I used various shell commands to remove all unnecessary elements from the XML documents and escape the characters that I forgot to escape. Then I had to organize the data by date so that I can do time series analysis. The script I used to accomplish this is as follows:

#!/usr/bin/env python -tt

from lxml import etree
import codecs

# read cleaned up track history XML
doc = etree.parse("escaped_processed.xml")

fh = codecs.open("bydate.xml", "a", encoding="utf-8")

# get all the dates (previously restricted to just month and year)
udates = list(set([date.text for date in doc.xpath("//date")]))

# create a new DOM tree to hang the transformation upon
root = etree.Element("bydate")

for cdate in udates:
    # element tags can't start with a number
    # add a "d" to it
    this = etree.SubElement(root, 'd' + cdate)
    # get all tracks listened to on that date
    these_tracks = [node for node in
                    doc.xpath("/alltags/track[date=" + cdate + "]")]
    # add the tracks to the DOM
    for itrack in these_tracks:
        this.append(itrack)

fh.write(etree.tostring(root, pretty_print=True))

Finally, I whipped up a quick script to sum the number of listens on a particular tag for each time interval.

At this time we have a file "playnumbymonth.csv" with the dates and total tracks listened to for that month that looks like this...

date,numlistens
03-2008,1422
10-2008,1394
05-2008,923
12-2009,640
10-2009,630
..........

and ("melted") file called "longformat.csv" that holds dates, tag names, and the number of tracks (played in that month) that contained the tag. It looks like this...

date,tag,number
03-2008,folk rock,1
03-2008,summer,1
03-2008,spoken word,2
03-2008,cute,5
03-2008,dance,11
..........

R analytics and visualization
First, to visualize the number of songs I’ve listened to over time, I had to import the "playnumbymonth.csv" dataset, parse the date with the lubridate package, make a "zoo" time series object out of the dataframe, and plot it.

library(zoo)
library(lubridate)

plays <- read.csv("playnumbymonth.csv", stringsAsFactors=FALSE)

# parse dates
plays$date <- parse_date_time(plays$date, "my")

#make time series object
tsplays <- read.zoo(plays)

#plot it with a LOWESS smooth curve
loline <- lowess(tsplays, f=.5)
plot(tsplays, main="Plays per month since 2008", ylab="Number of plays", xlab="Date")
lines(index(tsplays), loline$y, col='red', lwd=2)

The resulting plot looks like this:
Plays per month

While I was working with this data set, I wanted to check if there was any periodicity to my listening history (perhaps I listen to more music in the winter than I do in the summer). I briefly attempted to use seasonal decomposition and autocorrelation to try to detect this. No dice.

For the musical "diversity" and genre listening trends, I read in "longformat.csv", used reshape to aggregate (pivot) by tags until I had a huge matrix where each row was a month between 2008 and 2014, and each column was a last.fm tag. Then I used the vegan (vegetation analysis) package to take the Shannon diversity index of each month with respect to wealth and evenness of tags listened to:

long.tag.frame <- read.csv("longformat.csv", stringsAsFactors=FALSE)
long.tag.frame$date <- parse_date_time(long.frame$date, "my")

wide.frame <- data.frame(cast(long.tags.frame, date~tag))
# convert all NAs to zero
wide.frame[is.na(wide.frame)] <- 0

new.frame <- data.frame(wide.frame[,1])
new.frame$diversity <- diversity(wide.frame[,-1])

After some cleanup and "zoo" object creation, and LOWESS curve creation, the plot of the listening data and diversity indices looked like this:
Number of plays and variety

Visualizing how my music tastes have (appeared to) change over time was the best part. I created a diagonal matrix from the multiplicative inverse of number of tracks that I listened to each month and matrix-multiplied this with the wide tag matrix. The result of this computation yielded the proportion of songs I listened to each month that contained each tag.

I took a few choice tags corresponding to some of my favorite musical genres, put it in a new data frame ("tag.interest") and used the lattice package to visualize the trends.

tag.interest <- data.frame(dates)
tag.interest$Post.Punk <- prop.plays[,2227]
tag.interest$Indie <- prop.plays[,1413]
tag.interest$Punk <- prop.plays[,2270]
tag.interest$Coldwave <- prop.plays[,654]
tag.interest$Darkwave <- prop.plays[,762]
tag.interest$Twee <- prop.plays[,3003]
tag.interest$Indie.Pop <- prop.plays[,1422]
tag.interest$Hip.Hop <- prop.plays[,1337]

> names(tag.interest)
[1] "dates"     "Post.Punk" "Indie"     "Punk"      "Coldwave"  "Darkwave"  "Twee"      "Indie.Pop" "Hip.Hop"  

xyplot(read.zoo(tag.interest), type=c("l", "g"),
       ylab="Proportion of songs containing tag",
       main="Trends in musical genre listening habits",
       panel = function(x, y, col, ...) {
         panel.xyplot(x, y, col = "blue", ...)
         panel.loess(x, y, col = "red", lwd=3)
       })

This produced my favorite plot:
Genre listening trends

Looking at it, I remembered a period of time in 2009 that I listened to almost exclusively Hip-Hop music. I was also reminded that I got into the "coldwave" and "darkwave" genres rather recently and around the same time as each other in summer of 2011. Another neat result is that there is a fairly strong negative correlation between my "twee" music listening and my "darkwave" music listening history, as these genres are almost musical 'opposites'.

This had been a fun trip down memory lane for me. My only regret is that I didn't open my last.fm account sooner... as long as it was after a period in my childhood music-listening that I would be embarrassed to have on digital record.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail