Genre-based Music Recommendations Using Open Data (and the problem with recommender systems)

After a long 12 months of pouring my soul into it, my book, Data Analysis with R, was finally published. After the requisite 2-4 day breather, I started thinking about how I was going to get back into the swing of regular blog posts and decided that the easier and softer way is to cannibalize and expand on an example in the book.

In the chapter "Sources of Data" I show how to consume web data of different formats in R. The motivating example is to build a simple recommendation system that uses user-supplied "tags" (genres/labels) submitted to Last.fm and MusicBrainz to quantify musical artist "similarity". The example in the book stops at the construction and sorting of the similarity matrix but, in this post, we're going to make a really fly D3 visualization of the musical similarity network and provide recommendations in the tooltips. The code, including the Javascript and HTML, I used for this post was hastily thrown into a git repo and is available here. If you're uninterested in the detailed methodology, I suggest you skip to the section labeled "Outcome".

Methodology

Although in the book tags from both Last.fm and MusicBrainz are used, we'll just be using Last.fm here. (In additional contrast to the book, the code here is, as you might imagine, substantially faster-paced.)

The first step is to make a character vector of all the artists that you'd like to be included. If you were building a real system, you'd probably want all Last.fm artists. Since we're not, I just used 70 of my most played artists on my Last.fm. Since I got the list straight from the source, I didn't have to worry that any of the API requests would return "No Artist Found".

The following is a function that takes an artist and returns the properly formatted Last.fm API call to get the tags in JSON format.

create_artist_query_url_lfm <- function(artist_name){
  prefix <- "http://ws.audioscrobbler.com/2.0/?method=artist.gettoptags&artist="
  postfix <- "&api_key=c2e57923a25c03f3d8b317b3c8622b43&format=json"
  encoded_artist <- URLencode(artist_name)
  return(paste0(prefix, encoded_artist, postfix))
}

This is an example of the JSON payload from my favorite merengue artist.

We only want the tag names--curiously, attempts to factor in degree of tag fit (the "count" attribute) resulted in (what I interpreted as) substantially poorer recommendations.

The following is a function that will return a vector of all the tags.

library(jsonlite)

get_tag_frame_lfm <- function(an_artist){
  print(paste0("Attempting to fetch: ", an_artist))
  artist_url <- create_artist_query_url_lfm(an_artist)
  json <- fromJSON(artist_url)
  return(as.vector(json$toptags$tag[,"name"]))
}

Since the above function is referentially transparent, and it involves using resources that aren't yours, it's a good idea to memoize the function so that if you (accidentally or otherwise) call the function with the same artist, the function will return the cached result instead of making the web request again. This can be achieved quite easily with the memoise package.

library(memoise)
mem_get_tag_frame_lfm <- memoise(get_tag_frame_lfm)

To get the tags from all the artists in our custom ARTIST_LIST vector...

artists_tags <- sapply(ARTIST_LIST, mem_get_tag_frame_lfm)
names(artists_tags) <- ARTIST_LIST

To get a list of all pairs of artists to compute the similarity for, we can use the combn function to create a 2 by 2,415 character matrix of all possible combinations (choose 2). Let’s get that into a 2,415 by 2 data.frame with the name "artist1" and "artist2"...

cmbs <- combn(ARTIST_LIST, 2)
comparisons <- data.frame(t(cmbs))
names(comparisons) <- c("artist1", "artist2")

The similarity metric we’ll be using is simple as all get-out: the Jaccard index. Assuming we put the tags from both artists into two sets, it is the cardinality of the sets' intersection divided by the sets' union...

jaccard_index <- function(tags1, tags2){
  length(intersect(tags1, tags2))/length(union(tags1, tags2))
}

comparisons$similarity <- apply(comparisons, 1,
  function(arow){
    jaccard_index(artists_tags[[unlist(arow[1])]],
                  artists_tags[[unlist(arow[2])]])
  }) 

Now we've added a new column to our previously 2,415 by 2 data.frame, "similarity" that contains the Jaccard index.

Our D3 visualization expects a JSON with two top level attributes: "nodes" and "links". The "nodes" attribute is an array of x number of 5 key-value pairs (where x is the number of nodes). The 5 keys are "name" (the name of the artist) "group" (a number that affects the coloring of the node in the visualization that we will be setting to "1"), and "first", "second", and "third", which are the top 3 most similar artists and will serve as the recommendations that pop-up in a tool-tip when you mouse over an artist node in the visualization.

This is some code to get the top 3 most similar artists. It takes the 2,415 by 3 comparisons data.frame, the number of "most similar artists" to return, an artist, and an arbitrary threshold for "similar-ness" as arguments. Any similarity below this threshold will not be considered a viable recommendation.

library(dplyr)
get_top_n <- function(comparisons, N, artist, threshold){
  comparisons %<>%
    filter(artist1==artist | artist2==artist) %>%
    arrange(desc(similarity))
  other_artist <- ifelse(comparisons$similarity>threshold,
                         ifelse(comparisons$artist1==artist,
                                comparisons$artist2, comparisons$artist1),
                         "None")
  return(other_artist[1:N])
}

The inner ifelse clause has to handle the fact that the "similar" artist can be in the first column or the second column. The outer ifelse returns "None" for every similarity value that is not above the threshold.

Let's make the data.frame that will serve as the "nodes" attribute in the final JSON...

nodes <- sapply(ARTIST_LIST, function(x) get_top_n(comparisons, 3, x, 0.25))
nodes <- data.frame(t(nodes))
names(nodes) <- c("first", "second", "third")
nodes$name <- row.names(nodes)
row.names(nodes) <- NULL
nodes$group <- 1

For the other top-level JSON attribute, "links", we need an array of y number of 5 key-value pairs where y is the number of sufficiently strong similarities between the artists. The 5 keys are "node1" (the name of the first artist), "source" (the 0-indexed index of the artist with respect to the array in the "nodes" attribute), "node2" (the name of the second artist), "target" (the index of the second artist) and "weight", which is the degree of similarity between the two artists; this will translate into thicker "edges" in the similarity graph.

# find the 0-indexed index
lookup_number <- function(name) which(name==ARTIST_LIST)-1

strong_links <- comparisons %>%
  filter(similarity > 0.25) %>%
  rename(node1 = artist1, node2 = artist2, weight=similarity)
strong_links$source <- sapply(strong_links$node1, lookup_number)
strong_links$target <- sapply(strong_links$node2, lookup_number)

Finally, we can create the properly formatted JSON and send it to the file "artists.json" thusly...

object <- list("nodes"=nodes,
               "links"=strong_links)

sink("artists.json")
toJSON(object, dataframe="rows", pretty=TRUE)
sink()

Outcome

Musical Similarity Network

Using "artists.json" and the "index.html" that can be found here, the similarity graph looks a little like this. (Make sure you scroll to see the whole thing.)

For illustrative purposes, I pre-labeled the artists' "group" with labels that correspond to what I view as the artist's primary genre. This is why the nodes in the linked visualization have different colors. Note that, independently, the genres that I indicated tend to cluster together in the network. For example, Reggae (light green), Hip-Hop (green), and Punk (orange) all form almost completely connected graphs, though unconnected to each other (disjoint subgraphs). Indie rock (blue), post-punk (light blue) and classic rock (light orange) together form a rather tightly-connected subgraph. Curiously, the Sex Pistols (that I labeled "Punk") are not part of the Punk cluster but part of the Indie-rock/post-punk/classic-rock component. There are three orphan nodes (no edges), "Johann Sebastian Bach", "P:ano", and "No Kids". Bach is orphaned because he's the only Baroque artist in my top 70 artists :( --P:ano and No Kids are obscure... you’ve probably never heard of them.

The recommendations, prima facie, appear to be on point. For example, without direct knowledge of association, "KRS-One" recommends "Boogie Down Productions" (the group that KRS-One comes from) most highly. Similarly, "The Smiths" and "Morrissey" recommend each other, and "De La Soul" and "A Tribe Called Quest" (part of a positive, Afrocentric hip-hop collective known as the Native Tongues together with Queen Latifah, et al.) recommend each other.

Appropriately, Joy Division and New Order, whose Jaccard index of band members is 0.6 but whose music style is somewhat distinct, don't recommend each other.

Lastly, subgenred artists appear to recommend other artists in the subgenre. For example, goth band "The Sisters of Mercy" appropriately recommends other goth-esque bands "Bauhaus", "And Also The Trees", and "Joy Division".

Afterword

Using this similarity measure to drive recommendations seems successful. It should be noted, though, that my ability to assess the effectiveness of using the Jaccard index as the sole arbiter of musical similarity is hampered; judging an algorithm on the basis that the system recommends other bands that I necessarily like is prejudicial, to say the least.

This stands even if the system makes good theoretical sense. This still stands even if the system, quite independently, indicates that associated acts—that are objectively and incontrovertibly similar—are good recommendations.

This raises a larger question on how to accurately measure the effectiveness of recommender systems; do you tell people what they want to hear, or do you pledge allegiance to a particular theoretical interpretation of similarity? If it's the latter, how do you iterate and improve the system? If it's the former, is your only criterion for success positive user-provided feedback?

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Using Last.fm to data mine my music listening history

Indie Rock
I've (passively) been keeping meticulous records of almost every song I've listened to since January of 2008. Since I opened my last.fm account 6 years ago, they've accumulated a massive detailed dataset of the 107,222 songs I've listened to since then. The best thing is that they're willing to share this data with me!

I used the last.fm developer REST API to (over a very long period of time) retrieve my entire listening history, the date(s) that I've listened to each song, and the top three user-submitted "tags" for each song.

I want to glean every bit of insight that I can out of this data. For this post, I focused on:

  • total listening history over time
  • music "diversity" levels
  • trends in my musical genre listening habits

In future posts, I hope to explore other things like using PCA to determine "orthogonal" music genres, construct similarity matrices, predict trends, and perform acoustic analysis.

This has been one of my favorite pet-projects because it combines three things that I love:

  • data mining
  • music
  • navel-gazing

I used both R and Python in this analysis. Let’s get into it!

Obtaining data
Getting the data using the last.fm REST API was very straightforward; the only hiccups I encountered were the fault of Python2's unicode snafus. For the web requests I used the urllib2 module and to handle the XML responses I used the amazing lxml module. The code to get my whole listening history looked a little like this:

#!/usr/bin/env python -tt

import urllib2
import time
from lxml import etree
from StringIO import StringIO

baseurl = ''.join(["http://ws.audioscrobbler.com/2.0/",
                   "?method=user.getrecenttracks",
                   "&user=statethatiamin&api_key=XXXXXXX&limit=200"])

def clean_xml(the_xml):
    return "\n".join(the_xml.split("\n")[3:-2])

# let's get the first page so we know how many pages there are
response = urllib2.urlopen(baseurl+"&page=1", timeout=200)
html = response.read()

# parse the XML tree
doc = etree.parse(StringIO(html))

# use Xpath to query the number of pages
num_pages = int(doc.xpath("/lfm/recenttracks")[0].get("totalPages"))

# file to dump results
fh = open("all_the_tracks.xml", "a")

for page in xrange(0, num_pages+1):
    # I'm nice so I don't want to hit last.fm
    # with a bunch of requests a second.
    # Let's wait ten seconds between each request
    time.sleep(10)
    progress = "On page {} of {}...........  {}%"
    print progress.format(str(page),
                          str(num_pages),
                          str(round(float(page)/num_pages*100, 1)))
    response = urllib2.urlopen(baseurl+"&page="+str(page))
    html = response.read()
    the_xml = clean_xml(html)
    fh.write(the_xml)
fh.close()

I decided to make the requests for the user-submitted tags in another python script. The script is a little too long to post here, but it basically iterated over all "track" nodes in the output of the last script, and parsed the results from a REST query of tags. Since I'm considerate, I put a long wait between each request for the over 100,000 songs. Even though I handled repeated tracks gracefully, it took days to finish. I used the pickle module to serialize the sum of data I got at regular intervals so a failure during the night of day 2 wouldn't have been catastrophic.

XML transformations and XPath
There is still a little bit of cleanup to do... I used various shell commands to remove all unnecessary elements from the XML documents and escape the characters that I forgot to escape. Then I had to organize the data by date so that I can do time series analysis. The script I used to accomplish this is as follows:

#!/usr/bin/env python -tt

from lxml import etree
import codecs

# read cleaned up track history XML
doc = etree.parse("escaped_processed.xml")

fh = codecs.open("bydate.xml", "a", encoding="utf-8")

# get all the dates (previously restricted to just month and year)
udates = list(set([date.text for date in doc.xpath("//date")]))

# create a new DOM tree to hang the transformation upon
root = etree.Element("bydate")

for cdate in udates:
    # element tags can't start with a number
    # add a "d" to it
    this = etree.SubElement(root, 'd' + cdate)
    # get all tracks listened to on that date
    these_tracks = [node for node in
                    doc.xpath("/alltags/track[date=" + cdate + "]")]
    # add the tracks to the DOM
    for itrack in these_tracks:
        this.append(itrack)

fh.write(etree.tostring(root, pretty_print=True))

Finally, I whipped up a quick script to sum the number of listens on a particular tag for each time interval.

At this time we have a file "playnumbymonth.csv" with the dates and total tracks listened to for that month that looks like this...

date,numlistens
03-2008,1422
10-2008,1394
05-2008,923
12-2009,640
10-2009,630
..........

and ("melted") file called "longformat.csv" that holds dates, tag names, and the number of tracks (played in that month) that contained the tag. It looks like this...

date,tag,number
03-2008,folk rock,1
03-2008,summer,1
03-2008,spoken word,2
03-2008,cute,5
03-2008,dance,11
..........

R analytics and visualization
First, to visualize the number of songs I’ve listened to over time, I had to import the "playnumbymonth.csv" dataset, parse the date with the lubridate package, make a "zoo" time series object out of the dataframe, and plot it.

library(zoo)
library(lubridate)

plays <- read.csv("playnumbymonth.csv", stringsAsFactors=FALSE)

# parse dates
plays$date <- parse_date_time(plays$date, "my")

#make time series object
tsplays <- read.zoo(plays)

#plot it with a LOWESS smooth curve
loline <- lowess(tsplays, f=.5)
plot(tsplays, main="Plays per month since 2008", ylab="Number of plays", xlab="Date")
lines(index(tsplays), loline$y, col='red', lwd=2)

The resulting plot looks like this:
Plays per month

While I was working with this data set, I wanted to check if there was any periodicity to my listening history (perhaps I listen to more music in the winter than I do in the summer). I briefly attempted to use seasonal decomposition and autocorrelation to try to detect this. No dice.

For the musical "diversity" and genre listening trends, I read in "longformat.csv", used reshape to aggregate (pivot) by tags until I had a huge matrix where each row was a month between 2008 and 2014, and each column was a last.fm tag. Then I used the vegan (vegetation analysis) package to take the Shannon diversity index of each month with respect to wealth and evenness of tags listened to:

long.tag.frame <- read.csv("longformat.csv", stringsAsFactors=FALSE)
long.tag.frame$date <- parse_date_time(long.frame$date, "my")

wide.frame <- data.frame(cast(long.tags.frame, date~tag))
# convert all NAs to zero
wide.frame[is.na(wide.frame)] <- 0

new.frame <- data.frame(wide.frame[,1])
new.frame$diversity <- diversity(wide.frame[,-1])

After some cleanup and "zoo" object creation, and LOWESS curve creation, the plot of the listening data and diversity indices looked like this:
Number of plays and variety

Visualizing how my music tastes have (appeared to) change over time was the best part. I created a diagonal matrix from the multiplicative inverse of number of tracks that I listened to each month and matrix-multiplied this with the wide tag matrix. The result of this computation yielded the proportion of songs I listened to each month that contained each tag.

I took a few choice tags corresponding to some of my favorite musical genres, put it in a new data frame ("tag.interest") and used the lattice package to visualize the trends.

tag.interest <- data.frame(dates)
tag.interest$Post.Punk <- prop.plays[,2227]
tag.interest$Indie <- prop.plays[,1413]
tag.interest$Punk <- prop.plays[,2270]
tag.interest$Coldwave <- prop.plays[,654]
tag.interest$Darkwave <- prop.plays[,762]
tag.interest$Twee <- prop.plays[,3003]
tag.interest$Indie.Pop <- prop.plays[,1422]
tag.interest$Hip.Hop <- prop.plays[,1337]

> names(tag.interest)
[1] "dates"     "Post.Punk" "Indie"     "Punk"      "Coldwave"  "Darkwave"  "Twee"      "Indie.Pop" "Hip.Hop"  

xyplot(read.zoo(tag.interest), type=c("l", "g"),
       ylab="Proportion of songs containing tag",
       main="Trends in musical genre listening habits",
       panel = function(x, y, col, ...) {
         panel.xyplot(x, y, col = "blue", ...)
         panel.loess(x, y, col = "red", lwd=3)
       })

This produced my favorite plot:
Genre listening trends

Looking at it, I remembered a period of time in 2009 that I listened to almost exclusively Hip-Hop music. I was also reminded that I got into the "coldwave" and "darkwave" genres rather recently and around the same time as each other in summer of 2011. Another neat result is that there is a fairly strong negative correlation between my "twee" music listening and my "darkwave" music listening history, as these genres are almost musical 'opposites'.

This had been a fun trip down memory lane for me. My only regret is that I didn't open my last.fm account sooner... as long as it was after a period in my childhood music-listening that I would be embarrassed to have on digital record.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail