Genre-based Music Recommendations Using Open Data (and the problem with recommender systems)

After a long 12 months of pouring my soul into it, my book, Data Analysis with R, was finally published. After the requisite 2-4 day breather, I started thinking about how I was going to get back into the swing of regular blog posts and decided that the easier and softer way is to cannibalize and expand on an example in the book.

In the chapter "Sources of Data" I show how to consume web data of different formats in R. The motivating example is to build a simple recommendation system that uses user-supplied "tags" (genres/labels) submitted to Last.fm and MusicBrainz to quantify musical artist "similarity". The example in the book stops at the construction and sorting of the similarity matrix but, in this post, we're going to make a really fly D3 visualization of the musical similarity network and provide recommendations in the tooltips. The code, including the Javascript and HTML, I used for this post was hastily thrown into a git repo and is available here. If you're uninterested in the detailed methodology, I suggest you skip to the section labeled "Outcome".

Methodology

Although in the book tags from both Last.fm and MusicBrainz are used, we'll just be using Last.fm here. (In additional contrast to the book, the code here is, as you might imagine, substantially faster-paced.)

The first step is to make a character vector of all the artists that you'd like to be included. If you were building a real system, you'd probably want all Last.fm artists. Since we're not, I just used 70 of my most played artists on my Last.fm. Since I got the list straight from the source, I didn't have to worry that any of the API requests would return "No Artist Found".

The following is a function that takes an artist and returns the properly formatted Last.fm API call to get the tags in JSON format.

create_artist_query_url_lfm <- function(artist_name){
  prefix <- "http://ws.audioscrobbler.com/2.0/?method=artist.gettoptags&artist="
  postfix <- "&api_key=c2e57923a25c03f3d8b317b3c8622b43&format=json"
  encoded_artist <- URLencode(artist_name)
  return(paste0(prefix, encoded_artist, postfix))
}

This is an example of the JSON payload from my favorite merengue artist.

We only want the tag names--curiously, attempts to factor in degree of tag fit (the "count" attribute) resulted in (what I interpreted as) substantially poorer recommendations.

The following is a function that will return a vector of all the tags.

library(jsonlite)

get_tag_frame_lfm <- function(an_artist){
  print(paste0("Attempting to fetch: ", an_artist))
  artist_url <- create_artist_query_url_lfm(an_artist)
  json <- fromJSON(artist_url)
  return(as.vector(json$toptags$tag[,"name"]))
}

Since the above function is referentially transparent, and it involves using resources that aren't yours, it's a good idea to memoize the function so that if you (accidentally or otherwise) call the function with the same artist, the function will return the cached result instead of making the web request again. This can be achieved quite easily with the memoise package.

library(memoise)
mem_get_tag_frame_lfm <- memoise(get_tag_frame_lfm)

To get the tags from all the artists in our custom ARTIST_LIST vector...

artists_tags <- sapply(ARTIST_LIST, mem_get_tag_frame_lfm)
names(artists_tags) <- ARTIST_LIST

To get a list of all pairs of artists to compute the similarity for, we can use the combn function to create a 2 by 2,415 character matrix of all possible combinations (choose 2). Let’s get that into a 2,415 by 2 data.frame with the name "artist1" and "artist2"...

cmbs <- combn(ARTIST_LIST, 2)
comparisons <- data.frame(t(cmbs))
names(comparisons) <- c("artist1", "artist2")

The similarity metric we’ll be using is simple as all get-out: the Jaccard index. Assuming we put the tags from both artists into two sets, it is the cardinality of the sets' intersection divided by the sets' union...

jaccard_index <- function(tags1, tags2){
  length(intersect(tags1, tags2))/length(union(tags1, tags2))
}

comparisons$similarity <- apply(comparisons, 1,
  function(arow){
    jaccard_index(artists_tags[[unlist(arow[1])]],
                  artists_tags[[unlist(arow[2])]])
  }) 

Now we've added a new column to our previously 2,415 by 2 data.frame, "similarity" that contains the Jaccard index.

Our D3 visualization expects a JSON with two top level attributes: "nodes" and "links". The "nodes" attribute is an array of x number of 5 key-value pairs (where x is the number of nodes). The 5 keys are "name" (the name of the artist) "group" (a number that affects the coloring of the node in the visualization that we will be setting to "1"), and "first", "second", and "third", which are the top 3 most similar artists and will serve as the recommendations that pop-up in a tool-tip when you mouse over an artist node in the visualization.

This is some code to get the top 3 most similar artists. It takes the 2,415 by 3 comparisons data.frame, the number of "most similar artists" to return, an artist, and an arbitrary threshold for "similar-ness" as arguments. Any similarity below this threshold will not be considered a viable recommendation.

library(dplyr)
get_top_n <- function(comparisons, N, artist, threshold){
  comparisons %<>%
    filter(artist1==artist | artist2==artist) %>%
    arrange(desc(similarity))
  other_artist <- ifelse(comparisons$similarity>threshold,
                         ifelse(comparisons$artist1==artist,
                                comparisons$artist2, comparisons$artist1),
                         "None")
  return(other_artist[1:N])
}

The inner ifelse clause has to handle the fact that the "similar" artist can be in the first column or the second column. The outer ifelse returns "None" for every similarity value that is not above the threshold.

Let's make the data.frame that will serve as the "nodes" attribute in the final JSON...

nodes <- sapply(ARTIST_LIST, function(x) get_top_n(comparisons, 3, x, 0.25))
nodes <- data.frame(t(nodes))
names(nodes) <- c("first", "second", "third")
nodes$name <- row.names(nodes)
row.names(nodes) <- NULL
nodes$group <- 1

For the other top-level JSON attribute, "links", we need an array of y number of 5 key-value pairs where y is the number of sufficiently strong similarities between the artists. The 5 keys are "node1" (the name of the first artist), "source" (the 0-indexed index of the artist with respect to the array in the "nodes" attribute), "node2" (the name of the second artist), "target" (the index of the second artist) and "weight", which is the degree of similarity between the two artists; this will translate into thicker "edges" in the similarity graph.

# find the 0-indexed index
lookup_number <- function(name) which(name==ARTIST_LIST)-1

strong_links <- comparisons %>%
  filter(similarity > 0.25) %>%
  rename(node1 = artist1, node2 = artist2, weight=similarity)
strong_links$source <- sapply(strong_links$node1, lookup_number)
strong_links$target <- sapply(strong_links$node2, lookup_number)

Finally, we can create the properly formatted JSON and send it to the file "artists.json" thusly...

object <- list("nodes"=nodes,
               "links"=strong_links)

sink("artists.json")
toJSON(object, dataframe="rows", pretty=TRUE)
sink()

Outcome

Musical Similarity Network

Using "artists.json" and the "index.html" that can be found here, the similarity graph looks a little like this. (Make sure you scroll to see the whole thing.)

For illustrative purposes, I pre-labeled the artists' "group" with labels that correspond to what I view as the artist's primary genre. This is why the nodes in the linked visualization have different colors. Note that, independently, the genres that I indicated tend to cluster together in the network. For example, Reggae (light green), Hip-Hop (green), and Punk (orange) all form almost completely connected graphs, though unconnected to each other (disjoint subgraphs). Indie rock (blue), post-punk (light blue) and classic rock (light orange) together form a rather tightly-connected subgraph. Curiously, the Sex Pistols (that I labeled "Punk") are not part of the Punk cluster but part of the Indie-rock/post-punk/classic-rock component. There are three orphan nodes (no edges), "Johann Sebastian Bach", "P:ano", and "No Kids". Bach is orphaned because he's the only Baroque artist in my top 70 artists :( --P:ano and No Kids are obscure... you’ve probably never heard of them.

The recommendations, prima facie, appear to be on point. For example, without direct knowledge of association, "KRS-One" recommends "Boogie Down Productions" (the group that KRS-One comes from) most highly. Similarly, "The Smiths" and "Morrissey" recommend each other, and "De La Soul" and "A Tribe Called Quest" (part of a positive, Afrocentric hip-hop collective known as the Native Tongues together with Queen Latifah, et al.) recommend each other.

Appropriately, Joy Division and New Order, whose Jaccard index of band members is 0.6 but whose music style is somewhat distinct, don't recommend each other.

Lastly, subgenred artists appear to recommend other artists in the subgenre. For example, goth band "The Sisters of Mercy" appropriately recommends other goth-esque bands "Bauhaus", "And Also The Trees", and "Joy Division".

Afterword

Using this similarity measure to drive recommendations seems successful. It should be noted, though, that my ability to assess the effectiveness of using the Jaccard index as the sole arbiter of musical similarity is hampered; judging an algorithm on the basis that the system recommends other bands that I necessarily like is prejudicial, to say the least.

This stands even if the system makes good theoretical sense. This still stands even if the system, quite independently, indicates that associated acts—that are objectively and incontrovertibly similar—are good recommendations.

This raises a larger question on how to accurately measure the effectiveness of recommender systems; do you tell people what they want to hear, or do you pledge allegiance to a particular theoretical interpretation of similarity? If it's the latter, how do you iterate and improve the system? If it's the former, is your only criterion for success positive user-provided feedback?

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

The hardest thing about teaching statistics

(Note: this post should probably be titled "Quantitative Methods of Curricula Planning" but I thought the current title would draw more interest–though they would both lose out to "These Weird Approaches To Lesson Planning Will Leave You Speechless")

Suppose you were tasked with teaching a course about a field of study. There would be, of course, several topics that you are expected to cover by the course end date; how would you decide the order in which to teach them?

Most people would say that the topics should build on one another, with monotonically increasing levels of difficulty. Further, no topic should be brought up that requires comprehension of another topic yet unlearned.

Planning the syllabus under these constraints would, perhaps, come naturally to skilled and empathetic lecturers. But,

  • not all lecturers are skilled and empathetic
  • even satisfying all of these constraints, there are objectively superior and inferior lesson plans
  • there are some subjects for which these constraints cannot be satisfied (statistics)

For these reasons, having a suite of quantitative methods for choosing the best order of topics in teaching a field of study would be valuable to pedagogy (not to mention providing challenging problems for me to focus on instead of writing).

--

I started thinking about this topic as I began to plan writing my book about learning introductory statistics with R. There are, of course, myriad other very good books on this very topic, so I figured that one way I can stand out is to organize the topics in a way that best facilitates mastering the material. I thought that this would be especially appreciated with a field of study that is notoriously scary and difficult to the uninitiated (like statistics is.)

Anyone, anywhere, teaching introductory statistics will be expected to touch on the common topics: measures of central tendency, measures of dispersion, probability, the central limit theorem, sampling theory, etc… I know how everyone else have arranged the topics, but what's the best way?

It might seem strange, but answering that question was probably the hardest thing about putting together this book and in all of my (admittedly limited) experience designing statistics curricula.

Let's speak of graph theory

To explore optimal paths through the topics, we can represent the subject of statistics as a big graph, or network. Each topic would be a node and there would be directed edges indicating when knowledge of a particular topic is a prerequisite to understanding another. Specifically, if there is a edge connecting topic "a" to topic "b", topic "b" requires an understanding of "a"–like long division requires knowledge of subtraction.

This is what a topic network of an excerpt of introductory stats topics might look like.

statistics topics knowledge dependency diagram

In graph theory, this is known as a directed acyclic graph (DAG). DAGs have the property that there exists at least one ordering of nodes such that no node in the ordering is connected to ("pointing to") a node earlier in the ordering. This is called a topological sort. For most DAGs, there are a number of different orderings that satisfy the ‘dependency’ constraints.

Now that I have your attention, let's now speak of monads

To get a list of all of them, I wrote a small library and set of algorithms in Haskell. You can view it here but the "meat" of the algorithm is in the following snippet that recursively adds all nodes with no children (topics that have no topics that depend on them) to a list of possible alternatives and removes the childless nodes. This is repeated until there are no nodes left to remove. A potential snag is that the function only takes one path but each function call may generate multiple alternate paths. However, if we view the output of the "gatherAllChildless" function as a non-deterministic computation, we can exploit the fact that the path of nodes is a monad and have the function recursively call itself inside of a monadic bind.

This has a sub-quadratic time complexity (< O(n^2))… not too bad. There are 26 possible orderings of the topics that satisfy these “knowledge dependencies” including:

probability -> central tendency -> measures of dispersion -> sampling theory -> sampling distributions -> probability distributions -> central limit theorem -> statistical inference -> NHST

central tendency -> probability -> measures of dispersion -> probability distributions -> sampling theory -> sampling distributions -> central limit theorem -> statistical inference -> NHST

There are a few of the ordering that intuitively seem like poor choices. Taking the first one, for example: it might be strange to start a book on statistics with probability when readers may want to get starting with univariate analysis right away. Looking at the second one, it seems strange to stick "probability" in between "central tendency" and "measures of dispersion", even though it can technically be done, because most people expect highly related topics to be positioned next to each other.

One way of cutting down on the list is to label each topic node with a difficulty level, and choose the ordering which causes the fewest backwards jumps in difficulty level. This should represent the path that has the most gentle level-of-difficulty slope.

Given the algorithms from lines 67 to 78 of TopoSort.hs and the following (subjective) difficulty mapping:

"central tendency": "1"
"measures of dispersion": "2"
"sampling theory": "3"
"sampling distributions": "3"
"central limit theorem": "5"
"probability": "4"
"probability distributions": "3"
"statistical inference": "5"
"NHST": "5"

the “optimal” ordering is:

central tendency -> measures of dispersion -> sampling theory -> probability -> sampling distributions -> probability distributions -> central limit theorem -> statistical inference -> NHST

Yay! This is pretty close to the ordering I chose.

--

The most truly difficult thing about sorting this out is that the statistics topic network diagram is not a DAG. This means that there is no ordering possible that doesn’t appeal to topics yet unlearned. For example, explaining why sample standard deviation divides by n-1 instead of n requires appealing to sampling theory, which requires a good foundation in measures of dispersion to understand. There are a few more of these cyclical relationships in the field.

All of these instances require some hand-waving on the part of the writer or lecturer ("don't worry about why we divide by 'n-1', we’ll get to that later") and adds to the learner's perceived difficulty of grasping the field.

The best way to reconcile these circular knowledge dependencies is to introduce weight to the edges that represent the extent to which a topic requires knowledge of another. Then, a cycle detection algorithm can be run on the graph. Once all the cycles are detected, the edges in the cycles with the lowest weight can be systematically removed until there are no more cycles and the graph is a DAG. At that point, the specialized topo sort from above may be used. I plan on implementing this when I have more time :)

--

It's my hope that these and other qualitative methods for planning curricula can be applied to other legendarily confusing fields of study. These methods can even be applied to entire undergraduate course catalogues and major requirements to guide students over 4+ years of undergraduate study.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Visualizing data analysis pipelines using NetworkX

In complicated data analysis pipelines and scientific workflows, it's often difficult to keep track of which tasks have to be performed before others. Even with informal forms of documentation (my personal favorite is 'notes.txt'), as the size of a project grows, and more dependencies are introduced, a formal documentation process has to be put in place, or else the project will become unsustainable.

I'm writing a automated system for statisticians and scientists for carrying out large multistep analytics processes. I'll discuss this more in later posts, but the details pertinent for this post are that each step of a analytics pipeline is detailed in a YAML document called a "Sakefile" (not-so-clever play on Makefile) with sections explicitly defining dependencies and resulting output files.

Given dependency resolution's usage of concepts from graph theory (topological sorting) I thought it would be easy and neat to write a tool to visualize the components and dependencies that go into an analytics workflow as a directed graph.

I've rustled up a simple example examining correlates of DUI arrests with various adolescent-related data by state. I chose these data sets because they’re very small and freely available on the net.

The "Sakefile" looks like this:

---
format dui stats:
    help: format raw (copy and pasted) dui/state data using perl
    dependencies:
        - rawdata.txt
    formula: >
        perl -pe 's/^(\D+)\s+([\d,]+)\s+([\d,]+)\s*/\1\t\2\t\3\n/'
        rawdata.txt | sed 's/,//g' > duistats.tsv;
    output:
        - duistats.tsv

fetch teen stats:
    help: fetches various teen statstics from the web
    # no dependencies
    formula: >
        curl -o teenstats.xls http://mathforum.org/workshops/sum96/data.collections/datalibrary/US_TeenStats.XL.zip.xls;
    output:
        - teenstats.xls

convert teen stats to csv:
    help: uses gnumerics ssconvert to convert ugly xls to csv and cleans it
    dependencies:
        - teenstats.xls
    formula: >
        ssconvert teenstats.xls messyteenstats.csv;
        cat <(echo -n "state") <(< teenstats.csv sed '55,$d' |
        sed '1,2d') | sed 's/,,/,/g' > teenstats.csv;
        rm messyteenstats.csv;
    output:
        - teenstats.csv

find correlates:
    help: calls R script that finds correlated of DUI arrest in various teen statistics
    dependencies:
        - duistats.tsv
        - teenstats.csv
    formula: >
        ./dui-correlates.R
    output:
        - corrogram.png
        - table.csv

all:
    - format dui stats
    - fetch teen stats
    - convert teen stats to csv
    - find correlates
...

A short description of each of the steps appears in the "help" field on each entry. Basically, there are two source data files: one exists and raw text copy and pasted from a website, and the other is fetched from the web using curl. The former is cleaned and formatted using perl and sed; the latter has to go through a process that converts downloaded excel file into a CSV and strips useless lines. Both of these source data files then get read by an R script which, ultimately, outputs a corrogram graphic and a summarization table.

Below is the small python program that parses the "Sakefile" and created the visualization. It uses the great NetworkX module to create the graph and render it as an image.

#!/usr/bin/env python -tt

import matplotlib.pyplot as plt
import networkx as nx
import pudb
import yaml

sakefile = yaml.load(open("Sakefile.yaml").read())

G = nx.DiGraph()

def check_for_dep_in_outputs(dep):
    print "checking dep {}".format(dep)
    ret_list = []
    for node in G.nodes(data=True):
        if "output" not in node[1]:
            continue
        if dep in node[1]['output']:
            ret_list.append(node[0])
    return ret_list

# make graph nodes for each target
for target in sakefile:
    if target == "all":
        # we don't want this node
        continue
    G.add_node(target, sakefile[target])


for node in G.nodes(data=True):
    print "checking node {} for dependencies".format(node[0])
    if "dependencies" not in node[1]:
        continue
    print "it has dependencies"
    connects = []
    for dep in node[1]['dependencies']:
        matches = check_for_dep_in_outputs(dep)
        if not matches:
            continue
        for match in matches:
            connects.append(match)
    if connects:
        for connect in connects:
            G.add_edge(connect, node[0])


nx.draw(G, node_color="pink", node_size=10000)
plt.savefig("dependency-visualization.png")

The resulting visualization looks like this:

dependency-visualization

Sure, the arrows look weird and this is a really simple example, but it's easy to see that, even for the most byzantine of pipelines, that a visualization like this can really help get a sense all the actions involved in a workflow.

I'll go over the actual running and results of this example in a later post, when I get the "sake" system working properly. :)

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail