The hardest thing about teaching statistics

(Note: this post should probably be titled "Quantitative Methods of Curricula Planning" but I thought the current title would draw more interest–though they would both lose out to "These Weird Approaches To Lesson Planning Will Leave You Speechless")

Suppose you were tasked with teaching a course about a field of study. There would be, of course, several topics that you are expected to cover by the course end date; how would you decide the order in which to teach them?

Most people would say that the topics should build on one another, with monotonically increasing levels of difficulty. Further, no topic should be brought up that requires comprehension of another topic yet unlearned.

Planning the syllabus under these constraints would, perhaps, come naturally to skilled and empathetic lecturers. But,

  • not all lecturers are skilled and empathetic
  • even satisfying all of these constraints, there are objectively superior and inferior lesson plans
  • there are some subjects for which these constraints cannot be satisfied (statistics)

For these reasons, having a suite of quantitative methods for choosing the best order of topics in teaching a field of study would be valuable to pedagogy (not to mention providing challenging problems for me to focus on instead of writing).


I started thinking about this topic as I began to plan writing my book about learning introductory statistics with R. There are, of course, myriad other very good books on this very topic, so I figured that one way I can stand out is to organize the topics in a way that best facilitates mastering the material. I thought that this would be especially appreciated with a field of study that is notoriously scary and difficult to the uninitiated (like statistics is.)

Anyone, anywhere, teaching introductory statistics will be expected to touch on the common topics: measures of central tendency, measures of dispersion, probability, the central limit theorem, sampling theory, etc… I know how everyone else have arranged the topics, but what's the best way?

It might seem strange, but answering that question was probably the hardest thing about putting together this book and in all of my (admittedly limited) experience designing statistics curricula.

Let's speak of graph theory

To explore optimal paths through the topics, we can represent the subject of statistics as a big graph, or network. Each topic would be a node and there would be directed edges indicating when knowledge of a particular topic is a prerequisite to understanding another. Specifically, if there is a edge connecting topic "a" to topic "b", topic "b" requires an understanding of "a"–like long division requires knowledge of subtraction.

This is what a topic network of an excerpt of introductory stats topics might look like.

statistics topics knowledge dependency diagram

In graph theory, this is known as a directed acyclic graph (DAG). DAGs have the property that there exists at least one ordering of nodes such that no node in the ordering is connected to ("pointing to") a node earlier in the ordering. This is called a topological sort. For most DAGs, there are a number of different orderings that satisfy the ‘dependency’ constraints.

Now that I have your attention, let's now speak of monads

To get a list of all of them, I wrote a small library and set of algorithms in Haskell. You can view it here but the "meat" of the algorithm is in the following snippet that recursively adds all nodes with no children (topics that have no topics that depend on them) to a list of possible alternatives and removes the childless nodes. This is repeated until there are no nodes left to remove. A potential snag is that the function only takes one path but each function call may generate multiple alternate paths. However, if we view the output of the "gatherAllChildless" function as a non-deterministic computation, we can exploit the fact that the path of nodes is a monad and have the function recursively call itself inside of a monadic bind.

This has a sub-quadratic time complexity (< O(n^2))… not too bad. There are 26 possible orderings of the topics that satisfy these “knowledge dependencies” including:

probability -> central tendency -> measures of dispersion -> sampling theory -> sampling distributions -> probability distributions -> central limit theorem -> statistical inference -> NHST

central tendency -> probability -> measures of dispersion -> probability distributions -> sampling theory -> sampling distributions -> central limit theorem -> statistical inference -> NHST

There are a few of the ordering that intuitively seem like poor choices. Taking the first one, for example: it might be strange to start a book on statistics with probability when readers may want to get starting with univariate analysis right away. Looking at the second one, it seems strange to stick "probability" in between "central tendency" and "measures of dispersion", even though it can technically be done, because most people expect highly related topics to be positioned next to each other.

One way of cutting down on the list is to label each topic node with a difficulty level, and choose the ordering which causes the fewest backwards jumps in difficulty level. This should represent the path that has the most gentle level-of-difficulty slope.

Given the algorithms from lines 67 to 78 of TopoSort.hs and the following (subjective) difficulty mapping:

"central tendency": "1"
"measures of dispersion": "2"
"sampling theory": "3"
"sampling distributions": "3"
"central limit theorem": "5"
"probability": "4"
"probability distributions": "3"
"statistical inference": "5"
"NHST": "5"

the “optimal” ordering is:

central tendency -> measures of dispersion -> sampling theory -> probability -> sampling distributions -> probability distributions -> central limit theorem -> statistical inference -> NHST

Yay! This is pretty close to the ordering I chose.


The most truly difficult thing about sorting this out is that the statistics topic network diagram is not a DAG. This means that there is no ordering possible that doesn’t appeal to topics yet unlearned. For example, explaining why sample standard deviation divides by n-1 instead of n requires appealing to sampling theory, which requires a good foundation in measures of dispersion to understand. There are a few more of these cyclical relationships in the field.

All of these instances require some hand-waving on the part of the writer or lecturer ("don't worry about why we divide by 'n-1', we’ll get to that later") and adds to the learner's perceived difficulty of grasping the field.

The best way to reconcile these circular knowledge dependencies is to introduce weight to the edges that represent the extent to which a topic requires knowledge of another. Then, a cycle detection algorithm can be run on the graph. Once all the cycles are detected, the edges in the cycles with the lowest weight can be systematically removed until there are no more cycles and the graph is a DAG. At that point, the specialized topo sort from above may be used. I plan on implementing this when I have more time :)


It's my hope that these and other qualitative methods for planning curricula can be applied to other legendarily confusing fields of study. These methods can even be applied to entire undergraduate course catalogues and major requirements to guide students over 4+ years of undergraduate study.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

I'm all about that bootstrap ('bout that bootstrap)

As some of my regular readers may know, I'm in the middle of writing a book on introductory data analysis with R. I'm at the point in the writing of the book now where I have to make some hard choices about how I'm going to broach to topic of statistical inference and hypothesis testing.

Given the current climate against NHST (the journal Basic and Applied Social Psychology banned it) and my own personal preferences, I wasn't sure just how much to focus on classical hypothesis testing.

I didn't want to burden my readers with spending weeks trying to learn the intricacies of NHST just to have them being told to forget everything they know about it and not be able to use it without people making fun of them.

So I posed a question to twitter: "Is it too outlandish to not include the topic of parametric HTs in an intro book about data analysis. Asking for a friend.. named Tony…. You know, in favor of bootstrapped CIs, permutation tests, etc…"

To which my friend Zach Jones (@JonesZM) replied: "they could at least be better integrated with monte-carlo methods. i think they'd make it easier to understand". I agreed, which is why I'm proceeding with my original plan to introduce classical tests after and within the context of Monte Carlo bootstrapping (as opposed to exhaustive bootstrapping).

Even though I'm a huge fan of the bootstrap, I want to be careful not to further any misconceptions about it—chiefly, that bootstrapping is a cure-all for having a small sample size. To be able to show how this isn’t the case, I wrote an R script to take 1,000 samples from a population, calculate 95% confidence intervals using various methods and record the proportion of times the population mean was within the CIs.

The four ways I created the CIs were:

  • the z interval method: which assumes that the sampling distribution is normal around the sample mean (1.96 * the standard error)
  • the t interval method: which assumes that the population is normally distributed and the sampling distribution is normally distributed around the sample mean (t-distribution quantile at .975 [with appropriate degrees of freedom] * standard error)
  • basic bootstrap CI estimation (with boot() and boot.CI() from the boot R package)
  • adjusted percentile CI estimation (with boot() and boot.CI() from the boot R package)

I did this for various sample sizes and two different distributions, the normal and the very non-normal beta distribution (alpha=0.5, beta=0.5). Below is a plot depicting all of this information.

Accuracy of different CIs

So, clearly the normal (basic) boot doesn’t make up for small sample sizes.

It's no surprise the the t interval method blows everything else out of the water when sampling from a normal distribution. It even performs reasonably well with the beta distribution, although the adjusted bootstrap wins out for most sample sizes.

In addition to recording the proportion of the times the population mean was within the confidence intervals, I also kept track of the range of these intervals. All things being equal, narrower intervals are far preferable to wide ones. Check out this plot depicting the mean ranges of the estimated CIs:

Mean ranges for difference CIs

The t interval method always produces huge ranges.

The adjusted bootstrap produces ranges that are more or less on par with the other three methods BUT it outperforms the t interval method for non-normal populations. This suggests the the adjustments to the percentiles of the bootstrap distribution do a really good job at correcting for bias. It also shows that, if we are dealing with a non-normal population (common!), we should use adjusted percentile bootstrapped CIs.

Some final thoughts:

  • The bootstrap is not a panacea for small sample sizes
  • The bootstrap is cool because it doesn’t assume anything about the population distribution, unlike the z and t interval methods
  • Basic bootstrap intervals are whack. They’re pathologically narrow for small sample sizes.
  • Adjusted percentile intervals are great! You should always use them instead. Thanks Bradley Efron!

Also, if you're not using Windows, you can parallelize your bootstrap calculations really easily in R; below is the way I bootstrapped the mean for this project:

dasboot <- boot(a.sample, function(x, i){mean(x[i])}, 10000,
                           parallel="multicore", ncpus=4)

which uses 4 cores to perform the bootstrap in almost one fourth the time.

In later post, I plan to further demonstrate the value of the bootstrap by testing difference in means and show why permutation tests comparing means between two samples is always better than t-testing.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

How dplyr replaced my most common R idioms

Having written a lot of R code over the last few years, I've developed a set of constructs for my most common tasks. Like an idiom in a natural language (e.g. "break a leg"), I automatically grasp their meaning without having to think about it. Because they allow me to become more and more productive in R, these idioms have become ingrained in my head (and muscle memory) and, in large part, inform how I approach problems.

It's no wonder, then, why I'm hesitant to embrace new packages that threaten to displace these idioms; switching is tantamount to learning a new dialect after investing a lot of time becoming fluent in another.

On occasion, though, a package comes along whose benefits are so compelling (here's looking at you, Hadley Wickham, Dirk Eddelbuettel, and Romain François) that it incites me to take the plunge and employ new patterns and learn new idioms. The most recent package to accomplish this is the dplyr package. This package (among other things) reimplements 5 of my most common R data manipulation idioms--often, in blazing fast C++.

This post serves as both an advocation for dplyr (by comparing form and speed) but also as a rosetta stone--to serve as a personal reference for translating my old R idioms.

This uses a dataset documenting crimes in the US by state available here

library(dplyr) <- read.csv("CrimeStatebyState.csv")

Filtering rows

# base R
crime.ny.2005 <-[$Year==2005 &
                      $State=="New York", ]

# dplyr
crime.ny.2005 <- filter(, State=="New York", Year==2005)

There is a lot going on with my base R solution. It uses logical subsetting to extract choice rows from Specifically, it creates a two boolean vectors: one that is true only when the "Year" column's value is 2005, and one that is true only when the "State" column's value is "New York". It then logical "AND"s these vectors, so that the resulting boolean vector is true only where the year was 2005 and the state was New York. This vector then is used to subset, and includes all columns. In contrast, the dplyr solution reads much more naturally, and in far fewer characters. According to my (crude) benchmarks the dplyr solution appears to be twice as fast.

A quick note before moving on, we could've drastically cut down on the number of characters in the base R solution by "attaching" the crime.ny.2005 dataset, eliminating the need to preface the "Year" and "State" names with "$", but there are two reasons why I don't do this. (1) I consider it to be bad form in a lot of circumstances (for example, it can become confusing when more than one dataset is loaded), and (2) RStudio will tab-auto-complete a column name after prefacing it with "name-of-dataframe$" and that drastically increases my coding speed. My only complaint(?) about dplyr is that it disallows this prefacing syntax and requires me to lookup the column names (and spell them correctly).

Arranging and ordering

# base R
crime.ny.2005 <- crime.ny.2005[order(crime.ny.2005$Count, 
                                     decreasing=TRUE), ]

# dplyr
crime.ny.2005 <- arrange(crime.ny.2005, desc(Count))

The base R solution ranks each row by value of "Count" in decreasing order, and uses the rank vector to subset the "crime.ny.2005" data frame. The dplyr solution appears to be about 20% faster.

Selecting columns

# base R
crime.ny.2005 <- crime.ny.2005[, c("Type.of.Crime", "Count")]

# dplyr
crime.ny.2005 <- select(crime.ny.2005, Type.of.Crime, Count)

This example is relatively self-explanatory. Here the base R solution appears to be faster, by about 30%.

Creating new columns

# base R
crime.ny.2005$Proportion <- crime.ny.2005$Count /

# dplyr
crime.ny.2005 <- mutate(crime.ny.2005, 

Very often, I have to create a new column that is a function of one or more existing columns. Here, we are creating a new column, that represents the proportion that a particular crime claims from the total number of crimes, among all types. Incredibly, base R beats dplyr in this task--it is about 18 times faster.

If I had to guess, I think this is because of the nuances of R's vectorization. In the base R solution, a vector of crime counts is extracted. R recognizes that it is being divided by a scalar (the sum of the counts), and automatically creates a vector with this scalar repeated so that the length of the vectors match. Both of the vectors are stored contiguously and the resulting element-wise division is blindingly fast. In contrast, I think that in the dplyr solution, the sum of the counts column is actually evaluated for each element in the count vector, although I am not sure.

Aggregation and summarization

# base R
summary1 <- aggregate(Count ~ Type.of.Crime,
summary2 <- aggregate(Count ~ Type.of.Crime,
summary.crime.ny.2005 <- merge(summary1, summary2,

# dplyr
by.type <- group_by(crime.ny.2005, Type.of.Crime)
summary.crime.ny.2005 <- summarise(by.type,
                                   num.types = n(),
                                   counts = sum(Count))

This is the arena in which dplyr really shines over base R. In the original dataset, crime was identified by specific names ("Burglary", "Aggravated assault") and by a broader category ("Property Crime" and "Violent Crime")

Before this point, the data frame we are working with looks like this:

 Type.of.Crime  Count
 Violent Crime    874
 Violent Crime   3636
 Violent Crime  35179
 Violent Crime  46150
Property Crime  68034
Property Crime 302220
Property Crime  35736

In this pedagogical example we want to aggregate by the type of crime and (a) get the number of specific crimes that fall into each category, and (b) get the sum of all crimes committed in those categories. Base R makes it very easy to do one of these aggregations, but to get two values, it requires that we make two calls to aggregate and then merge the results. Dplyr's solution, on the other hand, is relatively intuitive, and requires just two function calls.

All together now

We haven't showcased the best part of dplyr yet... it presents itself when combining all of these statements:

# base R <- read.csv("CrimeStatebyState.csv")
crime.ny.2005 <-[$Year==2005 &
                        $State=="New York", 
                                c("Type.of.Crime", "Count")]
crime.ny.2005 <- crime.ny.2005[order(crime.ny.2005$Count, 
                                     decreasing=TRUE), ]
crime.ny.2005$Proportion <- crime.ny.2005$Count /
summary1 <- aggregate(Count ~ Type.of.Crime,
summary2 <- aggregate(Count ~ Type.of.Crime,
final <- merge(summary1, summary2,

# dplyr <- read.csv("CrimeStatebyState.csv")
final <- %>%
           filter(State=="New York", Year==2005) %>%
           arrange(desc(Count)) %>%
           select(Type.of.Crime, Count) %>%
           mutate(Proportion=Count/sum(Count)) %>%
           group_by(Type.of.Crime) %>%
           summarise(num.types = n(), counts = sum(Count))

When all combined, the base R solution took 60 seconds (over 10000 iterations) and the dplyr solution took 30 seconds. Perhaps more importantly, the dplyr code to uses many fewer lines and assignments and is more terse and probably more readable with its neat-o "%>%" operator.

I would be remiss if I didn't mention at least one of the other benefits of dplyr...

Dplyr's functions are generalized to handle more than just data.frames (like we were using here). As easily as dplyr handles the data frame, dplyr can also handle data.tables, remote (and out-of-memory) databases like MySQL; Postgres; Lite; and BigQuery by translating to the appropriate SQL on the fly.

There are still other neat features of dplyr but perhaps these are reason enough to give dplyr a shot. I know my own code may never look the same.

edit (9/17/14): I changed the pipe operator from the deprecated "%.%" to the preferred "%>%".

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Using to data mine my music listening history

Indie Rock
I've (passively) been keeping meticulous records of almost every song I've listened to since January of 2008. Since I opened my account 6 years ago, they've accumulated a massive detailed dataset of the 107,222 songs I've listened to since then. The best thing is that they're willing to share this data with me!

I used the developer REST API to (over a very long period of time) retrieve my entire listening history, the date(s) that I've listened to each song, and the top three user-submitted "tags" for each song.

I want to glean every bit of insight that I can out of this data. For this post, I focused on:

  • total listening history over time
  • music "diversity" levels
  • trends in my musical genre listening habits

In future posts, I hope to explore other things like using PCA to determine "orthogonal" music genres, construct similarity matrices, predict trends, and perform acoustic analysis.

This has been one of my favorite pet-projects because it combines three things that I love:

  • data mining
  • music
  • navel-gazing

I used both R and Python in this analysis. Let’s get into it!

Obtaining data
Getting the data using the REST API was very straightforward; the only hiccups I encountered were the fault of Python2's unicode snafus. For the web requests I used the urllib2 module and to handle the XML responses I used the amazing lxml module. The code to get my whole listening history looked a little like this:

#!/usr/bin/env python -tt

import urllib2
import time
from lxml import etree
from StringIO import StringIO

baseurl = ''.join(["",

def clean_xml(the_xml):
    return "\n".join(the_xml.split("\n")[3:-2])

# let's get the first page so we know how many pages there are
response = urllib2.urlopen(baseurl+"&page=1", timeout=200)
html =

# parse the XML tree
doc = etree.parse(StringIO(html))

# use Xpath to query the number of pages
num_pages = int(doc.xpath("/lfm/recenttracks")[0].get("totalPages"))

# file to dump results
fh = open("all_the_tracks.xml", "a")

for page in xrange(0, num_pages+1):
    # I'm nice so I don't want to hit
    # with a bunch of requests a second.
    # Let's wait ten seconds between each request
    progress = "On page {} of {}...........  {}%"
    print progress.format(str(page),
                          str(round(float(page)/num_pages*100, 1)))
    response = urllib2.urlopen(baseurl+"&page="+str(page))
    html =
    the_xml = clean_xml(html)

I decided to make the requests for the user-submitted tags in another python script. The script is a little too long to post here, but it basically iterated over all "track" nodes in the output of the last script, and parsed the results from a REST query of tags. Since I'm considerate, I put a long wait between each request for the over 100,000 songs. Even though I handled repeated tracks gracefully, it took days to finish. I used the pickle module to serialize the sum of data I got at regular intervals so a failure during the night of day 2 wouldn't have been catastrophic.

XML transformations and XPath
There is still a little bit of cleanup to do... I used various shell commands to remove all unnecessary elements from the XML documents and escape the characters that I forgot to escape. Then I had to organize the data by date so that I can do time series analysis. The script I used to accomplish this is as follows:

#!/usr/bin/env python -tt

from lxml import etree
import codecs

# read cleaned up track history XML
doc = etree.parse("escaped_processed.xml")

fh ="bydate.xml", "a", encoding="utf-8")

# get all the dates (previously restricted to just month and year)
udates = list(set([date.text for date in doc.xpath("//date")]))

# create a new DOM tree to hang the transformation upon
root = etree.Element("bydate")

for cdate in udates:
    # element tags can't start with a number
    # add a "d" to it
    this = etree.SubElement(root, 'd' + cdate)
    # get all tracks listened to on that date
    these_tracks = [node for node in
                    doc.xpath("/alltags/track[date=" + cdate + "]")]
    # add the tracks to the DOM
    for itrack in these_tracks:

fh.write(etree.tostring(root, pretty_print=True))

Finally, I whipped up a quick script to sum the number of listens on a particular tag for each time interval.

At this time we have a file "playnumbymonth.csv" with the dates and total tracks listened to for that month that looks like this...


and ("melted") file called "longformat.csv" that holds dates, tag names, and the number of tracks (played in that month) that contained the tag. It looks like this...

03-2008,folk rock,1
03-2008,spoken word,2

R analytics and visualization
First, to visualize the number of songs I’ve listened to over time, I had to import the "playnumbymonth.csv" dataset, parse the date with the lubridate package, make a "zoo" time series object out of the dataframe, and plot it.


plays <- read.csv("playnumbymonth.csv", stringsAsFactors=FALSE)

# parse dates
plays$date <- parse_date_time(plays$date, "my")

#make time series object
tsplays <- read.zoo(plays)

#plot it with a LOWESS smooth curve
loline <- lowess(tsplays, f=.5)
plot(tsplays, main="Plays per month since 2008", ylab="Number of plays", xlab="Date")
lines(index(tsplays), loline$y, col='red', lwd=2)

The resulting plot looks like this:
Plays per month

While I was working with this data set, I wanted to check if there was any periodicity to my listening history (perhaps I listen to more music in the winter than I do in the summer). I briefly attempted to use seasonal decomposition and autocorrelation to try to detect this. No dice.

For the musical "diversity" and genre listening trends, I read in "longformat.csv", used reshape to aggregate (pivot) by tags until I had a huge matrix where each row was a month between 2008 and 2014, and each column was a tag. Then I used the vegan (vegetation analysis) package to take the Shannon diversity index of each month with respect to wealth and evenness of tags listened to:

long.tag.frame <- read.csv("longformat.csv", stringsAsFactors=FALSE)
long.tag.frame$date <- parse_date_time(long.frame$date, "my")

wide.frame <- data.frame(cast(long.tags.frame, date~tag))
# convert all NAs to zero
wide.frame[] <- 0

new.frame <- data.frame(wide.frame[,1])
new.frame$diversity <- diversity(wide.frame[,-1])

After some cleanup and "zoo" object creation, and LOWESS curve creation, the plot of the listening data and diversity indices looked like this:
Number of plays and variety

Visualizing how my music tastes have (appeared to) change over time was the best part. I created a diagonal matrix from the multiplicative inverse of number of tracks that I listened to each month and matrix-multiplied this with the wide tag matrix. The result of this computation yielded the proportion of songs I listened to each month that contained each tag.

I took a few choice tags corresponding to some of my favorite musical genres, put it in a new data frame ("tag.interest") and used the lattice package to visualize the trends.

tag.interest <- data.frame(dates)
tag.interest$Post.Punk <- prop.plays[,2227]
tag.interest$Indie <- prop.plays[,1413]
tag.interest$Punk <- prop.plays[,2270]
tag.interest$Coldwave <- prop.plays[,654]
tag.interest$Darkwave <- prop.plays[,762]
tag.interest$Twee <- prop.plays[,3003]
tag.interest$Indie.Pop <- prop.plays[,1422]
tag.interest$Hip.Hop <- prop.plays[,1337]

> names(tag.interest)
[1] "dates"     "Post.Punk" "Indie"     "Punk"      "Coldwave"  "Darkwave"  "Twee"      "Indie.Pop" "Hip.Hop"  

xyplot(read.zoo(tag.interest), type=c("l", "g"),
       ylab="Proportion of songs containing tag",
       main="Trends in musical genre listening habits",
       panel = function(x, y, col, ...) {
         panel.xyplot(x, y, col = "blue", ...)
         panel.loess(x, y, col = "red", lwd=3)

This produced my favorite plot:
Genre listening trends

Looking at it, I remembered a period of time in 2009 that I listened to almost exclusively Hip-Hop music. I was also reminded that I got into the "coldwave" and "darkwave" genres rather recently and around the same time as each other in summer of 2011. Another neat result is that there is a fairly strong negative correlation between my "twee" music listening and my "darkwave" music listening history, as these genres are almost musical 'opposites'.

This had been a fun trip down memory lane for me. My only regret is that I didn't open my account sooner... as long as it was after a period in my childhood music-listening that I would be embarrassed to have on digital record.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Visualizing data analysis pipelines using NetworkX

In complicated data analysis pipelines and scientific workflows, it's often difficult to keep track of which tasks have to be performed before others. Even with informal forms of documentation (my personal favorite is 'notes.txt'), as the size of a project grows, and more dependencies are introduced, a formal documentation process has to be put in place, or else the project will become unsustainable.

I'm writing a automated system for statisticians and scientists for carrying out large multistep analytics processes. I'll discuss this more in later posts, but the details pertinent for this post are that each step of a analytics pipeline is detailed in a YAML document called a "Sakefile" (not-so-clever play on Makefile) with sections explicitly defining dependencies and resulting output files.

Given dependency resolution's usage of concepts from graph theory (topological sorting) I thought it would be easy and neat to write a tool to visualize the components and dependencies that go into an analytics workflow as a directed graph.

I've rustled up a simple example examining correlates of DUI arrests with various adolescent-related data by state. I chose these data sets because they’re very small and freely available on the net.

The "Sakefile" looks like this:

format dui stats:
    help: format raw (copy and pasted) dui/state data using perl
        - rawdata.txt
    formula: >
        perl -pe 's/^(\D+)\s+([\d,]+)\s+([\d,]+)\s*/\1\t\2\t\3\n/'
        rawdata.txt | sed 's/,//g' > duistats.tsv;
        - duistats.tsv

fetch teen stats:
    help: fetches various teen statstics from the web
    # no dependencies
    formula: >
        curl -o teenstats.xls;
        - teenstats.xls

convert teen stats to csv:
    help: uses gnumerics ssconvert to convert ugly xls to csv and cleans it
        - teenstats.xls
    formula: >
        ssconvert teenstats.xls messyteenstats.csv;
        cat <(echo -n "state") <(< teenstats.csv sed '55,$d' |
        sed '1,2d') | sed 's/,,/,/g' > teenstats.csv;
        rm messyteenstats.csv;
        - teenstats.csv

find correlates:
    help: calls R script that finds correlated of DUI arrest in various teen statistics
        - duistats.tsv
        - teenstats.csv
    formula: >
        - corrogram.png
        - table.csv

    - format dui stats
    - fetch teen stats
    - convert teen stats to csv
    - find correlates

A short description of each of the steps appears in the "help" field on each entry. Basically, there are two source data files: one exists and raw text copy and pasted from a website, and the other is fetched from the web using curl. The former is cleaned and formatted using perl and sed; the latter has to go through a process that converts downloaded excel file into a CSV and strips useless lines. Both of these source data files then get read by an R script which, ultimately, outputs a corrogram graphic and a summarization table.

Below is the small python program that parses the "Sakefile" and created the visualization. It uses the great NetworkX module to create the graph and render it as an image.

#!/usr/bin/env python -tt

import matplotlib.pyplot as plt
import networkx as nx
import pudb
import yaml

sakefile = yaml.load(open("Sakefile.yaml").read())

G = nx.DiGraph()

def check_for_dep_in_outputs(dep):
    print "checking dep {}".format(dep)
    ret_list = []
    for node in G.nodes(data=True):
        if "output" not in node[1]:
        if dep in node[1]['output']:
    return ret_list

# make graph nodes for each target
for target in sakefile:
    if target == "all":
        # we don't want this node
    G.add_node(target, sakefile[target])

for node in G.nodes(data=True):
    print "checking node {} for dependencies".format(node[0])
    if "dependencies" not in node[1]:
    print "it has dependencies"
    connects = []
    for dep in node[1]['dependencies']:
        matches = check_for_dep_in_outputs(dep)
        if not matches:
        for match in matches:
    if connects:
        for connect in connects:
            G.add_edge(connect, node[0])

nx.draw(G, node_color="pink", node_size=10000)

The resulting visualization looks like this:


Sure, the arrows look weird and this is a really simple example, but it's easy to see that, even for the most byzantine of pipelines, that a visualization like this can really help get a sense all the actions involved in a workflow.

I'll go over the actual running and results of this example in a later post, when I get the "sake" system working properly. :)

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

qstats - quick and dirty statistics tool for the Unix pipeline

Back when 200MB hard drives were the size of washing machines and programs had no choice but to be as efficient as possible, Unix was born. In a serendipitous twist of fate, the same programs that were borne of this era of 4MB RAM and 16 bit processors are useful to data analysts with 2,000 times the amount of RAM and 64 bit multicore processors, processing data files over several GBs large.

Like all good things, Unix was started at Bell Labs in the late 60s. It has been honed over 40 years and now runs, if not on your computer, on the vast majority of the web servers you visit, a lot of phones, embedded devices you use, and a toaster near you.


Since near everything in Unix is a text file, it grew up to be… very good at processing text. This is why Unix tools are a great addition to the data analyst's toolbox. There are a few great posts on how to get started using these tools in your workflow (here, here, here, and here) which you should read. By the way, when I talk about tools here, I’m talking about pipeline-able tools that take raw text input from standard input like sed; awk; and grep, not perl; tcl; or python.

There are tools to select columns, filter text for regular expressions, join files on a key, and reshape arrays, but I felt like there was one that was missing. After chaining tool after tool together and finally cajoling the data into a format and subset that I want to process and explore, I'd have to redirect the stream to a text file and read it from R. Clearly, if I’m to perform some complicated machine learning algorithm with this data, this is the best way to go. But if I just want to take a peek at the spread of the data, or quickly compare means, this is overkill.

Introducing qstats

Inspired by this gap in the Unix toolchain, I wrote a tool, qstats, that computes simple summary statistics from the command-line. It also includes data-binning and simple bar chart functionality. I designed it, in C, specifically to be as fast as possible, and bare-bones enough to work on any POSIX-compliant system without having to deal with outside dependencies. Let’s see it in action…

By default, qstats will print R-like summary statistics on the given data. This includes the minimum value, the 1st quartile, the median, the mean, the third quartile, the maximum value, the range, and the standard deviation. You can use the -m flag to just get the mean. This will be faster because the data does not have to be sorted.

In addition to these statistics, qstats can also produce a frequency tabulation with an arbitrary number of "bins". Calling qstats with the -f10 flag will create 10 equal intervals and -f20 will create 20. Just calling it with -f will use Sturge's rule to come up with a reasonable number of bins in most cases.

Finally, with the -b flag, qstats will output a histogram-like horizontal bar-chart. Much like with the -f flag, you can supply the number of intervals to create. We will see an example of the bar-chart at work in the next section.

Rudimentary spread visualization

To view the spread with a bar-chart, let's output samplings from two distributions, the normal and the chi-square...

# one million normally distributed with a mean of 100 and a standard deviation of 10
millnorm <- rnorm(1000000, mean=100, sd=10)
write.table(millnorm, "millnorm.dat", col.names=FALSE, row.names=FALSE)

# one million values sampled from the chi-square distribution with two degrees of freedom
millchi <- rchisq(1000000, df=2)
write.table(millchi, "millchi.dat", col.names=FALSE, row.names=FALSE)
Normal distribution

Visualization of normal distribution

Chi-square distribution

Visualization of chi-square distribution (two degrees of freedom)

Speed comparisons
Let’s create a file of 100,000,000 floating point numbers to test speeds with R…

# sample from normal distribution with a mean of 100 and a standard deviation of 10
one.h.m <- rnorm(100000000, mean=100, sd=10)
write.table(one.h.m, “one_hundred_million.dat”, row.names=FALSE, col.names=FALSE)

The resulting file is 1.7 GBs large.

  • R
    The R script that we’ll time will look like this…

    #!/usr/bin/rscript —vanilla
    frame <- scan(“one_hundred_million.dat”)

    and the timing...

    $ time ./rtest.R
    Read 100000000 items
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      44.95   93.26  100.00  100.00  106.70  157.00 
    ./rtest.R  210.66s user 3.57s system 99% cpu 3:35.08 total

    3.5 minutes

  • Awk

    $ time awk '{ x+=$1; next } END { print x/NR }' one_hundred_milion.dat
    awk '{ x+=$1; next } END { print x/NR }' one_hundred_milion.dat  128.34s user 0.56s system 99% cpu 2:09.01 total

    2 minutes.

    Note that this only computes the mean, not any of the other summary statistics. Some of these require sorting, which takes more time.

  • sort command

    $ time sort -n one_hundred_milion.dat > /dev/null
    sort -n one_hundred_milion.dat > /dev/null  151.89s user 3.46s system 99% cpu 2:35.72 total

    2.5 minutes

  • qstats

    $ time qstats one_hundred_milion.dat
    Min.     44.947
    1st Qu.  93.2553
    Median   100.001
    Mean     100.001
    3rd Qu.  106.747
    Max.     156.997
    Range    112.05
    Std Dev. 10.0002
    Length   100000000
    qstats one_hundred_milion.dat  53.62s user 1.04s system 99% cpu 54.722 total

    a little less than a minute

  • I show these comparisons here not to compare this small program with these great, veteran tools. Instead, I just want to underscore the point that smaller, very-few-trick-pony, specialized programs can afford to be faster than their more capable and robust counterparts. When these small tools will do the trick, they can not only be faster and simpler to use, but they also comport more with the Rule of Parsimony from the Unix philosophy.

Final words
The source code for this project is on Github with installation instructions. You can also download and install from this tarball. I've tested it on OS X, Debian, and NetBSD, but it should compile without any issue on any POSIX system with a reasonably recent C compiler. Please let me know if there are any installation issues for your system.

Please fork me and feel free to send a pull request or add an issue to the repo. I hope you like it!

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

On the treachery of point-and-click "black-box" data analysis

*in this post, by "black-box" I'm referring to software whose methods are undisclosed and un-audible rather than black-box math models

EZ Analysis

There’s a certain expectation that the analyses that inform not only business and stock trading, but public health and social welfare decisions, are carefully thought out and performed with painstaking attention to detail. However, the growing reliance of point-and-click 'black-box' data analysis solutions, while helpful for getting people into the field, are jeopardizing the reliability of these verdicts.

Lest readers think this is an exercise in idiotic reminiscence and fetishizing the past, I'm not romanticizing pen-and-paper analysis and F-table lookups; I have however personally witnessed decisions about to be made with analysis results that are dubious at best.

My issue with these tools comes down to the fact that they provide the ability to choose from a smörgåsbord of statistical tests with no mention of the assumptions they make or the conditions that have to be met in order for them to work properly. At worst, it gives the unscrupulous user the ability to perform test after test until they see the results that they want.

I was working on a project where we were encouraged to use an unnamed enterprise data miner trying to predict ****** (binary classification) from a series of *****. Under the "regression" tab, there were a lot of different tests to choose from. In the absence of a "logistic regression" button, I figured the "GLM" might automagically detect what I was trying to do. When, I finally got the results from the GLM, I decided to check it against R. The results were completely incompatible. I know what R was doing; I can type the name of any function (without the parentheses) and get the source code. But nowhere in this tool’s documentation of this function could I find a comprehensive list of the steps it took. For example, I couldn't find:

  • how the miner aggregated the data
  • whether it correctly deduced that I wanted to use logistic regression
  • whether it checked for colinearity and homogeneity of variance, and, if it did, whether it made any decisions for me that it didn’t tell me about
  • whether it included the non-numeric data points in the model, and if it did, what coding scheme it used
  • how it could possibly know if I didn't want to use a probit-model (if my closest option was only GLM)
  • what (if any) dimensionality techniques it used


The aggregation in R yielded over 200 dimensions. Since I was trying to replicate (reverse-engineer is such a harsh term) this tool’s steps, I decided to try (a) not reducing the dimensions, and (b) trying every dimensionality reduction technique under the sun and see if I got closer to the miner’s results. I made it through PCA and LDA until I gave up. I decided, then, to lobby for use of the tool whose algorithms are audit-able and well-defined. I’ll also vouch for the tool where tests and steps have to be explicitly performed over a point-and-click solution any day.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

On the misinterpretation of p-values:

First, let me start of by saying I'm a classical statistics and p-value apologist--I think it's the cats pajamas. It was mine (and many others') first introduction to statistics. So, in spite of my being a card-carrying member of The Bayesian Consipiracy, there will always be a place in my heart (grinch-sized though it is) for classical statistics.

That being said, I think that, often, the classical statistics' approach to hypothesis testing lends itself to misinterpretation and encourages academic dishonesty. There has already been much written about the controversial p-value, but I thought I'd weigh in with my ideas and experience.

One of the problems I see that encourages misinterpretation is how statisticians communicate with their superiors. Since Bayesian inference is almost certainly how people actually reason, the non-intuitive linguistic gymnastics that frequentist hypothesis testing forces us to use to make a true statement ("we fail to reject the null hypothesis") meets with confusion, or worse, from non-statisticians. When some people ask me about the results of a test they are interested in, if the result was statistically significant at p<.05, I might carefully say something to the effect of "an unlikely event occured or we can reject the null hypothesis". After what I assume is a mental eye-roll, I get asked "Is there an effect, or not?" At this point I have two options: (a) I can pendantically assert that I can't answer that question, or (b) I can say "yep".

When we first learned about frequentist hypothesis testing at school, many of my classmates remarked that .05 significance "cutoff" was too demanding (or, occasionally "too liberal"). It is then usually explained that increasing the significance level will necessarily result in more Type II errors, which are frequently (but not always) more damaging than the opposite error). Still, something doesn't seem right to the neophyte (perhaps it is the artifical dichomomy between "significance" and "non-significance" that has no basis in reality).

That raises an interesting point: newcomers to statistics have an important and un-tainted perspective because they can see the arbitrariness of significance cutoffs more readily than after it's beaten into their heads by some research advisors that they have to "publish or perish" and that not significant results won't attract attention.

[edit] Just after I wrote this article, I read this germane blog post entitled "Use The Wrong P-value, Go To Jail: Not A Joke"
[another edit] I keep finding great articles and blog posts on this subject. Here's one from John Myles White: Criticism 1 of Null Hypothesis Significance Testing

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail