Lessons learned in high-performance R

On this blog, I've had a long running investigation/demonstration of how to make a "embarrassingly-parallel" but computationally intractable (on commodity hardware, at least) R problem more performant by using parallel computation and Rcpp.

The example problem is to find the mean distance between every airport in the United States. This silly example was chosen because it exhibits polynomial growth in running time as a function of the number of airports and, thus, quickly becomes intractable without sampling. It is also easy to parallelize.

The first post used the (now-deprecated in favor of 'parallel') multicore package to achieve a substantial speedup. The second post used Rcpp to achieve a statistically significant but, functionally, trivial speedup by replacing the inner loop (the distance calculation using the Haversine formula) with a version written in C++ using Rcpp. Though I was disappointed in the results, it should be noted that porting the function to C++ took virtually no extra work.

By necessity, I've learned a lot more about high-performance R since writing those two posts (part of this is by trying to make my own R package as performant as possible). In particular, I did the Rcpp version all wrong, and I'd like to rectify that in this post. I also compare the running times of approaches that use both parallelism and Rcpp.

Lesson 1: use Rcpp correctly
The biggest lesson I learned, is that it isn’t sufficient to just replace inner loops with C++ code; the repeated transferring of data from R to C++ comes with a lot of overhead. By actually coding the loop in C++, the speedups to be had are often astounding.

In this example, the pure R version, that takes a matrix of longitude/latitude pairs and computed the mean distance between all combinations, looks like this...

just.R <- function(dframe){
  numrows <- nrow(dframe)
  combns <- combn(1:nrow(dframe), 2)
  numcombs <- ncol(combns)
  combns %>%
  {mapply(function(x,y){
          haversine(dframe[x,1], dframe[x,2],
                    dframe[y,1], dframe[y,2]) },
          .[1,], .[2,])} %>%
  sum %>%
  (function(x) x/(numrows*(numrows-1)/2))
}

The naive usage of Rcpp (and the one I used in the second blog post on this topic) simply replaces the call to "haversine" with a call to "haversine_cpp", which is written in C++. Again, a small speedup was obtained, but it was functionally trivial.

The better solution is to completely replace the combinations/"mapply" construct with a C++ version. Mine looks like this...

double all_cpp(Rcpp::NumericMatrix& mat){
    int nrow = mat.nrow();
    int numcomps = nrow*(nrow-1)/2;
    double running_sum = 0;
    for( int i = 0; i < nrow; i++ ){
        for( int j = i+1; j < nrow; j++){
            running_sum += haversine_cpp(mat(i,0), mat(i,1),
                                         mat(j,0), mat(j,1));
        }
    }
    return running_sum / numcomps;
}

The difference is incredible…

res <- benchmark(R.calling.cpp.naive(air.locs[,-1]),
                 just.R(air.locs[,-1]),
                 all_cpp(as.matrix(air.locs[,-1])),
                 columns = c("test", "replications", "elapsed", "relative"),
                                  order="relative", replications=10)
res
#                                   test replications elapsed relative
# 3  all_cpp(as.matrix(air.locs[, -1]))           10   0.021    1.000
# 1 R.calling.cpp.naive(air.locs[, -1])           10  14.419  686.619
# 2              just.R(air.locs[, -1])           10  15.068  717.524

The properly written solution in Rcpp is 718 times faster than the native R version and 687 times faster than the naive Rcpp solution (using 200 airports).

Lesson 2: Use mclapply/mcmapply
In the first blog post, I used a messy solution that explicitly called two parallel processes. I’ve learned that using mclapply/mcmapply is a lot cleaner and easier to intregrate into idiomatic/functional R routines. In order to parallelize the native R version above, all I had to do is replace the call to "mapply" to "mcmapply" and set the number of cores (now I have a 4-core machine!).

Here are the benchmarks:

                                           test replications elapsed relative
2 R.calling.cpp.naive.parallel(air.locs[, -1])           10  10.433    1.000
4              just.R.parallel(air.locs[, -1])           10  11.809    1.132
1          R.calling.cpp.naive(air.locs[, -1])           10  15.855    1.520
3                       just.R(air.locs[, -1])           10  17.221    1.651

Lesson 3: Smelly combinations of Rcpp and parallelism are sometimes counterproductive

Because of the nature of the problem and the way I chose to solve it, the solution that uses Rcpp correctly is not easily parallelize-able. I wrote some *extremely* smelly code that used explicit parallelism to use the proper Rcpp solution in a parallel fashion; the results were interesting:

                                          test replications elapsed relative
5           all_cpp(as.matrix(air.locs[, -1]))           10   0.023    1.000
4              just.R.parallel(air.locs[, -1])           10  11.515  500.652
6             all.cpp.parallel(air.locs[, -1])           10  14.027  609.870
2 R.calling.cpp.naive.parallel(air.locs[, -1])           10  17.580  764.348
1          R.calling.cpp.naive(air.locs[, -1])           10  21.215  922.391
3                       just.R(air.locs[, -1])           10  32.907 1430.739

The parallelized proper Rcpp solution (all.cpp.parallel) was outperformed by the parallelized native R version. Further the parallelized native R version was much easier to write and was idiomatic R.

How does it scale?

Comparing performance of different HP methods

Two quick things...

  • The "all_cpp" solution doesn't appear to exhibit polynomial growth; it does, it's just so much faster than the rest that it looks completely flat
  • It's hard to tell, but that's "just.R.parallel" that is tied with "R.calling.cpp.naive.parallel"

Too long, didn’t read:
If you know C++, try using Rcpp (but correctly). If you don't, try multicore versions of lapply and mapply, if applicable, for great good. If it’s fast enough, leave well enough alone.

PS: I way overstated how "intractable" this problem is. According to my curve fitting, the vanilla R solution would take somewhere between 2.5 and 3.5 hours. The fastest version of these methods, the non-parallelized proper Rcpp one, took 9 seconds to run. In case you were wondering, the answer is 1,869.7 km (1,161 miles). The geometric mean might have been more meaningful in this case, though.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Playing around with #rstats twitter data

As a bit of weekend fun, I decided to briefly look into the #rstats twitter data that Stephen Turner collected and made available (thanks!). Essentially, this data set contains some basic information about over 100,000 tweets that contain the hashtag "#rstats" that denotes that a tweeter is tweeting about R.

As a warning, I don't know much about how these data were collected, whether it was collected and random times during the day or whether it was biased toward particular times and, therefore, locations. I wouldn't really read too much into this.

Most common co-occuring hashtags
When a tweet uses a hashtag at all, it very often uses more than one. To extract the co-occuring hashtags, I used the following perl script:

#!/usr/bin/perl

while(<>){
    chomp;
    $_ = lc($_);
    $_ =~ s/#rstats//g;
    my @matches;
    push @matches, /(#\w+)/;
    print join "\n" => @matches if @matches;
}

which uses the regular expression "(#\w+)" to search for hashtags after removing "#rstats" from every tweet.

On the unix command-line, I put these other hashtags into a file and sorted via these commands:

cat data/R-hashtag-data.txt | ./PERL_SCRIPT_ABOVE.pl | tee other-hashtags.txt

sort other-hashtags.txt | uniq -c | sort -n -r > sorted-other-hashtags.txt

After running these commands, I get a numbered list of co-occuring hashtags, sorted in descending order. The top 10 co-occuring hashtags were as follows (you can see the rest here :

5258 #datascience
1665 #python
1625 #bigdata
1542 #r
1451 #dataviz
1360 #ggplot2
 852 #statistics
 783 #dplyr
 749 #machinelearning
 743 #analytics

Neat-o. The presence of "#python" and "#ggplot2" in the top 10 made me wonder what the top 10 programming language and R package related hashtags were. Here they are, respectively:

1665 #python
 423 #d3js (plus 72 for #d3) (plus 2 for #js)
 343 #sas
 312 #julialang (plus 43 for #julia)
 240 #fsharp
 140 #spss  (plus 7 for #ibmspss)
 102 #stata
  75 #matlab
  55 #sql
  38 #java

1360 #ggplot2  (plus 298 for ggplot)  (plus for 6 #gglot2) (plus 4 for #ggpot)
 783 #dplyr
 663 #shiny
 557 #rcpp (plus 22 for rcpp11)
 251 #knitr
 156 #magrittr
 105 #lme4
  93 #ggvis   (plus 11 for #ggivs)
  65 #datatable
  46 #rneo4j

You can view the full list here and here.

I was happy to see my favorite languages (python, perl, clojure, lisp, haskell, c) besides R being represented in the first list. Additionally, most of my favorite packages were fairly well tweeted about--at least as far as hashtags-applied-to-a-package go.

#strangehashtags
Before moving on to the next section, I wanted to share my favorite co-occuring hashtags that I found while sifting through the data: #rcatladies, #rdogfella, #bayesianbootycall, #dontbeaplyrhater, #overlyhonestmethods, #rickshaw (??), #statafail, and #monkeysinfrontoftypewriters.

Most prolific #rstats tweeters
One of the first things I did with these data is a simple aggregation and sort to find the tweeters that used the hashtag most often:

library(dplyr)
THE_DATA %>%
  group_by(User) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) -> prolific.rstats.tweeters

Here is the top 10 (you can see the rest here.)

@Rbloggers	1081
@hadleywickham	498
@timelyportfolio	427
@recology_	419
@revodavid	210
@chlalanne	209
@adolfoalvarez	199
@RLangTip	175
@jmgomez	160

Nothing terribly surprising here.

Normalizing by total tweets
In a twitter discussion about these data, a twitter friend Tim Hopper posited that though he had fewer #rstats tweets than another mutual friend, Trey Causey, he would have a higher number of #rstats tweets if you control for total tweet volume. I wondered how this sorting would look.

Answering this question gave me an excuse to use Hadley Wickham's new package, rvest (I literally just got why the package is named as much while typing this out) which makes web scraping easier--in part by leveraging the expressive power of the magrittr package.

To get the total number of tweets for a particular tweeter, I wrote the following function:

library(rvest)
library(magrittr)
get.num.tweets <- function(handle){
  tryCatch({
    unraw <- function(raw_str){
      raw_str <- sub(",", "", raw_str)    # remove commas if any
      if(grepl("K", raw_str)){
        return(as.numeric(sub("K", "", raw_str))*1000)   # in thousands
      }
      return(as.numeric(raw_str))
    }
    html(paste0("http://twitter.com/", sub("@", "", handle))) %>%
      html_nodes(".is-active .ProfileNav-value") %>%
      html_text() %>%
      unraw
    },
    error=function(cond){return(NA)})
}

The real logic (and beauty) of which is contained only in the last few lines:

    html(paste0("http://twitter.com/", sub("@", "", TWITTER_HANDLE))) %>%
      html_nodes(".is-active .ProfileNav-value") %>%
      html_text()

The CSS element that houses the number of total tweets from a useR's twitter page was found easily using SelectorGadget.

After scraping the number of tweets for almost 10,000 #rstats tweeters (waiting a few seconds between each request because I'm considerate) I divided number of #rstats tweets by the total number of tweets to come up with a normalized value.

The top 10 tweeteRs were as follows:

              User count num.of.tweets     ratio 
1     @medzihorsky     9            28 0.3214286 
2        @statworx     5            16 0.3125000 
3    @LearnRinaDay   114           404 0.2821782 
4  @RforExcelUsers     4            15 0.2666667 
5     @showmeshiny    27           102 0.2647059 
6           @tcrug     6            25 0.2400000 
7   @DailyRpackage   155           666 0.2327327 
8   @R_Programming    49           250 0.1960000 
9        @hexadata     8            41 0.1951220 
10     @Deep_RHelp    11            58 0.1896552 

In case you were wondering, Trey Causey still "won" by a long shot:

> tweeters[which(tweeters$User=="@tdhopper"),]   
Source: local data frame [1 x 4]                 
                                                 
       User count num.of.tweets        ratio     
1 @tdhopper     8         26700 0.0002996255     
> tweeters[which(tweeters$User=="@treycausey"),] 
Source: local data frame [1 x 4]                 
                                                 
         User count num.of.tweets      ratio     
1 @treycausey    50         28700 0.00174216

Before ending this post, I feel compelled to issue an almost certainly unnecessary but customary warning against using number of #rstats tweets as a proxy for who likes R the most or who are the biggest R "thought leaders" (whatever that is). Most tweets about R don't use the #rstats hashtag, anyway.

Again, I would't read too much into this :)

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Assertive R programming in dplyr/magrittr pipelines

A lot of my job–and side projects, for that matter–involve running R scripts on updates of open government data. While I’m infinitely grateful to have access to any interesting open datasets in the first place, I can’t ignore that dealing with open data is often a messy affair. In fact this seems to be characteristic of most data sets I work with, open access or otherwise.

So... let's say I have a labyrinthine analysis workflow that uses a wide array of government sources to answer an interesting question. The workflow is full of analyses that return components that are components of still other analyses.

Then there’s an update of the data! Whoopee! I rerun the scripts/workflow on updated (or partially updated) data. Then one of four things happen:

  • In the best case scenario, everything works because there were no errors in the data.
  • In the likely scenario, something very late in this labyrinthine analysis workflow breaks and it’s not clear what datum caused this error.
  • In the worst case scenario, nothing breaks and the error is only caught when the results–or part of them–are nonsensical.
  • In the worst worst case scenario, the results or some of the results are wrong but it looks ok and it goes undetected.

In an effort to help solve this common problem–and inspired by the elegance of dplyr/magrittr pipelines–I created a R package called assertr.

assertr works by adding two new verbs to the pipeline, verify and assert, and a couple of predicate functions. Early on in the pipeline, you make certain assertions about how the data should look. If the data conform to these assertions, then we go on with the pipeline. If not, the verbs produce errors that terminate any further pipeline computations. The benefit of the verbs, over the truth assurance functions already in R (like stopifnot) is that they needn’t interrupt the flow of the pipeline.

Take, for example, the following contrived snippet making sure that there are only 0s and 1s (automatic and manual, respectively) in R’s Motor Trend Car Road Test built-in dataset before calculating the average miles per gallon per category.

mtcars %>%
  verify(am %in% c(0,1)) %>%
  group_by(cyl) %>%
  summarise(mean.mpg=mean(mpg))

#   am     mean.mpg
#   0      17.14737
#   1      24.39231

Let’s say this dataset was much bigger, not built in to R, and curated and disseminated by someone with less perfectionistic (read obsessive/compulsive) tendencies than yours truly. If we wanted to find the average miles per gallon aggregated by number of engine cylinders, we might first want to check if the number of cylinders is reasonable (either 4, 6, or 8) and that the miles per gallon was a reasonable number (between 10 and 40 mpg) and not a data entry error that would greatly throw off our non-robust estimator:

mtcars %>%
  assert(in_set(4, 6, 8), cyl) %>%
  assert(within_bounds(10, 40), mpg) %>%
  group_by(cyl) %>%
  summarise(mean.mpg=mean(mpg))

#  cyl   mean.mpg
#   4     26.66364
#   6     19.74286
#   8     15.10000

Perhaps one day there will be cars that have more than 8 cylinders or less than 2. We might want to only check if there are an even number of cylinders (since it has to be even, I think); we can change the first assert line to:

assert(function(x) x%%2==0, cyl) %>%

assertr subscribes to the general idea that it is better to fail fast to spot data errors early. The benefit of assertr’s particular approach is that it’s friendly to the pipeline paradigm used by magrittr and dplyr.

The best thing about assertr’s approach, though, is that it forces you to state your assumptions up front. When your assumptions are stated clearly and verified, errors from messy data tend to disappear.

To learn more about assertr and the kinds of assertions that you can make with it, visit its page on github.

You can also read the vignette here.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail