## Lessons learned in high-performance R

On this blog, I've had a long running investigation/demonstration of how to make a "embarrassingly-parallel" but computationally intractable (on commodity hardware, at least) R problem more performant by using parallel computation and Rcpp.

The example problem is to find the mean distance between every airport in the United States. This silly example was chosen because it exhibits polynomial growth in running time as a function of the number of airports and, thus, quickly becomes intractable without sampling. It is also easy to parallelize.

The first post used the (now-deprecated in favor of 'parallel') multicore package to achieve a substantial speedup. The second post used Rcpp to achieve a statistically significant but, functionally, trivial speedup by replacing the inner loop (the distance calculation using the Haversine formula) with a version written in C++ using Rcpp. Though I was disappointed in the results, it should be noted that porting the function to C++ took virtually no extra work.

By necessity, I've learned a lot more about high-performance R since writing those two posts (part of this is by trying to make my own R package as performant as possible). In particular, I did the Rcpp version all wrong, and I'd like to rectify that in this post. I also compare the running times of approaches that use both parallelism and Rcpp.

Lesson 1: use Rcpp correctly
The biggest lesson I learned, is that it isn’t sufficient to just replace inner loops with C++ code; the repeated transferring of data from R to C++ comes with a lot of overhead. By actually coding the loop in C++, the speedups to be had are often astounding.

In this example, the pure R version, that takes a matrix of longitude/latitude pairs and computed the mean distance between all combinations, looks like this...

just.R <- function(dframe){
numrows <- nrow(dframe)
combns <- combn(1:nrow(dframe), 2)
numcombs <- ncol(combns)
combns %>%
{mapply(function(x,y){
haversine(dframe[x,1], dframe[x,2],
dframe[y,1], dframe[y,2]) },
.[1,], .[2,])} %>%
sum %>%
(function(x) x/(numrows*(numrows-1)/2))
}


The naive usage of Rcpp (and the one I used in the second blog post on this topic) simply replaces the call to "haversine" with a call to "haversine_cpp", which is written in C++. Again, a small speedup was obtained, but it was functionally trivial.

The better solution is to completely replace the combinations/"mapply" construct with a C++ version. Mine looks like this...

double all_cpp(Rcpp::NumericMatrix& mat){
int nrow = mat.nrow();
int numcomps = nrow*(nrow-1)/2;
double running_sum = 0;
for( int i = 0; i < nrow; i++ ){
for( int j = i+1; j < nrow; j++){
running_sum += haversine_cpp(mat(i,0), mat(i,1),
mat(j,0), mat(j,1));
}
}
return running_sum / numcomps;
}


The difference is incredible…

res <- benchmark(R.calling.cpp.naive(air.locs[,-1]),
just.R(air.locs[,-1]),
all_cpp(as.matrix(air.locs[,-1])),
columns = c("test", "replications", "elapsed", "relative"),
order="relative", replications=10)
res
#                                   test replications elapsed relative
# 3  all_cpp(as.matrix(air.locs[, -1]))           10   0.021    1.000
# 1 R.calling.cpp.naive(air.locs[, -1])           10  14.419  686.619
# 2              just.R(air.locs[, -1])           10  15.068  717.524


The properly written solution in Rcpp is 718 times faster than the native R version and 687 times faster than the naive Rcpp solution (using 200 airports).

Lesson 2: Use mclapply/mcmapply
In the first blog post, I used a messy solution that explicitly called two parallel processes. I’ve learned that using mclapply/mcmapply is a lot cleaner and easier to intregrate into idiomatic/functional R routines. In order to parallelize the native R version above, all I had to do is replace the call to "mapply" to "mcmapply" and set the number of cores (now I have a 4-core machine!).

Here are the benchmarks:

                                           test replications elapsed relative
2 R.calling.cpp.naive.parallel(air.locs[, -1])           10  10.433    1.000
4              just.R.parallel(air.locs[, -1])           10  11.809    1.132
1          R.calling.cpp.naive(air.locs[, -1])           10  15.855    1.520
3                       just.R(air.locs[, -1])           10  17.221    1.651


Lesson 3: Smelly combinations of Rcpp and parallelism are sometimes counterproductive

Because of the nature of the problem and the way I chose to solve it, the solution that uses Rcpp correctly is not easily parallelize-able. I wrote some *extremely* smelly code that used explicit parallelism to use the proper Rcpp solution in a parallel fashion; the results were interesting:

                                          test replications elapsed relative
5           all_cpp(as.matrix(air.locs[, -1]))           10   0.023    1.000
4              just.R.parallel(air.locs[, -1])           10  11.515  500.652
6             all.cpp.parallel(air.locs[, -1])           10  14.027  609.870
2 R.calling.cpp.naive.parallel(air.locs[, -1])           10  17.580  764.348
1          R.calling.cpp.naive(air.locs[, -1])           10  21.215  922.391
3                       just.R(air.locs[, -1])           10  32.907 1430.739


The parallelized proper Rcpp solution (all.cpp.parallel) was outperformed by the parallelized native R version. Further the parallelized native R version was much easier to write and was idiomatic R.

How does it scale?

Two quick things...

• The "all_cpp" solution doesn't appear to exhibit polynomial growth; it does, it's just so much faster than the rest that it looks completely flat
• It's hard to tell, but that's "just.R.parallel" that is tied with "R.calling.cpp.naive.parallel"

If you know C++, try using Rcpp (but correctly). If you don't, try multicore versions of lapply and mapply, if applicable, for great good. If it’s fast enough, leave well enough alone.

PS: I way overstated how "intractable" this problem is. According to my curve fitting, the vanilla R solution would take somewhere between 2.5 and 3.5 hours. The fastest version of these methods, the non-parallelized proper Rcpp one, took 9 seconds to run. In case you were wondering, the answer is 1,869.7 km (1,161 miles). The geometric mean might have been more meaningful in this case, though.

## Squeezing more speed from R for nothing, Rcpp style

In a previous post we explored how you can greatly speed up certain types of long-running computations in R by parallelizing your code using multicore package*. I also mentioned that there were a few other ways to speed up R code; the one I will be exploring in this post is using Rcpp to replace time-critical inner-loops with C++.

In general, good C++ code almost always runs faster than equivalent R code. Higher level language affordances like garbage collection, dynamic typing, and bounds checking can add a lot of computational overhead. Further, C/C++ compiles down to machine code, whereas R byte-code has to be interpreted.

On the other hand, I would hate to do all my statistics programming in a language like C++, precisely because of those higher-level language affordances I mentioned above. When development time (as opposed to execution time) is taken into account, programming in R is much faster for me (and makes me a very happy programmer).

On occasion, though, there are certain sections of R code that I wish I could rewrite in C/C++. They may be simple computations that get called thousands or millions of times in a loop. If I could just write these time-critical snippets with C/C++ and not have to throw the proverbial baby out with the bath water (and rewrite everything in C), my code would run much faster.

Though there have been packages to make this sort of thing since the early 2000s, Rcpp (and the Rcpp family**) has made this even easier; now interfacing with R objects is seamless.

To show an example of how you might use Rcpp, I’ve used the same example from my post "Parallel R (and air travel)". In this example, we use longitude and latitude info from all US airports to derive the average (mean) distance between every two US airports. The function I will be replacing with C++ is the function to compute the distance between two longitude latitude pairs (Haversine's formula) on a sphere (which is just an approximation).

The R functions to do this look like this:

to.radians<-function(degrees){
degrees * pi / 180
}

haversine <- function(lat1, long1, lat2, long2, unit="km"){
term1 <- sin(delta.phi/2) ^ 2
term2 <- cos(phi1) * cos(phi2) * sin(delta.lambda/2) ^ 2
the.terms <- term1 + term2
delta.sigma <- 2 * atan2(sqrt(the.terms), sqrt(1-the.terms))
if(unit=="km") return(distance)
if(unit=="miles") return(0.621371*distance)
}


While the C++ functions look like this:

#include <iostream>
#include <math.h>
#include <Rcpp.h>

// [[Rcpp::export]]
return(degrees * 3.141593 / 180);
}

// [[Rcpp::export]]
double haversine_cpp(double lat1, double long1,
double lat2, double long2,
std::string unit="km"){
double delta_phi = to_radians_cpp(lat2 - lat1);
double delta_lambda = to_radians_cpp(long2 - long1);
double term1 = pow(sin(delta_phi / 2), 2);
double term2 = cos(phi1) * cos(phi2) * pow(sin(delta_lambda/2), 2);
double the_terms = term1 + term2;
double delta_sigma = 2 * atan2(sqrt(the_terms), sqrt(1-the_terms));
double distance = radius * delta_sigma;

/* if it is anything *but* km it is miles */
if(unit != "km"){
return(distance*0.621371);
}

return(distance);
}


Besides for the semicolons, other assignment operator and the type declarations, these codes are almost identical.

Next, we put the C++ code above in a C++ source file. We will call it, and automatically compile and link to it from our driver R code thusly***:

calc.distance.two.rows <- function(ind1, ind2,
version=haversine){
return(version(air.locs[ind1, 2],
air.locs[ind1, 3],
air.locs[ind2, 2],
air.locs[ind2, 3]))
}

combos <- combn(1:nrow(air.locs), 2, simplify=FALSE)
num.of.comps <- length(combos)

mult.core <- function(version=haversine_cpp){
the.sum <- sum(unlist(mclapply(combos,
function(x){
calc.distance.two.rows(x[1], x[2],
version)
},
mc.cores=4)))
result <- the.sum / num.of.comps
return(result)
}

mult.core(version=haversine_cpp)


Comparing the R version against the C++ version over a range of sample sizes yielded a chart like this:

To run this to completion would have taken 4 hours but, if my math is correct, rewriting the distance function shaved of over 15 minutes from the completion time.

It is not uncommon for the Rcpp to speed up R code by *orders of magnitude*. In this link, Dirk Eddelbuettel demonstrates an 80-fold speed increase (albeit with a contrived example).

So why did we not get an 80-fold increase?

I could have (and will) rewrite more of this program in Rcpp to avoid some of the overhead with repeated calls to compiled C++. My point here was more to show that we can use Rcpp to speed up this program with very little work–almost for nothing. Again, besides for certain syntactical differences and type declarations, the R and C++ functions are virtually the same.

As you become more comfortable with it–and use it more within the same scripts–Rcpp will likely pay higher and higher dividends.

The next time we revisit this contrived airport example, we will be profiling it expanding the C++ and eventually, use distributed computing to get it as fast as we can.

* the 'multicore' package is now deprecated in favor of 'parallel'
** RCpp11 (for modern C++), RccpEigen (for use of the Eigen C++ linear algebra template library), RCppArmadillo (for use of the Eigen C++ linear algebra template library), and a few others
*** this code is a little bit different than the code in the first airport distance blog post because I switched from using the 'multicore' package to the 'parallel' package

## The performance gains from switching R's linear algebra libraries

What is often forgotten in the so-called data analysis "language wars” is that, across most of these languages, many common computations are performed using outsourced dynamically linked math libraries. For example, R; Python's Numpy; Julia; Matlab; and Mathematica all make heavy use of the BLAS linear algebra API. As a result, R can't be properly faulted (or praised) for how slowly (or rapidly) it performs Cholesky decomposition since (a) the R core team wasn't responsible for the algorithm's implementation, and (b) neither are other languages' contributors for theirs.

For code that deals predominately with numerical computing and constructs from linear algebra, language battles become more a petty and vacuous squabble over subjective preferences in syntax rather than substantive discourse that encourages true innovation and improvement. That being said, R is the best, and if you disagree you should feel bad.

Two posts ago, I asserted that, for speed-up purposes, recompilation of R was usually unnecessary and that other lower-hanging fruit should be taken before resorting to recompilations. We've already seen that for certain problems, parallelizing your code is a great and relatively easy-to-implement speed up option. Another great option that's available is the ability to swap out R's linear algebra libraries for faster ones. Since these libraries are linked to at run-time—as opposed to being included statically at compile-time—employing the use of these alternative libraries do not require recompilation.

The topic of swapping BLAS implementations has already been covered well in these blog posts by Nathan VanHoudnos and Zachary Mayer, as well as in this paper by Dirk Eddelbuettel, but I thought I’d throw in my thoughts and results, too.

For my comparisons, I pitted OpenBLAS and Apple’s Accelerate Framework's implementation of BLAS against each other and the BLAS that comes with R by default. I wanted to try others too, but I either had an extraordinarily difficult time compiling them, or was unwilling to shell out money for a propriety library (here's looking at you, Intel). Of particular interest to me was trying out the ATLAS library.

I originally thought that testing ATLAS would be redundant because I was given to understand that Apple Accelerate’s "vecLib" was a hand-tuned version of ATLAS for Apple processors. After looking further into it, I discovered that this is no longer the case. Apple asserts in line 632 of cblas.h that "The Apple BLAS is no longer based on ATLAS". Unfortunately, nothing I tried would get ATLAS to compile. C'est la vie.

The benchmarking script I used to test these implementations can be furnished from http://r.research.att.com/benchmarks/R-benchmark-25.R
It records the time to completion of a large series of various computations including matrix creation, inversion, and multiplication. By default, it performs each test 3 times and takes the trimmed geometric mean of the time to completions.

These are the results of the total time elapsed (for all tests) among the three libraries.

As you can see, both Accelerate and OpenBLAS blew R's default BLAS implementation out of the water, with Accelerate marginally outperforming OpenBLAS. The last line of the output from the UNIX "time" command gives us a clue as to why this might be the case:

R --slave 56.10 user 1.65s system 122% cpu 47.162 total


Judging by above-100-percent CPU usage of both Accelerate and OpenBLAS (and how hot my laptop got), I'd wager that the primary source of the improvement is Accelerate's and OpenBLAS's ability to use multiprocessing. If you have more than one core available, this is something that you might want to look into.

## Parallel R (and air travel)

My heart sinks a little when I check on my laptop in the morning and the computation I started the night before still hasn’t finished. Even when the data I’m playing with isn’t particularly.... large... (I’m not going to say it), I have a knack for choosing expensive algorithms late at night.

Because of my reluctance to get remote servers and tmux involved at ungodly hours, I’ve been exploring ways to speed up my code without too much trouble (I take the first of the Three virtues of programming very seriously) and without having to translate from my native tongue: R.

• writing idiomatic R
• taking advantage of the way R stores numerics in contiguous blocks of RAM and uses C code for vectorized functions

• using Rcpp or inline
• Identify bottlenecks and replace time critical/innermost-loops with C or C++ code

• distribute processing over machines
• It’s relatively easy to set up a computing cluster using Amazon Web Services and use the package snow, or a similar package.

• taking advantage of multiple cores
• Besides for writing idiomatic R (which is the subject of a future post), I’ve found that the easiest way to get a speedup is to use parallel processing on a single machine using the multicore package. This usually gives me a very substantial speedup, even if the code is less than elegant, as we’ll see.

Testing it:
Just for the sake of choosing a problem that exhibits combinatorial explosion, let's use the multicore package to get the mean distance between all airports in the U.S.

If we think of a network of air-travel as a complete graph where the vertices are airports and a bee-line between any two airports as an edge, we want to find the average length of the edges. The number of edges in a complete graph is $\frac{n(n-1)}{2}$ where n is the number of vertices. If this looks familiar, it’s because it’s equivalent to the binomial coefficient $n \choose 2$. Intuitively, this makes sense, since every edge connecting two vertices is a unique pairing of two airports. The fact that this number exhibits polynomial growth and that the computation of the distance between two airports does not depend on the distance between any other two, makes this a prime candidate for parallelizing.

The dataset of US airport codes and their longitude and latitude is available here. There are 13,429 airport codes, which makes the total number of combinations 90,162,306. Clearly, extreme measures have to be taken to make this a tractable problem on commodity hardware.

Since the Earth isn’t flat (it's not even, strictly speaking, a perfect sphere) the distance between longitude and latitude degrees is not constant. Luckily, there’s a great package, Imap, to handle conversion from two long/lat points to miles or kilometers.

The code:

library(multicore)
library(Imap)

calc.distance.two.rows <- function(ind1, ind2){
return(gdist(air.locs[ind1, 3],
air.locs[ind1, 2],
air.locs[ind2, 3],
air.locs[ind2, 2], units="km"))
}

sum.of.distances <- function(a.mat){
return(sum(apply(a.mat, 2, function(x){
calc.distance.two.rows(x[1], x[2])
})))
}

# read airport long/lat data set

args <- commandArgs(TRUE)

# read the number of airports to use (sample size) from the command-line
smp.size <- as.numeric(args[1])

# choose a random sample of airports
sampling <- sample((1:nrow(air.locs)), smp.size)

# shrink dataframe
air.locs <- air.locs[sampling, ]

# create a matrix of unique subsets of indices from the
# data frame that stores the airport information
combos <- combn(1:nrow(air.locs), 2)
num.of.comps <- ncol(combos)

# use single core
single <- function(){
the.sum <- sum.of.distances(combos)
result <- the.sum / num.of.comps
return(result)
}

# use two cores
mult <- function(){
half <- floor(num.of.comps/2)
f1 <- parallel(sum.of.distances(combos[, 1:half]))
f2 <- parallel(sum.of.distances(combos[, (half+1):num.of.comps]))
the.sum <- sum(as.vector(unlist(collect(list(f1, f2)))))
result <- the.sum / num.of.comps
return(result)
}

# compare the execution times (in time elapsed)
perform <- function(){
first <- system.time(res1 <- single())
second <- system.time(res2 <- mult())
cat(smp.size); cat(",first,"); cat(first[3]); cat(","); cat(res1); cat("\n");
cat(smp.size); cat(",second,"); cat(second[3]); cat(","); cat(res2); cat("\n");
}

perform()


Then, I wrote a wrapper program in python that compared the speeds using sample sizes from 10 to 800 in increments of ten.

The results:

Time comparison between single and multicore execution time in seconds. The curves were smoothed using LOESS.

The parallelized solution is much faster but it is not twice as fast, as one might expect. This is because, not only is there overhead involved in the process of forking and collecting the result, but part of the program (namely the reading of the dataset) is not parallelized. (For more information, check out Amdehl’s Law).

Some cursory curve-fitting suggests that the single core solution’s execution time is fairly well-modeled by the function $.00014(n)(n-1)$ and the dual core solution is well modeled by $.00009(n)(n-1)$. This would make the total time to completion about 7 hours and 4.5 hours, respectively.

After about 1 hour, I psychosomatically smelled melting plastic from my laptop so I called off the actual distance calculation. But repeated trials with sample sizes of 300 suggested that the sampling distribution of the sample mean (whew) had a mean of about 1,874 km and a standard deviation of 64 km (this is the standard error of the mean). This would suggest that the mean distance between any two US airports is about 1,874 km +- 125 (543 miles +- 78) with a standard deviation of around 1109 km (689 miles).

As this example shows, even a kludge-y solution that uses more than one thread can give you pretty substantial speed increases. In future posts, I plan to cover applying Rcpp to this problem and setting up a cluster to perform this calculation. In the meantime, I'll still be starting things at night and regretting it in the morning.

Footnote: If you're interested in this topic, I suggest you check out this site from CRAN. Also, everything I know about parallel R, I learned from this book.