How dplyr replaced my most common R idioms

Having written a lot of R code over the last few years, I've developed a set of constructs for my most common tasks. Like an idiom in a natural language (e.g. "break a leg"), I automatically grasp their meaning without having to think about it. Because they allow me to become more and more productive in R, these idioms have become ingrained in my head (and muscle memory) and, in large part, inform how I approach problems.

It's no wonder, then, why I'm hesitant to embrace new packages that threaten to displace these idioms; switching is tantamount to learning a new dialect after investing a lot of time becoming fluent in another.

On occasion, though, a package comes along whose benefits are so compelling (here's looking at you, Hadley Wickham, Dirk Eddelbuettel, and Romain François) that it incites me to take the plunge and employ new patterns and learn new idioms. The most recent package to accomplish this is the dplyr package. This package (among other things) reimplements 5 of my most common R data manipulation idioms--often, in blazing fast C++.

This post serves as both an advocation for dplyr (by comparing form and speed) but also as a rosetta stone--to serve as a personal reference for translating my old R idioms.

This uses a dataset documenting crimes in the US by state available here

library(dplyr)
crime.by.state <- read.csv("CrimeStatebyState.csv")

Filtering rows

# base R
crime.ny.2005 <- crime.by.state[crime.by.state$Year==2005 &
                                crime.by.state$State=="New York", ]

# dplyr
crime.ny.2005 <- filter(crime.by.state, State=="New York", Year==2005)

There is a lot going on with my base R solution. It uses logical subsetting to extract choice rows from crime.by.state. Specifically, it creates a two boolean vectors: one that is true only when the "Year" column's value is 2005, and one that is true only when the "State" column's value is "New York". It then logical "AND"s these vectors, so that the resulting boolean vector is true only where the year was 2005 and the state was New York. This vector then is used to subset crime.by.state, and includes all columns. In contrast, the dplyr solution reads much more naturally, and in far fewer characters. According to my (crude) benchmarks the dplyr solution appears to be twice as fast.

A quick note before moving on, we could've drastically cut down on the number of characters in the base R solution by "attaching" the crime.ny.2005 dataset, eliminating the need to preface the "Year" and "State" names with "crime.by.state$", but there are two reasons why I don't do this. (1) I consider it to be bad form in a lot of circumstances (for example, it can become confusing when more than one dataset is loaded), and (2) RStudio will tab-auto-complete a column name after prefacing it with "name-of-dataframe$" and that drastically increases my coding speed. My only complaint(?) about dplyr is that it disallows this prefacing syntax and requires me to lookup the column names (and spell them correctly).

Arranging and ordering

# base R
crime.ny.2005 <- crime.ny.2005[order(crime.ny.2005$Count, 
                                     decreasing=TRUE), ]

# dplyr
crime.ny.2005 <- arrange(crime.ny.2005, desc(Count))

The base R solution ranks each row by value of "Count" in decreasing order, and uses the rank vector to subset the "crime.ny.2005" data frame. The dplyr solution appears to be about 20% faster.

Selecting columns

# base R
crime.ny.2005 <- crime.ny.2005[, c("Type.of.Crime", "Count")]

# dplyr
crime.ny.2005 <- select(crime.ny.2005, Type.of.Crime, Count)

This example is relatively self-explanatory. Here the base R solution appears to be faster, by about 30%.

Creating new columns

# base R
crime.ny.2005$Proportion <- crime.ny.2005$Count /
                            sum(crime.ny.2005$Count)

# dplyr
crime.ny.2005 <- mutate(crime.ny.2005, 
                        Proportion=Count/sum(Count))

Very often, I have to create a new column that is a function of one or more existing columns. Here, we are creating a new column, that represents the proportion that a particular crime claims from the total number of crimes, among all types. Incredibly, base R beats dplyr in this task--it is about 18 times faster.

If I had to guess, I think this is because of the nuances of R's vectorization. In the base R solution, a vector of crime counts is extracted. R recognizes that it is being divided by a scalar (the sum of the counts), and automatically creates a vector with this scalar repeated so that the length of the vectors match. Both of the vectors are stored contiguously and the resulting element-wise division is blindingly fast. In contrast, I think that in the dplyr solution, the sum of the counts column is actually evaluated for each element in the count vector, although I am not sure.

Aggregation and summarization

# base R
summary1 <- aggregate(Count ~ Type.of.Crime,
                      data=crime.ny.2005,
                      FUN=sum)
summary2 <- aggregate(Count ~ Type.of.Crime,
                      data=crime.ny.2005,
                      FUN=length)
summary.crime.ny.2005 <- merge(summary1, summary2,
                               by="Type.of.Crime")

# dplyr
by.type <- group_by(crime.ny.2005, Type.of.Crime)
summary.crime.ny.2005 <- summarise(by.type,
                                   num.types = n(),
                                   counts = sum(Count))

This is the arena in which dplyr really shines over base R. In the original dataset, crime was identified by specific names ("Burglary", "Aggravated assault") and by a broader category ("Property Crime" and "Violent Crime")

Before this point, the data frame we are working with looks like this:

 Type.of.Crime  Count
 Violent Crime    874
 Violent Crime   3636
 Violent Crime  35179
 Violent Crime  46150
Property Crime  68034
Property Crime 302220
Property Crime  35736

In this pedagogical example we want to aggregate by the type of crime and (a) get the number of specific crimes that fall into each category, and (b) get the sum of all crimes committed in those categories. Base R makes it very easy to do one of these aggregations, but to get two values, it requires that we make two calls to aggregate and then merge the results. Dplyr's solution, on the other hand, is relatively intuitive, and requires just two function calls.

All together now

We haven't showcased the best part of dplyr yet... it presents itself when combining all of these statements:

# base R
crime.by.state <- read.csv("CrimeStatebyState.csv")
crime.ny.2005 <- crime.by.state[crime.by.state$Year==2005 &
                                  crime.by.state$State=="New York", 
                                c("Type.of.Crime", "Count")]
crime.ny.2005 <- crime.ny.2005[order(crime.ny.2005$Count, 
                                     decreasing=TRUE), ]
crime.ny.2005$Proportion <- crime.ny.2005$Count /
                            sum(crime.ny.2005$Count)
summary1 <- aggregate(Count ~ Type.of.Crime,
                      data=crime.ny.2005,
                      FUN=sum)
summary2 <- aggregate(Count ~ Type.of.Crime,
                      data=crime.ny.2005,
                      FUN=length)
final <- merge(summary1, summary2,
               by="Type.of.Crime")


# dplyr
crime.by.state <- read.csv("CrimeStatebyState.csv")
final <- crime.by.state %>%
           filter(State=="New York", Year==2005) %>%
           arrange(desc(Count)) %>%
           select(Type.of.Crime, Count) %>%
           mutate(Proportion=Count/sum(Count)) %>%
           group_by(Type.of.Crime) %>%
           summarise(num.types = n(), counts = sum(Count))

When all combined, the base R solution took 60 seconds (over 10000 iterations) and the dplyr solution took 30 seconds. Perhaps more importantly, the dplyr code to uses many fewer lines and assignments and is more terse and probably more readable with its neat-o "%>%" operator.

I would be remiss if I didn't mention at least one of the other benefits of dplyr...

Dplyr's functions are generalized to handle more than just data.frames (like we were using here). As easily as dplyr handles the data frame, dplyr can also handle data.tables, remote (and out-of-memory) databases like MySQL; Postgres; Lite; and BigQuery by translating to the appropriate SQL on the fly.

There are still other neat features of dplyr but perhaps these are reason enough to give dplyr a shot. I know my own code may never look the same.

edit (9/17/14): I changed the pipe operator from the deprecated "%.%" to the preferred "%>%".

share this: facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

18 Comments

  1. Bob Muenchen February 11, 2014 10:56 am Reply

    I just love an example that builds slowly, piece by piece. Superb job!

    Bob

    http://r4stats.com

    • [email protected] February 11, 2014 10:57 am Reply

      That's what I was aiming for. Thank you very much!

  2. Leon Di Stefano February 12, 2014 12:08 am Reply

    Fantastic overview. Hadley Wickham has done such extraordinary things for the R platform.

  3. G. Grothendieck February 21, 2014 6:26 pm Reply

    dplyr is great but to be fair the base solution could have used subset and transform which shortens it:

    • [email protected] February 21, 2014 7:10 pm Reply

      That's a really good point. Thanks!
      This post was more about how my personal idioms can be rewritten with dplyr (I don't use transform much), but I should have made that more clear.

  4. Joe Thorley February 21, 2014 7:49 pm Reply

    Very useful post which clarified a bunch of things for me very quickly - much appreciated. Minor thing but for what its worth your post was much easier to read (font-size, width, colors) on my machine on http://www.statsblogs.com/.

    http://www.poissonconsulting.ca

    • [email protected] February 21, 2014 9:31 pm Reply

      Thanks for the kind words about the post.
      I'm always interested in making this blog easier to read. Is the font size too big and the width too small? Is there something I can do to make it easier to read?

  5. Jim Hester February 22, 2014 11:27 am Reply

    The reason mutate is slower than just adding a new column to the data frame is mutate returns a copy if the entire data frame with the new columns added, so had to copy the entire thing, whereas the base r example just adds the new column to the existing object in place.

    Also as others have said most of your base R examples would be significantly shortened by using subset and transform.

  6. myschizobuddy March 21, 2014 10:15 am Reply

    please explain this
    summary.crime.ny.2005 <- summarise(by.type,
    num.types = n(),
    counts = sum(Count))
    what does num.types = n() do?

    • myschizobuddy March 21, 2014 10:19 am Reply

      Also by.type is a data frame with state, crime and year columns. How does summarise know that they are not needed.

      • [email protected] March 21, 2014 9:01 pm Reply

        by.type doesn't have state or year columns...
        this is the line that declared it from the code in the post:
        by.type <- group_by(crime.ny.2005, Type.of.Crime)
        it is made from the crime.ny.2005 dataframe (and there is only one state and year in that dataframe)

        Does that answer your question?

    • [email protected] March 21, 2014 8:58 pm Reply

      n() returns the count of occurrences. In 'by.type', there are 4 occurrences of "Violent Crime" and 3 occurrences of "Property Crime", so num.types will hold both of these values.

      If it helps, this is an SQL statement that will perform the same thing:
      SELECT Type.of.Crime, count(*) AS num.types, sum(Count)
      FROM crime.ny.2005
      GROUP BY Type.of.Crime

      I don't know if you're familiar with SQL or not, but I hope that helps. Also, the equivalent base R command is shown, it's:
      summary2 <- aggregate(Count ~ Type.of.Crime,
      data=crime.ny.2005,
      FUN=length)

  7. Arun July 4, 2014 6:33 pm Reply

    The link to the .csv file is broken. It'd be great if you could upload it somewhere and provide a link to get it from.

    • [email protected] July 7, 2014 10:16 am Reply

      Good catch, Arun! I guess the original author took it down. I saved it, though, and updated the link to a self-hosted copy–everything should work now. Thanks for pointing that out!

  8. Steve O July 23, 2014 1:07 pm Reply

    Very well done! Short and sweet, yet helpful. Thanks for sharing.

    http://about.me/carnagua

    • [email protected] July 23, 2014 1:13 pm Reply

      Thank you very much for the kind words

Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">