How dplyr replaced my most common R idioms

Having written a lot of R code over the last few years, I've developed a set of constructs for my most common tasks. Like an idiom in a natural language (e.g. "break a leg"), I automatically grasp their meaning without having to think about it. Because they allow me to become more and more productive in R, these idioms have become ingrained in my head (and muscle memory) and, in large part, inform how I approach problems.

It's no wonder, then, why I'm hesitant to embrace new packages that threaten to displace these idioms; switching is tantamount to learning a new dialect after investing a lot of time becoming fluent in another.

On occasion, though, a package comes along whose benefits are so compelling (here's looking at you, Hadley Wickham, Dirk Eddelbuettel, and Romain François) that it incites me to take the plunge and employ new patterns and learn new idioms. The most recent package to accomplish this is the dplyr package. This package (among other things) reimplements 5 of my most common R data manipulation idioms--often, in blazing fast C++.

This post serves as both an advocation for dplyr (by comparing form and speed) but also as a rosetta stone--to serve as a personal reference for translating my old R idioms.

This uses a dataset documenting crimes in the US by state available here

library(dplyr) <- read.csv("CrimeStatebyState.csv")

Filtering rows

# base R
crime.ny.2005 <-[$Year==2005 &
                      $State=="New York", ]

# dplyr
crime.ny.2005 <- filter(, State=="New York", Year==2005)

There is a lot going on with my base R solution. It uses logical subsetting to extract choice rows from Specifically, it creates a two boolean vectors: one that is true only when the "Year" column's value is 2005, and one that is true only when the "State" column's value is "New York". It then logical "AND"s these vectors, so that the resulting boolean vector is true only where the year was 2005 and the state was New York. This vector then is used to subset, and includes all columns. In contrast, the dplyr solution reads much more naturally, and in far fewer characters. According to my (crude) benchmarks the dplyr solution appears to be twice as fast.

A quick note before moving on, we could've drastically cut down on the number of characters in the base R solution by "attaching" the crime.ny.2005 dataset, eliminating the need to preface the "Year" and "State" names with "$", but there are two reasons why I don't do this. (1) I consider it to be bad form in a lot of circumstances (for example, it can become confusing when more than one dataset is loaded), and (2) RStudio will tab-auto-complete a column name after prefacing it with "name-of-dataframe$" and that drastically increases my coding speed. My only complaint(?) about dplyr is that it disallows this prefacing syntax and requires me to lookup the column names (and spell them correctly).

Arranging and ordering

# base R
crime.ny.2005 <- crime.ny.2005[order(crime.ny.2005$Count, 
                                     decreasing=TRUE), ]

# dplyr
crime.ny.2005 <- arrange(crime.ny.2005, desc(Count))

The base R solution ranks each row by value of "Count" in decreasing order, and uses the rank vector to subset the "crime.ny.2005" data frame. The dplyr solution appears to be about 20% faster.

Selecting columns

# base R
crime.ny.2005 <- crime.ny.2005[, c("Type.of.Crime", "Count")]

# dplyr
crime.ny.2005 <- select(crime.ny.2005, Type.of.Crime, Count)

This example is relatively self-explanatory. Here the base R solution appears to be faster, by about 30%.

Creating new columns

# base R
crime.ny.2005$Proportion <- crime.ny.2005$Count /

# dplyr
crime.ny.2005 <- mutate(crime.ny.2005, 

Very often, I have to create a new column that is a function of one or more existing columns. Here, we are creating a new column, that represents the proportion that a particular crime claims from the total number of crimes, among all types. Incredibly, base R beats dplyr in this task--it is about 18 times faster.

If I had to guess, I think this is because of the nuances of R's vectorization. In the base R solution, a vector of crime counts is extracted. R recognizes that it is being divided by a scalar (the sum of the counts), and automatically creates a vector with this scalar repeated so that the length of the vectors match. Both of the vectors are stored contiguously and the resulting element-wise division is blindingly fast. In contrast, I think that in the dplyr solution, the sum of the counts column is actually evaluated for each element in the count vector, although I am not sure.

Aggregation and summarization

# base R
summary1 <- aggregate(Count ~ Type.of.Crime,
summary2 <- aggregate(Count ~ Type.of.Crime,
summary.crime.ny.2005 <- merge(summary1, summary2,

# dplyr
by.type <- group_by(crime.ny.2005, Type.of.Crime)
summary.crime.ny.2005 <- summarise(by.type,
                                   num.types = n(),
                                   counts = sum(Count))

This is the arena in which dplyr really shines over base R. In the original dataset, crime was identified by specific names ("Burglary", "Aggravated assault") and by a broader category ("Property Crime" and "Violent Crime")

Before this point, the data frame we are working with looks like this:

 Type.of.Crime  Count
 Violent Crime    874
 Violent Crime   3636
 Violent Crime  35179
 Violent Crime  46150
Property Crime  68034
Property Crime 302220
Property Crime  35736

In this pedagogical example we want to aggregate by the type of crime and (a) get the number of specific crimes that fall into each category, and (b) get the sum of all crimes committed in those categories. Base R makes it very easy to do one of these aggregations, but to get two values, it requires that we make two calls to aggregate and then merge the results. Dplyr's solution, on the other hand, is relatively intuitive, and requires just two function calls.

All together now

We haven't showcased the best part of dplyr yet... it presents itself when combining all of these statements:

# base R <- read.csv("CrimeStatebyState.csv")
crime.ny.2005 <-[$Year==2005 &
                        $State=="New York", 
                                c("Type.of.Crime", "Count")]
crime.ny.2005 <- crime.ny.2005[order(crime.ny.2005$Count, 
                                     decreasing=TRUE), ]
crime.ny.2005$Proportion <- crime.ny.2005$Count /
summary1 <- aggregate(Count ~ Type.of.Crime,
summary2 <- aggregate(Count ~ Type.of.Crime,
final <- merge(summary1, summary2,

# dplyr <- read.csv("CrimeStatebyState.csv")
final <- %>%
           filter(State=="New York", Year==2005) %>%
           arrange(desc(Count)) %>%
           select(Type.of.Crime, Count) %>%
           mutate(Proportion=Count/sum(Count)) %>%
           group_by(Type.of.Crime) %>%
           summarise(num.types = n(), counts = sum(Count))

When all combined, the base R solution took 60 seconds (over 10000 iterations) and the dplyr solution took 30 seconds. Perhaps more importantly, the dplyr code to uses many fewer lines and assignments and is more terse and probably more readable with its neat-o "%>%" operator.

I would be remiss if I didn't mention at least one of the other benefits of dplyr...

Dplyr's functions are generalized to handle more than just data.frames (like we were using here). As easily as dplyr handles the data frame, dplyr can also handle data.tables, remote (and out-of-memory) databases like MySQL; Postgres; Lite; and BigQuery by translating to the appropriate SQL on the fly.

There are still other neat features of dplyr but perhaps these are reason enough to give dplyr a shot. I know my own code may never look the same.

edit (9/17/14): I changed the pipe operator from the deprecated "%.%" to the preferred "%>%".

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

27 Responses

  1. Bob Muenchen February 11, 2014 / 10:56 am

    I just love an example that builds slowly, piece by piece. Superb job!


  2. Leon Di Stefano February 12, 2014 / 12:08 am

    Fantastic overview. Hadley Wickham has done such extraordinary things for the R platform.

  3. G. Grothendieck February 21, 2014 / 6:26 pm

    dplyr is great but to be fair the base solution could have used subset and transform which shortens it:

    • [email protected] February 21, 2014 / 7:10 pm

      That's a really good point. Thanks!
      This post was more about how my personal idioms can be rewritten with dplyr (I don't use transform much), but I should have made that more clear.

      • Dominique Muller June 18, 2016 / 11:54 am

        I think I read that the subset function was to be avoided. Not sure why and I wonder whether the filter function is more similar to the subset function or to the bracket solution that people say we should favor over the subset function.

  4. Joe Thorley February 21, 2014 / 7:49 pm

    Very useful post which clarified a bunch of things for me very quickly - much appreciated. Minor thing but for what its worth your post was much easier to read (font-size, width, colors) on my machine on

    • [email protected] February 21, 2014 / 9:31 pm

      Thanks for the kind words about the post.
      I'm always interested in making this blog easier to read. Is the font size too big and the width too small? Is there something I can do to make it easier to read?

  5. Jim Hester February 22, 2014 / 11:27 am

    The reason mutate is slower than just adding a new column to the data frame is mutate returns a copy if the entire data frame with the new columns added, so had to copy the entire thing, whereas the base r example just adds the new column to the existing object in place.

    Also as others have said most of your base R examples would be significantly shortened by using subset and transform.

  6. myschizobuddy March 21, 2014 / 10:15 am

    please explain this
    summary.crime.ny.2005 <- summarise(by.type,
    num.types = n(),
    counts = sum(Count))
    what does num.types = n() do?

    • myschizobuddy March 21, 2014 / 10:19 am

      Also by.type is a data frame with state, crime and year columns. How does summarise know that they are not needed.

      • [email protected] March 21, 2014 / 9:01 pm

        by.type doesn't have state or year columns...
        this is the line that declared it from the code in the post:
        by.type <- group_by(crime.ny.2005, Type.of.Crime) it is made from the crime.ny.2005 dataframe (and there is only one state and year in that dataframe) Does that answer your question?

    • [email protected] March 21, 2014 / 8:58 pm

      n() returns the count of occurrences. In 'by.type', there are 4 occurrences of "Violent Crime" and 3 occurrences of "Property Crime", so num.types will hold both of these values.

      If it helps, this is an SQL statement that will perform the same thing:
      SELECT Type.of.Crime, count(*) AS num.types, sum(Count)
      FROM crime.ny.2005
      GROUP BY Type.of.Crime

      I don't know if you're familiar with SQL or not, but I hope that helps. Also, the equivalent base R command is shown, it's:
      summary2 <- aggregate(Count ~ Type.of.Crime, data=crime.ny.2005, FUN=length)

  7. Arun July 4, 2014 / 6:33 pm

    The link to the .csv file is broken. It'd be great if you could upload it somewhere and provide a link to get it from.

    • [email protected] July 7, 2014 / 10:16 am

      Good catch, Arun! I guess the original author took it down. I saved it, though, and updated the link to a self-hosted copy–everything should work now. Thanks for pointing that out!

  8. Steve O July 23, 2014 / 1:07 pm

    Very well done! Short and sweet, yet helpful. Thanks for sharing.

  9. Jarsen March 19, 2015 / 12:29 pm

    You have three lines because of aggregate,that is really unnecessary. This one line takes care of all that: final<-aggregate(Count ~ Type.of.Crime, crime.ny.2005,function(x){c(sum=sum(x),len=length(x))})

    some people might not like using that because it will give you a two columns, Count.sum and Count.len, where the Count column of the data.frame is really a list with two columns 'sum' and 'len'. That part would easily confuse the uninitiated.

    • JARSEN August 11, 2017 / 8:01 am

      it seems the default behavior of this has changed to where simplify =T . This no longer results in a column of 2 lists but in 2 distinct columns

  10. Hannah January 28, 2016 / 10:43 pm

    How can I calculate proportion of the total values for a set of columns? I see how using the mutate_each I can get the sum of the cumulative sum, but what function gives the proportion?

      • Hannah January 29, 2016 / 9:58 pm

        Thanks. I was looking for a dplyr solution.

        • Vinay July 24, 2017 / 4:27 pm

          add %>%mutate(prop = prop.table(n)) to the end of your dplyr chain.

  11. helmingstay December 2, 2016 / 9:39 pm

    I came across this blog post searching for the difference between subset() and filter().
    It's a nice post (thanks!), but it *amazes* me to see so many R users unfamiliar with subset() and with(). Honestly, this is something of a failure of the documentation :)

    subset(), with(), and within() are generic functions that do the right thing almost all of the time. So, subset() works just fine (and stupidly fast) with data.table. with() is trickier and quirkier, but ridiculously useful, especially as a conceptual tool.

    Here's Hadley's comment on filter - no difference except for backend (so why not just write a new generic function for subset??):!topic/manipulatr/rXViiihNO0g

Leave a Reply

Your email address will not be published. Required fields are marked *