Assertive R programming in dplyr/magrittr pipelines

A lot of my job–and side projects, for that matter–involve running R scripts on updates of open government data. While I’m infinitely grateful to have access to any interesting open datasets in the first place, I can’t ignore that dealing with open data is often a messy affair. In fact this seems to be characteristic of most data sets I work with, open access or otherwise.

So... let's say I have a labyrinthine analysis workflow that uses a wide array of government sources to answer an interesting question. The workflow is full of analyses that return components that are components of still other analyses.

Then there’s an update of the data! Whoopee! I rerun the scripts/workflow on updated (or partially updated) data. Then one of four things happen:

  • In the best case scenario, everything works because there were no errors in the data.
  • In the likely scenario, something very late in this labyrinthine analysis workflow breaks and it’s not clear what datum caused this error.
  • In the worst case scenario, nothing breaks and the error is only caught when the results–or part of them–are nonsensical.
  • In the worst worst case scenario, the results or some of the results are wrong but it looks ok and it goes undetected.

In an effort to help solve this common problem–and inspired by the elegance of dplyr/magrittr pipelines–I created a R package called assertr.

assertr works by adding two new verbs to the pipeline, verify and assert, and a couple of predicate functions. Early on in the pipeline, you make certain assertions about how the data should look. If the data conform to these assertions, then we go on with the pipeline. If not, the verbs produce errors that terminate any further pipeline computations. The benefit of the verbs, over the truth assurance functions already in R (like stopifnot) is that they needn’t interrupt the flow of the pipeline.

Take, for example, the following contrived snippet making sure that there are only 0s and 1s (automatic and manual, respectively) in R’s Motor Trend Car Road Test built-in dataset before calculating the average miles per gallon per category.

mtcars %>%
  verify(am %in% c(0,1)) %>%
  group_by(cyl) %>%
  summarise(mean.mpg=mean(mpg))

#   am     mean.mpg
#   0      17.14737
#   1      24.39231

Let’s say this dataset was much bigger, not built in to R, and curated and disseminated by someone with less perfectionistic (read obsessive/compulsive) tendencies than yours truly. If we wanted to find the average miles per gallon aggregated by number of engine cylinders, we might first want to check if the number of cylinders is reasonable (either 4, 6, or 8) and that the miles per gallon was a reasonable number (between 10 and 40 mpg) and not a data entry error that would greatly throw off our non-robust estimator:

mtcars %>%
  assert(in_set(4, 6, 8), cyl) %>%
  assert(within_bounds(10, 40), mpg) %>%
  group_by(cyl) %>%
  summarise(mean.mpg=mean(mpg))

#  cyl   mean.mpg
#   4     26.66364
#   6     19.74286
#   8     15.10000

Perhaps one day there will be cars that have more than 8 cylinders or less than 2. We might want to only check if there are an even number of cylinders (since it has to be even, I think); we can change the first assert line to:

assert(function(x) x%%2==0, cyl) %>%

assertr subscribes to the general idea that it is better to fail fast to spot data errors early. The benefit of assertr’s particular approach is that it’s friendly to the pipeline paradigm used by magrittr and dplyr.

The best thing about assertr’s approach, though, is that it forces you to state your assumptions up front. When your assumptions are stated clearly and verified, errors from messy data tend to disappear.

To learn more about assertr and the kinds of assertions that you can make with it, visit its page on github.

You can also read the vignette here.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

18 Responses

  1. WMC January 24, 2015 / 12:25 pm

    Terrific. Thank you for a valuable validation tool. I hope to use it in a automated web-scraping pipeline where, as you point out, stuff can always go wrong.

    For the record, car engines needn't have an even number of cylinders. Notable semi-recent examples include the Acura Vigor (5 cylinders) and Audi A2 (3). Many cars have had twelve cylinder engines (eg, BMW and Jaguar), and the recent Dodge Viper had ten. Many motorcycles and lawnmowers have only one.

    • [email protected] January 24, 2015 / 11:04 pm

      Your welcome!
      And, haha, thanks! I know close to nothing about cars, apparently. Oops! Thanks for pointing that out!

  2. Brad January 24, 2015 / 12:45 pm

    How would you compare this to the ensurer package? I've used ensurer successfully (and with immense benefit) in magrittr pipelines previously. I completely agree with your advice on assertive programming in analysis pipelines!

    • [email protected] January 24, 2015 / 11:15 pm

      Good question! I didn't know about ensurer when I was writing this package but, looking it over, there are some key differences. I think assertr makes it easier to check attributes of data.frames and easier to check for violations at the element level of vectors/data.frames. Ensurer is awesome in it's own right because of the custom failing functions and cool function return value ensurers. And it's use of the dot placeholder is super cool.

      So, mainly, I think data.frames, and checking their contents, are more important to assertr.

  3. Kasi January 24, 2015 / 1:38 pm

    Very nice. Thank you for this package! Will try it on Monday. It simplifies my code by a lot.

  4. Daniel Hadley January 24, 2015 / 2:55 pm

    Cool. I use R for a lot of open-gov data and I think this will be handy indeed.

    • [email protected] January 24, 2015 / 11:05 pm

      Haha, I barely know anything about cars–I'm not sure why I thought they had to be an even number. Oh well

  5. Han January 25, 2015 / 12:45 pm

    Well done!

  6. Leo Godin January 25, 2015 / 3:03 pm

    Fantastic! Some of your predicates would work well in the testthat package.

  7. Karthik Ram March 30, 2015 / 9:02 pm

    Cool package, Tony.
    This is related to a project I began a year ago called testdat (like testthat for data). Very similar idea, and I should now see whether it's worth finishing that or porting ideas over here.

    https://github.com/ropensci/testdat

    • [email protected] March 31, 2015 / 10:49 am

      Hi Karthik,
      Cool package yourself! (and cool name for the package).
      Sorry, when I was searching for data integrity testing R packages before I built this one, I didn't see testdat.
      That's up to you, but I'm open to collaborating :)

      • Karthik Ram April 2, 2015 / 6:29 pm

        Hi Tony, that would be fantastic! It makes more sense for me to contribute to your effort since you are further along. I'll follow up over email. Cheers!

Leave a Reply

Your email address will not be published. Required fields are marked *