A lot of my job–and side projects, for that matter–involve running R scripts on updates of open government data. While I’m infinitely grateful to have access to any interesting open datasets in the first place, I can’t ignore that dealing with open data is often a messy affair. In fact this seems to be characteristic of most data sets I work with, open access or otherwise.
So... let's say I have a labyrinthine analysis workflow that uses a wide array of government sources to answer an interesting question. The workflow is full of analyses that return components that are components of still other analyses.
Then there’s an update of the data! Whoopee! I rerun the scripts/workflow on updated (or partially updated) data. Then one of four things happen:
- In the best case scenario, everything works because there were no errors in the data.
- In the likely scenario, something very late in this labyrinthine analysis workflow breaks and it’s not clear what datum caused this error.
- In the worst case scenario, nothing breaks and the error is only caught when the results–or part of them–are nonsensical.
- In the worst worst case scenario, the results or some of the results are wrong but it looks ok and it goes undetected.
In an effort to help solve this common problem–and inspired by the elegance of dplyr/magrittr pipelines–I created a R package called assertr.
assertr works by adding two new verbs to the pipeline,
assert, and a couple of predicate functions. Early on in the pipeline, you make certain assertions about how the data should look. If the data conform to these assertions, then we go on with the pipeline. If not, the verbs produce errors that terminate any further pipeline computations. The benefit of the verbs, over the truth assurance functions already in R (like
stopifnot) is that they needn’t interrupt the flow of the pipeline.
Take, for example, the following contrived snippet making sure that there are only 0s and 1s (automatic and manual, respectively) in R’s Motor Trend Car Road Test built-in dataset before calculating the average miles per gallon per category.
mtcars %>% verify(am %in% c(0,1)) %>% group_by(cyl) %>% summarise(mean.mpg=mean(mpg)) # am mean.mpg # 0 17.14737 # 1 24.39231
Let’s say this dataset was much bigger, not built in to R, and curated and disseminated by someone with less perfectionistic (read obsessive/compulsive) tendencies than yours truly. If we wanted to find the average miles per gallon aggregated by number of engine cylinders, we might first want to check if the number of cylinders is reasonable (either 4, 6, or 8) and that the miles per gallon was a reasonable number (between 10 and 40 mpg) and not a data entry error that would greatly throw off our non-robust estimator:
mtcars %>% assert(in_set(4, 6, 8), cyl) %>% assert(within_bounds(10, 40), mpg) %>% group_by(cyl) %>% summarise(mean.mpg=mean(mpg)) # cyl mean.mpg # 4 26.66364 # 6 19.74286 # 8 15.10000
Perhaps one day there will be cars that have more than 8 cylinders or less than 2. We might want to only check if there are an even number of cylinders (since it has to be even, I think); we can change the first assert line to:
assert(function(x) x%%2==0, cyl) %>%
assertr subscribes to the general idea that it is better to fail fast to spot data errors early. The benefit of assertr’s particular approach is that it’s friendly to the pipeline paradigm used by
The best thing about
assertr’s approach, though, is that it forces you to state your assumptions up front. When your assumptions are stated clearly and verified, errors from messy data tend to disappear.