In data analysis workflows that depend on un-sanitized data sets from external sources, it’s very common that errors in data bring an analysis to a screeching halt. Oftentimes, these errors occur late in the analysis and provide no clear indication of which datum caused the error.

On occasion, the error resulting from bad data won’t even appear to be a data error at all. Still worse, errors in data will pass through analysis without error, remain undetected, and produce inaccurate results.

The solution to the problem is to provide as much information as you can about how you expect the data to look up front so that any deviation from this expectation can be dealt with immediately. This is what the assertr package tries to make dead simple.

Essentially, assertr provides a suite of functions designed to verify assumptions about data early in an analysis pipeline. assertr is meant to be used with the piping constructs of the `magrittr`

package and fits right in with the structure and data manipulation verbs of the `dplyr`

package.

Let’s say, for example, that the R’s built-in car dataset, mtcars, was not built-in but rather procured from an external source that was known for making errors in data entry or coding.

In particular, the mtcars dataset looks like this:

`head(mtcars)`

```
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
```

But let’s pretend that the data we got accidentally negated the 5th mpg value:

```
our.data <- mtcars
our.data$mpg[5] <- our.data$mpg[5] * -1
our.data[4:6,]
```

```
## mpg cyl disp hp drat wt qsec vs am gear carb
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout -18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
```

Whoops!

If we wanted to find the average miles per gallon for each number of engine cylinders, we might do so like this:

```
library(dplyr)
our.data %>%
group_by(cyl) %>%
summarise(avg.mpg=mean(mpg))
```

```
## Source: local data frame [3 x 2]
##
## cyl avg.mpg
## 1 4 26.66364
## 2 6 19.74286
## 3 8 12.42857
```

This indicates that the average miles per gallon for a 8 cylinder car is a lowly 12.43. However, in the correct dataset it’s really just over 15. Data errors like that are extremely easy to miss because it doesn’t cause an error, and the results look reasonable.

To combat this, we might want to use assertr’s `verify`

function to make sure that `mpg`

is a positive number:

```
library(assertr)
our.data %>%
verify(mpg >= 0) %>%
group_by(cyl) %>%
summarise(avg.mpg=mean(mpg))
```

`## Error in verify(., mpg >= 0): verification failed! (1 failure)`

If we had done this, we would have caught this data error.

The `verify`

function takes a data frame (its first argument is provided by the `%>%`

operator), and a logical (boolean) expression. Then, `verify`

evaluates that expression using the scope of the provided data frame. If any of the logical values of the expression’s result are `FALSE`

, `verify`

will raise an error that terminates any further processing of the pipeline.

We could have also written this assertion using `assertr`

’s `assert`

function…

```
our.data %>%
assert(within_bounds(0,Inf), mpg) %>%
group_by(cyl) %>%
summarise(avg.mpg=mean(mpg))
```

`## Error: Assertion 'within_bounds' violated at index 5 of vector 'mpg' (value: -18.7)`

The `assert`

function takes a data frame, a predicate function, and an arbitrary number of columns to apply the predicate function to. The predicate function (a function that returns a logical/boolean value) is then applied to every element of the columns selected, and will raise an error when it finds the first violation.

Internally, the `assert`

function uses `dplyr`

’s `select`

function to extract the columns to test the predicate function on. This allows for complex assertions. Let’s say we wanted to make sure that all values in the dataset are *greater* than zero (except `mpg`

):

```
library(assertr)
our.data %>%
assert(within_bounds(0,Inf, include.lower=FALSE), -mpg) %>%
group_by(cyl) %>%
summarise(avg.mpg=mean(mpg))
```

`## Error: Assertion 'within_bounds' violated at index 1 of vector 'vs' (value: 0)`

The first noticable difference between `verify`

and `assert`

is that `verify`

takes an expression, and `assert`

takes a predicate and columns to apply it to. This might make the `verify`

function look more elegant–but there’s an important drawback.

`verify`

has to evaluate the entire expression first, and *then* check if there were any violations. Because of this, `verify`

can’t tell you the offending datum. That brings us to the second difference.

Because `assert`

applies the predicate function to each datum, one at a time, it can stop immediately after finding the first violation, and specify the location and the value of the offending element.

This also means that `assert`

will fail sooner than `verify`

, potentially making it a faster, less time-consuming affair for data that are assumed to have errors.

One important drawback to `assert`

, and a consequence of its application of the predicate to *columns*, is that `assert`

can’t confirm assertions about the data structure *itself*. For example, let’s say we were reading a dataset from disk that we know has more than 100 observations; we could write a check of that assumption like this:

```
dat <- read.csv("a-data-file.csv") %>%
verify(nrow(dat) > 100) %>%
....
```

This is a powerful advantage over `assert`

… but `assert`

has one more advantage of its own that we heretofore ignored.

`assertr`

’s predicates, both built-in and custom, make `assert`

very powerful. The three predicates that are built in to `assertr`

are

`not_na`

- that checks if an element is not NA`within_bounds`

- that returns a predicate function that checks if a numeric value falls within the bounds supplied, and`in_set`

- that returns a predicate function that checks if an element is a member of the set supplied.

We’ve already seen `within_bounds`

in action… let’s use the `in_set`

function to make sure that there are only 0s and 1s (automatic and manual, respectively) values in the `am`

column…

```
our.data %>%
assert(in_set(0,1), am) %>%
...
```

If we were reading a dataset that contained a column representing boroughs of New York City (named `BORO`

), we can verify that there are no mis-spelled or otherwise unexpected boroughs like so…

```
boroughs <- c("Bronx", "Manhattan", "Queens", "Brooklyn", "Staten Island")
read.csv("a-dataset.csv") %>%
assert(in_set(boroughs), BORO) %>%
...
```

Rad!

A convenient feature of `assertr`

is that it makes the construction of custom predicate functions easy.

In order to make a custom predicate, you only have to specify cases where the predicate should return FALSE. Let’s say that a dataset has an ID column (named `ID`

) that we want to check is not an empty string. We can create a predicate like this:

`not.empty.p <- function(x) if(x=="") return(FALSE)`

and apply it like this:

```
read.csv("another-dataset.csv") %>%
assert(not.empty.p, ID) %>%
...
```

Let’s say that the ID column is always a 7-digit number. We can confirm that all the IDs are 7-digits by defining the following predicate:

`seven.digit.p <- function(x) nchar(x)==7`

A powerful consequence of this easy creation of predicates is that the `assert`

function lends itself to use with lambda predicates (unnamed predicates that are only used once). The check above might be better written as

```
read.csv("another-dataset.csv") %>%
assert(function(x) nchar(x)==7, ID) %>%
...
```

Neat-o!

`insist`

and predicate ‘generators’Very often, there is a need to dynamically determine the predicate function to be used based on the vector being checked.

For example, to check to see if every element of a vector is within *n* standard deviations of the mean, you need to create a `within_bounds`

predicate *after* dynamically determining the bounds by reading and computing on the vector itself.

To this end, the `assert`

function is no good; it just applies a raw predicate to a vector. We need a function like `assert`

that will apply predicate *generators* to vectors, return predicates, and *then* perform `assert`

-like functionality by checking each element of the vectors with its respective custom predicate. This is precisely what `insist`

does.

This is all much simpler than it may sound. Hopefully, the examples will clear up any confusion.

The primary use case for `insist`

is in conjunction with the `within_n_sds`

predicate generator.

Suppose we wanted to check that every `mpg`

value in the `mtcars`

data set was within 3 standard deviations of the mean before finding the average miles per gallon for each number of engine cylinders. We could write something like this:

```
mtcars %>%
insist(within_n_sds(3), mpg) %>%
group_by(cyl) %>%
summarise(avg.mpg=mean(mpg))
```

```
## Source: local data frame [3 x 2]
##
## cyl avg.mpg
## 1 4 26.66364
## 2 6 19.74286
## 3 8 15.10000
```

Notice what happens when we drop that z-score to 2 stardard deviations from the mean

```
mtcars %>%
insist(within_n_sds(2), mpg) %>%
group_by(cyl) %>%
summarise(avg.mpg=mean(mpg))
```

`## Error: Assertion 'within_n_sds' violated at index 18 of vector 'mpg' (value: 32.4)`

Execution of the pipeline was halted. But now we know exactly which data point (and column) violated the predicate that `within_n_sds(3)(mtcars$mpg)`

returned.

Now that’s an efficient car!

After the predicate generator, `insist`

takes an arbitrary number of columns just like `assert`

using the syntax of `dplyr`

’s `select`

function. If you wanted to check that everything in mtcars is within 10 standard deviations of the mean (of each column vector), you can do so like this:

```
mtcars %>%
insist(within_n_sds(10), mpg:carb) %>%
group_by(cyl) %>%
summarise(avg.mpg=mean(mpg))
```

```
## Source: local data frame [3 x 2]
##
## cyl avg.mpg
## 1 4 26.66364
## 2 6 19.74286
## 3 8 15.10000
```

Aces!

Let’s say that as part of an automated pipeline that grabs mtcars from an untrusted source and finds the average miles per gallon for each number of engine cylinders, we want to perform the following checks…

- that the dataset contains more than 10 observations
- that the column for ‘miles per gallon’ (mpg) is a positive number
- that the column for ‘miles per gallon’ (mpg) does not contain a datum that is outside 4 standard deviations from its mean, and
- that the am and vs columns (automatic/manual and v/straight engine, respectively) contain 0s and 1s only

This could be written thusly:

```
mtcars %>%
verify(nrow(mtcars) > 10) %>%
verify(mpg > 0) %>%
insist(within_n_sds(4), mpg) %>%
assert(in_set(0,1), am, vs) %>%
group_by(cyl) %>%
summarise(avg.mpg=mean(mpg))
```

```
## Source: local data frame [3 x 2]
##
## cyl avg.mpg
## 1 4 26.66364
## 2 6 19.74286
## 3 8 15.10000
```

Ew, there are four lines of assertions before the real fun starts. We can make look much better by abstracting out all the assertions:

```
check_me <- . %>%
verify(nrow(mtcars) > 10) %>%
verify(mpg > 0) %>%
insist(within_n_sds(4), mpg) %>%
assert(in_set(0,1), am, vs)
mtcars %>%
check_me %>%
group_by(cyl) %>%
summarise(avg.mpg=mean(mpg))
```

```
## Source: local data frame [3 x 2]
##
## cyl avg.mpg
## 1 4 26.66364
## 2 6 19.74286
## 3 8 15.10000
```

Awesome! Now we can add an arbitrary number of assertions, as the need arises, without touching the real logic.

`insist`

`assertr`

is build with robustness, correctness, and extensibility in mind. Just like `assertr`

makes it easy to create your own custom predicates, so too does this package make it easy to create your own custom predicate generators.

Okay… so its, perhaps, not *easy* because predicate generators by nature are functions that return functions. But it’s possible!

Let’s say you wanted to create a predicate generator that checks if all elements of a vector are within 3 times the vector’s interquartile range from the median. We need to create a function that looks like this

```
within_3_iqrs <- function(a_vector){
the_median <- median(a_vector)
the_iqr <- IQR(a_vector)
within_bounds((the_median-the_iqr*3), (the_median+the_iqr*3))
}
```

Now, we can use it on `mpg`

from `mtcars`

like so:

```
mtcars %>%
insist(within_3_iqrs, mpg) %>%
group_by(cyl) %>%
summarise(avg.mpg=mean(mpg))
```

```
## Source: local data frame [3 x 2]
##
## cyl avg.mpg
## 1 4 26.66364
## 2 6 19.74286
## 3 8 15.10000
```

There are two problems with this, though…

- We may want to abstract this so that we can supply an arbitrary number of IQRs to create the bounds with
- We lose the ability to choose what optional arguments (if any) that we give to the returned
`within_bounds`

predicate.

Now we have to write a function that returns a function that returns a function…

```
within_n_iqrs <- function(n, ...){
function(a_vector){
the_median <- median(a_vector)
the_iqr <- IQR(a_vector)
within_bounds((the_median-the_iqr*n), (the_median+the_iqr*n), ...)
}
}
```

Much better! Now, if we want to check that every `mpg`

from `mtcars`

is within 5 IQRs of the median and *not allow NA values* we can do so like this:

```
mtcars %>%
insist(within_n_iqrs(5), mpg) %>%
group_by(cyl) %>%
summarise(avg.mpg=mean(mpg))
```

```
## Source: local data frame [3 x 2]
##
## cyl avg.mpg
## 1 4 26.66364
## 2 6 19.74286
## 3 8 15.10000
```

Super!