Data validation with the assertr package

Version 2.0 of my data set validation package assertr hit CRAN just this weekend. It has some pretty great improvements over version 1. For those new to the package, what follows is a short and new introduction. For those who are already using assertr, the text below will point out the improvements.

I can (and have) go on and on about the treachery of messy/bad datasets. Though its substantially less exciting than… pretty much everything else, I believe (proportional to the heartache and stress it causes) we don’t spend enough time talking about it or building solutions around it. No matter how new and fancy your ML algorithm is, it’s success is predicated upon a properly sanitized dataset. If you are using bad data, your approach will fail—either flagrantly (best case), or unnoticeably (considerably more probable and considerably more pernicious).

assertr is a R package to help you identify common dataset errors. More specifically, it helps you easily spell out your assumptions about how the data should look and alert you of any deviation from those assumptions.

I’ll return to this point later in the post when we have more background, but I want to be up front about the goals of the package; assertr is not (and can never be) a “one-stop shop” for all of your data validation needs. The specific kind of checks individuals or teams have to perform any particular dataset are often far too idiosyncratic to ever be exhaustively addressed by a single package (although, the assertive meta-package may come very close!) But all of these checks will reuse motifs and follow the same patterns. So, instead, I’m trying to sell assertr as a way of thinking about dataset validations—a set of common dataset validation actions. If we think of these actions as verbs, you could say that assertr attempts to impose a grammar of error checking for datasets.

In my experience, the overwhelming majority of data validation tasks fall into only five different patterns:

  • For every element in a column, you want to make sure it fits certain criteria. Examples of this strain of error checking would be to make sure every element is a valid credit card number, or fits a certain regex pattern, or represents a date between two other dates. assertr calls this verb assert.
  • For every element in a column, you want to make sure certain criteria are met but the criteria can only be decided only after looking at the entire column as a whole. For example, testing whether each element is within n standard deviations of the mean of the elements requires computation on the elements prior to inform the criteria to check for. assertr calls this verb insist.
  • For every row of a dataset, you want to make sure certain assumptions hold. Examples include ensuring that no row has more than n number of missing values, or that a group of columns are jointly unique and never duplicated. assertr calls this verb assert_rows.
  • For every row of a dataset, you want to make sure certain assumptions hold but the criteria can only be decided only after looking at the entire column as a whole. This closely mirrors the distinction between assert and insist, but for entire rows (not individual elements). An example of using this would be checking to make sure that the Mahalanobis distance between each row and all other rows are within n number of standard deviations of the mean distance. assertr calls this verb insist_rows.
  • You want to check some property of the dataset as a whole object. Examples include making sure the dataset has more than n columns, making sure the dataset has some specified column names, etc… assertr calls this last verb verify.

Some of this might sound a little complicated, but I promise this is a worthwhile way to look at dataset validation. Now we can begin with an example of what can be achieved with these verbs. The following example is borrowed from the package vignette and README…

Pretend that, before finding the average miles per gallon for each number of engine cylinders in the mtcars dataset, we wanted to confirm the following dataset assumptions…

  • that it has the columns mpg, vs, and am
  • that the dataset contains more than 10 observations
  • that the column for 'miles per gallon' (mpg) is a positive number
  • that the column for ‘miles per gallon’ (mpg) does not contain a datum that is outside 4 standard deviations from its mean
  • that the am and vs columns (automatic/manual and v/straight engine, respectively) contain 0s and 1s only
  • each row contains at most 2 NAs
  • each row is unique jointly between the mpg, am, and wt columns
  • each row's mahalanobis distance is within 10 median absolute deviations of all the distances (for outlier detection)

mtcars %>%
  verify(has_all_names("mpg", "vs", "am", "wt")) %>%
  verify(nrow(.) > 10) %>%
  verify(mpg > 0) %>%
  insist(within_n_sds(4), mpg) %>%
  assert(in_set(0,1), am, vs) %>%
  assert_rows(num_row_NAs, within_bounds(0,2), everything()) %>%
  assert_rows(col_concat, is_uniq, mpg, am, wt) %>%
  insist_rows(maha_dist, within_n_mads(10), everything()) %>%
  group_by(cyl) %>%

Before assertr version 2, the pipeline would immediately terminate at the first failure. Sometimes this is a good thing. However, sometimes we’d like to run a dataset through our entire suite of checks and record all failures. The latest version includes the chain_start and chain_end functions; all assumptions within a chain (below a call to chain_start and above chain_end) will run from beginning to end and accumulate errors along the way. At the end of the chain, a specific action can be taken but the default is to halt execution and display a comprehensive report of what failed including line numbers and the offending datum, where applicable.

Another major improvement since the last version of assertr of CRAN is that assertr errors are now S3 classes (instead of dumb strings). Additionally, the behavior of each assertion statement on success (no error) and failure can now be flexibly customized. For example, you can now tell assertr to just return TRUE and FALSE instead of returning the data passed in or halting execution, respectively. Alternatively, you can instruct assertr to just give a warning instead of throwing a fatal error. For more information on this, see help("success_and_error_functions")

Beyond these examples

Since the package was initially published on CRAN (almost exactly two years ago) many people have asked me how they should go about using assertr to test a particular assumption (and I’m very happy to help if you have one of your own, dear reader!) In every single one of these cases, I’ve been able to express it as an incantation using one of these 5 verbs. It also underscored, to me, that creating specialized functions for every need is a pipe dream. There is, however, two good pieces of news.

The first is that there is another package, assertive (vignette here) that greatly enhances the assertr experience. The predicates (functions that start with “is_”) from this (meta)package can be used in assertr pipelines just as easily as assertr’s own predicates. And assertive has an enormous amount of them! Some specialized and exciting examples include is_hex_color, is_ip_address, and is_isbn_code!

The second is if assertive doesn’t have what you’re looking for, with just a little bit of studying the assertr grammar, you can whip up your own predicates with relative ease. Using some these basic constructs and a little effort, I’m confident that the grammar is expressive enough to completely adapt to your needs.

If this package interests you, I urge you to read the most recent package vignette here. If you're a assertr old-timer, I point you to this NEWS file that list the changes from the previous version.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

14 Responses

  1. mstanley March 22, 2017 / 12:07 pm

    I have two main bits of feedback on this package. The first is that conceptually, I really like this idea. It's interesting and useful.

    The second is that this seems like a perfect opportunity to implement some of Hadley's essay Beyond Exception Handling:

    One of the things this package is trying to do is indicate that different conditions have arisen (that is to say, different checks have failed) and provide both a selection of recovery strategies (raise errors, raise warnings, add to results data, etc) and allow the possibility of custom recovery strategies. This is exactly what the condition system is designed for so it would probably be useful for the methods in that system to be used.

    • [email protected] March 25, 2017 / 2:30 pm

      That's really good feedback, thanks!
      Yeah, I'll have to reread that chapter again.
      The error system in R is heavily inspired by Common Lisp's error system which is the most powerful I've ever seen but also *extremely* complicated.

      • mstanley March 27, 2017 / 4:38 am

        In internal packages I have three (kinds of) functions:

        1) raise_condition, which Hadley defines
        2) Specific conditions, which indicate the types of errors that I anticipate happening; for example, I have a function called wrong_type_error that uses raise_condition to create a condition that indicates the input had some kind of incorrect class and contains enough information about the input and expectations to recover from the error
        3) Functions that check criteria and raise conditions if the checks fail. For example I have check_type_is and check_type_is_not that both accept an input and a list of types. They simply check the class list against the expectation and run wrong_type_condition if the check fails.

        There is a fourth type of niche function, too:
        Because wrong_type_condition contains all the information necessary to recover from the error (the original object, its classes as received, and the expectation of what it should have been) I can use withCallingHandlers to define custom recovery strategies if I need to. These could be kept with the above code but that hasn't been necessary so far; these strategies are usually project-specific.

        • mstanley March 27, 2017 / 4:39 am

          It's called wrong_type_condition, not wrong_type error, whoops.

  2. Tobias March 23, 2017 / 3:31 pm

    How would I apply assertive:: is_matching_regex() with a particular pattern to multiple columns? Or should I assert regex matching in some other way?

    • [email protected] March 25, 2017 / 2:49 pm

  3. Apoorv April 26, 2017 / 7:16 am

    Is it possible to check for a conditionality condition. For Eg, if a column value is dependent on values of other columns, eg pregnancy is dependent on gender. Similarly a column can be dependent on more than 1 col, happens usually in survey data. Would be helpful if these cases can also be flagged.

    • [email protected] April 26, 2017 / 10:13 am

      Good question! You can do that with

      and the row reduction functions.

      For example:

      Also, as a side-note, it's possible for people to identify as male to be pregnant :)

  4. Francisco May 4, 2017 / 6:35 am


    I would like to know whether is possible to get the indices and values from the error report. For instance, in:

    Column 'X' violates assertion 'within_n_iqrs(7)' 2 times
    index value
    1 10 1500000
    2 11 1500000
    Error: assertr stopped execution

    I would like to create an R data table with these indices (10 and 11), the values (1500000) and other data provided from another source.


    • [email protected] May 4, 2017 / 10:29 am

      Thanks for reaching out! You can do this using the error_fun parameter. Like this

      Does that answer your question?

      • Francisco May 9, 2017 / 2:03 am

        I tried this with error_append() and error_report() as:

        dt %>%
        chain_start() %>%
        chain_end %>%
        error_report() -> this

        And it did not work. Any guess?


        • [email protected] May 15, 2017 / 10:49 am

          is supposed to be used inside the

          function, not by itself

  5. Anonymous October 2, 2017 / 2:28 pm

    Can I find any similar library with Python or Anaconda3?

Leave a Reply

Your email address will not be published. Required fields are marked *