qstats - quick and dirty statistics tool for the Unix pipeline

Back when 200MB hard drives were the size of washing machines and programs had no choice but to be as efficient as possible, Unix was born. In a serendipitous twist of fate, the same programs that were borne of this era of 4MB RAM and 16 bit processors are useful to data analysts with 2,000 times the amount of RAM and 64 bit multicore processors, processing data files over several GBs large.

Like all good things, Unix was started at Bell Labs in the late 60s. It has been honed over 40 years and now runs, if not on your computer, on the vast majority of the web servers you visit, a lot of phones, embedded devices you use, and a toaster near you.

Unices?

Since near everything in Unix is a text file, it grew up to be… very good at processing text. This is why Unix tools are a great addition to the data analyst's toolbox. There are a few great posts on how to get started using these tools in your workflow (here, here, here, and here) which you should read. By the way, when I talk about tools here, I’m talking about pipeline-able tools that take raw text input from standard input like sed; awk; and grep, not perl; tcl; or python.

There are tools to select columns, filter text for regular expressions, join files on a key, and reshape arrays, but I felt like there was one that was missing. After chaining tool after tool together and finally cajoling the data into a format and subset that I want to process and explore, I'd have to redirect the stream to a text file and read it from R. Clearly, if I’m to perform some complicated machine learning algorithm with this data, this is the best way to go. But if I just want to take a peek at the spread of the data, or quickly compare means, this is overkill.

Introducing qstats

Inspired by this gap in the Unix toolchain, I wrote a tool, qstats, that computes simple summary statistics from the command-line. It also includes data-binning and simple bar chart functionality. I designed it, in C, specifically to be as fast as possible, and bare-bones enough to work on any POSIX-compliant system without having to deal with outside dependencies. Let’s see it in action…

Functionality
By default, qstats will print R-like summary statistics on the given data. This includes the minimum value, the 1st quartile, the median, the mean, the third quartile, the maximum value, the range, and the standard deviation. You can use the -m flag to just get the mean. This will be faster because the data does not have to be sorted.

In addition to these statistics, qstats can also produce a frequency tabulation with an arbitrary number of "bins". Calling qstats with the -f10 flag will create 10 equal intervals and -f20 will create 20. Just calling it with -f will use Sturge's rule to come up with a reasonable number of bins in most cases.

Finally, with the -b flag, qstats will output a histogram-like horizontal bar-chart. Much like with the -f flag, you can supply the number of intervals to create. We will see an example of the bar-chart at work in the next section.

Rudimentary spread visualization

To view the spread with a bar-chart, let's output samplings from two distributions, the normal and the chi-square...

# one million normally distributed with a mean of 100 and a standard deviation of 10
millnorm <- rnorm(1000000, mean=100, sd=10)
write.table(millnorm, "millnorm.dat", col.names=FALSE, row.names=FALSE)

# one million values sampled from the chi-square distribution with two degrees of freedom
millchi <- rchisq(1000000, df=2)
write.table(millchi, "millchi.dat", col.names=FALSE, row.names=FALSE)
Normal distribution

Visualization of normal distribution

Chi-square distribution

Visualization of chi-square distribution (two degrees of freedom)

Speed comparisons
Let’s create a file of 100,000,000 floating point numbers to test speeds with R…

# sample from normal distribution with a mean of 100 and a standard deviation of 10
one.h.m <- rnorm(100000000, mean=100, sd=10)
write.table(one.h.m, “one_hundred_million.dat”, row.names=FALSE, col.names=FALSE)

The resulting file is 1.7 GBs large.

  • R
    The R script that we’ll time will look like this…

    #!/usr/bin/rscript —vanilla
    frame <- scan(“one_hundred_million.dat”)
    summary(frame)
    

    and the timing...

    $ time ./rtest.R
    Read 100000000 items
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      44.95   93.26  100.00  100.00  106.70  157.00 
    ./rtest.R  210.66s user 3.57s system 99% cpu 3:35.08 total
    

    3.5 minutes

  • Awk

    $ time awk '{ x+=$1; next } END { print x/NR }' one_hundred_milion.dat
    100.001
    awk '{ x+=$1; next } END { print x/NR }' one_hundred_milion.dat  128.34s user 0.56s system 99% cpu 2:09.01 total
    

    2 minutes.

    Note that this only computes the mean, not any of the other summary statistics. Some of these require sorting, which takes more time.

  • sort command

    $ time sort -n one_hundred_milion.dat > /dev/null
    sort -n one_hundred_milion.dat > /dev/null  151.89s user 3.46s system 99% cpu 2:35.72 total
    

    2.5 minutes

  • qstats

    $ time qstats one_hundred_milion.dat
    Min.     44.947
    1st Qu.  93.2553
    Median   100.001
    Mean     100.001
    3rd Qu.  106.747
    Max.     156.997
    Range    112.05
    Std Dev. 10.0002
    Length   100000000
    qstats one_hundred_milion.dat  53.62s user 1.04s system 99% cpu 54.722 total
    

    a little less than a minute

  • I show these comparisons here not to compare this small program with these great, veteran tools. Instead, I just want to underscore the point that smaller, very-few-trick-pony, specialized programs can afford to be faster than their more capable and robust counterparts. When these small tools will do the trick, they can not only be faster and simpler to use, but they also comport more with the Rule of Parsimony from the Unix philosophy.

Final words
The source code for this project is on Github with installation instructions. You can also download and install from this tarball. I've tested it on OS X, Debian, and NetBSD, but it should compile without any issue on any POSIX system with a reasonably recent C compiler. Please let me know if there are any installation issues for your system.

Please fork me and feel free to send a pull request or add an issue to the repo. I hope you like it!

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail