Data validation with the assertr package

Version 2.0 of my data set validation package assertr hit CRAN just this weekend. It has some pretty great improvements over version 1. For those new to the package, what follows is a short and new introduction. For those who are already using assertr, the text below will point out the improvements.

I can (and have) go on and on about the treachery of messy/bad datasets. Though its substantially less exciting than… pretty much everything else, I believe (proportional to the heartache and stress it causes) we don’t spend enough time talking about it or building solutions around it. No matter how new and fancy your ML algorithm is, it’s success is predicated upon a properly sanitized dataset. If you are using bad data, your approach will fail—either flagrantly (best case), or unnoticeably (considerably more probable and considerably more pernicious).

assertr is a R package to help you identify common dataset errors. More specifically, it helps you easily spell out your assumptions about how the data should look and alert you of any deviation from those assumptions.

I’ll return to this point later in the post when we have more background, but I want to be up front about the goals of the package; assertr is not (and can never be) a “one-stop shop” for all of your data validation needs. The specific kind of checks individuals or teams have to perform any particular dataset are often far too idiosyncratic to ever be exhaustively addressed by a single package (although, the assertive meta-package may come very close!) But all of these checks will reuse motifs and follow the same patterns. So, instead, I’m trying to sell assertr as a way of thinking about dataset validations—a set of common dataset validation actions. If we think of these actions as verbs, you could say that assertr attempts to impose a grammar of error checking for datasets.

In my experience, the overwhelming majority of data validation tasks fall into only five different patterns:

  • For every element in a column, you want to make sure it fits certain criteria. Examples of this strain of error checking would be to make sure every element is a valid credit card number, or fits a certain regex pattern, or represents a date between two other dates. assertr calls this verb assert.
  • For every element in a column, you want to make sure certain criteria are met but the criteria can only be decided only after looking at the entire column as a whole. For example, testing whether each element is within n standard deviations of the mean of the elements requires computation on the elements prior to inform the criteria to check for. assertr calls this verb insist.
  • For every row of a dataset, you want to make sure certain assumptions hold. Examples include ensuring that no row has more than n number of missing values, or that a group of columns are jointly unique and never duplicated. assertr calls this verb assert_rows.
  • For every row of a dataset, you want to make sure certain assumptions hold but the criteria can only be decided only after looking at the entire column as a whole. This closely mirrors the distinction between assert and insist, but for entire rows (not individual elements). An example of using this would be checking to make sure that the Mahalanobis distance between each row and all other rows are within n number of standard deviations of the mean distance. assertr calls this verb insist_rows.
  • You want to check some property of the dataset as a whole object. Examples include making sure the dataset has more than n columns, making sure the dataset has some specified column names, etc… assertr calls this last verb verify.

Some of this might sound a little complicated, but I promise this is a worthwhile way to look at dataset validation. Now we can begin with an example of what can be achieved with these verbs. The following example is borrowed from the package vignette and README…

Pretend that, before finding the average miles per gallon for each number of engine cylinders in the mtcars dataset, we wanted to confirm the following dataset assumptions…

  • that it has the columns mpg, vs, and am
  • that the dataset contains more than 10 observations
  • that the column for 'miles per gallon' (mpg) is a positive number
  • that the column for ‘miles per gallon’ (mpg) does not contain a datum that is outside 4 standard deviations from its mean
  • that the am and vs columns (automatic/manual and v/straight engine, respectively) contain 0s and 1s only
  • each row contains at most 2 NAs
  • each row is unique jointly between the mpg, am, and wt columns
  • each row's mahalanobis distance is within 10 median absolute deviations of all the distances (for outlier detection)
library(dplyr)
library(assertr)

mtcars %>%
  verify(has_all_names("mpg", "vs", "am", "wt")) %>%
  verify(nrow(.) > 10) %>%
  verify(mpg > 0) %>%
  insist(within_n_sds(4), mpg) %>%
  assert(in_set(0,1), am, vs) %>%
  assert_rows(num_row_NAs, within_bounds(0,2), everything()) %>%
  assert_rows(col_concat, is_uniq, mpg, am, wt) %>%
  insist_rows(maha_dist, within_n_mads(10), everything()) %>%
  group_by(cyl) %>%
  summarise(avg.mpg=mean(mpg))

Before assertr version 2, the pipeline would immediately terminate at the first failure. Sometimes this is a good thing. However, sometimes we’d like to run a dataset through our entire suite of checks and record all failures. The latest version includes the chain_start and chain_end functions; all assumptions within a chain (below a call to chain_start and above chain_end) will run from beginning to end and accumulate errors along the way. At the end of the chain, a specific action can be taken but the default is to halt execution and display a comprehensive report of what failed including line numbers and the offending datum, where applicable.

Another major improvement since the last version of assertr of CRAN is that assertr errors are now S3 classes (instead of dumb strings). Additionally, the behavior of each assertion statement on success (no error) and failure can now be flexibly customized. For example, you can now tell assertr to just return TRUE and FALSE instead of returning the data passed in or halting execution, respectively. Alternatively, you can instruct assertr to just give a warning instead of throwing a fatal error. For more information on this, see help("success_and_error_functions")

Beyond these examples

Since the package was initially published on CRAN (almost exactly two years ago) many people have asked me how they should go about using assertr to test a particular assumption (and I’m very happy to help if you have one of your own, dear reader!) In every single one of these cases, I’ve been able to express it as an incantation using one of these 5 verbs. It also underscored, to me, that creating specialized functions for every need is a pipe dream. There is, however, two good pieces of news.

The first is that there is another package, assertive (vignette here) that greatly enhances the assertr experience. The predicates (functions that start with “is_”) from this (meta)package can be used in assertr pipelines just as easily as assertr’s own predicates. And assertive has an enormous amount of them! Some specialized and exciting examples include is_hex_color, is_ip_address, and is_isbn_code!

The second is if assertive doesn’t have what you’re looking for, with just a little bit of studying the assertr grammar, you can whip up your own predicates with relative ease. Using some these basic constructs and a little effort, I’m confident that the grammar is expressive enough to completely adapt to your needs.

If this package interests you, I urge you to read the most recent package vignette here. If you're a assertr old-timer, I point you to this NEWS file that list the changes from the previous version.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Using Python decorators to be a lazy programmer: a case study

Decorators are considered one of the more advanced features of python and it will often be the last topic in a python class or introductory book. It will, unfortunately, also be one that trips up many beginning or even intermediate python programmers. Those who stick it out and work through it, though, will be handsomely rewarded for their hard work.

Known by those in-the-know, decorators are tools to make your python code beautiful, more concise, well-written, and elegant--but did you know you can use decorators to be a lazy bum of a programmer?!

I just recently whipped up a script that I'm using to help my expand and organize my photo library of my favorite pieces of art. This script that takes the URL of an art piece from artsy.net, downloads the image of the artwork and renames the downloaded file to follow a filename template (given as a CLI arg) based on the artist's name, the title of the piece, and the date of completion. For example, if we wanted to download an image of the Fountain, and have the file automatically named: Marcel Duchamp - Fountain - 1917.jpg, one can use the following command:

python artsy-dl.py "https://www.artsy.net/artwork/marcel-duchamp-fountain-1" 
                   "%a - %t - %d"

because %a is automatically replaced with the artist's name (that is extracted from the webpage), %t is automatically replaced with the title of the piece, and %d is the date of the piece.

In this post, I'll be demonstrating the use of decorators to the end doing the bare minimum and how I saved myself from having to write tedious (but important) error-checking code. But before that, you, dear reader, have to be clear on what a decorator is. The section that follows it perhaps the greatest intro to decorators ever.

(If you are already familiar with decorators, you can skip to the section called "The problem"--though you may want to at least skim this section.)

What are decorators?

Here's the skinny on Python decorators. Grokking decorators necessitates an intuitive understanding of three concepts:

  • People often speak of functions as being "first-class citizens" in Python. By this they mean that functions are values that can be assigned to variables, returned from (other) functions, and passed as an argument to (still other) functions.
  • When a function (we’ll call it outer) returns another function (we'll call this inner), the inner function "closes over" (remembers) variables defined in the enclosing scope (outer's scope). If the returned function is stored in a variable and called at a later time, it still remembers the variable(s) from the enclosing scope—even if it is called long after the outer function finishes running and it's variables otherwise lost. The inner function that is returned is known as a "closure".
  • A language that supports closures affords us the unique opportunity to easily add or modify the behavior of a function a by creating a function b that takes function a as an argument, and returns a function c which does something, and then calls function a. Function c can now be used in place of function a--it is essentially function a plus some extra functionality: it is a "decorated" version of function a.

In order to concretize these concepts, let's see an example of a decorator complete with an illustration of the motivation behind creating it and the cognitive steps taken toward it's finished state. Unlike some decorator tutorials, this lesson will not patronize you, dear reader, by designing a overly-simple decorator with no practical worth (it's always been my thought that this pedagogical strategy most often backfires). Instead, we'll create, you and I, a decorator of actual utility.

Suppose we wanted to time the execution of a function. Wanting something with a little more precision than a stopwatch, we decide to use the time module:

import time

def sleep_for_a_second():
    time.sleep(1)

start_time = time.time()
sleep_for_a_second()
end_time = time.time()

print("It took {0:.2f} seconds".format(end_time-start_time))
#> It took 1.00 seconds

This is ok, but if we want to time the execution of many different functions, this will result in a lot of repeated code. Being champions of the DRY principle, we decide it would be better to put this in a function:

def time_a_function(func):
    start_time = time.time()
    func()
    end_time = time.time()
    print("It took {0:.2f} seconds".format(end_time-start_time))

def sleep_for_a_second():
    time.sleep(1)

def sleep_for_two_seconds():
    time.sleep(2)

time_a_function(sleep_for_a_second)
#> It took 1.00 seconds
time_a_function(sleep_for_two_seconds)
#> It took 2.00 seconds

time_a_function is a function that takes the function we want to time as an argument.

We just timed two functions. Notice how we can now time an arbitrary number of functions with no extra code.

But there's an issue with this approach. We've hitherto been timing functions that take no arguments. How would we time a function that takes one or more arguments?

def sleep_for_n_seconds(n):
    time.sleep(n)

time_a_function(sleep_for_n_seconds(5))
#> TypeError: 'NoneType' object is not callable

Nope. Before, we were passing the variable that holds the function to time_a_function, but the above incantation evaluates sleep_for_n_seconds(5), passes it's None return value to time_a_function and because time_a_function can't call it (because it's not a function), we get an error. So how are we going to time sleep_for_n_seconds?

The solution is to make a function that takes a function that returns a function that takes an argument (n) and performs the function and times it and use the returned function in place of the original (whew!).

In other words:

def timer_decoration(func):
    def new_fn(n):
        start_time = time.time()
        func(n)
        end_time = time.time()
        print("It took {0:.2f} seconds".format(end_time-start_time))
    return new_fn

def sleep_for_n_seconds(n):
    time.sleep(n)

sleep_for_n_seconds = timer_decoration(sleep_for_n_seconds)
sleep_for_n_seconds(3)
#> It took 3.01 seconds

Study this code carefully. You've just wrote a decorator.

If you are confused it--as always with programming--helps if you type the code out yourself (no copy-and-pasting!) and play around with it.

Though it's not terrible unwieldy otherwise, python gives us a nice elegant way to tag a particular function with a decorator so that the function is automatically decorated (i.e. doesn't require us to replace the original function).

@timer_decoration
def sleep_for_n_seconds(n):
    time.sleep(n)

# now the reassignment of 'sleep_for_n_seconds' is unnecessary
sleep_for_n_seconds(3)
#> It took 3.00 seconds

But what happens if we try to decorate the original sleep_for_a_second function?

@timer_decoration
def sleep_for_a_second():
    time.sleep(1)

# sleep_for_a_second()
#> TypeError: new_fn() missing 1 required positional argument: 'n'

sleep_for_a_second is now expecting 1 argument :(. We can generalize our decorator to handle functions that take an arbitrary number of arguments with *args and **kargs...

def timer_decoration(func):
    def new_fn(*args, **kargs):
        start_time = time.time()
        func(*args, **kargs)
        end_time = time.time()
        print("It took {0:.2f} seconds".format(end_time-start_time))
    return new_fn

@timer_decoration
def sleep_for_n_seconds(n):
    time.sleep(n)

@timer_decoration
def sleep_for_a_second():
    time.sleep(1)

@timer_decoration
def sleep_for_k_seconds(k=1):
    time.sleep(k)

sleep_for_n_seconds(3)
#> It took 3.00 seconds
sleep_for_a_second()
#> It took 1.00 seconds
sleep_for_k_seconds(k=4)
#> It took 4.00 seconds

Ace!

Finally, let's rewrite our decorator to support returning the return value of the decorated function.

def timer_decoration(func):
    def new_fn(*args, **kargs):
        start_time = time.time()
        ret_val = func(*args, **kargs)
        end_time = time.time()
        print("It took {0:.2f} seconds".format(end_time-start_time))
        return ret_val
    return new_fn

@timer_decoration
def sleep_for_a_second_p():
    time.sleep(1)
    return True

print(sleep_for_a_second_p())
#> It took 1.01 seconds
#> True

Note that our decorator is now generalized enough to be used with any function... no matter what it's return type is... no matter what arguments it takes...

It doesn't matter what the function is, the function's behavior remains the same except now it is "decorated" with functionality that times it.

This is just one example of a decorator with obvious generalized utility. You can also use decorators to perform memoization, static-typing-like type enforcement of function signatures, automatically retry functions that failed, and simulate non-strict evaluation.

The problem

To review, I wrote a script that takes the URL of an artwork on artsy.net, downloads the image, and then names the file in accordance with a user-supplied format string that uses info about the artwork. With the help of the requests, lxml, and wget modules, a script to do this can be coded relatively quickly. The problem, though--which is common for scripts that talk to the web that don't do error-checking--is that the script is brittle. Without error-checking, any malformed URL, network interruption, invalid output path, or weird edge case like an artwork without a title, will result in an unsightly error message and lengthy stack trace. Besides being aesthetically objectionable, if anyone else is using your script, you will look like an incompetent software engineer. So you have to bite the bullet and error-check.

The problem with error-checking is

  • If all possible errors are checked for separately and individually and handled appropriately (this is good practice), it will result in code often many times longer than the original code. Only a small fraction of the code will be the actual interesting logic of the program--most of it will now be mindless conditionals.
  • It's difficult for someone without training (me) to anticipate every possible error.
  • It takes a lot of work and I'm lazy

So my usual M.O. is to wrap each component in a try/except block (with no specificity in the exception), print an error message, and terminate execution...

try:
    <brittle code>
except:
    sys.exit("<brittle code> broke")

Except I don't even do that. Instead of wrapping each component in a try/except with its own error message, I just wind up try/excepting main once. This cuts down typing and carpal tunnel is a real thing...

def main():
    <literally everything>

try:
    main()
except:
    sys.exit("whoopsie daisy")

Decorators to the rescue

Okkkaaaayyyyy... if we absolutely must do some modicum of error checking around each component (so the user has some kind of clue as to why it the script's usage failed) we can write a decorator to do this for us. The following is an excerpt of the script as of commit d6be4956543:

...
# the decorator
def cop_out(f):
    def inner(*args, **kargs):
        try:
            return f(*args, **kargs)
        except:
            sys.exit("\nThe function <{}> failed\n".format(f.__name__))
    return inner

@cop_out
def get_command_line_arguments():
    return sys.argv[1], sys.argv[2]

@cop_out
def download_webpage(url):
    r = requests.get(url)
    if r.status_code != 200:
        raise Exception
    return r

@cop_out
def parse_webpage(requests_object):
    return lxml.html.fromstring(requests_object.text)
....

Note how we use f.__name__ to get the name of the function that was decorated. This allows us to add the support for specialized (at the function level, at least) error messages for free!

Now, if the user calls the script with too few arguments, the program will print The function <get_command_line_arguments> failed. If you give it a real URL but not to an artsy.net artwork, it'll say The function <extract_artist_name_from_webpage> failed. If you give it a made-up URL, it'll say The function <download_webpage> failed, etc...

Sure, beyond the function level, you don't know why it failed, but anything is better than nothing and your users shouldn't be so bossy and entitled.

But one more thing... if you looked at the code, you'll notice that my function names are descriptive... maybe too long and descriptive. The use of prose-like descriptive function names (certainly by the standards of Haskell programmers) was no accident. Although it may seem like an uncharacteristically diligent and conscientious decision on my part, it was actually to facilitate further laziness. Consider the following tweak to the decorator:

def cop_out(f):
    def inner(*args, **kargs):
        try:
            return f(*args, **kargs)
        except:
            message = f.__name__.replace("_", " ")
            sys.exit("\nFailed to {}\n".format(message))
    return inner

Consider how this generates error messages that appear to individualized...

$ python artsy-dl.py "https://www.artsy.net/artwork/jean-michel-basquiat-untitled-33211237"
Failed to get command line arguments

$ python3.4 artsy-dl.py "https://www.artsy.net/artwork/jean-michel-BAHHSKIIIAAHT-untitled-33211237"
                        "%a/%t - %d"
Failed to download webpage

$ python2.7 artsy-dl.py "https://www.artsy.net/artist/jean-michel-basquiat"
                         "%a/%t - %d"
Failed to extract artist name from webpage

So there you have it! Decorators can be used for legitimate, elegant solutions but can also be employed--virtually for free--to give the illusion that you are a caring software engineer and meticulous with your error checking.

PS

If you're a potential employer or client, I'm just kidding--I'm very diligent about error checking. This piece is satire. I promise.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Assertive R programming in dplyr/magrittr pipelines

A lot of my job–and side projects, for that matter–involve running R scripts on updates of open government data. While I’m infinitely grateful to have access to any interesting open datasets in the first place, I can’t ignore that dealing with open data is often a messy affair. In fact this seems to be characteristic of most data sets I work with, open access or otherwise.

So... let's say I have a labyrinthine analysis workflow that uses a wide array of government sources to answer an interesting question. The workflow is full of analyses that return components that are components of still other analyses.

Then there’s an update of the data! Whoopee! I rerun the scripts/workflow on updated (or partially updated) data. Then one of four things happen:

  • In the best case scenario, everything works because there were no errors in the data.
  • In the likely scenario, something very late in this labyrinthine analysis workflow breaks and it’s not clear what datum caused this error.
  • In the worst case scenario, nothing breaks and the error is only caught when the results–or part of them–are nonsensical.
  • In the worst worst case scenario, the results or some of the results are wrong but it looks ok and it goes undetected.

In an effort to help solve this common problem–and inspired by the elegance of dplyr/magrittr pipelines–I created a R package called assertr.

assertr works by adding two new verbs to the pipeline, verify and assert, and a couple of predicate functions. Early on in the pipeline, you make certain assertions about how the data should look. If the data conform to these assertions, then we go on with the pipeline. If not, the verbs produce errors that terminate any further pipeline computations. The benefit of the verbs, over the truth assurance functions already in R (like stopifnot) is that they needn’t interrupt the flow of the pipeline.

Take, for example, the following contrived snippet making sure that there are only 0s and 1s (automatic and manual, respectively) in R’s Motor Trend Car Road Test built-in dataset before calculating the average miles per gallon per category.

mtcars %>%
  verify(am %in% c(0,1)) %>%
  group_by(cyl) %>%
  summarise(mean.mpg=mean(mpg))

#   am     mean.mpg
#   0      17.14737
#   1      24.39231

Let’s say this dataset was much bigger, not built in to R, and curated and disseminated by someone with less perfectionistic (read obsessive/compulsive) tendencies than yours truly. If we wanted to find the average miles per gallon aggregated by number of engine cylinders, we might first want to check if the number of cylinders is reasonable (either 4, 6, or 8) and that the miles per gallon was a reasonable number (between 10 and 40 mpg) and not a data entry error that would greatly throw off our non-robust estimator:

mtcars %>%
  assert(in_set(4, 6, 8), cyl) %>%
  assert(within_bounds(10, 40), mpg) %>%
  group_by(cyl) %>%
  summarise(mean.mpg=mean(mpg))

#  cyl   mean.mpg
#   4     26.66364
#   6     19.74286
#   8     15.10000

Perhaps one day there will be cars that have more than 8 cylinders or less than 2. We might want to only check if there are an even number of cylinders (since it has to be even, I think); we can change the first assert line to:

assert(function(x) x%%2==0, cyl) %>%

assertr subscribes to the general idea that it is better to fail fast to spot data errors early. The benefit of assertr’s particular approach is that it’s friendly to the pipeline paradigm used by magrittr and dplyr.

The best thing about assertr’s approach, though, is that it forces you to state your assumptions up front. When your assumptions are stated clearly and verified, errors from messy data tend to disappear.

To learn more about assertr and the kinds of assertions that you can make with it, visit its page on github.

You can also read the vignette here.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Why is my OS X Yosemite install taking so long?: an analysis

Why?
Since the latest Mac OS X update, 10.10 "Yosemite", was released last Thursday, there have been complaints springing up online of the progress bar woefully underestimating the actual time to complete installation. More specifically, it appeared as if, for a certain group of people (myself included), the installer would stall out at "two minutes remaining" or "less than a minute remaining"–sometimes for hours.

In the vast majority of these cases, though, the installation process didn't hang, it was just performing a bunch of unexpected tasks that it couldn't predict.

During the install, striking "Command" + "L" would bring up the install logs. In my case, the logs indicated that the installer was busy right until the very last minute.

Not knowing very much about OS X's installation process and wanting to learn more, and wanting to answer why the installation was taking longer than the progress bar expected, I saved the log to a file on my disk with the intention of analyzing it before the installer automatically restarted my computer.

Cleaning
The log file from the Yosemite installer wasn't in a format that R (or any program) could handle natively so before we can use it, we have to clean/munge it. To do this, we'll write a program in the queen of all text-processing languages: perl.

This script will read the log file, line-by-line from standard input (for easy shell piping), and spit out nicely formatted tab-delimited lines.

#!/usr/bin/perl

use strict;
use warnings;

# read from stdin
while(<>){
    chomp;
    my $line = $_;
    my ($not_message, $message) = split ': ', $line, 2;

    # skip lines with blank messages
    next if $message =~ m/^\s*$/;

    my ($month, $day, $time, $machine, $service) = split " ", $not_message;

    print join("\t", $month, $day, $time, $machine, $service, $message) . "\n";
}

We can output the cleaned log file with this shell command:

echo "Month\tDay\tTime\tMachine\tService\tMessage" > cleaned.log
grep '^Oct' ./YosemiteInstall.log | grep -v ']:  ' | grep -v ': }' |  ./clean-log.pl >> cleaned.log

This cleaned log contains 6 fields: 'Month', 'Day', 'Time', 'Machine (host)', 'Service', and 'Message'. The installation didn't span days (it didn't even span an hour) so technically I didn't need the 'Month' and 'Day' fields, but I left them in for completeness' sake.

Analysis

Let's set some options and load the libraries we are going to use:

# options
options(echo=TRUE)
options(stringsAsFactors=FALSE)

# libraries
library(dplyr)
library(ggplot2)
library(lubridate)
library(reshape2)

Now we read the log file that I cleaned and add a few columns with correctly parsed timestamps using lubridate’s "parse_date_time()" function

yos.log <- read.delim("./cleaned.log", sep="\t") %>%
  mutate(nice.date=paste(Month, Day, "2014", Time)) %>%
  mutate(lub.time=parse_date_time(nice.date, 
                                  "%b %d! %Y! %H!:%M!:%S!", 
                                  tz="EST"))

And remove the rows of dates that didn't parse correctly

yos.log <- yos.log[!is.na(yos.log$lub.time),]

head(yos.log)


##   Month Day     Time   Machine        Service
## 1   Oct  18 11:28:23 localhost opendirectoryd
## 2   Oct  18 11:28:23 localhost opendirectoryd
## 3   Oct  18 11:28:23 localhost opendirectoryd
## 4   Oct  18 11:28:23 localhost opendirectoryd
## 5   Oct  18 11:28:23 localhost opendirectoryd
## 6   Oct  18 11:28:23 localhost opendirectoryd
##                                                                    Message
## 1                   opendirectoryd (build 382.0) launched - installer mode
## 2                                  Logging level limit changed to 'notice'
## 3                                               Initialize trigger support
## 4 created endpoint for mach service 'com.apple.private.opendirectoryd.rpc'
## 5                                set default handler for RPC 'reset_cache'
## 6                           set default handler for RPC 'reset_statistics'
##              nice.date            lub.time
## 1 Oct 18 2014 11:28:23 2014-10-18 11:28:23
## 2 Oct 18 2014 11:28:23 2014-10-18 11:28:23
## 3 Oct 18 2014 11:28:23 2014-10-18 11:28:23
## 4 Oct 18 2014 11:28:23 2014-10-18 11:28:23
## 5 Oct 18 2014 11:28:23 2014-10-18 11:28:23
## 6 Oct 18 2014 11:28:23 2014-10-18 11:28:23

The first question I had was how long the installation process took

install.time <- yos.log[nrow(yos.log), "lub.time"] - yos.log[1, "lub.time"]
(as.duration(install.time))
## [1] "1848s (~30.8 minutes)"

Ok, about a half-hour.

Let's make a column for cumulative time by subtracting each row's time by the start time

yos.log$cumulative <- yos.log$lub.time - min(yos.log$lub.time, na.rm=TRUE)

In order to see what processes were taking the longest, we have to make a column for elapsed time. To do this, we can subtract each row's time from the time of the subsequent row.

yos.log$elapsed <- lead(yos.log$lub.time) - yos.log$lub.time

# remove last row
yos.log <- yos.log[-nrow(yos.log),]

Which services were responsible for the most writes to the log and what services took the longest? We can find out with the following elegant dplyr construct. While we're at it, we should add columns for percentange of the whole for easy plotting.

counts <- yos.log %>%
  group_by(Service) %>%
  summarise(n=n(), totalTime=sum(elapsed)) %>%
  arrange(desc(n)) %>%
  top_n(8, n) %>%
  mutate(percent.n = n/sum(n)) %>%
  mutate(percent.totalTime = as.numeric(totalTime)/sum(as.numeric(totalTime)))
(counts)

## Source: local data frame [8 x 5]
## 
##           Service     n totalTime percent.n percent.totalTime
## 1     OSInstaller 42400 1586 secs 0.9197197          0.867615
## 2  opendirectoryd  3263   43 secs 0.0707794          0.023523
## 3         Unknown   236  157 secs 0.0051192          0.085886
## 4  _mdnsresponder    52   17 secs 0.0011280          0.009300
## 5              OS    49    1 secs 0.0010629          0.000547
## 6 diskmanagementd    47    7 secs 0.0010195          0.003829
## 7     storagekitd    29    2 secs 0.0006291          0.001094
## 8         configd    25   15 secs 0.0005423          0.008206

Ok, the "OSInstaller" is responsible for the vast majority of the writes to the log and to the total time of the installation. "opendirectoryd" was the next most verbose process, but its processes were relatively quick compared to the "Unknown" process' as evidenced by "Unknown" taking almost 4 times longer, in aggregate, in spite of having only 7% of "opendirectoryd"'s log entries.

We can more intuitively view the number-of-entries/time-taken mismatch thusly:

melted <- melt(as.data.frame(counts[,c("Service",
                                       "percent.n",
                                       "percent.totalTime")]))

ggplot(melted, aes(x=Service, y=as.numeric(value), fill=factor(variable))) +
  geom_bar(width=.8, stat="identity", position = "dodge",) +
  ggtitle("Breakdown of services during installation by writes to log") +
  ylab("percent") + xlab("service") +
  scale_fill_discrete(name="Percent of",
                      breaks=c("percent.n", "percent.totalTime"),
                      labels=c("writes to logfile", "time elapsed"))

Breakdown

As you can see, the "Unknown" process took a disproportionately long time for its relatively few log entries; the opposite behavior is observed with "opendirectoryd". The other processes contribute very little to both the number of log entries and the total time in the installation process.

What were the 5 most lengthy processes?

yos.log %>%
  arrange(desc(elapsed)) %>%
  select(Service, Message, elapsed) %>%
  head(n=5)


##       Service
## 1 OSInstaller
## 2 OSInstaller
## 3     Unknown
## 4 OSInstaller
## 5 OSInstaller
##                                                                                                                                            Message
## 1 PackageKit: Extracting file:///System/Installation/Packages/Essentials.pkg (destination=/Volumes/Macintosh HD/.OSInstallSandboxPath/Root, uid=0)
## 2                                    System Reaper: Archiving previous system logs to /Volumes/Macintosh HD/private/var/db/PreviousSystemLogs.cpgz
## 3                       kext file:///Volumes/Macintosh%20HD/System/Library/Extensions/JMicronATA.kext/ is in hash exception list, allowing to load
## 4                                                                   Folder Manager is being asked to create a folder (down) while running as uid 0
## 5                                                                                                                      Checking catalog hierarchy.
##    elapsed
## 1 169 secs
## 2 149 secs
## 3  70 secs
## 4  46 secs
## 5  44 secs

The top processes were:

  • Unpacking and moving the contents of "Essentials.pkg" into what is to become the newsystem directory structure. This ostensibly contains items like all the updated applications (Safari, Mail, etc..). (almost three minutes)
  • Archiving the old system logs (two and a half minutes)
  • Loading the kernel module that allows the onboard serial ATA controller to work (a little over a minute)

Let's view a density plot of the number of writes to the log file during installation.

ggplot(yos.log, aes(x=lub.time)) +
  geom_density(adjust=3, fill="#0072B2") +
  ggtitle("Density plot of number of writes to log file during installation") +
  xlab("time") + ylab("")

density

This graph is very illuminating; the vast majority of log file writes were the result of very quick processes that took place in the last 15 minutes of the install, which is when the progress bar read that only two minutes were remaining.

In particular, there were a very large number of log file writes between 11:47 and 11:48; what was going on here?

# if the first time is in between the second two, this returns TRUE
is.in <- function(time, start, end){
  if(time > start && time < end)
    return(TRUE)
  return(FALSE)
}

the.start <- ymd_hms("14-10-18 11:47:00", tz="EST")
the.end <- ymd_hms("14-10-18 11:48:00", tz="EST")

# logical vector containing indices of writes in time interval
is.in.interval <- sapply(yos.log$lub.time, is.in,
                         the.start,
                         the.end)

# extract only these rows
in.interval <- yos.log[is.in.interval, ]

# what do they look like?
silence <- in.interval %>%
  select(Message) %>%
  sample_n(7) %>%
  apply(1, function (x){cat("\n");cat(x);cat("\n")})

## 
## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/usr/local/texlive/2013/tlpkg/tlpobj/featpost.tlpobj -> /Volumes/Macintosh HD/usr/local/texlive/2013/tlpkg/tlpobj Final name: featpost.tlpobj (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy)
## 
## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/usr/local/texlive/2013/texmf-dist/tex/generic/pst-eucl/pst-eucl.tex -> /Volumes/Macintosh HD/usr/local/texlive/2013/texmf-dist/tex/generic/pst-eucl Final name: pst-eucl.tex (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy)
## 
## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/Library/Python/2.7/site-packages/pandas-0.12.0_943_gaef5061-py2.7-macosx-10.9-intel.egg/pandas/tests/test_groupby.py -> /Volumes/Macintosh HD/Library/Python/2.7/site-packages/pandas-0.12.0_943_gaef5061-py2.7-macosx-10.9-intel.egg/pandas/tests Final name: test_groupby.py (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy)
## 
## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/usr/local/texlive/2013/texmf-dist/tex/latex/ucthesis/uct10.clo -> /Volumes/Macintosh HD/usr/local/texlive/2013/texmf-dist/tex/latex/ucthesis Final name: uct10.clo (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy)
## 
## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/usr/local/texlive/2013/texmf-dist/doc/latex/przechlewski-book/wkmgr1.tex -> /Volumes/Macintosh HD/usr/local/texlive/2013/texmf-dist/doc/latex/przechlewski-book Final name: wkmgr1.tex (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy)
## 
## WARNING : ensureParentPathExists: Created  `/Volumes/Macintosh HD/usr/local/texlive/2013/texmf-dist/doc/latex/moderntimeline' w/ {
## 
## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/usr/local/texlive/2013/texmf-dist/fonts/type1/wadalab/mrj/mrjkx.pfb -> /Volumes/Macintosh HD/usr/local/texlive/2013/texmf-dist/fonts/type1/wadalab/mrj Final name: mrjkx.pfb (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy)

Ah, so these processes are the result of the installer having to move files back into the new installation directory structure. In particular, the vast majority of these move operations are moving files related to a program called "texlive". I'll explain why this is to blame for the inaccurate projected time to completion in the next section.

But lastly, let's view a faceted density plot of the number of log files writes by process. This might give us a sense of what steps go on as the installation progresses by showing us with processes are most active.

# reduce number of service to a select few of the most active
smaller <- yos.log %>%
  filter(Service %in% c("OSInstaller", "opendirectoryd",
                        "Unknown", "OS"))

ggplot(smaller, aes(x=lub.time, color=Service)) +
  geom_density(aes( y = ..scaled..)) +
  ggtitle("Faceted density of log file writes by process (scaled)") +
  xlab("time") + ylab("")

facet

This shows that no one process runs consistently throughout the entire installation process, but rather that the process run in spurts.

the answer
The vast majority of Mac users don't place strange files in certain special system-critical locations like '/usr/local/' and '/Library/'. Among those who do, though, these directories are littered with hundreds and hundreds of custom files that the installer doesn't and can't have prior knowledge of.

In my case, and probably many others, the estimated time-to-completion was inaccurate because the installer couldn't anticipate needing to copy back so many files to certain special directories after unpacking the contents of the new OS. Additionally, for each of these copied files, the installer had to make sure the subdirectories had the exact same meta-data (permissions, owner, reference count, creation date, etc…) as before the installation began. This entire process added many minutes to the procedure at a point when the installer thought it was pretty much done.

What were some of the files that the installer needed to copy back? The answer will be different for each system but, as mentioned above, anything placed '/usr/local' and '/Library' directories that wasn't Apple-supplied needed to be moved and moved back.

/usr/local/
/usr/local/ is used chiefly for user-installed software that isn't part of the OS distribution. In my case, my /usr/local contained a custom compliled Vim; ClamXAV, a lightweight virus scanner that I use only for the benefit of my Windows-using friends; and texlive, software for the TeX typesetting system. texlive was, by far, the biggest time-sink since it had over 123,491 files.

In addition to these programs, many users might find that the Homebrew package manager is to blame for their long installation process, since this software also uses the /usr/local prefix (although it probably should not).

/Library/
Among other things, this directory holds (subdirectories that hold) modules and packages that the Apple-supplied Python, Ruby, and Perl uses. If you use these Apple-supplied versions of these languages and you install your own packages/modules using super-user privileges, the new packages will go into this directory and it will appear foreign to the Yosemite installer.

To get around this issue, either install packages/modules in a local (non-system) library, or use alternate versions of these programming languages that you either download and install yourself, or use MacPorts to install.

---

You can find all the code and logs that I used for this analysis in this git repository

This post is also available as a RMarkdown report here

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Damn the torpedoes, full speed ahead: making the switch to Python 3

Python 3 has been out since 2008 (and realistically usable since 2009).
In spite of this four year availability period, Python 3 use has yet to see widespread adoption, particularly among groups in the scientific community. In the company of data scientists/statisticians, when someone says they've written their own Python code to perform some task, it's usually assumed that they are talking about Python 2; it is Python 3 that requires the version number qualification.

There are members of the community (I used to include myself in this category) that are really happy with Python(2) and hope that if they ignore Python 3 it will just go away.

It won't though. The fact of the matter is that "Python 2.x is legacy, Python 3.x is the present and future of the language."

So how do we get people to adopt Python 3? In my opinion, there are three key strategies:

  • go softer on Python 3 denialists, perhaps with a Python 2.8 (Guido said this will not happen)
  • go harder on Python 3 denialists by discontinuing 2.7 maintenance
  • serve as an example to programmers (especially new ones) by switching your default python interpreter to python3.

As you, dear reader, can probably tell from my wording, I personally favor strategy 3.

Part of that solution involves vendors shipping Python 3 by default. We are making some progress in this regard (Arch GNU/Linux now has python sym-linked to python3, and Fedora and Ubuntu have stated that they will follow suit), but we still have a lot of work to do. A huge step forward would be if Apple ships macs with Python 3. Current macs use 2.7 which wasn't, finally, released until 2010. This means that they could have used Python 3 instead. That would have really shook things up because a lot of my friends and colleagues in my field just use the Apple-supplied Python interpreter for analytics (vis-a-vis SciPy Superpack). The extension of 2.7 support until 2020 will unfortunately afford Apple the opportunity to be lackadaisical in its porting to Python 3 because they might only do so when upstream maintenance ends (and maybe not even then).

The other part of strategy 3 involves personally serving as an example by using Python 3 as your default interpreter.

But I can't rationally will that more people do this if I am unwilling to do this myself. While it's true that my big open-source Python project meant for widespread public consumption was very carefully made Python 3 compatible, I noticed that the code on this blog is often Python 3 incompatible. This is primarily because the python code I quickly whip up is run through my default Python 2 interpreter. My obstinance to switch to Python 3 by default is helping to contribute to Python 3's slow adoption and implicitly serving notice that it's ok to still use Python 2.

But I no longer want to be party to this transition quagmire (and the ASCII-normative cultural hegemony). Because of this, I recently took the plunge and switched my default Python stack to Python 3. Damn the torpedoes, full speed ahead!

I was, perhaps, in a better position to do this than some because all of my most used third-party Python packages have already been ported; this includes the SciPy ecosystem (NumPy, SciPy, pandas, scikit-learn), IPython, lxml, networks, BeautifulSoup, and requests. It was really easy for me to ditch the Apple-supplied Python interpreter in favor for MacPorts' build of Python 3.4. I was even able to install most of my favorite third-party packages using MacPorts (the ones that weren't available I pip-installed as "user" to not muck up the MacPorts installation prefix). The only hard part about the switch was that almost all of my system python code stopped working; everything from my own system utilities to the battery-life indicator in my tmux panes.

While fixing all of this wayward code, I took notice of the incompatibilities that caused me the most trouble:

  • changes to exception handling
  • changing xrange() to range()
  • changing raw_input() to input()
  • changing my print statements into functions
  • explicitly requesting relative module imports
  • wrapping map() function calls in "list()" because map() now returns an iterator (I actually just re-wrote the code to use list comprehensions.)

But by far the incompatibility that caused the most heart-ache were I/O changes, and this is mostly due to the way Python 3 handles unicode.

There is far too much to say about Unicode in Python 3 to provide a detailed explanation in this post (if you're interested, I've put some great learning resources at the end of this post) but, essentially, Python no longer allows you to willy-nilly mix 8-bit string data and text objects. If you find yourself asking "Why am I having such trouble with my porting?" than the answer may be that you were playing fast-and-loose with mixing (what probably always should have been) incompatible types: unicode strings and 8-bit string data.

For this reason Python 2 was easier to deal with for programmers/scientists who dealt with largely numbers or ASCII-only text, but this won't cut it anymore. It's unreasonable and culturally chauvinistic to require that non-English speaking programmers misspell (or transliterate) their names and variables just to contribute to a non-internationalized codebase.

Before you learn anymore hacks to get unicode working in Python 2.x, consider switching to Python 3. Basically, the new I/O/string paradigm in Python 3 comes down to

  • understanding unicode and utf-8
  • decoding input as early as you can
  • encoding output as late as you can
  • only working with unicode strings within the program (no bytes!)



If you're reading this blog, you're probably a data scientist/statistician (or my parents) and you can almost assuredly make the switch to Python 3 since virtually all of our most prized packages have been ported (even nltk has a branch that works with Python 3). Just to make sure, you can use this tool that tracks the Python 3-readiness of some popular packages.

Final notes:
This post was in no way meant to shame programmers/scientists who still choose to use Python 2.x. I completely understand and sympathize with those that have very large Python 2 codebases to maintain or are locked into a particular Python version because of their company policy or their clients' needs.

If you are interested using Python 3, though, but want to approach it cautiously, consider using the 2to3 conversion tool to see how the code needs to change. Another great strategy is to use the various __future__ imports to ease the transition. Something like:

from __future__ import division, absolute_import, print_function, unicode_literals

At the least, you call python using the "-3" flag to see possible problems and incompatibilities. You can do this by changing your shebang line to something like

#!/usr/bin/env python -3


or create the following shell alias

alias python="python -3"

Resources for learning Python 3 I/O and unicode:

Other notes from Python 3 apologists:

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Compiling R from source and why you shouldn't do it

I’ve always thought that it’s silly, in most cases, source compiling software that’s already available in binary form. To the end of making more binary packages available to Mac users, I just started contributing to a project that is creating a repository of 64 bit builds of pkgsrc’s (NetBSD's portable package manager) over 12,000 packages. This means having to get my hands dirty compiling packages myself. After contributing Vim, the next logical thing for me is to provide a R build.

Compiling R from source (again and again) has been tremendously enlightening for me. Not only do I feel like I understand a lot more about R’s internals, but I’ve also come to the conclusion that if the CRAN provides a binary build for your system, you should never really compile R yourself. This, most definitely, includes Mac users.

Before I go into how to build it, let’s explore some of the reasons someone might want to build R themselves and why, in most cases, this is unnecessary.

  • I want a faster R.
  • It’s sometimes assumed that if you build something from source yourself, it’s customized to your particular system and, therefore, runs faster. In practice this requires a lot of intervention (and heartache) at the configuration step of the compilation process. In the case of R on OS X, no amount of compiler optimization and configuration (using the stock linear algebra libraries) I’ve attempted was able to outperform R from CRAN. You don’t know R better than the R Core Team, and they know what’s good for you. Just use theirs.

  • I can compile against other linear algebra libraries and get a speedup that way.
  • You don’t need to compile R against these other libraries in order to use them. I’ll go into how you can use them from your current R installation in another post.

  • I’m on a system for which there is no binary available.
  • Yikes! You’re probably used to heartache. You don’t have a choice than to build R yourself. Have a ball!

  • I just want to.
  • As I’ve discovered, it is a great way to learn more about R’s internals. If you fancy yourself an R ‘guru’ and want to build R yourself, I can’t really blame you—so long as you don’t use your likely botched build in a production environment.

  • I’m a gentoo user.
  • I’m so sorry.

  • I’m a Windows user and a masochist.
  • Compiling R is an excellent choice. The safe word is “GNU”.

  • I’m helping to build a repo of 64 bit binaries for pkgsrc or I’m writing a blog post about compiling R.
  • You’re exempt from criticism or ridicule.

If at this point, you’re still interested in compiling R, in spite of my attesting to it being, for most cases, completely unnecessary, please read on. I also strongly recommend that you read the following guide from CRAN.

Dependencies
Users of most GNU/Linux systems can build the dependencies necessary by running:

sudo apt-get build-dep r-base-dev

or the equivalent command for your system.

On OS X you need

  • Xcode and Xcode command-line tools: Xcode is available from the App Store. The command-line tools have to be downloaded separately from the ‘Preferences’ menu.
  • gfortran: or another compliant Fortran compiler. You need this to chiefly compile the linear algebra libraries.
  • Java: You can grab the Java for OS X developers package from the Apple Developers page or grab another JDK. You need this for the JNI headers.
  • XQuartz: This includes the X11 headers and cairo.
  • MacTex: This isn’t strictly necessary but you will need it to generate R’s PDF documentation. If you don’t want to download this over 2 GB package, there are other recourses available. If you want this package, you have to add "/usr/texbin" to your PATH environment variable. Yay, now you have LaTeX!

Other dependencies are unnecessary because the R source ships with fallback versions of them. These include pcre, zlib, xdr, and a few others. Still other dependencies will be present on any POSIX-compliant system.

Configuration and build
After downloading the source here , you have a few decisions to make. The first is where you want to install R. You don’t have to install R anywhere per se because it can be run straight from the build directory, you can just place the R script (which contains the prefix hardcoded) in the bin subdirectory anywhere on your PATH. If you do not specify the prefix, it will default to the build directory.

It’s customary to set your prefix for user compiled software to /usr/local, so that’s what we’ll do here.

The other decisions that have to be made are very platform/system specific. You can see all the configuration options by running

./configure --help

The auto-configuration is very good at setting sane defaults for most of these options. For example, if you’re building on OS X, it will by default build R as a framework and shared library, which you would need if you want to use R.app. This is a separate install.

On OS X, I ran my pre-configuration and configuration thusly:

export CC="clang"
export CXX="clang"
export F77="gfortran-4.2 -arch x86_64"
export FC=$F77
export OBJC="clang"
./configure -prefix=/usr/local

Assuming everything goes well, you can now start building with

make

If it successfully builds, you can install R to the prefix with

make install

Now you have R.

If your on a Mac, you may have noticed that you have a crippled R install. This is for a few reasons.

  • The binary from CRAN comes with R.app. If you want that, you have to build that yourself.
  • You can no longer download binary builds of your favorite R packages. It has to build them from source now.

As an R user on a Mac, you then realize how good you’ve had it. The binary build from CRAN comes with R.app, a fast R framework, and it installs binary R packages by default. Now you no longer have those options.

Additionally, dear Mac-user, you also have the benefit of using RStudio’s new Cocoa interface. Count your lucky stars, install CRAN’s binary build, and read my next post about how to switch out the linear algebra libraries that R uses for a few other faster alternatives.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

The state of package management on Mac OS X

It's that time again; I suspect that Mavericks will be released in the next few weeks, so I get the once-every-year-(or-so) chance to experiment and modify the hell out of my OS X installation because I'll just do a fresh install soon anyway. This time around I'm experimenting with package managers.

I've actually tried really hard to avoid ever having to use them. I started using Slackware in high school, and after some brief experimentation (in college) with Ubuntu, I took up OS X as my main OS. But, since building from source code is somewhat of a nightmare on a Mac--at least compared to what I was used to--I started to look into package management solutions.

The terrain was difficult to navigate. It seemed like people had some really strong opinions on which one was the best and which ones were on their way out. Since I didn't know who to believe, I just stuck to manual building. But, since I'm going to get a tabula rasa in a few weeks, I thought I'd take this opportunity to document this terrain exploring and present my finding in the most impartial manner that I'm capable of.

Before I start, I want to make a few things clear. (1) There is some disagreement on what actually constitutes a package manager. Here, I'm referring broadly to any centralized software installation framework that tracks or resolves dependencies, whether it builds from source or not. (2) I haven't had the time to become an expert on all of the managers I audited, so keep that in mind. (3) Not only are all of these package managers open source, but many of them have robust configuration options, so I'll be talking mostly about default behavior from the perspective of a new user.

If my old editions of O'Reilly books discussing Mac software are any indication, MacPorts and Fink were the two best options available. Then Homebrew came on the scene and a lot of people seem to be raving about it. I started off with the intention of only trying out these three but in the course of my research, I learned about two others that I wanted to give a chance.

To see a table summary of my findings, you can just scroll down to the end of this post.

Rudix
Rudix is a binary-only package manager that attempts a "hassle-free" way of getting Unix programs on a Mac. It doesn't have many packages available yet, but it has no trouble at all installing and uninstalling the ones that it does offer. For example, their 'Go' installation was the most painless installation of a language that I've ever experienced. My complaints are that (a) the binaries go directly to /usr/bin, so they are not sandboxed, and (b) the man files for these tools were not installed with the binaries.

MacPorts
MacPorts was one of the most recommended package management solutions that I came across in my research. It also probably attracted the most flak. It was built with the likeness of FreeBSD's Ports system, so it's a source building manager. What I liked about MacPorts was the fact that the installation was painless (it updated my PATH for me!), the compiled binaries were sandboxed in /opt/local, and the wealth of packages available was hard not to love.

An interesting thing about MacPorts is that it eschews Apple-supplied libraries and links sources against its own. A benefit of this is that it can ensure a consistent experience across OS X versions and whatever whimsical decisions Apple may choose to make in the future. The drawback to this approach is that building what appears, prima facie, to be a small package may require an extraordinarily large amount of huge programs and libraries to be built as dependencies.

Fink
Fink is modeled after Debian's dpkg and apt-get. Having used Debian-based distros in the past, I was excited to see what Fink had to offer. Like apt-get, Fink can install binaries or build from source. What wasn't like apt-get was that a completely different command was used to build from source ("fink") than to install the binaries. This was somewhat confusing. Furthermore, there is no binary installer for 10.6 to 10.8, so installation was a bit harrowing. Once it was installed, though, and I got used to the separate commands and its differences to "apt-get", I was pleased that my PATH was automatically updated and that the installed binaries were appropriately sandboxed.

Homebrew
Like I mentioned above, a lot of people are really excited about Homebrew. It is being developed with the intention to correct (what it perceived to be) MacPorts' shortcomings. From what I can tell, it tries really hard to work with OS X's existing framework/libraries. For this reason, Homebrew is probably a good choice for someone who is using it to install the occasional tool on a single user system.

A neat thing about Homebrew is that it is written very simply in ruby. Its "recipes" to install packages are easy-to-read ruby scripts. They are also very easy to modify and the community encourages upstream development.

Something not-so-neat about Homebrew is that it is publicly antagonistic towards MacPorts. This is probably something that only I care about, though.

pkgsrc/pkgin
Again, I started with the intention of only auditing Fink, Homebrew and MacPorts. When I learned about pkgsrc, I thought that it was too obscure to be a serious contender and I was considering not looking into it further. I am so glad that, for completeness' sake, I decided to try it out because I virtually have only good things to say about it.

pkgsrc started as NetBSD's package management solution. Given NetBSD's dedication to portability, it is perhaps not a surprise that their package manager would attempt to follow suit. It has now been adapted for use on over a dozen different operating systems. Among these are AIX, Solaris, HP-UX, GNU/Linux, Windows (via Cygwin and Interix) and, of course, OS X. It is the default manager on DragonflyBSD and was even the default manager on a now-discontinued GNU/Linux distro, Bluewall Linux. It is similar to (and, indeed, was forked from) FreeBSD's ports system.

I don't think many Mac power-users know that this is an option for them which is a shame because it turned out to be my favorite. After following some fairly simple steps, a mature and sophisticated package manager with over 8,000 packages is at your disposal.

Probably the best thing about pkgsrc from the perspective of Mac users is a tool called pkgin. It's an apt-like tool for installing binaries from pkgsrc. Installing strange Unix tools on OS X *can not* be easier.

The only caveat I should mention is that I haven't tested installing Python with it because I'm still too far away from Mavericks to risk botching my environment that badly. I suspect that it would cause issues because pkgsrc, being a NetBSD project, can't be as aware of OS X framework idiosyncracies as a Mac-specific package manager can.

I'd like to write more on this topic, but this post is getting unwieldy. I plan to talk more about pkgsrc and OS X in another post but, for this one, I'll conclude with the "too-long-didn't-read" version of my journey through package-manager-land.

categoryRudixMacPortsFinkHomebrewpkgsrc / pkgin
Homepagerudix.org
MacPorts.orgfink.thetis.ig42.orgbrew.shpkgsrc.org and pkgin.net
Twitter@rudix4mac (updates often)@macports (last tweet in July)@finkmac (hasn't had update since 2010)@machomebrew (very active)@pkgsrc (last tweet in September)
Year project started2005200220012009Support for Darwin added in 2001
Number of packages488 (but `rudix available | wc -l` says 351)17,680 (but `port list | wc -l` says 17,686)7,951. `apt-cache search . | wc -l` says 209 stable binary .deps)2,498. `brew search | wc -l` says 2,591. This is not counting various extra "taps"8,884 binaries for OS X (according to `pkgin available | wc -l`)
Source/binary/both?Binary onlyTraditionally only sourceOption for bothSource, but also binaries through "bottles"Both. Traditional pkgsrc will do both but using only pkgin will grab the binaries
Language written inPythonTclPerl (front-end)RubyC
LicenseBSDBSDGPL :(BSDBSD
Gui optionsNot really... but there's an internet package browsing optionCurrently threeTwo: fink commander, and phynchronicityNope, but online package browser at Braumeister.orgOnline package browser at pkgsrc.se but none others that I can find
Default prefixDirectly to /usr/local/opt/local/sw/usr/local/Cellar. Programs symlink to /usr/local/bin/usr/pkg
Power-PC supportNot anymoreYes because it is built from sourceYesNot traditionally, but there are forks available that might provide this functionalityNot unless you build from source
Lastest GCC availableNot available4.8.14.84.9No binary available but pkgsrc has 4.8
Python stuffNot availablePy27 and 33 and a lot of great packagesPy23 and 33 and a lot of great packagesPy27 and 33. I couldn't find any packages but the python installs pip and easy_installPy27 and 33 and a lot of great packages. (see warning above)
Installation of package managerVery easy and fastVery easy and fastNightmarish (no binary installer for 10.6 - 10.8)Easy as pieVery easy and fast with these instructions
Uninstallation of package managerEasy and painlessHell-ishVery easy and fastRelatively easy if you follow this gist: https://gist.github.com/mxcl/1173223Not sure, probably just a rm -rf-ing the /usr/pkg and /usr/pkgsrc directories
Installation of packagesExtremely easySlow, since it builds from sourceThe source builds are understandably slow, but the binaries are (obviously) quickSource compilation is obviously slow. I've had some linking issues sometimes.Trivially easy
Uninstallation of packagesEasy and painlessEasyEasy and fastVery easyTrivially easy
Community supportNot very much is requiredGreatNot so greatVery very goodA few websites have some great documentation but some other information it is hard to find OS X-specific info.
DevelopmentGit. Primarily lead by one person. 5 contributors.Subversion. Very happening. Many many developers.Git. 14 GitHub contributors. Commits are infrequentGit. Most vibrant. Over 3,000 contributors. "Recipes" for compilation are easily modified and you are encouraged to submit pull requests. This project is very easy to contribute to.Pkgsrc is CVS. Pkgin is Git. pkgsrc is well backed by the NetBSD Foundation
share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

On the treachery of point-and-click "black-box" data analysis

*in this post, by "black-box" I'm referring to software whose methods are undisclosed and un-audible rather than black-box math models

EZ Analysis

There’s a certain expectation that the analyses that inform not only business and stock trading, but public health and social welfare decisions, are carefully thought out and performed with painstaking attention to detail. However, the growing reliance of point-and-click 'black-box' data analysis solutions, while helpful for getting people into the field, are jeopardizing the reliability of these verdicts.

Lest readers think this is an exercise in idiotic reminiscence and fetishizing the past, I'm not romanticizing pen-and-paper analysis and F-table lookups; I have however personally witnessed decisions about to be made with analysis results that are dubious at best.

My issue with these tools comes down to the fact that they provide the ability to choose from a smörgåsbord of statistical tests with no mention of the assumptions they make or the conditions that have to be met in order for them to work properly. At worst, it gives the unscrupulous user the ability to perform test after test until they see the results that they want.

I was working on a project where we were encouraged to use an unnamed enterprise data miner trying to predict ****** (binary classification) from a series of *****. Under the "regression" tab, there were a lot of different tests to choose from. In the absence of a "logistic regression" button, I figured the "GLM" might automagically detect what I was trying to do. When, I finally got the results from the GLM, I decided to check it against R. The results were completely incompatible. I know what R was doing; I can type the name of any function (without the parentheses) and get the source code. But nowhere in this tool’s documentation of this function could I find a comprehensive list of the steps it took. For example, I couldn't find:

  • how the miner aggregated the data
  • whether it correctly deduced that I wanted to use logistic regression
  • whether it checked for colinearity and homogeneity of variance, and, if it did, whether it made any decisions for me that it didn’t tell me about
  • whether it included the non-numeric data points in the model, and if it did, what coding scheme it used
  • how it could possibly know if I didn't want to use a probit-model (if my closest option was only GLM)
  • what (if any) dimensionality techniques it used

 

The aggregation in R yielded over 200 dimensions. Since I was trying to replicate (reverse-engineer is such a harsh term) this tool’s steps, I decided to try (a) not reducing the dimensions, and (b) trying every dimensionality reduction technique under the sun and see if I got closer to the miner’s results. I made it through PCA and LDA until I gave up. I decided, then, to lobby for use of the tool whose algorithms are audit-able and well-defined. I’ll also vouch for the tool where tests and steps have to be explicitly performed over a point-and-click solution any day.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail