On the treachery of point-and-click "black-box" data analysis

*in this post, by "black-box" I'm referring to software whose methods are undisclosed and un-audible rather than black-box math models

EZ Analysis

There’s a certain expectation that the analyses that inform not only business and stock trading, but public health and social welfare decisions, are carefully thought out and performed with painstaking attention to detail. However, the growing reliance of point-and-click 'black-box' data analysis solutions, while helpful for getting people into the field, are jeopardizing the reliability of these verdicts.

Lest readers think this is an exercise in idiotic reminiscence and fetishizing the past, I'm not romanticizing pen-and-paper analysis and F-table lookups; I have however personally witnessed decisions about to be made with analysis results that are dubious at best.

My issue with these tools comes down to the fact that they provide the ability to choose from a smörgåsbord of statistical tests with no mention of the assumptions they make or the conditions that have to be met in order for them to work properly. At worst, it gives the unscrupulous user the ability to perform test after test until they see the results that they want.

I was working on a project where we were encouraged to use an unnamed enterprise data miner trying to predict ****** (binary classification) from a series of *****. Under the "regression" tab, there were a lot of different tests to choose from. In the absence of a "logistic regression" button, I figured the "GLM" might automagically detect what I was trying to do. When, I finally got the results from the GLM, I decided to check it against R. The results were completely incompatible. I know what R was doing; I can type the name of any function (without the parentheses) and get the source code. But nowhere in this tool’s documentation of this function could I find a comprehensive list of the steps it took. For example, I couldn't find:

  • how the miner aggregated the data
  • whether it correctly deduced that I wanted to use logistic regression
  • whether it checked for colinearity and homogeneity of variance, and, if it did, whether it made any decisions for me that it didn’t tell me about
  • whether it included the non-numeric data points in the model, and if it did, what coding scheme it used
  • how it could possibly know if I didn't want to use a probit-model (if my closest option was only GLM)
  • what (if any) dimensionality techniques it used

 

The aggregation in R yielded over 200 dimensions. Since I was trying to replicate (reverse-engineer is such a harsh term) this tool’s steps, I decided to try (a) not reducing the dimensions, and (b) trying every dimensionality reduction technique under the sun and see if I got closer to the miner’s results. I made it through PCA and LDA until I gave up. I decided, then, to lobby for use of the tool whose algorithms are audit-able and well-defined. I’ll also vouch for the tool where tests and steps have to be explicitly performed over a point-and-click solution any day.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Leave a Reply

Your email address will not be published. Required fields are marked *