Visualizing data analysis pipelines using NetworkX

In complicated data analysis pipelines and scientific workflows, it's often difficult to keep track of which tasks have to be performed before others. Even with informal forms of documentation (my personal favorite is 'notes.txt'), as the size of a project grows, and more dependencies are introduced, a formal documentation process has to be put in place, or else the project will become unsustainable.

I'm writing a automated system for statisticians and scientists for carrying out large multistep analytics processes. I'll discuss this more in later posts, but the details pertinent for this post are that each step of a analytics pipeline is detailed in a YAML document called a "Sakefile" (not-so-clever play on Makefile) with sections explicitly defining dependencies and resulting output files.

Given dependency resolution's usage of concepts from graph theory (topological sorting) I thought it would be easy and neat to write a tool to visualize the components and dependencies that go into an analytics workflow as a directed graph.

I've rustled up a simple example examining correlates of DUI arrests with various adolescent-related data by state. I chose these data sets because they’re very small and freely available on the net.

The "Sakefile" looks like this:

---
format dui stats:
    help: format raw (copy and pasted) dui/state data using perl
    dependencies:
        - rawdata.txt
    formula: >
        perl -pe 's/^(\D+)\s+([\d,]+)\s+([\d,]+)\s*/\1\t\2\t\3\n/'
        rawdata.txt | sed 's/,//g' > duistats.tsv;
    output:
        - duistats.tsv

fetch teen stats:
    help: fetches various teen statstics from the web
    # no dependencies
    formula: >
        curl -o teenstats.xls http://mathforum.org/workshops/sum96/data.collections/datalibrary/US_TeenStats.XL.zip.xls;
    output:
        - teenstats.xls

convert teen stats to csv:
    help: uses gnumerics ssconvert to convert ugly xls to csv and cleans it
    dependencies:
        - teenstats.xls
    formula: >
        ssconvert teenstats.xls messyteenstats.csv;
        cat <(echo -n "state") <(< teenstats.csv sed '55,$d' |
        sed '1,2d') | sed 's/,,/,/g' > teenstats.csv;
        rm messyteenstats.csv;
    output:
        - teenstats.csv

find correlates:
    help: calls R script that finds correlated of DUI arrest in various teen statistics
    dependencies:
        - duistats.tsv
        - teenstats.csv
    formula: >
        ./dui-correlates.R
    output:
        - corrogram.png
        - table.csv

all:
    - format dui stats
    - fetch teen stats
    - convert teen stats to csv
    - find correlates
...

A short description of each of the steps appears in the "help" field on each entry. Basically, there are two source data files: one exists and raw text copy and pasted from a website, and the other is fetched from the web using curl. The former is cleaned and formatted using perl and sed; the latter has to go through a process that converts downloaded excel file into a CSV and strips useless lines. Both of these source data files then get read by an R script which, ultimately, outputs a corrogram graphic and a summarization table.

Below is the small python program that parses the "Sakefile" and created the visualization. It uses the great NetworkX module to create the graph and render it as an image.

#!/usr/bin/env python -tt

import matplotlib.pyplot as plt
import networkx as nx
import pudb
import yaml

sakefile = yaml.load(open("Sakefile.yaml").read())

G = nx.DiGraph()

def check_for_dep_in_outputs(dep):
    print "checking dep {}".format(dep)
    ret_list = []
    for node in G.nodes(data=True):
        if "output" not in node[1]:
            continue
        if dep in node[1]['output']:
            ret_list.append(node[0])
    return ret_list

# make graph nodes for each target
for target in sakefile:
    if target == "all":
        # we don't want this node
        continue
    G.add_node(target, sakefile[target])


for node in G.nodes(data=True):
    print "checking node {} for dependencies".format(node[0])
    if "dependencies" not in node[1]:
        continue
    print "it has dependencies"
    connects = []
    for dep in node[1]['dependencies']:
        matches = check_for_dep_in_outputs(dep)
        if not matches:
            continue
        for match in matches:
            connects.append(match)
    if connects:
        for connect in connects:
            G.add_edge(connect, node[0])


nx.draw(G, node_color="pink", node_size=10000)
plt.savefig("dependency-visualization.png")

The resulting visualization looks like this:

dependency-visualization

Sure, the arrows look weird and this is a really simple example, but it's easy to see that, even for the most byzantine of pipelines, that a visualization like this can really help get a sense all the actions involved in a workflow.

I'll go over the actual running and results of this example in a later post, when I get the "sake" system working properly. :)

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail