Sending text messages at random times using python

Python and SMS

Given my interest for applying statistics and analytics to most (if not all of the) quantifiable aspects of my life, when I learned about self-tracking, and the associated 'Quantified Self' movement, it should come as no surprise to anyone that knows me that I wanted to get started right away.
And...
Given my interest in making life harder than it needs to be, it makes sense that I would eschew existing self-tracking tools and build my own. A neat side-effect of this obstinance is getting to learn new things.

The basic idea is at random times during the day, I fill out a survey that I designed for myself including questions such as: "How happy are you right now?", "How much energy would you say that you have right now?", and "Where are you right now?".

The most reliable and fastest way to get in touch with me is to send a text message. So, sending myself text messages at random times during the day is the best way to prompt me to fill out this self-tracking survey.

To make it easier (and, therefore, more likely that I'll fill it out) the content of the text message should be a link to the survey on the web. And in order to add flexibility to when I have to fill out the survey form but also preserve the randomness of the sampling, the timestamp of the time the text message was sent should be included as a url parameter so that it can be stored in the database along with the answers to the survey.

The service that sends these text messages runs on a Debian GNU/Linux EC2 instance that also hosts the form I fill out and the database that the answers are dumped to.

Before we get to the code, I should explain the modules that we will need for this task, and my rationale for choosing them.

logging
Trying to debug a scheduled task or workflow is a living hell without proper and verbose logging. Since this must be run in the background (and not tied to a particular terminal emulator) simple print statements will not do. The more elegant, scalable, and extensible solution is to use Python's excellent 'logging' module.

smtplib
While there are a few different ways to send text messages (SMS) using Python, the solution I settled on is to use the 'smtplib' standard library module to send an email to an SMS gateway. This gateway will convert the email into a text message sent to my phone. smtplib is needed to send the email message.

apscheduler
Although cron (or equivalently [?] Windows Scheduling Service) should be the tool of choice when scheduling commands to be run at specific times that never change, the fact that the text messages have to be sent at different times everyday requires another solution. Probably the most elegant and cross-platform solution is to use the advanced python scheduling library, apscheduler. The Python standard library comes with a similar module, sched, but apscheduler is more advanced with its scheduling capability and its ability to persistently store tasks in a database that survives process restart. (It supports storage in SQLite, PostgreSQL, MongoDB, Redis, MySQL, Oracle, MS-SQL, Firebird, and Sybase). But, unlike its standard library counterpart, it needs to be pip installed.

We will divide this task up into two python scripts, one that gets run once a day, computes n random times, schedules to send a text message those times, and then sends the message (we will call this send_daily_texts.py), and one script that runs once that calls send_daily_texts at midnight everyday (we will call this run_everyday.py).

send_daily_texts.py

#!/usr/bin/python -tt

import random
import sys
import logging
import smtplib
import email.utils
from email.mime.text import MIMEText
from datetime import datetime, timedelta, date
from apscheduler.schedulers.blocking import BlockingScheduler

# create logger
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
handler = logging.FileHandler('send_daily_texts.log')
handler.setLevel(logging.DEBUG)
logger.addHandler(handler)
logger.info("[{}] - send_daily_texts was run".format(datetime.now()))

# the number of times to schedule and send text messages
# are provided as a command line argument
n = int(sys.argv[1])

logger.info("[{}] - going to choose {} random times".format(datetime.now(), n))

# we need to parse today's state to properly
# schedule the text message sending
dadate = datetime.now()
year = dadate.year
month = dadate.month
day = dadate.day

# the lower bound is 8 o' clock
lower_bound = datetime(year, month, day, 8, 0, 0)
logger.info("[{}] - the lower bound is {}".format(datetime.now(), lower_bound))

# the upper bound is 11 o' clock PM
upper_bound = datetime(year, month, day, 23, 0, 0)
logger.info("[{}] - the upper bound is {}".format(datetime.now(), upper_bound))

sched = BlockingScheduler()
logger.info("[{}] - Created blocking scheduler".format(datetime.now()))

wherefrom = 'YOUEMAILACCOUNTYOCREATE AT gmail DOT com'
whereto = 'YOURPHONENUMBER AT YOURSMSGATEWAY DOT com'
gmail_pw = 'YOURGMAILPASSWORD'

def encode_timestamp(timestamp):
    return str(timestamp).replace(" ", "+").replace(":", "%3A")

def make_message(timestamp, wherefrom, whereto):
    slug = "http://THELINKURL/?timestamp={}".format(encode_timestamp(timestamp))
    msg = MIMEText(slug)
    msg['To'] = email.utils.formataddr(('Recipient', whereto))
    msg['From'] = email.utils.formataddr(('Author', wherefrom))
    msg['Subject'] = 'Time for the survey!'
    return msg

def send_text(should_exit=False):
    logger.info('[{}] - trigger triggered, going to send text'.format(datetime.now()))
    logger.info('[{}] - attempting to connect to gmail'.format(datetime.now()))
    server = smtplib.SMTP("smtp.gmail.com", 587)
    server.starttls()
    server.login(wherefrom, gmail_pw)
    logger.info('[{}] - successfully connected to gmail'.format(datetime.now()))
    timestamp = datetime.now()
    msg = make_message(timestamp, wherefrom, whereto)
    logger.info('[{}] - going to send message {} to {}'.format(datetime.now(),
                                                               damsg.replace('\n', '<br>'),
                                                               whereto))
    ret = server.sendmail(wherefrom, [whereto], damsg)
    server.quit()
    if should_exit:
        logger.info('[{}] - finished... going to exit'.format(datetime.now()))
        sched.shutdown(wait=False)

def random_time(start, end):
    sec_diff = int((end-start).total_seconds())
    secs_to_add = random.randint(0, sec_diff)
    return start + timedelta(seconds=secs_to_add)

def get_n_random_times(n, start, end):
    times = []
    for i in range(0, n):
        times.append(random_time(start, end))
    times.sort()
    return times

times = get_n_random_times(n, lower_bound, upper_bound)
logger.info("[{}] - Received {} times to schedule".format(datetime.now(),
                                                         len(times)))

for ind, atime in enumerate(times):
    if ind == (n-1):
        sched.add_job(send_text, 'date', run_date=atime,
                      args={"should_exit": True})
        logger.info("[{}] - added last task at {}".format(datetime.now(),
                                                         atime))
    else:
        sched.add_job(send_text, 'date', run_date=atime)
        logger.info("[{}] - added task at {}".format(datetime.now(),
                                                     atime))

sched.start()
logger.info("[{}] - everything is done".format(datetime.now()))

Before I describe "run_everyday.py" there are a few things I should note about the snippet above.

When I originally wrote this script, the text messages wouldn’t send even though the logger indicated that it had. I assumed this was because gmail rejected it because it didn't look enough like an email message. In order to correct this, I needed to use the email.mime.text module to add the standard email headers to the message to be send.

Since I am only interested in experience sampling my waking life, I didn't want to fill out the survey during hours that I am normally sleep. I had to make sure the I set 8 o' clock and 23 (11pm) o' clock as my lower and upper bound, respectively.

Third, if you decide to cannibalize this code, make sure you change the values for 'wherefrom', 'whereto', and 'gmail_pw'. The format the SMS gateway you should use depends upon your mobile carrier. My particular SMS gateway is my 10 digit phone number @vtext.com. Your’s will likely be different–consult this list.

run_everyday.py

#!/usr/bin/python -tt

import sys
import logging
from datetime import datetime
from subprocess import Popen, PIPE
from apscheduler.schedulers.blocking import BlockingScheduler

def run_daily_surveys(thelogger):
    thelogger.info("[{}] - Trigger triggered".format(datetime.now()))
    thelogger.info("[{}] - Going to run daily script".format(datetime.now()))
    p = Popen('./send_daily_texts.py 3', shell=True, stdout=PIPE, stderr=PIPE)
    out, err = p.communicate()
    if p.returncode:
        thelogger.error("[{}] - Failed to run daily script".format(datetime.now()))
        sys.exit("Failed to run daily script")
    thelogger.info("[{}] - Ran daily script".format(datetime.now()))
    if p.returncode:
        sys.exit("Command failed to run")

def main():
    logger = logging.getLogger(__name__)
    logger.setLevel(logging.DEBUG)
    handler = logging.FileHandler('run_everyday.log')
    handler.setLevel(logging.DEBUG)
    logger.addHandler(handler)
    logger.info("[{}] - run_everyday.py was run".format(datetime.now()))
    
    sched = BlockingScheduler()
    logger.info("[{}] - blocking scheduler was created".format(datetime.now()))
    sched.add_job(run_daily_surveys, 'interval', days=1, args=[logger])
    logger.info("[{}] - everyday task added, going to start the scheduler".format(datetime.now()))
    sched.start()
    return 0

if __name__ == '__main__':
    STATUS = main()
    sys.exit(STATUS)

I've been running these tasks for about a week now, and its working great!

My next couple of blog posts will be about server-side code and architecture to support my self-tracking project.

share this: facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Why is my OS X Yosemite install taking so long?: an analysis

big fat lie

Why?
Since the latest Mac OS X update, 10.10 "Yosemite", was released last Thursday, there have been complaints springing up online of the progress bar woefully underestimating the actual time to complete installation. More specifically, it appeared as if, for a certain group of people (myself included), the installer would stall out at "two minutes remaining" or "less than a minute remaining"–sometimes for hours.

In the vast majority of these cases, though, the installation process didn't hang, it was just performing a bunch of unexpected tasks that it couldn't predict.

During the install, striking "Command" + "L" would bring up the install logs. In my case, the logs indicated that the installer was busy right until the very last minute.

Not knowing very much about OS X's installation process and wanting to learn more, and wanting to answer why the installation was taking longer than the progress bar expected, I saved the log to a file on my disk with the intention of analyzing it before the installer automatically restarted my computer.

Cleaning
The log file from the Yosemite installer wasn't in a format that R (or any program) could handle natively so before we can use it, we have to clean/munge it. To do this, we'll write a program in the queen of all text-processing languages: perl.

This script will read the log file, line-by-line from standard input (for easy shell piping), and spit out nicely formatted tab-delimited lines.

#!/usr/bin/perl

use strict;
use warnings;

# read from stdin
while(<>){
    chomp;
    my $line = $_;
    my ($not_message, $message) = split ': ', $line, 2;

    # skip lines with blank messages
    next if $message =~ m/^\s*$/;

    my ($month, $day, $time, $machine, $service) = split " ", $not_message;

    print join("\t", $month, $day, $time, $machine, $service, $message) . "\n";
}

We can output the cleaned log file with this shell command:

echo "Month\tDay\tTime\tMachine\tService\tMessage" > cleaned.log
grep '^Oct' ./YosemiteInstall.log | grep -v ']:  ' | grep -v ': }' |  ./clean-log.pl >> cleaned.log

This cleaned log contains 6 fields: 'Month', 'Day', 'Time', 'Machine (host)', 'Service', and 'Message'. The installation didn't span days (it didn't even span an hour) so technically I didn't need the 'Month' and 'Day' fields, but I left them in for completeness' sake.

Analysis

Let's set some options and load the libraries we are going to use:

# options
options(echo=TRUE)
options(stringsAsFactors=FALSE)

# libraries
library(dplyr)
library(ggplot2)
library(lubridate)
library(reshape2)

Now we read the log file that I cleaned and add a few columns with correctly parsed timestamps using lubridate’s "parse_date_time()" function

yos.log <- read.delim("./cleaned.log", sep="\t") %>%
  mutate(nice.date=paste(Month, Day, "2014", Time)) %>%
  mutate(lub.time=parse_date_time(nice.date, 
                                  "%b %d! %Y! %H!:%M!:%S!", 
                                  tz="EST"))

And remove the rows of dates that didn't parse correctly

yos.log <- yos.log[!is.na(yos.log$lub.time),]

head(yos.log)


##   Month Day     Time   Machine        Service
## 1   Oct  18 11:28:23 localhost opendirectoryd
## 2   Oct  18 11:28:23 localhost opendirectoryd
## 3   Oct  18 11:28:23 localhost opendirectoryd
## 4   Oct  18 11:28:23 localhost opendirectoryd
## 5   Oct  18 11:28:23 localhost opendirectoryd
## 6   Oct  18 11:28:23 localhost opendirectoryd
##                                                                    Message
## 1                   opendirectoryd (build 382.0) launched - installer mode
## 2                                  Logging level limit changed to 'notice'
## 3                                               Initialize trigger support
## 4 created endpoint for mach service 'com.apple.private.opendirectoryd.rpc'
## 5                                set default handler for RPC 'reset_cache'
## 6                           set default handler for RPC 'reset_statistics'
##              nice.date            lub.time
## 1 Oct 18 2014 11:28:23 2014-10-18 11:28:23
## 2 Oct 18 2014 11:28:23 2014-10-18 11:28:23
## 3 Oct 18 2014 11:28:23 2014-10-18 11:28:23
## 4 Oct 18 2014 11:28:23 2014-10-18 11:28:23
## 5 Oct 18 2014 11:28:23 2014-10-18 11:28:23
## 6 Oct 18 2014 11:28:23 2014-10-18 11:28:23

The first question I had was how long the installation process took

install.time <- yos.log[nrow(yos.log), "lub.time"] - yos.log[1, "lub.time"]
(as.duration(install.time))
## [1] "1848s (~30.8 minutes)"

Ok, about a half-hour.

Let's make a column for cumulative time by subtracting each row's time by the start time

yos.log$cumulative <- yos.log$lub.time - min(yos.log$lub.time, na.rm=TRUE)

In order to see what processes were taking the longest, we have to make a column for elapsed time. To do this, we can subtract each row's time from the time of the subsequent row.

yos.log$elapsed <- lead(yos.log$lub.time) - yos.log$lub.time

# remove last row
yos.log <- yos.log[-nrow(yos.log),]

Which services were responsible for the most writes to the log and what services took the longest? We can find out with the following elegant dplyr construct. While we're at it, we should add columns for percentange of the whole for easy plotting.

counts <- yos.log %>%
  group_by(Service) %>%
  summarise(n=n(), totalTime=sum(elapsed)) %>%
  arrange(desc(n)) %>%
  top_n(8, n) %>%
  mutate(percent.n = n/sum(n)) %>%
  mutate(percent.totalTime = as.numeric(totalTime)/sum(as.numeric(totalTime)))
(counts)

## Source: local data frame [8 x 5]
## 
##           Service     n totalTime percent.n percent.totalTime
## 1     OSInstaller 42400 1586 secs 0.9197197          0.867615
## 2  opendirectoryd  3263   43 secs 0.0707794          0.023523
## 3         Unknown   236  157 secs 0.0051192          0.085886
## 4  _mdnsresponder    52   17 secs 0.0011280          0.009300
## 5              OS    49    1 secs 0.0010629          0.000547
## 6 diskmanagementd    47    7 secs 0.0010195          0.003829
## 7     storagekitd    29    2 secs 0.0006291          0.001094
## 8         configd    25   15 secs 0.0005423          0.008206

Ok, the "OSInstaller" is responsible for the vast majority of the writes to the log and to the total time of the installation. "opendirectoryd" was the next most verbose process, but its processes were relatively quick compared to the "Unknown" process' as evidenced by "Unknown" taking almost 4 times longer, in aggregate, in spite of having only 7% of "opendirectoryd"'s log entries.

We can more intuitively view the number-of-entries/time-taken mismatch thusly:

melted <- melt(as.data.frame(counts[,c("Service",
                                       "percent.n",
                                       "percent.totalTime")]))

ggplot(melted, aes(x=Service, y=as.numeric(value), fill=factor(variable))) +
  geom_bar(width=.8, stat="identity", position = "dodge",) +
  ggtitle("Breakdown of services during installation by writes to log") +
  ylab("percent") + xlab("service") +
  scale_fill_discrete(name="Percent of",
                      breaks=c("percent.n", "percent.totalTime"),
                      labels=c("writes to logfile", "time elapsed"))

Breakdown

As you can see, the "Unknown" process took a disproportionately long time for its relatively few log entries; the opposite behavior is observed with "opendirectoryd". The other processes contribute very little to both the number of log entries and the total time in the installation process.

What were the 5 most lengthy processes?

yos.log %>%
  arrange(desc(elapsed)) %>%
  select(Service, Message, elapsed) %>%
  head(n=5)


##       Service
## 1 OSInstaller
## 2 OSInstaller
## 3     Unknown
## 4 OSInstaller
## 5 OSInstaller
##                                                                                                                                            Message
## 1 PackageKit: Extracting file:///System/Installation/Packages/Essentials.pkg (destination=/Volumes/Macintosh HD/.OSInstallSandboxPath/Root, uid=0)
## 2                                    System Reaper: Archiving previous system logs to /Volumes/Macintosh HD/private/var/db/PreviousSystemLogs.cpgz
## 3                       kext file:///Volumes/Macintosh%20HD/System/Library/Extensions/JMicronATA.kext/ is in hash exception list, allowing to load
## 4                                                                   Folder Manager is being asked to create a folder (down) while running as uid 0
## 5                                                                                                                      Checking catalog hierarchy.
##    elapsed
## 1 169 secs
## 2 149 secs
## 3  70 secs
## 4  46 secs
## 5  44 secs

The top processes were:

  • Unpacking and moving the contents of "Essentials.pkg" into what is to become the newsystem directory structure. This ostensibly contains items like all the updated applications (Safari, Mail, etc..). (almost three minutes)
  • Archiving the old system logs (two and a half minutes)
  • Loading the kernel module that allows the onboard serial ATA controller to work (a little over a minute)


Let's view a density plot of the number of writes to the log file during installation.

ggplot(yos.log, aes(x=lub.time)) +
  geom_density(adjust=3, fill="#0072B2") +
  ggtitle("Density plot of number of writes to log file during installation") +
  xlab("time") + ylab("")

density

This graph is very illuminating; the vast majority of log file writes were the result of very quick processes that took place in the last 15 minutes of the install, which is when the progress bar read that only two minutes were remaining.

In particular, there were a very large number of log file writes between 11:47 and 11:48; what was going on here?

# if the first time is in between the second two, this returns TRUE
is.in <- function(time, start, end){
  if(time > start && time < end)
    return(TRUE)
  return(FALSE)
}

the.start <- ymd_hms("14-10-18 11:47:00", tz="EST")
the.end <- ymd_hms("14-10-18 11:48:00", tz="EST")

# logical vector containing indices of writes in time interval
is.in.interval <- sapply(yos.log$lub.time, is.in,
                         the.start,
                         the.end)

# extract only these rows
in.interval <- yos.log[is.in.interval, ]

# what do they look like?
silence <- in.interval %>%
  select(Message) %>%
  sample_n(7) %>%
  apply(1, function (x){cat("\n");cat(x);cat("\n")})

## 
## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/usr/local/texlive/2013/tlpkg/tlpobj/featpost.tlpobj -> /Volumes/Macintosh HD/usr/local/texlive/2013/tlpkg/tlpobj Final name: featpost.tlpobj (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy)
## 
## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/usr/local/texlive/2013/texmf-dist/tex/generic/pst-eucl/pst-eucl.tex -> /Volumes/Macintosh HD/usr/local/texlive/2013/texmf-dist/tex/generic/pst-eucl Final name: pst-eucl.tex (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy)
## 
## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/Library/Python/2.7/site-packages/pandas-0.12.0_943_gaef5061-py2.7-macosx-10.9-intel.egg/pandas/tests/test_groupby.py -> /Volumes/Macintosh HD/Library/Python/2.7/site-packages/pandas-0.12.0_943_gaef5061-py2.7-macosx-10.9-intel.egg/pandas/tests Final name: test_groupby.py (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy)
## 
## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/usr/local/texlive/2013/texmf-dist/tex/latex/ucthesis/uct10.clo -> /Volumes/Macintosh HD/usr/local/texlive/2013/texmf-dist/tex/latex/ucthesis Final name: uct10.clo (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy)
## 
## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/usr/local/texlive/2013/texmf-dist/doc/latex/przechlewski-book/wkmgr1.tex -> /Volumes/Macintosh HD/usr/local/texlive/2013/texmf-dist/doc/latex/przechlewski-book Final name: wkmgr1.tex (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy)
## 
## WARNING : ensureParentPathExists: Created  `/Volumes/Macintosh HD/usr/local/texlive/2013/texmf-dist/doc/latex/moderntimeline' w/ {
## 
## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/usr/local/texlive/2013/texmf-dist/fonts/type1/wadalab/mrj/mrjkx.pfb -> /Volumes/Macintosh HD/usr/local/texlive/2013/texmf-dist/fonts/type1/wadalab/mrj Final name: mrjkx.pfb (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy)

Ah, so these processes are the result of the installer having to move files back into the new installation directory structure. In particular, the vast majority of these move operations are moving files related to a program called "texlive". I'll explain why this is to blame for the inaccurate projected time to completion in the next section.

But lastly, let's view a faceted density plot of the number of log files writes by process. This might give us a sense of what steps go on as the installation progresses by showing us with processes are most active.

# reduce number of service to a select few of the most active
smaller <- yos.log %>%
  filter(Service %in% c("OSInstaller", "opendirectoryd",
                        "Unknown", "OS"))

ggplot(smaller, aes(x=lub.time, color=Service)) +
  geom_density(aes( y = ..scaled..)) +
  ggtitle("Faceted density of log file writes by process (scaled)") +
  xlab("time") + ylab("")

facet

This shows that no one process runs consistently throughout the entire installation process, but rather that the process run in spurts.

the answer
The vast majority of Mac users don't place strange files in certain special system-critical locations like '/usr/local/' and '/Library/'. Among those who do, though, these directories are littered with hundreds and hundreds of custom files that the installer doesn't and can't have prior knowledge of.

In my case, and probably many others, the estimated time-to-completion was inaccurate because the installer couldn't anticipate needing to copy back so many files to certain special directories after unpacking the contents of the new OS. Additionally, for each of these copied files, the installer had to make sure the subdirectories had the exact same meta-data (permissions, owner, reference count, creation date, etc…) as before the installation began. This entire process added many minutes to the procedure at a point when the installer thought it was pretty much done.

What were some of the files that the installer needed to copy back? The answer will be different for each system but, as mentioned above, anything placed '/usr/local' and '/Library' directories that wasn't Apple-supplied needed to be moved and moved back.

/usr/local/
/usr/local/ is used chiefly for user-installed software that isn't part of the OS distribution. In my case, my /usr/local contained a custom compliled Vim; ClamXAV, a lightweight virus scanner that I use only for the benefit of my Windows-using friends; and texlive, software for the TeX typesetting system. texlive was, by far, the biggest time-sink since it had over 123,491 files.

In addition to these programs, many users might find that the Homebrew package manager is to blame for their long installation process, since this software also uses the /usr/local prefix (although it probably should not).

/Library/
Among other things, this directory holds (subdirectories that hold) modules and packages that the Apple-supplied Python, Ruby, and Perl uses. If you use these Apple-supplied versions of these languages and you install your own packages/modules using super-user privileges, the new packages will go into this directory and it will appear foreign to the Yosemite installer.

To get around this issue, either install packages/modules in a local (non-system) library, or use alternate versions of these programming languages that you either download and install yourself, or use MacPorts to install.

---

You can find all the code and logs that I used for this analysis in this git repository

This post is also available as a RMarkdown report here

share this: facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Fun with .Rprofile and customizing R startup

Over the years, I've meticulously compiled–and version controlled–massive and extensive configuration files for virtually all of my most used utilities, most notably vim, tmux, and zsh.

In fact, one of the only configurable utilities for which I had no special configuration schema was R. This is extremely surprising, given that I use R everyday.

One reason I think that this was the case is because I came to R from using general-purpose programming languages for which there is no provision to configure the language in a way that would actually change results or program output.

I only vaguely knew that .Rprofile was a configuration file that some people used, and that others warned against using, but it never occurred to me to actually use it for myself.

Because I never used it, I developed odd habits and rituals in my interactive R programming including adding "stringsAsFactors=FALSE" to all of my "read.csv" function calls and making frequent calls to the "option()" function.

Since I actually began to use and expand my R configuration, though, I've realized how much I've been missing. I pre-set all my preferred options (saving time) and I've even made provisions for some cool tricks and hacks.

That being said, there's a certain danger in using a custom R profile but we'll talk about how to thwart that later.

The R Startup Process

In the absence of any command-line flags being used, when R starts up, it will "source" (run) the site-wide R startup configuration file/script if it exists. In a fresh install of R, this will rarely exist, but if it does, it will usually be in '/Library/Frameworks/R.framework/Resources/etc/' on OS X, 'C:\Program Files\R\R-***\etc\' on Windows, or '/etc/R/' on Debian. Next, it will check for a .Rprofile hidden file in the current working directory (the directory where R is started on the command-line) to source. Failing that, it will check your home directory for the .Rprofile hidden file.

You can check if you have a site-wide R configuration script by running

R.home(component = "home")

in the R console and then checking for the presence of Rprofile.site in that directory. The presence of the user-defined R configuration can be checked for in the directory of whichever path

path.expand("~")

indicates.

More information on the R startup process can be found here and here.

The site-wide R configuration script
For most R installations on primarily single-user systems, using the site-wide R configuration script should be given up in favor of the user-specific configuration. That being said, a look at the boilerplate site-wide R profile that Debian furnishes (but comments out by default) provides some good insight into what might be a good idea to include in this file, if you choose to use it.

##                      Emacs please make this -*- R -*-
## empty Rprofile.site for R on Debian
##
## Copyright (C) 2008 Dirk Eddelbuettel and GPL'ed
##
## see help(Startup) for documentation on ~/.Rprofile and Rprofile.site

# ## Example of .Rprofile
# options(width=65, digits=5)
# options(show.signif.stars=FALSE)
# setHook(packageEvent("grDevices", "onLoad"),
#         function(...) grDevices::ps.options(horizontal=FALSE))
# set.seed(1234)
# .First <- function() cat("\n   Welcome to R!\n\n")
# .Last <- function()  cat("\n   Goodbye!\n\n")

# ## Example of Rprofile.site
# local({
#  # add MASS to the default packages, set a CRAN mirror
#  old <- getOption("defaultPackages"); r <- getOption("repos")
#  r["CRAN"] <- "http://my.local.cran"
#  options(defaultPackages = c(old, "MASS"), repos = r)
#})

Two things you might want to do in a site-wide R configuration file is add other packages to the default packages list and set a preferred CRAN mirror. Other things that the above snippet indicates you can do is set various width and number display options, setting a random-number seed (making random number generation deterministic for reproducible analysis), and hiding the stars that R shows for different significance levels (ostensibly because of their connection to the much-maligned NHST paradigm).

The user-specific .Rprofile
In contrast to the site-wide config (that will be used for all users on the system), the user-specific R configuration file is a place to put more personal preferences, shortcuts, aliases, and hacks. Immediately below is my .Rprofile.

local({r <- getOption("repos")
      r["CRAN"] <- "http://cran.revolutionanalytics.com"
      options(repos=r)})

options(stringsAsFactors=FALSE)

options(max.print=100)

options(scipen=10)

options(editor="vim")

# options(show.signif.stars=FALSE)

options(menu.graphics=FALSE)

options(prompt="> ")
options(continue="... ")

options(width = 80)

q <- function (save="no", ...) {
  quit(save=save, ...)
}

utils::rc.settings(ipck=TRUE)

.First <- function(){
  if(interactive()){
    library(utils)
    timestamp(,prefix=paste("##------ [",getwd(),"] ",sep=""))

  }
}

.Last <- function(){
  if(interactive()){
    hist_file <- Sys.getenv("R_HISTFILE")
    if(hist_file=="") hist_file <- "~/.RHistory"
    savehistory(hist_file)
  }
}

if(Sys.getenv("TERM") == "xterm-256color")
  library("colorout")

sshhh <- function(a.package){
  suppressWarnings(suppressPackageStartupMessages(
    library(a.package, character.only=TRUE)))
}

auto.loads <-c("dplyr", "ggplot2")

if(interactive()){
  invisible(sapply(auto.loads, sshhh))
}

.env <- new.env()


.env$unrowname <- function(x) {
  rownames(x) <- NULL
  x
}

.env$unfactor <- function(df){
  id <- sapply(df, is.factor)
  df[id] <- lapply(df[id], as.character)
  df
}

attach(.env)

message("\n*** Successfully loaded .Rprofile ***\n")

[Lines 1-3]: First, because I don't have a site-wide R configuration script, I set my local CRAN mirror here. My particular choice of mirror is largely arbitrary.

[Line 5]: If stringsAsFactors hasn't bitten you yet, it will.

[Line 9]: Setting 'scipen=10' effectively forces R to never use scientific notation to express very small or large numbers.

[Line 13]: I included the snippet to turn off significance stars because it is a popular choice, but I have it commented out because ever since 1st grade I've used number of stars as a proxy for my worth as a person.

[Line 15]: I don't have time for Tk to load; I'd prefer to use the console, if possible.

[Lines 17-18]: Often, when working in the interactive console I forget a closing brace or paren. When I start a new line, R changes the prompt to "+" to indicate that it is expecting a continuation. Because "+" and ">" are the same width, though, I often don't notice and really screw things up. These two lines make the R REPL mimic the Python REPL by changing the continuation prompt to the wider "...".

[Lines 22-24]: Change the default behavior of "q()" to quit immediately and not save workspace.

[Line 26]: This snippet allows you to tab-complete package names for use in "library()" or "require()" calls. Credit for this one goes to @mikelove.

[Lines 28-34]: There are three main reasons I like to have R save every command I run in the console into a history file.

  • Occasionally I come up with a clever way to solve a problem in an interactive session that I may want to remember for later use; instead of it getting lost in the ether, if I save it to a history file, I can look it up later.
  • Sometimes I need an alibi for what I was doing at a particular time
  • I ran a bunch of commands in the console to perform an analysis not realizing that I would have to repeat this analysis later. I can retrieve the commands from a history file and put it into a script where it belongs.

These lines instruct R to, before anything else, echo a timestamp to the console and to my R history file. The timestamp greatly improves my ability to search through my history for relevant commands.

[Lines 36-42]: These lines instruct R, right before exiting, to write all commands I used in that session to my R command history file. I usually have this set in an environment variable called "R_HISTFILE" on most of my systems, but in case I don't have this defined, I write it to a file in my home directory called .Rhistory.

[Line 44]: Enables the colorized output from R (provided by the colorout package) on appropriate consoles.

[Lines 47-50]: This defines a function that loads a libary into the namespace without any warning or startup messages clobbering my console.

[Line 52]: I often want to autoload the 'dplyr' and 'ggplot2' packages (particularly 'dplyr' as it is now an integral part of my R experience).

[Lines 54-56]: This loads the packages in my "auto.loads" vector if the R session is interactive.

[Lines 58-59]: This creates a new hidden namespace that we can store some functions in. We need to do this in order for these functions to survive a call to "rm(list=ls())" which will remove everything in the current namespace. This is described wonderfully in this blog post.

[Lines 61-64]: This defines a simple function to remove any row names a data.frame might have. This was stolen from Stephen Turner (which was in turn stolen from plyr).

[Lines 66-70]: This defines a function to sanely undo a "factor()" call. This was stolen from Dason Kurkiewicz.

Warnings about .Rprofile
There are some compelling reasons to abstain from using an R configuration file at all. The most persuasive argument against using it is the portability issue: As you begin to rely more and more on shortcuts and options you define in your .Rprofile, your R scripts will depend on them more and more. If a script is then transferred to a friend or colleague, often it won't work; in the worst case scenario, it will run without error but produce wrong results.

There are several ways this pitfall can be avoided, though:

  • For R sessions/scripts that might be shared or used on systems without your .Rprofile, make sure to start the R interpreter with the --vanilla option, or add/change your shebang lines to "#!/usr/bin/Rscript --vanilla". The "--vanilla" option will tell R to ignore any configuration files. Writing scripts that will conform to a vanilla R startup environment is a great thing to do for portability.
  • Use your .Rprofile everywhere! This is a bit of an untenable solution because you can't put your .Rprofile everywhere. However, if you put your .Rprofile on GitHub, you can easily clone it on any system that needs it. You can find mine here.
  • Save your .Rprofile to another file name and, at the start of every R session where you want to use your custom configuration, manually source the file. This will behave just as it would if it were automatically sourced by R but removes the potential for the .Rprofile to be sourced when it is unwanted.

    A variation on this theme is to create a shell alias to use R with your special configuration. For example, adding a shell alias like this:

    alias aR="R_PROFILE_USER=~/.myR/aR.profile R"
    

    will make it so that when "R" is run, it will run as before (without special configuration). In order to have R startup and auto-source your configuration you now have to run "aR". When 'aR' is run, the shell temporarily assigns an environment variable that R will follow to source a config script. In this case, it will source a config file called aR.profile in a hidden .myR subdirectory of my home folder. This path can be changed to anything, though.

This is the solution I have settled on because it is very easy to live with and invalidates concerns about portability.

share this: facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Interactive visualization of non-linear logistic regression decision boundaries with Shiny

fourdegreemoon

(skip to the shiny app)

Model building is very often an iterative process that involves multiple steps of choosing an algorithm and hyperparameters, evaluating that model / cross validation, and optimizing the hyperparameters.

I find a great aid in this process, for classification tasks, is not only to keep track of the accuracy across models, but also to have some visual aid to note which data points are systematically misclassified and why. Is there a lot of noise? Does the model require a non-linear classifier?

My desire for visualizing the results are stymied by (a) high-dimensional data (for which we have no choice but to reduce dimensionality) and (b) the cost of task switching between tweaking the hyperparameters and re-running the plot. Unless I'm using two monitors, I can't even see the plots change in real-time.

Well... Enter Shiny.

Shiny is an R package from RStudio and other open source contributors that makes it incredibly easy to create interactive web applications from R analyses. With Shiny, I can add dropdown menus and sliders to choose algorithms or features and control hyperparameters and visualize the changes to the model in real-time right from a web browser (all in pure R and no Javascript or CSS).

Further, I can deploy this web app easily (and for free) so I can share it with my friends and colleagues.

For a first real foray into Shiny, I chose to visualize the decision boundaries of logistic regression classifiers. I chose logistic regression because I'm taking Andrew Ng's excellent Machine Learning course on Coursera, and reimplementing the algorithms in R (from GNU Octave / Matlab) and it was our last homework assignment.

The implementation of logistic regression and the visualization of the decision boundaries proved to be difficult for two reasons:

(a) The residuals of logistic regression aren't normally distributed and there exists no closed form solution that returns the coefficients that maximize the likelihood function. This means that we have to provide R's 'optim' higher-order function with a custom-written function to be minimized or maximized (we will be minimizing the cost function) and a function that returns the gradient (the differentiation of that function at that location). And...

(b) Although a linear combination of the predictor variables (a first degree polynomial hypothesis) has a linear decision boundary, adding ("faking") higher-degree polynomial features results in non-linear decision boundaries; awesome for classification, un-awesome for visualization.

crummy linear fit to circular data

crummy linear fit to circular data


great quadratic non-linear fit to circular data

great quadratic non-linear fit to circular data

The two datasets we will be using were generated using make_circles and make_moons from scikit-learn's 'datasets' module. These will both require non-linear hypothesis to achieve any kind of better-than-chance classification.

These are the supporting functions to add polynomial features, compute the hypothesis function, compute the cost function, and return the gradient:

add.poly.features <- function(x.mat, degree=2){
  new.mat <- matrix(1, nrow=nrow(x.mat))
  for (i in 1:degree){
    for (j in 0:i){
      new.mat <- cbind(new.mat, (x.mat[,1]^(i-j) * (x.mat[,2]^j)))
    }
  }
  return(new.mat)
}

hypothesis.function <- function(param.vec, x.mat){
  zed <- x.mat %*% matrix(param.vec)
  return(1 / (1 + exp(-zed)))
}

get.gradient <- function(param.vec, x.mat, y.vec, lambda=0){
  m <- nrow(x.mat)
  modtheta <- param.vec
  modtheta[1] <- 0
  the.hyp <- hypothesis.function(param.vec, x.mat)
  gradient <- (t(x.mat) %*% (the.hyp - y.vec) + lambda*modtheta) / m
  return(gradient)
}

cost.function <- function(param.vec, x.mat, y.vec, lambda=0){
  m <- nrow(x.mat)
  the.hyp <- hypothesis.function(param.vec, x.mat)
  cost <- (((t(-y.vec) %*% log(the.hyp)) - (t(1-y.vec) %*% log(1-the.hyp))) / m) +
    ((lambda / (2*m)) * sum(param.vec[2:length(param.vec)] ^ 2))
  
  return(cost)
}

Finally, this is the code that finds the optimal coefficients and plots the resulting hypothesis (this is wrapped in the reactive "renderPlot" Shiny function so it can be updated every time the Shiny controls are changed)

if(input$pattern=="moon")
  da.dataset <- moon
else
  da.dataset <- circle

da.lambda <- input$lambda
da.degree <- input$degree

design.mat <- add.poly.features(da.dataset[,c(1,2)], degree=da.degree)

result <- optim(par=rep(0, ncol(design.mat)),
                cost.function, 
                get.gradient,
                x.mat=design.mat,
                y.vec=as.matrix(da.dataset[,3]),
                lambda=da.lambda,
                method=input$opt)

predictions <- hypothesis.function(result$par, design.mat)
accuracy <- paste0(round(sum(round(predictions) ==
                                   da.dataset[,3]) / 3, 2), "%")

thex1 <- da.dataset[,1]
thex2 <- da.dataset[,2]
somex <- seq(min(thex1), max(thex1), by=.05)
somex2 <- seq(min(thex2), max(thex2), length.out=length(somex))

z <- matrix(0, nrow=length(somex), ncol=length(somex))

for (i in 1:length(somex)){
  for (j in 1:length(somex)){
    keep <- add.poly.features(t(matrix(c(somex[i], somex2[j]))), da.degree)
    z[i, j] <- as.matrix(keep) %*% result$par
  }
}

plot(da.dataset$X2 ~ da.dataset$X1,  pch=20, 
     col=c("red","green3")[da.dataset$Y+1],
     xlab="X1", ylab="X2")
title(paste("Degree:", da.degree,
            " -  Lambda:", da.lambda,
            "     -      Accuracy:", accuracy))

contour(somex, t(somex2), z, nlevels=1, add=TRUE, drawlabels=FALSE)

Notice that the classification dataset, the degree of the hypothesized polynomial, the regularization hyperparameter (lambda), and the optimization method are parameterized. We will control these options from the Shiny app.

Put all together, code looks like it does in this GitHub repo and yields this Shiny app.

Shiny app screeshot

Shiny app screeshot

Is it just me, or is what you can do with Shiny amazing?

In future iterations of my Shiny visualization of classification endeavors, I plan to:

  • add support for more classification algorithms and their respective relevant hyper parameters
  • use file upload to plot custom datasets
  • and use dimensionality reduction automatically for datasets with more than two 'true' features

Until then, shine on you crazy diamond.

share this: facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Squeezing more speed from R for nothing, Rcpp style

R vs Rcpp speed

In a previous post we explored how you can greatly speed up certain types of long-running computations in R by parallelizing your code using multicore package*. I also mentioned that there were a few other ways to speed up R code; the one I will be exploring in this post is using Rcpp to replace time-critical inner-loops with C++.

In general, good C++ code almost always runs faster than equivalent R code. Higher level language affordances like garbage collection, dynamic typing, and bounds checking can add a lot of computational overhead. Further, C/C++ compiles down to machine code, whereas R byte-code has to be interpreted.

On the other hand, I would hate to do all my statistics programming in a language like C++, precisely because of those higher-level language affordances I mentioned above. When development time (as opposed to execution time) is taken into account, programming in R is much faster for me (and makes me a very happy programmer).

On occasion, though, there are certain sections of R code that I wish I could rewrite in C/C++. They may be simple computations that get called thousands or millions of times in a loop. If I could just write these time-critical snippets with C/C++ and not have to throw the proverbial baby out with the bath water (and rewrite everything in C), my code would run much faster.

Though there have been packages to make this sort of thing since the early 2000s, Rcpp (and the Rcpp family**) has made this even easier; now interfacing with R objects is seamless.

To show an example of how you might use Rcpp, I’ve used the same example from my post "Parallel R (and air travel)". In this example, we use longitude and latitude info from all US airports to derive the average (mean) distance between every two US airports. The function I will be replacing with C++ is the function to compute the distance between two longitude latitude pairs (Haversine's formula) on a sphere (which is just an approximation).

The R functions to do this look like this:

to.radians<-function(degrees){
  degrees * pi / 180
}

haversine <- function(lat1, long1, lat2, long2, unit="km"){
  radius <- 6378      # radius of Earth in kilometers
  delta.phi <- to.radians(lat2 - lat1)
  delta.lambda <- to.radians(long2 - long1)
  phi1 <- to.radians(lat1)
  phi2 <- to.radians(lat2)
  term1 <- sin(delta.phi/2) ^ 2
  term2 <- cos(phi1) * cos(phi2) * sin(delta.lambda/2) ^ 2
  the.terms <- term1 + term2
  delta.sigma <- 2 * atan2(sqrt(the.terms), sqrt(1-the.terms))
  distance <- radius * delta.sigma
  if(unit=="km") return(distance)
  if(unit=="miles") return(0.621371*distance)
}

While the C++ functions look like this:

#include <iostream>
#include <math.h>
#include <Rcpp.h>

// [[Rcpp::export]]
double to_radians_cpp(double degrees){
    return(degrees * 3.141593 / 180);
}

// [[Rcpp::export]]
double haversine_cpp(double lat1, double long1,
                     double lat2, double long2,
                     std::string unit="km"){
    int radius = 6378;
    double delta_phi = to_radians_cpp(lat2 - lat1);
    double delta_lambda = to_radians_cpp(long2 - long1);
    double phi1 = to_radians_cpp(lat1);
    double phi2 = to_radians_cpp(lat2);
    double term1 = pow(sin(delta_phi / 2), 2);
    double term2 = cos(phi1) * cos(phi2) * pow(sin(delta_lambda/2), 2);
    double the_terms = term1 + term2;
    double delta_sigma = 2 * atan2(sqrt(the_terms), sqrt(1-the_terms));
    double distance = radius * delta_sigma;

    /* if it is anything *but* km it is miles */
    if(unit != "km"){
        return(distance*0.621371);
    }

    return(distance);
}

Besides for the semicolons, other assignment operator and the type declarations, these codes are almost identical.

Next, we put the C++ code above in a C++ source file. We will call it, and automatically compile and link to it from our driver R code thusly***:

calc.distance.two.rows <- function(ind1, ind2,
                                   version=haversine){
  return(version(air.locs[ind1, 2],
                 air.locs[ind1, 3],
                 air.locs[ind2, 2],
                 air.locs[ind2, 3]))
}

air.locs <- read.csv("airportcodes.csv", stringsAsFactors=FALSE)


combos <- combn(1:nrow(air.locs), 2, simplify=FALSE)
num.of.comps <- length(combos)


mult.core <- function(version=haversine_cpp){
  the.sum <- sum(unlist(mclapply(combos, 
                                 function(x){ 
                                   calc.distance.two.rows(x[1], x[2],
                                                          version)
                                 },
                                 mc.cores=4)))
  result <- the.sum / num.of.comps
  return(result)
}

mult.core(version=haversine_cpp)

Comparing the R version against the C++ version over a range of sample sizes yielded a chart like this:
R vs Rcpp speed

To run this to completion would have taken 4 hours but, if my math is correct, rewriting the distance function shaved of over 15 minutes from the completion time.

It is not uncommon for the Rcpp to speed up R code by *orders of magnitude*. In this link, Dirk Eddelbuettel demonstrates an 80-fold speed increase (albeit with a contrived example).

So why did we not get an 80-fold increase?

I could have (and will) rewrite more of this program in Rcpp to avoid some of the overhead with repeated calls to compiled C++. My point here was more to show that we can use Rcpp to speed up this program with very little work–almost for nothing. Again, besides for certain syntactical differences and type declarations, the R and C++ functions are virtually the same.

As you become more comfortable with it–and use it more within the same scripts–Rcpp will likely pay higher and higher dividends.

The next time we revisit this contrived airport example, we will be profiling it expanding the C++ and eventually, use distributed computing to get it as fast as we can.


* the 'multicore' package is now deprecated in favor of 'parallel'
** RCpp11 (for modern C++), RccpEigen (for use of the Eigen C++ linear algebra template library), RCppArmadillo (for use of the Eigen C++ linear algebra template library), and a few others
*** this code is a little bit different than the code in the first airport distance blog post because I switched from using the 'multicore' package to the 'parallel' package

share this: facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Damn the torpedoes, full speed ahead: making the switch to Python 3

Python 3 has been out since 2008 (and realistically usable since 2009).
In spite of this four year availability period, Python 3 use has yet to see widespread adoption, particularly among groups in the scientific community. In the company of data scientists/statisticians, when someone says they've written their own Python code to perform some task, it's usually assumed that they are talking about Python 2; it is Python 3 that requires the version number qualification.

There are members of the community (I used to include myself in this category) that are really happy with Python(2) and hope that if they ignore Python 3 it will just go away.

It won't though. The fact of the matter is that "Python 2.x is legacy, Python 3.x is the present and future of the language."

So how do we get people to adopt Python 3? In my opinion, there are three key strategies:

  • go softer on Python 3 denialists, perhaps with a Python 2.8 (Guido said this will not happen)
  • go harder on Python 3 denialists by discontinuing 2.7 maintenance
  • serve as an example to programmers (especially new ones) by switching your default python interpreter to python3.



As you, dear reader, can probably tell from my wording, I personally favor strategy 3.

Part of that solution involves vendors shipping Python 3 by default. We are making some progress in this regard (Arch GNU/Linux now has python sym-linked to python3, and Fedora and Ubuntu have stated that they will follow suit), but we still have a lot of work to do. A huge step forward would be if Apple ships macs with Python 3. Current macs use 2.7 which wasn't, finally, released until 2010. This means that they could have used Python 3 instead. That would have really shook things up because a lot of my friends and colleagues in my field just use the Apple-supplied Python interpreter for analytics (vis-a-vis SciPy Superpack). The extension of 2.7 support until 2020 will unfortunately afford Apple the opportunity to be lackadaisical in its porting to Python 3 because they might only do so when upstream maintenance ends (and maybe not even then).

The other part of strategy 3 involves personally serving as an example by using Python 3 as your default interpreter.

But I can't rationally will that more people do this if I am unwilling to do this myself. While it's true that my big open-source Python project meant for widespread public consumption was very carefully made Python 3 compatible, I noticed that the code on this blog is often Python 3 incompatible. This is primarily because the python code I quickly whip up is run through my default Python 2 interpreter. My obstinance to switch to Python 3 by default is helping to contribute to Python 3's slow adoption and implicitly serving notice that it's ok to still use Python 2.

But I no longer want to be party to this transition quagmire (and the ASCII-normative cultural hegemony). Because of this, I recently took the plunge and switched my default Python stack to Python 3. Damn the torpedoes, full speed ahead!

I was, perhaps, in a better position to do this than some because all of my most used third-party Python packages have already been ported; this includes the SciPy ecosystem (NumPy, SciPy, pandas, scikit-learn), IPython, lxml, networks, BeautifulSoup, and requests. It was really easy for me to ditch the Apple-supplied Python interpreter in favor for MacPorts' build of Python 3.4. I was even able to install most of my favorite third-party packages using MacPorts (the ones that weren't available I pip-installed as "user" to not muck up the MacPorts installation prefix). The only hard part about the switch was that almost all of my system python code stopped working; everything from my own system utilities to the battery-life indicator in my tmux panes.

While fixing all of this wayward code, I took notice of the incompatibilities that caused me the most trouble:

  • changes to exception handling
  • changing xrange() to range()
  • changing raw_input() to input()
  • changing my print statements into functions
  • explicitly requesting relative module imports
  • wrapping map() function calls in "list()" because map() now returns an iterator (I actually just re-wrote the code to use list comprehensions.)



But by far the incompatibility that caused the most heart-ache were I/O changes, and this is mostly due to the way Python 3 handles unicode.

There is far too much to say about Unicode in Python 3 to provide a detailed explanation in this post (if you're interested, I've put some great learning resources at the end of this post) but, essentially, Python no longer allows you to willy-nilly mix 8-bit string data and text objects. If you find yourself asking "Why am I having such trouble with my porting?" than the answer may be that you were playing fast-and-loose with mixing (what probably always should have been) incompatible types: unicode strings and 8-bit string data.

For this reason Python 2 was easier to deal with for programmers/scientists who dealt with largely numbers or ASCII-only text, but this won't cut it anymore. It's unreasonable and culturally chauvinistic to require that non-English speaking programmers misspell (or transliterate) their names and variables just to contribute to a non-internationalized codebase.

Before you learn anymore hacks to get unicode working in Python 2.x, consider switching to Python 3. Basically, the new I/O/string paradigm in Python 3 comes down to

  • understanding unicode and utf-8
  • decoding input as early as you can
  • encoding output as late as you can
  • only working with unicode strings within the program (no bytes!)



If you're reading this blog, you're probably a data scientist/statistician (or my parents) and you can almost assuredly make the switch to Python 3 since virtually all of our most prized packages have been ported (even nltk has a branch that works with Python 3). Just to make sure, you can use this tool that tracks the Python 3-readiness of some popular packages.

Final notes:
This post was in no way meant to shame programmers/scientists who still choose to use Python 2.x. I completely understand and sympathize with those that have very large Python 2 codebases to maintain or are locked into a particular Python version because of their company policy or their clients' needs.

If you are interested using Python 3, though, but want to approach it cautiously, consider using the 2to3 conversion tool to see how the code needs to change. Another great strategy is to use the various __future__ imports to ease the transition. Something like:

from __future__ import division, absolute_import, print_function, unicode_literals

At the least, you call python using the "-3" flag to see possible problems and incompatibilities. You can do this by changing your shebang line to something like

#!/usr/bin/env python -3


or create the following shell alias

alias python="python -3"

Resources for learning Python 3 I/O and unicode:


Other notes from Python 3 apologists:


share this: facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Take a look, it's in a book: distribution of kindle e-book highlights

Distribution Of E-Book Highlights Fiction Distinction

If you've ever started a book and not finished it, it may comfort you to know that you are not alone. It's hard to get accurate estimates of the percentage books that are discontinued, but the rise of e-reading (and resulting circumvention of privacy) affords us the opportunity to answer related questions.

The kindle e-reading devices allow readers to highlight salient passages of books and optionally share them with Amazon. Amazon then curates these highlights and displays them to readers who opt-in. These are called "popular highlights".

After reading a few books on the Kindle, it's hard not to notice a pattern with popular highlights: they become sparser the further you get into a book. Given my penchant for answering mildly interesting questions with statistics, I couldn't help but analyze and visualize the distribution of these popular highlights.

I organized the location of the 10 most popular highlights of 64 books (21 fiction and 43 non-fiction) along with the location of the end of the book (this doesn't include the index, notes, and references of non-fiction books) and loaded it into R:

library(dplyr)

ebook.frame <- read.csv("./ebooks.csv",
                        stringsAsFactors=FALSE)

ebook.frame <- ebook.frame %.%
  mutate(normalized=location/end.location)

In order to meaningfully compare locations across books, I needed to express each location as a percentage of the total length of the book. Let's use ggplot2 to visualize the distribution of where the popular highlights appear across all books:

library(ggplot2)
ggplot(ebook.frame, aes(x=normalized)) +
  geom_density(adjust=2, fill="#0072B2", alpha=.8) +
  labs(title="Distribution of e-book highlights\n") +
  xlab("location in book (percent)") +
  theme(axis.ticks = element_blank(),
        axis.text.y = element_blank()) +
  guides(fill=guide_legend(title=NULL))
ggsave(file="DistributionOfEBookHighlights.png")

Distribution Of E-Book Highlights

Distribution Of E-Book Highlights

Before we go on, it's important to express a few words of warning...
These books are not a proper sample of all kindle e-books; since these books came from my personal collection, books on science and philosophy are oversampled, books about vampires are woefully underrepresented, and there is far more Janet Evanovich than chance would dictate. Because of this, any insights gleaned from these data (to the extent that these data offer any) are only applicable to the reading habits of a certain type of e-reader, namely, boring ones that don't like to have fun.

The spreadsheet I loaded also contained a logical field representing whether the book was fiction. We can take a look at the differences in the highlight locations between fiction and non-fiction books thusly:

ggplot(ebook.frame, aes(x=normalized)) +
  geom_density(adjust=2, aes(fill=factor(fiction)), 
                             alpha=.5) +
  labs(title="Distribution of e-book highlights\n") +
  scale_fill_discrete(labels=c(“non-fiction",
                               "fiction")) +
  xlab("location in book (percent)") +
  theme(axis.ticks = element_blank(), 
        axis.text.y = element_blank()) +
  guides(fill=guide_legend(title=NULL))
ggsave(file="DistributionOfEBookHighlightsFictionDistinction.png")

Distribution Of E-Book Highlights Fiction Distinction

Distribution Of E-Book Highlights Fiction Distinction

It would appear as if non-fiction books have a less uniform distribution of popular highlights. There are likely many causes for this, but one explanation could be that the reader is less likely to make it to the end of a non-fiction book.

In order to make some quantifiable claims, let's look at the empirical cumulative distribution function:

ggplot(ebook.frame, aes(normalized, colour=factor(fiction))) +
  stat_ecdf() +
  labs(title="Cumulative distribution of e-book highlights\n") +
  scale_colour_discrete(labels=c("non-fiction", "fiction")) +
  xlab("location in book (percent)") +
  ylab("cumulative percentage of highlights") +
  theme(legend.title=element_blank())
ggsave(file="CumulativeDistributionofEbookHighlights.png")

Cumulative Distribution of E-book Highlights

Cumulative Distribution of E-book Highlights

Interestingly, for non-fiction books, a full 75% of the highlights are contained in the first 25% percent of the book; not quite pareto, but close).

Before we come to any conclusions regarding the proportion of readers that make it through a book, let's check our assumptions:

  • e-readers that highlight passages (and choose to share them with Amazon) behave just like e-readers that don't
  • salient passages are uniformly distributed throughout a book and, thus, the distribution of highlights is uniform across the entire length of the read portion of a book
  • the fact that a passage was already highlighted by many e-readers has no bearing on the reader’s decision to highlight the same passage

These assumptions don't hold up to critical scrutiny. Nevertheless, these results serve as strong evidence that at least some e-books go unfinished. As for the percentage of books that go unfinished, perhaps Amazon is in a better position to answer that question.

share this: facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

How to make an absurd twitter bot in python

In my last post, I outlined the steps I took to programmatically mimic the wine reviews of a dilettante sommelier. In this post, I'll explain the steps I took to create the twitter bot @HorseWineReview which combines a random wine with a random computer-generated review. I'll keep it short and sweet–the steps are as follows:

  • get a list of wines (from Freebase)
  • create a twitter account and application
  • write script to create and post the tweet
  • automate it with a cron job



Get a list of wines (from Freebase)
Freebase is a collaborative knowledge base that uses a graph database to store semantic information. Information can be retrieved by running MQL queries on their web interface or through a google-powered API. We'll test a query first on their query editor online, but move to accessing the result from the API once we can verify that our query is constructed properly.
The query is very simple, it looks like this:

[{
  "name": null,
  "type": "/wine/wine"
}]

This will return a list of names of all entities of type "/wine/wine" in JSON format. This is an excerpt:

{
  "result": [
    {
      "type": "/wine/wine",
      "name": "1999 Domaine Romanee Conti La Tache"
    },
     .............

Now that we’ve confirmed that this query works and the syntax is correct, let's get the results by using the Google Freebase API. If you don't have one already, you need to get a Google Developers account. From the Google Developers Console, you need to grab a Freebase API key. After that, we can write and run a python script to retrieve and dump out all the wine names. You need the python module "freebase" which can be installed via pip. The code goes thusly:

#!/usr/bin/python
import freebase
import json
import urllib

api_key = "YOUR KEY HERE"
service_url = 'https://www.googleapis.com/freebase/v1/mqlread'

freebase_query = [{'name': None,
                   'limit': 999999999999999,
                   "type": "/wine/wine"}]

params = {"query": json.dumps(freebase_query),
          "key": api_key}

url = service_url + "?" + urllib.urlencode(params)
response = json.loads(urllib.urlopen(url).read())

for item in response['result']:
    print item['name'].encode('utf-8')

Run this script and redirect its contents to a file...

This stores a wine name on each line of the file.

Create a twitter account for your bot and register an application
After following the standard procedure of setting up a twitter account, head over to dev.twitter.com and set-up a developers account on behalf of your bot. Then head over to apps.twitter.com to create a new application; this will be the conduit with which we programmatically update your twitter bot's status. Make sure you read the Terms of Service and don't violate them. Make sure you also allow your application read and write access. After this, if you navigate to the "API Keys" tab, you should record the following information: the API key, API secret, access token, and access token secret. We'll need this to authenticate from our auto-posting python script.

Write script to tweet
We'll use the Twython package to authenticate and serve as an interface to twitter.

A bare bones script to perform this task looks like this:

#!/usr/bin/python

import subprocess
import random
import re
import os
from twython import Twython

twitter = Twython("YOUR API KEY",
                  "YOUR API SECRET",
                  "YOUR ACCESS TOKEN",
                  "YOUR ACCESS TOKEN SECRET")

def output_tweet(text):
    twitter.update_status(status=text)
    os._exit(0)

lengthoftweet = 999
# string to build tweet
tweet = ""

while lengthoftweet > 140:
    tweet = ""
    all_names = [wine.rstrip() for wine
                  in open("ABSOLUTE PATH TO WINE LIST FILE").read().split("\n")]
    wine_name = random.choice(all_names)
    tweet += wine_name + ": "
    rev = subprocess.check_output("ABSOLUTE PATH TO REVIEW GENERATOR")
    # we don't want a review to imply that the wine is either
    # red or white
    if "red" in rev:
        continue
    if "white" in rev:
        continue
    if len(tweet + rev.rstrip()) < 141:
        output_tweet(tweet + rev.rstrip())
    # if it is too long, try to use the first
    # sentence only (using regex)
    rev = re.search("(.+?\.).*", rev).group(1)
    tweet += rev.rstrip()
    lengthoftweet = len(tweet)
    
output_tweet(tweet)
os._exit(0)

A lot of the code is to ensure that the tweet doesn't exceed 140 characters. You might want to add a logging feature to the script–it will help with debugging the cron job.

Automate it with a cron job
If you're on a Unix system we can set up this script to run at specified times automatically using a cron job. Windows users can use Windows Task Scheduler but I've never used it so you're on your own.

Cron jobs are notoriously hard to debug. The #1 problem is encountered is not using absolute paths. When cron calls your script, the working directory is not where the script resides (unlike when you call the script on the command-line from the same directory). Because of this, any file IO in the script that uses relative paths will fail (quietly, if you don't add logging to the script).

The #2 problem you may encounter with cron jobs is that, because it runs as a detached process outside the login environment, the shell it executes it from may not be the one you normally use. Furthermore, it may not have the directories in the PATH that you need, or any other environment variables that you depend on.

The third problem that occurs often in botched cron jobs are bad permissions. Make sure you have all correct permissions (e.g. chmod +x YOUR_SCRIPT.py)

One final problem that I encountered was not putting an empty line after your cron entry–learn from my mistakes!

To add a cron entry, execute

in the shell

In the editor, I wrote the entry like this:

Your paths will obviously depend on what you are running and from where. Make sure you have an empty line after the entry!

To briefly explain this entry…

The first two numbers (00) specify the minute (0-60) to run the job. I chose to run it on the first (zeroth) minute of the hour. The second section (9,14,21) specifies the hour and tells cron to run the job at 9:00 am, 2:00 pm, and 9:00 pm. The next three sections (the asterisks) specify "day of the month" (1-31), "month of the year" (1-12), and "day of the week" (0-7), respectively. The asterisks indicate that this is to run every day of the month and every month of the year.

The next two strings instruct cron to run the script (which was explained in my last post), and the rest of the line redirects any output from logging to a text file called CRON.log

First endnote: Why @HorseWineReviews?
The name pays homage to the infamous twitter bot @Horse_ebooks that (used to) post (unintentionally hilarious) context-free excerpts and Markov chain clumps from books about horses in an (successful) effort to avoid looking like a spam account whilst occasionally tweeting links to promote the sales of e-books.

Final endnote
While the task of tweeting fake wine reviews has already been taken, if you are looking for ideas of a twitter bot of your own, dear reader, you might want to explore the following ideas that I think would a hit:

  • Train a Markov chain on equal parts famous philosophical works and vacuous and decidedly un-philosophical ramblings (a la KimKierkegaardashian and Kantye West). Philosophical corpora can be grabbed from Project Gutenburg. Vacuous babble can probably be obtained from choice subreddits or most of the trending hashtags on twitter.
  • Web-scrape a huge corpus of episode descriptions from various television shows, train a Markov chain on them and let it loose on Twitter. You can get episode descriptions from tvrage.com (example)
  • Train a Markov chain on the abstracts of academic papers.

You’re welcome :)

share this: facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

How to fake a sophisticated knowledge of wine with Markov Chains

How soon is now
How soon is now

Markov chain

To the untrained (like me), wine criticism may seem like an exercise in pretentiousness. It may seem like anybody following a set of basic rules and knowing the proper descriptors can feign sophistication (at least when it comes to wine).

In this post, we will be exploiting the formulaic nature of wine reviews to automatically generate our own reviews that appear (at least to the untrained) to be legitimate.

Markov Chains
A Markov chain is a system that transitions between states using a random, memoryless process. The transition from one state to another is determined by a single random sample from a (usually discrete) probability distribution. Additionally, the current state wanders aimlessly through the chain according to these random transitions with no regard to its previous states.

A roll-of-the-dice board game can be likened to a Markov chain; the dice determine how many squares you move, and is in no way is influenced by your previous rolls. Scores in basketball games appear to act in this way as well (in spite of the myth of the 'hot hand') and a gambler's earnings almost certainly hold the Markov property (see the Monte Carlo Fallacy)

Many more phenomena can be appropriately modeled by a Markovian process... but language isn't one of them.

Markov Chains and Text Generation
The image below shows a Markov chain that is built from the following lyrics:

How Soon is Now

Markov chain of The Smiths lyrics


I am the son
and the heir
of a shyness that is criminally vulgar
I am the son and heir
of nothing in particular

Here, each word is a state, and the transitions are based on the the number of times a word appears after another one. For example, "I" always precedes "am" in the text, so the transition from "I" to "am" occurs with certainty (p=1). Following the word "the", however, "son" occurs twice and "heir" occurs once, so the probability of the transitions are .66 and .33, respectively.

Text can be generated using this Markov chain by changing state until an "absorbing state" is reached, where there are no longer any transitions possible.

Given "I" as an initial state, two possible text generations are

  • I am the heir of nothing in particular
  • I am the son and the son and the son and the son and the son of a shyness that is criminally vulgar.

Very often, the generated text violates basic rules of grammar; after all, the transitions are "dumb" stochastic processes without knowledge of grammar and semantics.

Instead of a memoryless chain, though, we can build a chain where the next state depends on the last n states. This can still satisfy the Markov property if we view each state as holding n words. When using these 'higher order' chains to generate text, something very interesting happens. Since the states are now made up of clauses and phrases (instead of words) the generated text seems to magically follow (some of) the rules of grammar, while still being devoid of semantic sense.

The higher order the chain, more text needs to be fed into the chain to achieve the same level of 'arbitrariness'–but the more the generated text seems to conform to actual correct English. In order to fake our wine reviews, we are going to train an order-two Markov chain on a web-scraped corpus of almost 9,000 wine reviews.

The scraping
The corpus of wine reviews I chose to use was from www.winespectator.com. If you go to this site, you'll see that there 709 pages of reviews. I used SelectorGadget to determine the XPath selector for the content I wanted and wrote a few python scripts along these lines:

#!/usr/bin/env python -tt
 
import urllib2
from lxml.html import fromstring
import sys
import time
 
urlprefix = "http://www.winespectator.com/dailypicks/category/catid/1/page/"
 
#709
for page in xrange(1, 710):
    try:
        out = "-> On page {} of {}....      {}%"
        print out.format(page, "709", str(round(float(page)/709*100, 2)))
        response = urllib2.urlopen(urlprefix + str(page))
        html = response.read()
        dom = fromstring(html)
        sels = dom.xpath('//*[(@id = "searchResults")]//p')
        for review in sels:
            if review.text:
                print review.text.rstrip()
        sys.stdout.flush()
        time.sleep(2)
    except:
        continue

and grabbed/processed it with shell code like this:

# capture output of script
./get-reviews.py | tee prep1.txt

# remove all lines that indicate progress of script
cat grep -E -v '^-' prep1.txt > prep2.txt

# add the words "BEGIN NOW" to the beginning of each line
cat prep2.txt | sed 's/^/BEGIN NOW /' > prep3.txt

# add the word "END" to the end of each line
cat prep3.txt | sed 's/$/ END/' > wine-reviews.txt

This is a sample of what out text file looks like at this point:

BEGIN NOW A balanced red, with black currant, ... lowed by a spiced finish. END
BEGIN NOW Fresh and balanced, with a stony ... pear and spice. END

The "BEGIN NOW" tokens at the beginning of each line will serve as the initial state of our generative Markov process, and the "END" token will denote a stopping point.

Now comes the construction of the Markov chain which will be represented as a python dictionary. We can get away with not calculating the probabilities of the transitions by just storing the word that occurs after each bi-gram (two words) in a list that can be accessed using the bi-gram key to the chain dictionary. We will then 'pickle' (serialize) the dictionary for use in the script that generates the fake review. The code is very simple and reads thusly:

#!/usr/bin/env python -tt
 
import pickle
 
fh = open("wine-reviews.txt", "r")
 
chain = {}
 
def generate_trigram(words):
    if len(words) < 3:
        return
    for i in xrange(len(words) - 2):
        yield (words[i], words[i+1], words[i+2])
 
for line in fh.readlines():
    words = line.split()
    for word1, word2, word3 in generate_trigram(words):
        key = (word1, word2)
        if key in chain:
            chain[key].append(word3)
        else:
            chain[key] = [word3]
 
pickle.dump(chain, open("chain.p", "wb" ))

Finally, the python script to generate the review from the pickled Markov chain dictionary looks like this:

#!/usr/bin/env python -tt
 
import pickle
import random
 
chain = pickle.load(open("chain.p", "rb"))
 
new_review = []
sword1 = "BEGIN"
sword2 = "NOW"
 
while True:
    sword1, sword2 = sword2, random.choice(chain[(sword1, sword2)])
    if sword2 == "END":
        break
    new_review.append(sword2)
 
print ' '.join(new_review)

The random.choice() function allows us to skip the calculation of the transition probabilities because it will choose from the list of possible next states in accordance with the frequencies at which they occur.

The results
Obviously, some generated reviews come out better than others. After playing with the generator for a while, I compiled a list of "greatest hits" and "greatest misses".

Greatest hits

  • Quite rich, but stopping short of opulent, this white sports peach and apricot, yet a little in finesse.
  • Dense and tightly wound, with taut dark berry, black cherry and red licorice. A touch of toast.
  • Delicious red licorice, blood orange and ginger, with nicely rounded frame.
  • This stylish Australian Cabernet is dark, deep and complex, ending with a polished mouthful of spicy fruit and plenty of personality.

Greatest misses

  • From South Africa.
  • Tropical fruit notes of cream notes.
  • Here's a bright structure. Dry and austere on the finish.
  • This has good flesh.
  • Really enticing nose, with orange peel and chamomile for the vintage, this touts black currant, plum and meat notes. Flavors linger enticingly.
  • Blackberry, blueberry and blackberry fruit, with hints of cream. Crunchy and fresh fruit character to carry the finish.

Possibilities for improvement
The results are amazing, but the algorithm needs a little work before it will be able to fool a sommelier.

One major giveaway is the inclusion of contradictory descriptors in the same review. I don't know anything about wine (I drink Pepsi) but even I know that a wine should never be described as both "dry" and "sweet". One possible solution to this would be to use association mining to infer a list of complementary and discordant descriptors.

Another clue that these reviews are nonsense is the indiscriminate chaining of clauses that have nothing to do with each other. I'm not quite sure how to solve this, yet, but I have a few ideas.

An additional hiccup is that there are still grammatically incorrect sentences that creep through. One solution would be to identify and remove them. Unfortunately, this is much easier said than done. In the absence of a formal English grammar, we have to rely on less-than-perfect techniques like context-based identification and simple pattern-matching.

The last obvious problem is that some of the generated reviews are just too long. This increases the likelihood of containing contradictory descriptors and committing grammar errors, as with this review: (which also exemplifies all of the problems stated above)

Luscious, sleek and generous with its gorgeous blueberry, raspberry and blackberry flavors , with hints of herbs, cocoa and graphite. The long, briary edge lingering on the nose and palate. Medium-bodied, with a modest, lightly juicy and brambly flavors of milk chocolate. Full-bodied, with fine focus and its broad, intense and vivid, with a tangy, lip-smacking profile. Light-weight and intense, with a deft balance.

In future posts, I hope to explore some of these avenues of improvement. I also plan to use parts-of-speech tagging to automate an unusual games of wine review mad-libs.

Sooner, though, I'll explain the process I took to set-up the fake wine review twitter bot (@HorseWineReview) that I will use to experiment with different text-generation techniques.

share this: facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

How dplyr replaced my most common R idioms

Having written a lot of R code over the last few years, I've developed a set of constructs for my most common tasks. Like an idiom in a natural language (e.g. "break a leg"), I automatically grasp their meaning without having to think about it. Because they allow me to become more and more productive in R, these idioms have become ingrained in my head (and muscle memory) and, in large part, inform how I approach problems.

It's no wonder, then, why I'm hesitant to embrace new packages that threaten to displace these idioms; switching is tantamount to learning a new dialect after investing a lot of time becoming fluent in another.

On occasion, though, a package comes along whose benefits are so compelling (here's looking at you, Hadley Wickham, Dirk Eddelbuettel, and Romain François) that it incites me to take the plunge and employ new patterns and learn new idioms. The most recent package to accomplish this is the dplyr package. This package (among other things) reimplements 5 of my most common R data manipulation idioms--often, in blazing fast parallel C++.

This post serves as both an advocation for dplyr (by comparing form and speed) but also as a rosetta stone--to serve as a personal reference for translating my old R idioms.

This uses a dataset documenting crimes in the US by state available here

library(dplyr)
crime.by.state <- read.csv("CrimeStatebyState.csv")

Filtering rows

# base R
crime.ny.2005 <- crime.by.state[crime.by.state$Year==2005 &
                                crime.by.state$State=="New York", ]

# dplyr
crime.ny.2005 <- filter(crime.by.state, State=="New York", Year==2005)

There is a lot going on with my base R solution. It uses logical subsetting to extract choice rows from crime.by.state. Specifically, it creates a two boolean vectors: one that is true only when the "Year" column's value is 2005, and one that is true only when the "State" column's value is "New York". It then logical "AND"s these vectors, so that the resulting boolean vector is true only where the year was 2005 and the state was New York. This vector then is used to subset crime.by.state, and includes all columns. In contrast, the dplyr solution reads much more naturally, and in far fewer characters. According to my (crude) benchmarks the dplyr solution appears to be twice as fast.

A quick note before moving on, we could've drastically cut down on the number of characters in the base R solution by "attaching" the crime.ny.2005 dataset, eliminating the need to preface the "Year" and "State" names with "crime.by.state$", but there are two reasons why I don't do this. (1) I consider it to be bad form in a lot of circumstances (for example, it can become confusing when more than one dataset is loaded), and (2) RStudio will tab-auto-complete a column name after prefacing it with "name-of-dataframe$" and that drastically increases my coding speed. My only complaint(?) about dplyr is that it disallows this prefacing syntax and requires me to lookup the column names (and spell them correctly).

Arranging and ordering

# base R
crime.ny.2005 <- crime.ny.2005[order(crime.ny.2005$Count, 
                                     decreasing=TRUE), ]

# dplyr
crime.ny.2005 <- arrange(crime.ny.2005, desc(Count))

The base R solution ranks each row by value of "Count" in decreasing order, and uses the rank vector to subset the "crime.ny.2005" data frame. The dplyr solution appears to be about 20% faster.

Selecting columns

# base R
crime.ny.2005 <- crime.ny.2005[, c("Type.of.Crime", "Count")]

# dplyr
crime.ny.2005 <- select(crime.ny.2005, Type.of.Crime, Count)

This example is relatively self-explanatory. Here the base R solution appears to be faster, by about 30%.

Creating new columns

# base R
crime.ny.2005$Proportion <- crime.ny.2005$Count /
                            sum(crime.ny.2005$Count)

# dplyr
crime.ny.2005 <- mutate(crime.ny.2005, 
                        Proportion=Count/sum(Count))

Very often, I have to create a new column that is a function of one or more existing columns. Here, we are creating a new column, that represents the proportion that a particular crime claims from the total number of crimes, among all types. Incredibly, base R beats dplyr in this task--it is about 18 times faster.

If I had to guess, I think this is because of the nuances of R's vectorization. In the base R solution, a vector of crime counts is extracted. R recognizes that it is being divided by a scalar (the sum of the counts), and automatically creates a vector with this scalar repeated so that the length of the vectors match. Both of the vectors are stored contiguously and the resulting element-wise division is blindingly fast. In contrast, I think that in the dplyr solution, the sum of the counts column is actually evaluated for each element in the count vector, although I am not sure.

Aggregation and summarization

# base R
summary1 <- aggregate(Count ~ Type.of.Crime,
                      data=crime.ny.2005,
                      FUN=sum)
summary2 <- aggregate(Count ~ Type.of.Crime,
                      data=crime.ny.2005,
                      FUN=length)
summary.crime.ny.2005 <- merge(summary1, summary2,
                               by="Type.of.Crime")

# dplyr
by.type <- group_by(crime.ny.2005, Type.of.Crime)
summary.crime.ny.2005 <- summarise(by.type,
                                   num.types = n(),
                                   counts = sum(Count))

This is the arena in which dplyr really shines over base R. In the original dataset, crime was identified by specific names ("Burglary", "Aggravated assault") and by a broader category ("Property Crime" and "Violent Crime")

Before this point, the data frame we are working with looks like this:

 Type.of.Crime  Count
 Violent Crime    874
 Violent Crime   3636
 Violent Crime  35179
 Violent Crime  46150
Property Crime  68034
Property Crime 302220
Property Crime  35736

In this pedagogical example we want to aggregate by the type of crime and (a) get the number of specific crimes that fall into each category, and (b) get the sum of all crimes committed in those categories. Base R makes it very easy to do one of these aggregations, but to get two values, it requires that we make two calls to aggregate and then merge the results. Dplyr's solution, on the other hand, is relatively intuitive, and requires just two function calls.

All together now

We haven't showcased the best part of dplyr yet... it presents itself when combining all of these statements:

# base R
crime.by.state <- read.csv("CrimeStatebyState.csv")
crime.ny.2005 <- crime.by.state[crime.by.state$Year==2005 &
                                  crime.by.state$State=="New York", 
                                c("Type.of.Crime", "Count")]
crime.ny.2005 <- crime.ny.2005[order(crime.ny.2005$Count, 
                                     decreasing=TRUE), ]
crime.ny.2005$Proportion <- crime.ny.2005$Count /
                            sum(crime.ny.2005$Count)
summary1 <- aggregate(Count ~ Type.of.Crime,
                      data=crime.ny.2005,
                      FUN=sum)
summary2 <- aggregate(Count ~ Type.of.Crime,
                      data=crime.ny.2005,
                      FUN=length)
final <- merge(summary1, summary2,
               by="Type.of.Crime")


# dplyr
crime.by.state <- read.csv("CrimeStatebyState.csv")
final <- crime.by.state %>%
           filter(State=="New York", Year==2005) %>%
           arrange(desc(Count)) %>%
           select(Type.of.Crime, Count) %>%
           mutate(Proportion=Count/sum(Count)) %>%
           group_by(Type.of.Crime) %>%
           summarise(num.types = n(), counts = sum(Count))

When all combined, the base R solution took 60 seconds (over 10000 iterations) and the dplyr solution took 30 seconds. Perhaps more importantly, the dplyr code to uses many fewer lines and assignments and is more terse and probably more readable with its neat-o "%>%" operator.

I would be remiss if I didn't mention at least one of the other benefits of dplyr...

Dplyr's functions are generalized to handle more than just data.frames (like we were using here). As easily as dplyr handles the data frame, dplyr can also handle data.tables, remote (and out-of-memory) databases like MySQL; Postgres; Lite; and BigQuery by translating to the appropriate SQL on the fly.

There are still other neat features of dplyr but perhaps these are reason enough to give dplyr a shot. I know my own code may never look the same.

edit (9/17/14): I changed the pipe operator from the deprecated "%.%" to the preferred "%>%".

share this: facebooktwittergoogle_plusredditpinterestlinkedintumblrmail