Using Python decorators to be a lazy programmer: a case study

Decorators are considered one of the more advanced features of python and it will often be the last topic in a python class or introductory book. It will, unfortunately, also be one that trips up many beginning or even intermediate python programmers. Those who stick it out and work through it, though, will be handsomely rewarded for their hard work.

Known by those in-the-know, decorators are tools to make your python code beautiful, more concise, well-written, and elegant--but did you know you can use decorators to be a lazy bum of a programmer?!

I just recently whipped up a script that I'm using to help my expand and organize my photo library of my favorite pieces of art. This script that takes the URL of an art piece from, downloads the image of the artwork and renames the downloaded file to follow a filename template (given as a CLI arg) based on the artist's name, the title of the piece, and the date of completion. For example, if we wanted to download an image of the Fountain, and have the file automatically named: Marcel Duchamp - Fountain - 1917.jpg, one can use the following command:

python "" 
                   "%a - %t - %d"

because %a is automatically replaced with the artist's name (that is extracted from the webpage), %t is automatically replaced with the title of the piece, and %d is the date of the piece.

In this post, I'll be demonstrating the use of decorators to the end doing the bare minimum and how I saved myself from having to write tedious (but important) error-checking code. But before that, you, dear reader, have to be clear on what a decorator is. The section that follows it perhaps the greatest intro to decorators ever.

(If you are already familiar with decorators, you can skip to the section called "The problem"--though you may want to at least skim this section.)

What are decorators?

Here's the skinny on Python decorators. Grokking decorators necessitates an intuitive understanding of three concepts:

  • People often speak of functions as being "first-class citizens" in Python. By this they mean that functions are values that can be assigned to variables, returned from (other) functions, and passed as an argument to (still other) functions.
  • When a function (we’ll call it outer) returns another function (we'll call this inner), the inner function "closes over" (remembers) variables defined in the enclosing scope (outer's scope). If the returned function is stored in a variable and called at a later time, it still remembers the variable(s) from the enclosing scope—even if it is called long after the outer function finishes running and it's variables otherwise lost. The inner function that is returned is known as a "closure".
  • A language that supports closures affords us the unique opportunity to easily add or modify the behavior of a function a by creating a function b that takes function a as an argument, and returns a function c which does something, and then calls function a. Function c can now be used in place of function a--it is essentially function a plus some extra functionality: it is a "decorated" version of function a.

In order to concretize these concepts, let's see an example of a decorator complete with an illustration of the motivation behind creating it and the cognitive steps taken toward it's finished state. Unlike some decorator tutorials, this lesson will not patronize you, dear reader, by designing a overly-simple decorator with no practical worth (it's always been my thought that this pedagogical strategy most often backfires). Instead, we'll create, you and I, a decorator of actual utility.

Suppose we wanted to time the execution of a function. Wanting something with a little more precision than a stopwatch, we decide to use the time module:

import time

def sleep_for_a_second():

start_time = time.time()
end_time = time.time()

print("It took {0:.2f} seconds".format(end_time-start_time))
#> It took 1.00 seconds

This is ok, but if we want to time the execution of many different functions, this will result in a lot of repeated code. Being champions of the DRY principle, we decide it would be better to put this in a function:

def time_a_function(func):
    start_time = time.time()
    end_time = time.time()
    print("It took {0:.2f} seconds".format(end_time-start_time))

def sleep_for_a_second():

def sleep_for_two_seconds():

#> It took 1.00 seconds
#> It took 2.00 seconds

time_a_function is a function that takes the function we want to time as an argument.

We just timed two functions. Notice how we can now time an arbitrary number of functions with no extra code.

But there's an issue with this approach. We've hitherto been timing functions that take no arguments. How would we time a function that takes one or more arguments?

def sleep_for_n_seconds(n):

#> TypeError: 'NoneType' object is not callable

Nope. Before, we were passing the variable that holds the function to time_a_function, but the above incantation evaluates sleep_for_n_seconds(5), passes it's None return value to time_a_function and because time_a_function can't call it (because it's not a function), we get an error. So how are we going to time sleep_for_n_seconds?

The solution is to make a function that takes a function that returns a function that takes an argument (n) and performs the function and times it and use the returned function in place of the original (whew!).

In other words:

def timer_decoration(func):
    def new_fn(n):
        start_time = time.time()
        end_time = time.time()
        print("It took {0:.2f} seconds".format(end_time-start_time))
    return new_fn

def sleep_for_n_seconds(n):

sleep_for_n_seconds = timer_decoration(sleep_for_n_seconds)
#> It took 3.01 seconds

Study this code carefully. You've just wrote a decorator.

If you are confused it--as always with programming--helps if you type the code out yourself (no copy-and-pasting!) and play around with it.

Though it's not terrible unwieldy otherwise, python gives us a nice elegant way to tag a particular function with a decorator so that the function is automatically decorated (i.e. doesn't require us to replace the original function).

def sleep_for_n_seconds(n):

# now the reassignment of 'sleep_for_n_seconds' is unnecessary
#> It took 3.00 seconds

But what happens if we try to decorate the original sleep_for_a_second function?

def sleep_for_a_second():

# sleep_for_a_second()
#> TypeError: new_fn() missing 1 required positional argument: 'n'

sleep_for_a_second is now expecting 1 argument :(. We can generalize our decorator to handle functions that take an arbitrary number of arguments with *args and **kargs...

def timer_decoration(func):
    def new_fn(*args, **kargs):
        start_time = time.time()
        func(*args, **kargs)
        end_time = time.time()
        print("It took {0:.2f} seconds".format(end_time-start_time))
    return new_fn

def sleep_for_n_seconds(n):

def sleep_for_a_second():

def sleep_for_k_seconds(k=1):

#> It took 3.00 seconds
#> It took 1.00 seconds
#> It took 4.00 seconds


Finally, let's rewrite our decorator to support returning the return value of the decorated function.

def timer_decoration(func):
    def new_fn(*args, **kargs):
        start_time = time.time()
        ret_val = func(*args, **kargs)
        end_time = time.time()
        print("It took {0:.2f} seconds".format(end_time-start_time))
        return ret_val
    return new_fn

def sleep_for_a_second_p():
    return True

#> It took 1.01 seconds
#> True

Note that our decorator is now generalized enough to be used with any function... no matter what it's return type is... no matter what arguments it takes...

It doesn't matter what the function is, the function's behavior remains the same except now it is "decorated" with functionality that times it.

This is just one example of a decorator with obvious generalized utility. You can also use decorators to perform memoization, static-typing-like type enforcement of function signatures, automatically retry functions that failed, and simulate non-strict evaluation.

The problem

To review, I wrote a script that takes the URL of an artwork on, downloads the image, and then names the file in accordance with a user-supplied format string that uses info about the artwork. With the help of the requests, lxml, and wget modules, a script to do this can be coded relatively quickly. The problem, though--which is common for scripts that talk to the web that don't do error-checking--is that the script is brittle. Without error-checking, any malformed URL, network interruption, invalid output path, or weird edge case like an artwork without a title, will result in an unsightly error message and lengthy stack trace. Besides being aesthetically objectionable, if anyone else is using your script, you will look like an incompetent software engineer. So you have to bite the bullet and error-check.

The problem with error-checking is

  • If all possible errors are checked for separately and individually and handled appropriately (this is good practice), it will result in code often many times longer than the original code. Only a small fraction of the code will be the actual interesting logic of the program--most of it will now be mindless conditionals.
  • It's difficult for someone without training (me) to anticipate every possible error.
  • It takes a lot of work and I'm lazy

So my usual M.O. is to wrap each component in a try/except block (with no specificity in the exception), print an error message, and terminate execution...

    <brittle code>
    sys.exit("<brittle code> broke")

Except I don't even do that. Instead of wrapping each component in a try/except with its own error message, I just wind up try/excepting main once. This cuts down typing and carpal tunnel is a real thing...

def main():
    <literally everything>

    sys.exit("whoopsie daisy")

Decorators to the rescue

Okkkaaaayyyyy... if we absolutely must do some modicum of error checking around each component (so the user has some kind of clue as to why it the script's usage failed) we can write a decorator to do this for us. The following is an excerpt of the script as of commit d6be4956543:

# the decorator
def cop_out(f):
    def inner(*args, **kargs):
            return f(*args, **kargs)
            sys.exit("\nThe function <{}> failed\n".format(f.__name__))
    return inner

def get_command_line_arguments():
    return sys.argv[1], sys.argv[2]

def download_webpage(url):
    r = requests.get(url)
    if r.status_code != 200:
        raise Exception
    return r

def parse_webpage(requests_object):
    return lxml.html.fromstring(requests_object.text)

Note how we use f.__name__ to get the name of the function that was decorated. This allows us to add the support for specialized (at the function level, at least) error messages for free!

Now, if the user calls the script with too few arguments, the program will print The function <get_command_line_arguments> failed. If you give it a real URL but not to an artwork, it'll say The function <extract_artist_name_from_webpage> failed. If you give it a made-up URL, it'll say The function <download_webpage> failed, etc...

Sure, beyond the function level, you don't know why it failed, but anything is better than nothing and your users shouldn't be so bossy and entitled.

But one more thing... if you looked at the code, you'll notice that my function names are descriptive... maybe too long and descriptive. The use of prose-like descriptive function names (certainly by the standards of Haskell programmers) was no accident. Although it may seem like an uncharacteristically diligent and conscientious decision on my part, it was actually to facilitate further laziness. Consider the following tweak to the decorator:

def cop_out(f):
    def inner(*args, **kargs):
            return f(*args, **kargs)
            message = f.__name__.replace("_", " ")
            sys.exit("\nFailed to {}\n".format(message))
    return inner

Consider how this generates error messages that appear to individualized...

$ python ""
Failed to get command line arguments

$ python3.4 ""
                        "%a/%t - %d"
Failed to download webpage

$ python2.7 ""
                         "%a/%t - %d"
Failed to extract artist name from webpage

So there you have it! Decorators can be used for legitimate, elegant solutions but can also be employed--virtually for free--to give the illusion that you are a caring software engineer and meticulous with your error checking.


If you're a potential employer or client, I'm just kidding--I'm very diligent about error checking. This piece is satire. I promise.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Computational foreign language learning: a study in Spanish verbs usage

Abstract: I did some computer-y stuff to construct a personal Spanish text corpus and create a Spanish verb study guide specifically tailored to the linguistic variety of Spanish I intend to consume and produce. It worked fairly well. It also revealed a (in some small way) generalizable depiction of the relative frequencies of Spanish verb tenses and moods. This technique may prove to be extremely beneficial to Spanish-language pedagogy. If you're uninterested in my motivations or procedure, you can skip to the section labeled "results".

As regular readers of this blog may be aware, one of my favorite activities is marshaling the skills that I use as a computational scientist to study the humanities. For example, in a previous post, we saw how principles from phylogenetic systematics helped textual critics reconstruct the original manuscript for "The Canterbury Tales"; in another, we deployed techniques first used to study physics to the end of fooling vineyards into retweeting fake, computer-generated wine reviews.

For this post, I used both tools from computational linguistics and some good-old-fashioned data wrangling (web-scraping, parsing texts, etc...) to create a custom-fit Spanish verb study guide.

The problems

Problem #1

Although foreign language immersion is the almost certainly the best learning path for most types of foreign language learners, no reasonable student without an lavish budget for traveling can expect to get by without having to do some rote memorization. In the context of Spanish verbs, this either means unguided memorization of a dictionary or consultation of a list of the most commonly used Spanish verbs. But, even if you could trust that the most-popular-verbs list was compiled in a principled manner, there are vast regional and sub-culture-specific variations in verb frequency. For example, the verb coger means "to take" in Spain but in Central America it's... it’s a… pretty vulgar verb. It stands to reason that there are pretty enormous differences in this verb's popularity across regions, contexts, and registers. Depending on which region's dialect you prioritize familiarity with, and depending on how raggle-taggle the people you intend to roll with are—or the media you intend to consume—a one-size-fits-all verb list might let you down.

Problem #2

English isn't a very inflective language—the tense (or person, mood, aspect, etc...) is largely determined, not through verb conjugation, but via periphrasis, the use of personal pronouns, and other auxiliary words. This is in stark comparison to Spanish, a highly-inflective, relatively synthetic language where the verb's conjugation betrays its tense, person, mood, and aspect—all in one word! This linguistic elegance is a learning obstacle, since one verb might be written in a little under 60 different ways (6 persons * (4 tenses in the indicative mood + 3 tenses in the subjunctive mood + 1 imperative mood)).

This pedagogical nightmare is partially allayed by careful prioritization of some tenses and moods, over others—at least initially. For example, a Spanish-language learner almost always learns the commonly-used and versatile present indicative tense first. But beyond the next few obvious choices, the order in which these tenses should be prioritized is not clear and (probably) dependent on how and where you expect to use and consume the language. Further complicating things, there are entire persons (here's looking to you, vosotros) that are very uncommon in most Spanish-speaking countries.

The solution

The solution to this problem is to create a personal corpus of Spanish text, containing examples of the types of text you expect to consume and produce. Then, the verbs need to be identified, have their mood, tense, and person recorded, and converted into infinitive form (for frequency tabulation). The relative frequencies of the persons, mood, and tenses—as well as the frequencies of the verbs (in infinitive form)—will inform the creation of a Spanish verb study guide specifically catered to type of linguistic variety the learner intends to employ. Whether the learner’s primary interest in learning Spanish is to be able to bond with a new family member over their love of Mexican telenovelas or to read and understand Don Quixote in its entirety, this approach will hasten the learner’s sense of accomplishment with respect to cookie-cutter verb study guides, increase learner satisfaction, and increase the likelihood of the learner actually achieving language mastery. I mean, as a learner myself, I would be discouraged if I felt like the main payoff of studying Spanish is to read and understand books that are very obviously juvenile or primary meant for pedagogical purposes. I want to read Márquez and I want to read him now!

The corpus

For my particular corpus, I chose a whole mess of books (most of which I've read—and loved—in English) that I'm interested in reading in the original language. These include Rayuelas and Final De Juego by Julio Cortázar (my favorite short story writer), Cien Años De Soledad by Gabriel García Márquez (generally considered to be a masterpiece), Darios de Motocicleta by Che Guevara, Ficciones by Jorge Luis Borges, and La Cuidad De Las Bestias by Isabel Allende. These texts were obtained electronically—legitimately!—and I used various ad-hoc regexes to remove formatting and conversion-from-PDF-to-text) artifacts.

My interest in Spanish isn't only for consuming literature, though; I wanted to include other sources of text, like movie scripts (I planned on Lo Que le Pasó a Santiago, generally considered to be one of the best Puerto Rican films), but I couldn't find the script online. I also wanted to include the lyrics to my favorite Spanish-language bands (Soda Stereo, El Ultimo Vecíno, Décima Víctima, Caifenes, Shakira, Millie Quezada, ...) but the tool I used to identify the verbs in the corpus often choked on these texts. Why, you ask?...

Parts-of-speech tagging

references are at the bottom of the post

Parts-of-speech tagging (hereafter, 'POS tagging') is when you go through a text and, for each word, identify the which part of speech (verb, noun, adjective, etc...) the word functions as.

This is a non-trivial task because the same word can function as different parts-of-speech depending on the context. Take the following sentence, for example, which is an expanded and modified version of a sentence that is used as an example in this video

Fruit flies like bananas

So, taken individually, all words in this sentence can function as multiple parts of speech. Take "like" for instance; it can be a noun ("my status got mad likes"), a verb ("I like your status"), a quotative ("I was like, 'I enjoyed your status'"), conjunction (“I updated my status like the world depended on it”), a preposition ("I wrote my status like Nathaniel Hawthorne"). Depending on how colloquial the text in question is, "like" can even be used as a discourse marker ("I'm, like, scared of ghosts, Scoob"). As a standalone word, "like" can serve the purpose of 6 different parts of speech.

But even looking at the entire sentence as a whole, the parts-of-speech for each word is ambiguous.

Concretely, the sentence can be interpreted as (a) "fruit flies (noun) like (verb) bananas (noun)", (b) "fruit (noun) flies (verb) like (preposition) bananas (noun) [do]", or even (c) "fruit (noun) flies (verb) like (conjunction(?)) bananas (adjective)"—using the colloquial meaning of the word bananas meaning "crazy".

Note that the POS tag for one word is conditional on the POS tags of other words: whether flies is a noun or a verb affects whether bananas is interpretable as a adjective.

Because this task isn't easy, this job used to be left to humans to perform. Now, various techniques allow for this to be done programmatically to a high degree of accuracy. We'll go through a few of them, ending with the sophisticated method employed by the POS tagger that we will be using, the Stanford Parts-of-speech tagger.

Unigram tagging

A training corpus with the POS tags for each word is read and, for each unique word, the number of times it is used as one of the various parts of speech is tallied. When a word is encountered in untagged text, the tagger chooses the part-of-speech that the word is most commonly used as in the training text. If the word encountered was not in the training text at all, it defaults to a noun. Somehow, this context-free elementary method can yield accuracies of 90%-94% (Brill & Wu, 1998). When Brill and Wu used this method with/on the famous Penn Treebank Wall Street Journal corpus with a 80%/20% training/testing split, it achieved 93.3% accuracy.

n-gram tagging

Using an n-gram model, the tag of a particular word is assumed to be conditionally dependent on the tag of the preceding n-1 words. For example, in a bigram model, the tag of the current word is guessed from the current word, and the tag of the previous word. A trigram model uses tag information from the previous two words, in concert with the conditional probability of a particular tag given a certain word. The unigram tagger is a special case of the n-gram tagger where n is 1. It's not hard to see that n-gram tagging will offer an enormous accuracy improvement.

If this reminds you of the Markov chains that we made use of in the previous post on computer-generating wine reviews, then you have a good eye. N-gram tagging is a type of Hidden Markov Model (HMM). What makes HMMs different than simple Markov models is that the states themselves (the POS tags) are not directly observable; the observable portion of each state are the actual words—and the words are only a probabilistic function of the state.

In addition to testing a unigram model, Brill and Wu also tested this technique's ability on the WSJ corpus. In particular, they used a trigram tagger—with a twist. Weischedel, Ralph, et al (1993) noted that the suffix of a word (-ed, -s, -ing, -ion, -ly, etc...) strongly influenced the probability that the word served as a particular part of speech. When this information was wielded to help classify unknown words, it greatly improved accuracy outcomes. When Brill and Wu used this method with a trigram tagger against the WSJ corpus, the technique yielded an 96.4% accuracy rate.

Maximum Entropy models

Maximum Entropy models are a lot like—insofar as they are equivalent to—multinomial logistic regression models that attempt to model the probability of a given tag class given various predictor variables, or features. Maximum entropy models can use features such as the current word, the previous word, the previous word’s tag, etc...—like would a HMM—but also features like whether the word contains a number, whether the word is capitalized, etc... An optimization algorithm called Generalized Iterative Scaling selects the feature weights that maximize the likelihood function.

Ratnaparkhi (1996) tested a straightforward maximum entropy model on the WSJ corpus and noted that it yielded an accuracy of 96.6%. Four years after that, Toutanova et al. (2000) published a paper in which they show that by adding additional features like whether the word is capitalized and in the middle of a sentence and non-local features that look 8 words back for a modal verb (for disambiguating base form verbs and non-3rd person singular present verbs) they can achieve a WSJ accuracy of 96.8%. This is the benefit of the Maximum Entropy model approach—you can arbitrarily add features (within reason) without necessarily knowing how those features contribute the the probabilities of tag outputs.

Three years after that, Toutanova et al. (2003) achieved a 97.2% accuracy rate on the WSJ corpus by (a) adding features for the words following the word currently being tagged, and (b) using regularization to combat overfitting as a result of using many features—many of which probably only weakly contribute information of the probability of the current word's tag class. Their regularization technique involved placing a zero-centered Gaussian prior on the feature weights and is mathematically tantamount to the L2 regularization that we saw in this previous blog post. This state-of-the-art tagger is the one on which the Stanford tagger we use is based.

[There is another famous type of POS tagger called Transformation-Based tagger. In contrast to all the others that were mentioned above, this is not a probabilistic/stochastic model and is, instead, based on rules and knowledge. I won't describe it here because it’s very different and this post is already too long but I should mention that it can score a 96.6% on the on WSJ corpus (Brill et al., 1998).]

The procedure

These steps assume a POSIX compliant system and some command-line proficiency
The filenames are links and you can find a repo with all the code here

  • Downloaded full version of the Stanford Parts-of-speech tagger
  • Ran the tagger on the text, put each tag on a separate line, and filtered for verbs only. The parts-of-speech were identified using this tagset. As you can see, the verbs all start with the letter "v". This can be achieved by the following incantation:

    ./ models/spanish.tagger THE_BOOK.txt | perl -pe 's/ /\n/g' | grep '_v' > tmp

    If this causes you problems, you might want to try to give the tagger (which runs in multicore!) more memory; try adding -Xmx2048M as a argument in the java command in ./—this will give it 2GBs to work with.

  • For each work, I ran on it, which parsed the stanford tag and made it in nice tab delimited format:

    ./ < tmp > ./output-verbs/THE_BOOK.txt

  • Catted all of them together into all.txt–a monstrous text file with 84,437 words that the tagger interpreted as verbs:

    cat rayuelas.txt final-de-juego.txt darios-de-motocicleta.txt cien-anos-de-soledad.txt ficciones.txt la-cuidad-de-las-bestias.txt > all.txt

Now we need to get the infinitives, but in order to prioritize which we should get the infinitives for, and not have to repeat conjugated verbs, we need to get the uniques...

  • So I ran

    cat all.txt | perl -pe 's/(.+?)\t.*/\1/g' > all-verbs.txt

    to get a list of only verbs (no mood or tense)

  • I wanted to get a list of unique verbs sorted by the number of occurrences; this would normally be a job for the sort | uniq -c. Desafortunademente, this command fails. It turns out that unicode can represent (for example) habría in at least two different ways. For this reason, we have to use the python script which uses the unicodedata module to normalize the verbs and then count them.

    ./ | tee all-verbs-count.txt

Ok, now were ready to get infinitive forms for these verbs. We are going to do this by programmatically making request to translate the word to the (excellent) website Span¡shD! What we want can be extracted from the returned HTML via CSS selectors.

  • goes through each line of all-verbs-count.txt and constructs the url to query the website with. It then uses the CSS selector ".mismatch" for information about the verb. In the best case scenario, it says something like " is the ____ form of _____ in the ____". Sometimes, there's more than one possible person or tense so it says "____ represents different conjugations of the verb _____". In either case, we get the infinitive. If it fails, we record it and move on. It waits between 1 and 2 seconds between each verb. After every 20, it dumps the JSON so that in case something bad happens I could just load the intermediate results and restart.
  • You can see that the SpanishDict infinitive conversion systematically failed for certain words. For example, it interpreted inflected verbs like he, dice, and era as English words to translate, not Spanish words to provide information for. In other cases, it interpreted a verb’s past participle (aburrir -> aburrido ("to bore")) as an adjective ("boring"). I manually filled in many of the ones that failed using equal parts regex and black magic. This went into finished-supplemented.json.
  • Finally, we need to inner join all.txt to the information in finished-supplemented.json. The script does this:

    ./ | tee tagged-plus-infinitives.txt 

The tab-delimited tagged-plus-infinitives.txt in now ready to be consumed for analysis.

Some numbers

  • Rayuelas - 203,197 words - 29,882 verbs
    Final de juego - 54,303 words - 8,160 verbs
    Darios de Motocicleta - 53,804 words - 6,557 verbs
    Cien Años de Soledad - 15,4381 words - 20,987 verbs
    Ficciones - 48,845 words - 5,769 verbs
    La Cuidad De Las Bestias - 94,075 words - 13,082 verbs
  • There were 84,437 words that the tagger identified as verbs in all.
  • There were 13,972 unique conjugated verbs.
  • After the first try with SpanishDict, for only 6,852 verbs did we have the infinitives. This greatly increased with the black magic alluded to in the previous section.
  • I went from 84,437 to 71,378 verbs when I inner joined with the verbs that I was able to find infinitives for.

The results

Figure 1: Proportion of Spanish verb moods and tenses in corpus

Figure 1: Proportion of Spanish verb moods and tenses in corpus

The results were rather fascinating. These were the 14 most common conjugated verbs:


(you can see the full spreadsheet here)

With this information alone, this whole endeavor was worth it. Sure, most of the verbs in this list aren’t that much of a surprise, but there are two pieces of information that could prove really helpful to me. The first is that 4 verbs in the top 15 are forms of the verb haber ("to have")—including the very first one, which accounts for 3.6% of all conjugated verbs in the corpus. This is a verb that I was, heretofore, relatively unfamiliar with.

In contrast to tener (which also means "to have"), haber is often used as an auxiliary verb as it would in such english sentences as "I have to go to the dentist", "I had all but lost it" (past perfect tense), "there is a freeze-up coming". Because of it's ubiquitous usage as an auxiliary word (like its being used in all sentences in the perfect mood), I should get more familiar with this verb and its conjugations if I ever hope to read these works of literature.

The second important piece of information for me was that a majority of the verbs in the top 14 were in the imperfect tense (a type of past tense). Now, I think I may have been concentrating too much on the preterite tense (another past tense) in comparison.

Next, these were the 14 most common verbs when put into infinitive form:


(you can see the full spreadsheet here)

To me, there wasn't really anything unexpected here except for maybe pasar (to happen) and parecer (to seem), which I was, up until this point—relatively unfamiliar with in spite of the fact that they are used in a number of frequently spoken expressions like ¿Que pasó? ("What happened?") and ¿Que te parece? (~"What do you think?").

Finally, figure 1 is a plot which depicts the proportions in which each mood and tense occur. The large vertical bars show the relative proportions of each mood (I count the Infinitive, Gerund, and Participle as moods) in descending order; they are Indicative (65%), Infinitive (20%), Subjunctive (4%), Participle (4%), Gerund (3%), and Imperative (1%). Each vertical bar is further broken down by the proportion of each tense within that mood (sorted, with the most frequently used on the bottom. For example, the present tense is the most common tense in the indicative mood and accounts for 26% of all mood/tense pairs. The Infinitive, Participle, and Imperative moods (to the extent that there are actually moods) have only one tense (to the extent that they can be said to have tenses).

These results were most surprising to me; for one, I was (again) reminded that I should probably hold nailing down the imperfect tense with as much or more importance as I do with the preterite tense. Second, I was surprised that usage of the future tense was far eclipsed by gerund, participle, and both subjective tenses—in spite of the fact that I use it quite often in my texts to my friends and my internal monologue. Of course, this—and other insights—may just be artifacts of the particular body of literature I chose for my corpus (see next section).

Although this was a wildly fun project that yielded interesting and extremely practical insights, there are a number of important caveats to be aware of when interpreting these results.

First is a generalizability issue; the results indicate the verb popularity and mood/tense breakdowns for just 6 pieces of Spanish literature. Because of this, the corpus is heavily dominated by the writing style of the included authors—at least some of whom have a very idiosyncratic writing style. Additionally, as with most literature, all of the non-short-stories in my corpus were told in the past tense (usually by a third person omniscient narrator). This past tense bias is very clearly non-representative of everyday spoken Spanish (of course, it was never meant to be representative of that). This problem could have been, at least partially, alleviated via the inclusion of more prosaic Spanish from movie scripts and blogs—if only they POS tagged correctly!!

Speaking of tagging correctly, the second issue is one of the correctness of the POS tags. The best POS taggers (Stanford is certainly one) can, at best, achieve an accuracy of 97%. Although this is an incredible feat of computational linguistics and the product of many many years of research, it is important to put this in the proper perspective. Recall that the rudimentary unigram tagger can achieve a 90%-94% accuracy rate (b) the 97% accuracy rate decreases as the testing corpus diverges in style from the training corpus. Especially because of Cortázar—who (at least in English translations) employs highly unusual sentence structure and often straight-up grammatically-incorrect non-human-parsable sentences—this fact must be kept in mind; unless the Spanish model that comes with Stanford was trained with Surrealist literature (it wasn't!), tag accuracy will suffer.


Brill, Eric, and Jun Wu. "Classifier combination for improved lexical disambiguation." Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics-Volume 1. Association for Computational Linguistics, 1998.

Ratnaparkhi, Adwait. "A maximum entropy model for part-of-speech tagging." Proceedings of the conference on empirical methods in natural language processing. Vol. 1. 1996.

Toutanova, Kristina, and Christopher D. Manning. "Enriching the knowledge sources used in a maximum entropy part-of-speech tagger." Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics-Volume 13. Association for Computational Linguistics, 2000.

Toutanova, Kristina, et al. "Feature-rich part-of-speech tagging with a cyclic dependency network." Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1. Association for Computational Linguistics, 2003.

Weischedel, Ralph, et al. "Coping with ambiguity and unknown words through probabilistic models." Computational linguistics 19.2 (1993): 361-382.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

What does Flatland have to do with Haskell?

Edwin Abbott's 1884 novella, Flatland, recounts the misadventures of a square that lives in a two-dimensional world called "Flatland". In this story, the square has a dream where he visits a one-dimensional world (Lineland) and unsuccessfully tries to educate the populace about Flatland's existence. Shortly thereafter, a sphere visits Flatland to introduce our protagonist to his own home, Spaceland. The sphere looks like a circle to the square, because the square can only see the part of the sphere that intersects Flatland's plane. The square can't fathom Spaceland until the sphere actually brings him into the third dimension.

Having his view of reality sufficiently rocked, the square postulates the existence of still higher dimensional lands. The sphere denies this possibility and returns the square to Flatland on bad terms.

The irony here is that, in spite of being aware of the square's previous insistence that Flatland is the only reality there is, the square's corresponding limitation of thought, and the square's subsequent enlightenment; the sphere arrogantly asserted that his own land represented the limits of dimensionality and that there can be no more dimensions.

I begin with this summary not as an impromptu book report, but because this is a perfect analogy for why I've decided to learn Haskell.
It isn't hard to imagine the inveterate C programmer being hesitant to embrace higher-level languages like Python and Perl. "Imagine the whole world that opens up to you when you don't have to worry about memory management and resolving pointers," requests the proselytizing Python programmer.

But the C programmer got on fine without knowing Python all this time. Besides the (standard) Python interpreter is written in C.

"Why? So I can change types on a whim?" retorts the C programmer. " I can pass functions as arguments to other functions? Big deal! I can do that with function pointers in C."

While it is technically true that, in C, functions can be passed as arguments to other functions, and can be returned by functions, this is only a small part of the functional programming techniques that Python allows. But it is the only part that most only trained in C can understand. This is like when the square thought that he saw the sphere because he saw a circle where the sphere intersected Flatland. There's a bigger picture (lambda functions, currying, closures) that only appears when the constraints imposed by a programming paradigm are pushed back.

Shifting tides in industry eventually compel the C programmer to give Python a shot. Once she is fluent in Python and its idioms (becomes a "pythonista", as they say) things begin to change. The OOP of Python allows the programmer to gain a new perspective on programming that had been, up to this point, unavailable to her simply because the concept doesn't exist in C.

Resolved to follow the road to higher abstraction, the (former) C programmer asks the Python programmer if there are other languages that will provoke a similar shift in thought relative to even Python. "Haskell, perhaps?" she offers.

The Python programmer scoffs, "Nah, Python is as good as it gets. Besides, purely functional programmers are a bunch of weirdos."


Haskell is a relatively obscure programming language (20th place or lower on most popularity indices), but accounts for a disproportionate number of the "this-language-will-change-the-way-you-look-at-programming" claims. By the accounts of Haskell aficionados, finally understanding Haskell is a transformative experience.

Up until recently, I found myself mirroring the thoughts and behavior of the Python programer in this story; I keep hearing about how great Haskell is, and how it'll facilitate an enormous shift in how I'll view computer programming, but I largely dismissed these claims as the rantings of eccentrics that are more concerned with mathematical elegance than practical concerns.

Is it any wonder that I didn't believe the Haskell programmers? I can't see the benefits of Haskell within the constraints of how I view computer programming—constraints subtly imposed by my language of choice. (Lisp programmer Paul Graham describes this as the "Blub Paradox" in this essay Beating The Averages)

But just like the sphere and the arrogant Python programmer should, I have to remain open to the idea that there's a whole world out there that I'm not availing myself of just because I’m too obstinate to try it.

Any language that is Turing-complete will let you do anything. It's trivially true that there is nothing that you can do in one language that you can't do in another. What sets languages apart (from most trivial to least) is the elegance of the syntax, it's community and third party packages, the ease in which you can perform certain tasks, and whether you’ll achieve enlightenment as a result of using it. I don't know if Haskell will do this for me but I think I ought to give it a shot.

If the ideas of relativity when applied to the domain of programming languages interests you, you might also be interested in the Sapir–Whorf hypothesis, which applies similar ideas mentioned in this post to the realm of natural languages.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Sending text messages at random times using python

Given my interest for applying statistics and analytics to most (if not all of the) quantifiable aspects of my life, when I learned about self-tracking, and the associated 'Quantified Self' movement, it should come as no surprise to anyone that knows me that I wanted to get started right away.
Given my interest in making life harder than it needs to be, it makes sense that I would eschew existing self-tracking tools and build my own. A neat side-effect of this obstinance is getting to learn new things.

The basic idea is at random times during the day, I fill out a survey that I designed for myself including questions such as: "How happy are you right now?", "How much energy would you say that you have right now?", and "Where are you right now?".

The most reliable and fastest way to get in touch with me is to send a text message. So, sending myself text messages at random times during the day is the best way to prompt me to fill out this self-tracking survey.

To make it easier (and, therefore, more likely that I'll fill it out) the content of the text message should be a link to the survey on the web. And in order to add flexibility to when I have to fill out the survey form but also preserve the randomness of the sampling, the timestamp of the time the text message was sent should be included as a url parameter so that it can be stored in the database along with the answers to the survey.

The service that sends these text messages runs on a Debian GNU/Linux EC2 instance that also hosts the form I fill out and the database that the answers are dumped to.

Before we get to the code, I should explain the modules that we will need for this task, and my rationale for choosing them.

Trying to debug a scheduled task or workflow is a living hell without proper and verbose logging. Since this must be run in the background (and not tied to a particular terminal emulator) simple print statements will not do. The more elegant, scalable, and extensible solution is to use Python's excellent 'logging' module.

While there are a few different ways to send text messages (SMS) using Python, the solution I settled on is to use the 'smtplib' standard library module to send an email to an SMS gateway. This gateway will convert the email into a text message sent to my phone. smtplib is needed to send the email message.

Although cron (or equivalently [?] Windows Scheduling Service) should be the tool of choice when scheduling commands to be run at specific times that never change, the fact that the text messages have to be sent at different times everyday requires another solution. Probably the most elegant and cross-platform solution is to use the advanced python scheduling library, apscheduler. The Python standard library comes with a similar module, sched, but apscheduler is more advanced with its scheduling capability and its ability to persistently store tasks in a database that survives process restart. (It supports storage in SQLite, PostgreSQL, MongoDB, Redis, MySQL, Oracle, MS-SQL, Firebird, and Sybase). But, unlike its standard library counterpart, it needs to be pip installed.

We will divide this task up into two python scripts, one that gets run once a day, computes n random times, schedules to send a text message those times, and then sends the message (we will call this, and one script that runs once that calls send_daily_texts at midnight everyday (we will call this

#!/usr/bin/python -tt

import random
import sys
import logging
import smtplib
import email.utils
from email.mime.text import MIMEText
from datetime import datetime, timedelta, date
from apscheduler.schedulers.blocking import BlockingScheduler

# create logger
logger = logging.getLogger(__name__)
handler = logging.FileHandler('send_daily_texts.log')
logger.addHandler(handler)"[{}] - send_daily_texts was run".format(

# the number of times to schedule and send text messages
# are provided as a command line argument
n = int(sys.argv[1])"[{}] - going to choose {} random times".format(, n))

# we need to parse today's state to properly
# schedule the text message sending
dadate =
year = dadate.year
month = dadate.month
day =

# the lower bound is 8 o' clock
lower_bound = datetime(year, month, day, 8, 0, 0)"[{}] - the lower bound is {}".format(, lower_bound))

# the upper bound is 11 o' clock PM
upper_bound = datetime(year, month, day, 23, 0, 0)"[{}] - the upper bound is {}".format(, upper_bound))

sched = BlockingScheduler()"[{}] - Created blocking scheduler".format(


def encode_timestamp(timestamp):
    return str(timestamp).replace(" ", "+").replace(":", "%3A")

def make_message(timestamp, wherefrom, whereto):
    slug = "http://THELINKURL/?timestamp={}".format(encode_timestamp(timestamp))
    msg = MIMEText(slug)
    msg['To'] = email.utils.formataddr(('Recipient', whereto))
    msg['From'] = email.utils.formataddr(('Author', wherefrom))
    msg['Subject'] = 'Time for the survey!'
    return msg

def send_text(should_exit=False):'[{}] - trigger triggered, going to send text'.format('[{}] - attempting to connect to gmail'.format(
    server = smtplib.SMTP("", 587)
    server.login(wherefrom, gmail_pw)'[{}] - successfully connected to gmail'.format(
    timestamp =
    msg = make_message(timestamp, wherefrom, whereto)'[{}] - going to send message {} to {}'.format(,
                                                               damsg.replace('\n', '<br>'),
    ret = server.sendmail(wherefrom, [whereto], damsg)
    if should_exit:'[{}] - finished... going to exit'.format(

def random_time(start, end):
    sec_diff = int((end-start).total_seconds())
    secs_to_add = random.randint(0, sec_diff)
    return start + timedelta(seconds=secs_to_add)

def get_n_random_times(n, start, end):
    times = []
    for i in range(0, n):
        times.append(random_time(start, end))
    return times

times = get_n_random_times(n, lower_bound, upper_bound)"[{}] - Received {} times to schedule".format(,

for ind, atime in enumerate(times):
    if ind == (n-1):
        sched.add_job(send_text, 'date', run_date=atime,
                      args={"should_exit": True})"[{}] - added last task at {}".format(,
        sched.add_job(send_text, 'date', run_date=atime)"[{}] - added task at {}".format(,

sched.start()"[{}] - everything is done".format(

Before I describe "" there are a few things I should note about the snippet above.

When I originally wrote this script, the text messages wouldn’t send even though the logger indicated that it had. I assumed this was because gmail rejected it because it didn't look enough like an email message. In order to correct this, I needed to use the email.mime.text module to add the standard email headers to the message to be send.

Since I am only interested in experience sampling my waking life, I didn't want to fill out the survey during hours that I am normally sleep. I had to make sure the I set 8 o' clock and 23 (11pm) o' clock as my lower and upper bound, respectively.

Third, if you decide to cannibalize this code, make sure you change the values for 'wherefrom', 'whereto', and 'gmail_pw'. The format the SMS gateway you should use depends upon your mobile carrier. My particular SMS gateway is my 10 digit phone number Your’s will likely be different–consult this list.

#!/usr/bin/python -tt

import sys
import logging
from datetime import datetime
from subprocess import Popen, PIPE
from apscheduler.schedulers.blocking import BlockingScheduler

def run_daily_surveys(thelogger):"[{}] - Trigger triggered".format("[{}] - Going to run daily script".format(
    p = Popen('./ 3', shell=True, stdout=PIPE, stderr=PIPE)
    out, err = p.communicate()
    if p.returncode:
        thelogger.error("[{}] - Failed to run daily script".format(
        sys.exit("Failed to run daily script")"[{}] - Ran daily script".format(
    if p.returncode:
        sys.exit("Command failed to run")

def main():
    logger = logging.getLogger(__name__)
    handler = logging.FileHandler('run_everyday.log')
    logger.addHandler(handler)"[{}] - was run".format(
    sched = BlockingScheduler()"[{}] - blocking scheduler was created".format(
    sched.add_job(run_daily_surveys, 'interval', days=1, args=[logger])"[{}] - everyday task added, going to start the scheduler".format(
    return 0

if __name__ == '__main__':
    STATUS = main()

I've been running these tasks for about a week now, and its working great!

My next couple of blog posts will be about server-side code and architecture to support my self-tracking project.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Damn the torpedoes, full speed ahead: making the switch to Python 3

Python 3 has been out since 2008 (and realistically usable since 2009).
In spite of this four year availability period, Python 3 use has yet to see widespread adoption, particularly among groups in the scientific community. In the company of data scientists/statisticians, when someone says they've written their own Python code to perform some task, it's usually assumed that they are talking about Python 2; it is Python 3 that requires the version number qualification.

There are members of the community (I used to include myself in this category) that are really happy with Python(2) and hope that if they ignore Python 3 it will just go away.

It won't though. The fact of the matter is that "Python 2.x is legacy, Python 3.x is the present and future of the language."

So how do we get people to adopt Python 3? In my opinion, there are three key strategies:

  • go softer on Python 3 denialists, perhaps with a Python 2.8 (Guido said this will not happen)
  • go harder on Python 3 denialists by discontinuing 2.7 maintenance
  • serve as an example to programmers (especially new ones) by switching your default python interpreter to python3.

As you, dear reader, can probably tell from my wording, I personally favor strategy 3.

Part of that solution involves vendors shipping Python 3 by default. We are making some progress in this regard (Arch GNU/Linux now has python sym-linked to python3, and Fedora and Ubuntu have stated that they will follow suit), but we still have a lot of work to do. A huge step forward would be if Apple ships macs with Python 3. Current macs use 2.7 which wasn't, finally, released until 2010. This means that they could have used Python 3 instead. That would have really shook things up because a lot of my friends and colleagues in my field just use the Apple-supplied Python interpreter for analytics (vis-a-vis SciPy Superpack). The extension of 2.7 support until 2020 will unfortunately afford Apple the opportunity to be lackadaisical in its porting to Python 3 because they might only do so when upstream maintenance ends (and maybe not even then).

The other part of strategy 3 involves personally serving as an example by using Python 3 as your default interpreter.

But I can't rationally will that more people do this if I am unwilling to do this myself. While it's true that my big open-source Python project meant for widespread public consumption was very carefully made Python 3 compatible, I noticed that the code on this blog is often Python 3 incompatible. This is primarily because the python code I quickly whip up is run through my default Python 2 interpreter. My obstinance to switch to Python 3 by default is helping to contribute to Python 3's slow adoption and implicitly serving notice that it's ok to still use Python 2.

But I no longer want to be party to this transition quagmire (and the ASCII-normative cultural hegemony). Because of this, I recently took the plunge and switched my default Python stack to Python 3. Damn the torpedoes, full speed ahead!

I was, perhaps, in a better position to do this than some because all of my most used third-party Python packages have already been ported; this includes the SciPy ecosystem (NumPy, SciPy, pandas, scikit-learn), IPython, lxml, networks, BeautifulSoup, and requests. It was really easy for me to ditch the Apple-supplied Python interpreter in favor for MacPorts' build of Python 3.4. I was even able to install most of my favorite third-party packages using MacPorts (the ones that weren't available I pip-installed as "user" to not muck up the MacPorts installation prefix). The only hard part about the switch was that almost all of my system python code stopped working; everything from my own system utilities to the battery-life indicator in my tmux panes.

While fixing all of this wayward code, I took notice of the incompatibilities that caused me the most trouble:

  • changes to exception handling
  • changing xrange() to range()
  • changing raw_input() to input()
  • changing my print statements into functions
  • explicitly requesting relative module imports
  • wrapping map() function calls in "list()" because map() now returns an iterator (I actually just re-wrote the code to use list comprehensions.)

But by far the incompatibility that caused the most heart-ache were I/O changes, and this is mostly due to the way Python 3 handles unicode.

There is far too much to say about Unicode in Python 3 to provide a detailed explanation in this post (if you're interested, I've put some great learning resources at the end of this post) but, essentially, Python no longer allows you to willy-nilly mix 8-bit string data and text objects. If you find yourself asking "Why am I having such trouble with my porting?" than the answer may be that you were playing fast-and-loose with mixing (what probably always should have been) incompatible types: unicode strings and 8-bit string data.

For this reason Python 2 was easier to deal with for programmers/scientists who dealt with largely numbers or ASCII-only text, but this won't cut it anymore. It's unreasonable and culturally chauvinistic to require that non-English speaking programmers misspell (or transliterate) their names and variables just to contribute to a non-internationalized codebase.

Before you learn anymore hacks to get unicode working in Python 2.x, consider switching to Python 3. Basically, the new I/O/string paradigm in Python 3 comes down to

  • understanding unicode and utf-8
  • decoding input as early as you can
  • encoding output as late as you can
  • only working with unicode strings within the program (no bytes!)

If you're reading this blog, you're probably a data scientist/statistician (or my parents) and you can almost assuredly make the switch to Python 3 since virtually all of our most prized packages have been ported (even nltk has a branch that works with Python 3). Just to make sure, you can use this tool that tracks the Python 3-readiness of some popular packages.

Final notes:
This post was in no way meant to shame programmers/scientists who still choose to use Python 2.x. I completely understand and sympathize with those that have very large Python 2 codebases to maintain or are locked into a particular Python version because of their company policy or their clients' needs.

If you are interested using Python 3, though, but want to approach it cautiously, consider using the 2to3 conversion tool to see how the code needs to change. Another great strategy is to use the various __future__ imports to ease the transition. Something like:

from __future__ import division, absolute_import, print_function, unicode_literals

At the least, you call python using the "-3" flag to see possible problems and incompatibilities. You can do this by changing your shebang line to something like

#!/usr/bin/env python -3

or create the following shell alias

alias python="python -3"

Resources for learning Python 3 I/O and unicode:

Other notes from Python 3 apologists:

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

How to make an absurd twitter bot in python

In my last post, I outlined the steps I took to programmatically mimic the wine reviews of a dilettante sommelier. In this post, I'll explain the steps I took to create the twitter bot @HorseWineReview which combines a random wine with a random computer-generated review. I'll keep it short and sweet–the steps are as follows:

  • get a list of wines (from Freebase)
  • create a twitter account and application
  • write script to create and post the tweet
  • automate it with a cron job

Get a list of wines (from Freebase)
Freebase is a collaborative knowledge base that uses a graph database to store semantic information. Information can be retrieved by running MQL queries on their web interface or through a google-powered API. We'll test a query first on their query editor online, but move to accessing the result from the API once we can verify that our query is constructed properly.
The query is very simple, it looks like this:

  "name": null,
  "type": "/wine/wine"

This will return a list of names of all entities of type "/wine/wine" in JSON format. This is an excerpt:

  "result": [
      "type": "/wine/wine",
      "name": "1999 Domaine Romanee Conti La Tache"

Now that we’ve confirmed that this query works and the syntax is correct, let's get the results by using the Google Freebase API. If you don't have one already, you need to get a Google Developers account. From the Google Developers Console, you need to grab a Freebase API key. After that, we can write and run a python script to retrieve and dump out all the wine names. You need the python module "freebase" which can be installed via pip. The code goes thusly:

import freebase
import json
import urllib

api_key = "YOUR KEY HERE"
service_url = ''

freebase_query = [{'name': None,
                   'limit': 999999999999999,
                   "type": "/wine/wine"}]

params = {"query": json.dumps(freebase_query),
          "key": api_key}

url = service_url + "?" + urllib.urlencode(params)
response = json.loads(urllib.urlopen(url).read())

for item in response['result']:
    print item['name'].encode('utf-8')

Run this script and redirect its contents to a file...

This stores a wine name on each line of the file.

Create a twitter account for your bot and register an application
After following the standard procedure of setting up a twitter account, head over to and set-up a developers account on behalf of your bot. Then head over to to create a new application; this will be the conduit with which we programmatically update your twitter bot's status. Make sure you read the Terms of Service and don't violate them. Make sure you also allow your application read and write access. After this, if you navigate to the "API Keys" tab, you should record the following information: the API key, API secret, access token, and access token secret. We'll need this to authenticate from our auto-posting python script.

Write script to tweet
We'll use the Twython package to authenticate and serve as an interface to twitter.

A bare bones script to perform this task looks like this:


import subprocess
import random
import re
import os
from twython import Twython

twitter = Twython("YOUR API KEY",
                  "YOUR API SECRET",
                  "YOUR ACCESS TOKEN",
                  "YOUR ACCESS TOKEN SECRET")

def output_tweet(text):

lengthoftweet = 999
# string to build tweet
tweet = ""

while lengthoftweet > 140:
    tweet = ""
    all_names = [wine.rstrip() for wine
                  in open("ABSOLUTE PATH TO WINE LIST FILE").read().split("\n")]
    wine_name = random.choice(all_names)
    tweet += wine_name + ": "
    rev = subprocess.check_output("ABSOLUTE PATH TO REVIEW GENERATOR")
    # we don't want a review to imply that the wine is either
    # red or white
    if "red" in rev:
    if "white" in rev:
    if len(tweet + rev.rstrip()) < 141:
        output_tweet(tweet + rev.rstrip())
    # if it is too long, try to use the first
    # sentence only (using regex)
    rev ="(.+?\.).*", rev).group(1)
    tweet += rev.rstrip()
    lengthoftweet = len(tweet)

A lot of the code is to ensure that the tweet doesn't exceed 140 characters. You might want to add a logging feature to the script–it will help with debugging the cron job.

Automate it with a cron job
If you're on a Unix system we can set up this script to run at specified times automatically using a cron job. Windows users can use Windows Task Scheduler but I've never used it so you're on your own.

Cron jobs are notoriously hard to debug. The #1 problem is encountered is not using absolute paths. When cron calls your script, the working directory is not where the script resides (unlike when you call the script on the command-line from the same directory). Because of this, any file IO in the script that uses relative paths will fail (quietly, if you don't add logging to the script).

The #2 problem you may encounter with cron jobs is that, because it runs as a detached process outside the login environment, the shell it executes it from may not be the one you normally use. Furthermore, it may not have the directories in the PATH that you need, or any other environment variables that you depend on.

The third problem that occurs often in botched cron jobs are bad permissions. Make sure you have all correct permissions (e.g. chmod +x

One final problem that I encountered was not putting an empty line after your cron entry–learn from my mistakes!

To add a cron entry, execute

in the shell

In the editor, I wrote the entry like this:

Your paths will obviously depend on what you are running and from where. Make sure you have an empty line after the entry!

To briefly explain this entry…

The first two numbers (00) specify the minute (0-60) to run the job. I chose to run it on the first (zeroth) minute of the hour. The second section (9,14,21) specifies the hour and tells cron to run the job at 9:00 am, 2:00 pm, and 9:00 pm. The next three sections (the asterisks) specify "day of the month" (1-31), "month of the year" (1-12), and "day of the week" (0-7), respectively. The asterisks indicate that this is to run every day of the month and every month of the year.

The next two strings instruct cron to run the script (which was explained in my last post), and the rest of the line redirects any output from logging to a text file called CRON.log

First endnote: Why @HorseWineReviews?
The name pays homage to the infamous twitter bot @Horse_ebooks that (used to) post (unintentionally hilarious) context-free excerpts and Markov chain clumps from books about horses in an (successful) effort to avoid looking like a spam account whilst occasionally tweeting links to promote the sales of e-books.

Final endnote
While the task of tweeting fake wine reviews has already been taken, if you are looking for ideas of a twitter bot of your own, dear reader, you might want to explore the following ideas that I think would a hit:

  • Train a Markov chain on equal parts famous philosophical works and vacuous and decidedly un-philosophical ramblings (a la KimKierkegaardashian and Kantye West). Philosophical corpora can be grabbed from Project Gutenburg. Vacuous babble can probably be obtained from choice subreddits or most of the trending hashtags on twitter.
  • Web-scrape a huge corpus of episode descriptions from various television shows, train a Markov chain on them and let it loose on Twitter. You can get episode descriptions from (example)
  • Train a Markov chain on the abstracts of academic papers.

You’re welcome :)

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

How to fake a sophisticated knowledge of wine with Markov Chains

How soon is now

Markov chain

To the untrained (like me), wine criticism may seem like an exercise in pretentiousness. It may seem like anybody following a set of basic rules and knowing the proper descriptors can feign sophistication (at least when it comes to wine).

In this post, we will be exploiting the formulaic nature of wine reviews to automatically generate our own reviews that appear (at least to the untrained) to be legitimate.

Markov Chains
A Markov chain is a system that transitions between states using a random, memoryless process. The transition from one state to another is determined by a single random sample from a (usually discrete) probability distribution. Additionally, the current state wanders aimlessly through the chain according to these random transitions with no regard to its previous states.

A roll-of-the-dice board game can be likened to a Markov chain; the dice determine how many squares you move, and is in no way is influenced by your previous rolls. Scores in basketball games appear to act in this way as well (in spite of the myth of the 'hot hand') and a gambler's earnings almost certainly hold the Markov property (see the Monte Carlo Fallacy)

Many more phenomena can be appropriately modeled by a Markovian process... but language isn't one of them.

Markov Chains and Text Generation
The image below shows a Markov chain that is built from the following lyrics:

How Soon is Now

Markov chain of The Smiths lyrics

I am the son
and the heir
of a shyness that is criminally vulgar
I am the son and heir
of nothing in particular

Here, each word is a state, and the transitions are based on the the number of times a word appears after another one. For example, "I" always precedes "am" in the text, so the transition from "I" to "am" occurs with certainty (p=1). Following the word "the", however, "son" occurs twice and "heir" occurs once, so the probability of the transitions are .66 and .33, respectively.

Text can be generated using this Markov chain by changing state until an "absorbing state" is reached, where there are no longer any transitions possible.

Given "I" as an initial state, two possible text generations are

  • I am the heir of nothing in particular
  • I am the son and the son and the son and the son and the son of a shyness that is criminally vulgar.

Very often, the generated text violates basic rules of grammar; after all, the transitions are "dumb" stochastic processes without knowledge of grammar and semantics.

Instead of a memoryless chain, though, we can build a chain where the next state depends on the last n states. This can still satisfy the Markov property if we view each state as holding n words. When using these 'higher order' chains to generate text, something very interesting happens. Since the states are now made up of clauses and phrases (instead of words) the generated text seems to magically follow (some of) the rules of grammar, while still being devoid of semantic sense.

The higher order the chain, more text needs to be fed into the chain to achieve the same level of 'arbitrariness'–but the more the generated text seems to conform to actual correct English. In order to fake our wine reviews, we are going to train an order-two Markov chain on a web-scraped corpus of almost 9,000 wine reviews.

The scraping
The corpus of wine reviews I chose to use was from If you go to this site, you'll see that there 709 pages of reviews. I used SelectorGadget to determine the XPath selector for the content I wanted and wrote a few python scripts along these lines:

#!/usr/bin/env python -tt
import urllib2
from lxml.html import fromstring
import sys
import time
urlprefix = ""
for page in xrange(1, 710):
        out = "-> On page {} of {}....      {}%"
        print out.format(page, "709", str(round(float(page)/709*100, 2)))
        response = urllib2.urlopen(urlprefix + str(page))
        html =
        dom = fromstring(html)
        sels = dom.xpath('//*[(@id = "searchResults")]//p')
        for review in sels:
            if review.text:
                print review.text.rstrip()

and grabbed/processed it with shell code like this:

# capture output of script
./ | tee prep1.txt

# remove all lines that indicate progress of script
cat grep -E -v '^-' prep1.txt > prep2.txt

# add the words "BEGIN NOW" to the beginning of each line
cat prep2.txt | sed 's/^/BEGIN NOW /' > prep3.txt

# add the word "END" to the end of each line
cat prep3.txt | sed 's/$/ END/' > wine-reviews.txt

This is a sample of what out text file looks like at this point:

BEGIN NOW A balanced red, with black currant, ... lowed by a spiced finish. END
BEGIN NOW Fresh and balanced, with a stony ... pear and spice. END

The "BEGIN NOW" tokens at the beginning of each line will serve as the initial state of our generative Markov process, and the "END" token will denote a stopping point.

Now comes the construction of the Markov chain which will be represented as a python dictionary. We can get away with not calculating the probabilities of the transitions by just storing the word that occurs after each bi-gram (two words) in a list that can be accessed using the bi-gram key to the chain dictionary. We will then 'pickle' (serialize) the dictionary for use in the script that generates the fake review. The code is very simple and reads thusly:

#!/usr/bin/env python -tt
import pickle
fh = open("wine-reviews.txt", "r")
chain = {}
def generate_trigram(words):
    if len(words) < 3:
    for i in xrange(len(words) - 2):
        yield (words[i], words[i+1], words[i+2])
for line in fh.readlines():
    words = line.split()
    for word1, word2, word3 in generate_trigram(words):
        key = (word1, word2)
        if key in chain:
            chain[key] = [word3]
pickle.dump(chain, open("chain.p", "wb" ))

Finally, the python script to generate the review from the pickled Markov chain dictionary looks like this:

#!/usr/bin/env python -tt
import pickle
import random
chain = pickle.load(open("chain.p", "rb"))
new_review = []
sword1 = "BEGIN"
sword2 = "NOW"
while True:
    sword1, sword2 = sword2, random.choice(chain[(sword1, sword2)])
    if sword2 == "END":
print ' '.join(new_review)

The random.choice() function allows us to skip the calculation of the transition probabilities because it will choose from the list of possible next states in accordance with the frequencies at which they occur.

The results
Obviously, some generated reviews come out better than others. After playing with the generator for a while, I compiled a list of "greatest hits" and "greatest misses".

Greatest hits

  • Quite rich, but stopping short of opulent, this white sports peach and apricot, yet a little in finesse.
  • Dense and tightly wound, with taut dark berry, black cherry and red licorice. A touch of toast.
  • Delicious red licorice, blood orange and ginger, with nicely rounded frame.
  • This stylish Australian Cabernet is dark, deep and complex, ending with a polished mouthful of spicy fruit and plenty of personality.

Greatest misses

  • From South Africa.
  • Tropical fruit notes of cream notes.
  • Here's a bright structure. Dry and austere on the finish.
  • This has good flesh.
  • Really enticing nose, with orange peel and chamomile for the vintage, this touts black currant, plum and meat notes. Flavors linger enticingly.
  • Blackberry, blueberry and blackberry fruit, with hints of cream. Crunchy and fresh fruit character to carry the finish.

Possibilities for improvement
The results are amazing, but the algorithm needs a little work before it will be able to fool a sommelier.

One major giveaway is the inclusion of contradictory descriptors in the same review. I don't know anything about wine (I drink Pepsi) but even I know that a wine should never be described as both "dry" and "sweet". One possible solution to this would be to use association mining to infer a list of complementary and discordant descriptors.

Another clue that these reviews are nonsense is the indiscriminate chaining of clauses that have nothing to do with each other. I'm not quite sure how to solve this, yet, but I have a few ideas.

An additional hiccup is that there are still grammatically incorrect sentences that creep through. One solution would be to identify and remove them. Unfortunately, this is much easier said than done. In the absence of a formal English grammar, we have to rely on less-than-perfect techniques like context-based identification and simple pattern-matching.

The last obvious problem is that some of the generated reviews are just too long. This increases the likelihood of containing contradictory descriptors and committing grammar errors, as with this review: (which also exemplifies all of the problems stated above)

Luscious, sleek and generous with its gorgeous blueberry, raspberry and blackberry flavors , with hints of herbs, cocoa and graphite. The long, briary edge lingering on the nose and palate. Medium-bodied, with a modest, lightly juicy and brambly flavors of milk chocolate. Full-bodied, with fine focus and its broad, intense and vivid, with a tangy, lip-smacking profile. Light-weight and intense, with a deft balance.

In future posts, I hope to explore some of these avenues of improvement. I also plan to use parts-of-speech tagging to automate an unusual games of wine review mad-libs.

Sooner, though, I'll explain the process I took to set-up the fake wine review twitter bot (@HorseWineReview) that I will use to experiment with different text-generation techniques.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Using to data mine my music listening history

Indie Rock
I've (passively) been keeping meticulous records of almost every song I've listened to since January of 2008. Since I opened my account 6 years ago, they've accumulated a massive detailed dataset of the 107,222 songs I've listened to since then. The best thing is that they're willing to share this data with me!

I used the developer REST API to (over a very long period of time) retrieve my entire listening history, the date(s) that I've listened to each song, and the top three user-submitted "tags" for each song.

I want to glean every bit of insight that I can out of this data. For this post, I focused on:

  • total listening history over time
  • music "diversity" levels
  • trends in my musical genre listening habits

In future posts, I hope to explore other things like using PCA to determine "orthogonal" music genres, construct similarity matrices, predict trends, and perform acoustic analysis.

This has been one of my favorite pet-projects because it combines three things that I love:

  • data mining
  • music
  • navel-gazing

I used both R and Python in this analysis. Let’s get into it!

Obtaining data
Getting the data using the REST API was very straightforward; the only hiccups I encountered were the fault of Python2's unicode snafus. For the web requests I used the urllib2 module and to handle the XML responses I used the amazing lxml module. The code to get my whole listening history looked a little like this:

#!/usr/bin/env python -tt

import urllib2
import time
from lxml import etree
from StringIO import StringIO

baseurl = ''.join(["",

def clean_xml(the_xml):
    return "\n".join(the_xml.split("\n")[3:-2])

# let's get the first page so we know how many pages there are
response = urllib2.urlopen(baseurl+"&page=1", timeout=200)
html =

# parse the XML tree
doc = etree.parse(StringIO(html))

# use Xpath to query the number of pages
num_pages = int(doc.xpath("/lfm/recenttracks")[0].get("totalPages"))

# file to dump results
fh = open("all_the_tracks.xml", "a")

for page in xrange(0, num_pages+1):
    # I'm nice so I don't want to hit
    # with a bunch of requests a second.
    # Let's wait ten seconds between each request
    progress = "On page {} of {}...........  {}%"
    print progress.format(str(page),
                          str(round(float(page)/num_pages*100, 1)))
    response = urllib2.urlopen(baseurl+"&page="+str(page))
    html =
    the_xml = clean_xml(html)

I decided to make the requests for the user-submitted tags in another python script. The script is a little too long to post here, but it basically iterated over all "track" nodes in the output of the last script, and parsed the results from a REST query of tags. Since I'm considerate, I put a long wait between each request for the over 100,000 songs. Even though I handled repeated tracks gracefully, it took days to finish. I used the pickle module to serialize the sum of data I got at regular intervals so a failure during the night of day 2 wouldn't have been catastrophic.

XML transformations and XPath
There is still a little bit of cleanup to do... I used various shell commands to remove all unnecessary elements from the XML documents and escape the characters that I forgot to escape. Then I had to organize the data by date so that I can do time series analysis. The script I used to accomplish this is as follows:

#!/usr/bin/env python -tt

from lxml import etree
import codecs

# read cleaned up track history XML
doc = etree.parse("escaped_processed.xml")

fh ="bydate.xml", "a", encoding="utf-8")

# get all the dates (previously restricted to just month and year)
udates = list(set([date.text for date in doc.xpath("//date")]))

# create a new DOM tree to hang the transformation upon
root = etree.Element("bydate")

for cdate in udates:
    # element tags can't start with a number
    # add a "d" to it
    this = etree.SubElement(root, 'd' + cdate)
    # get all tracks listened to on that date
    these_tracks = [node for node in
                    doc.xpath("/alltags/track[date=" + cdate + "]")]
    # add the tracks to the DOM
    for itrack in these_tracks:

fh.write(etree.tostring(root, pretty_print=True))

Finally, I whipped up a quick script to sum the number of listens on a particular tag for each time interval.

At this time we have a file "playnumbymonth.csv" with the dates and total tracks listened to for that month that looks like this...


and ("melted") file called "longformat.csv" that holds dates, tag names, and the number of tracks (played in that month) that contained the tag. It looks like this...

03-2008,folk rock,1
03-2008,spoken word,2

R analytics and visualization
First, to visualize the number of songs I’ve listened to over time, I had to import the "playnumbymonth.csv" dataset, parse the date with the lubridate package, make a "zoo" time series object out of the dataframe, and plot it.


plays <- read.csv("playnumbymonth.csv", stringsAsFactors=FALSE)

# parse dates
plays$date <- parse_date_time(plays$date, "my")

#make time series object
tsplays <- read.zoo(plays)

#plot it with a LOWESS smooth curve
loline <- lowess(tsplays, f=.5)
plot(tsplays, main="Plays per month since 2008", ylab="Number of plays", xlab="Date")
lines(index(tsplays), loline$y, col='red', lwd=2)

The resulting plot looks like this:
Plays per month

While I was working with this data set, I wanted to check if there was any periodicity to my listening history (perhaps I listen to more music in the winter than I do in the summer). I briefly attempted to use seasonal decomposition and autocorrelation to try to detect this. No dice.

For the musical "diversity" and genre listening trends, I read in "longformat.csv", used reshape to aggregate (pivot) by tags until I had a huge matrix where each row was a month between 2008 and 2014, and each column was a tag. Then I used the vegan (vegetation analysis) package to take the Shannon diversity index of each month with respect to wealth and evenness of tags listened to:

long.tag.frame <- read.csv("longformat.csv", stringsAsFactors=FALSE)
long.tag.frame$date <- parse_date_time(long.frame$date, "my")

wide.frame <- data.frame(cast(long.tags.frame, date~tag))
# convert all NAs to zero
wide.frame[] <- 0

new.frame <- data.frame(wide.frame[,1])
new.frame$diversity <- diversity(wide.frame[,-1])

After some cleanup and "zoo" object creation, and LOWESS curve creation, the plot of the listening data and diversity indices looked like this:
Number of plays and variety

Visualizing how my music tastes have (appeared to) change over time was the best part. I created a diagonal matrix from the multiplicative inverse of number of tracks that I listened to each month and matrix-multiplied this with the wide tag matrix. The result of this computation yielded the proportion of songs I listened to each month that contained each tag.

I took a few choice tags corresponding to some of my favorite musical genres, put it in a new data frame ("tag.interest") and used the lattice package to visualize the trends.

tag.interest <- data.frame(dates)
tag.interest$Post.Punk <- prop.plays[,2227]
tag.interest$Indie <- prop.plays[,1413]
tag.interest$Punk <- prop.plays[,2270]
tag.interest$Coldwave <- prop.plays[,654]
tag.interest$Darkwave <- prop.plays[,762]
tag.interest$Twee <- prop.plays[,3003]
tag.interest$Indie.Pop <- prop.plays[,1422]
tag.interest$Hip.Hop <- prop.plays[,1337]

> names(tag.interest)
[1] "dates"     "Post.Punk" "Indie"     "Punk"      "Coldwave"  "Darkwave"  "Twee"      "Indie.Pop" "Hip.Hop"  

xyplot(read.zoo(tag.interest), type=c("l", "g"),
       ylab="Proportion of songs containing tag",
       main="Trends in musical genre listening habits",
       panel = function(x, y, col, ...) {
         panel.xyplot(x, y, col = "blue", ...)
         panel.loess(x, y, col = "red", lwd=3)

This produced my favorite plot:
Genre listening trends

Looking at it, I remembered a period of time in 2009 that I listened to almost exclusively Hip-Hop music. I was also reminded that I got into the "coldwave" and "darkwave" genres rather recently and around the same time as each other in summer of 2011. Another neat result is that there is a fairly strong negative correlation between my "twee" music listening and my "darkwave" music listening history, as these genres are almost musical 'opposites'.

This had been a fun trip down memory lane for me. My only regret is that I didn't open my account sooner... as long as it was after a period in my childhood music-listening that I would be embarrassed to have on digital record.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

The performance gains from switching R's linear algebra libraries

What is often forgotten in the so-called data analysis "language wars” is that, across most of these languages, many common computations are performed using outsourced dynamically linked math libraries. For example, R; Python's Numpy; Julia; Matlab; and Mathematica all make heavy use of the BLAS linear algebra API. As a result, R can't be properly faulted (or praised) for how slowly (or rapidly) it performs Cholesky decomposition since (a) the R core team wasn't responsible for the algorithm's implementation, and (b) neither are other languages' contributors for theirs.

For code that deals predominately with numerical computing and constructs from linear algebra, language battles become more a petty and vacuous squabble over subjective preferences in syntax rather than substantive discourse that encourages true innovation and improvement. That being said, R is the best, and if you disagree you should feel bad.

Two posts ago, I asserted that, for speed-up purposes, recompilation of R was usually unnecessary and that other lower-hanging fruit should be taken before resorting to recompilations. We've already seen that for certain problems, parallelizing your code is a great and relatively easy-to-implement speed up option. Another great option that's available is the ability to swap out R's linear algebra libraries for faster ones. Since these libraries are linked to at run-time—as opposed to being included statically at compile-time—employing the use of these alternative libraries do not require recompilation.

The topic of swapping BLAS implementations has already been covered well in these blog posts by Nathan VanHoudnos and Zachary Mayer, as well as in this paper by Dirk Eddelbuettel, but I thought I’d throw in my thoughts and results, too.

For my comparisons, I pitted OpenBLAS and Apple’s Accelerate Framework's implementation of BLAS against each other and the BLAS that comes with R by default. I wanted to try others too, but I either had an extraordinarily difficult time compiling them, or was unwilling to shell out money for a propriety library (here's looking at you, Intel). Of particular interest to me was trying out the ATLAS library.

I originally thought that testing ATLAS would be redundant because I was given to understand that Apple Accelerate’s "vecLib" was a hand-tuned version of ATLAS for Apple processors. After looking further into it, I discovered that this is no longer the case. Apple asserts in line 632 of cblas.h that "The Apple BLAS is no longer based on ATLAS". Unfortunately, nothing I tried would get ATLAS to compile. C'est la vie.

The benchmarking script I used to test these implementations can be furnished from
It records the time to completion of a large series of various computations including matrix creation, inversion, and multiplication. By default, it performs each test 3 times and takes the trimmed geometric mean of the time to completions.

These are the results of the total time elapsed (for all tests) among the three libraries.

Comparison of BLAS implementation performance

As you can see, both Accelerate and OpenBLAS blew R's default BLAS implementation out of the water, with Accelerate marginally outperforming OpenBLAS. The last line of the output from the UNIX "time" command gives us a clue as to why this might be the case:

R --slave 56.10 user 1.65s system 122% cpu 47.162 total

Judging by above-100-percent CPU usage of both Accelerate and OpenBLAS (and how hot my laptop got), I'd wager that the primary source of the improvement is Accelerate's and OpenBLAS's ability to use multiprocessing. If you have more than one core available, this is something that you might want to look into.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Unsupervised correction of optical character misrecognition

For a good overview of what OCR is, check out this overview

I found myself cutting the spines off books, again. This time it was because I couldn’t find an e-book copy of ‘Animal Liberation’ anywhere on the net, and I’ve amassed quite a few physical copies--mostly from garage sales--that I could afford to experiment with. The pages were to be scanned, OCR-ed, and put on my Kindle.

I was a little disappointed in all the OCR tools I used. The best one seemed to be Tesseract, an open source solution, but only after it was processed with a series of ImageMagick manipulations.

The experience got me thinking about what it would take to write a good OCR post-processor. This is something that I’ve been thinking (off and on) about for a few months now, and I have a lot of ideas running around in my head. So this will be the first in a series of blog posts of my testing out some of those ideas. The first few, this included, will be about autocorrection of individual words (isolated term correction).

One of the first things that came to mind is to use traditional spelling correction procedures on the OCR output: finding approximate matches from a corpus of acceptable vocabulary (using sim-hash (more) or a k-gram index), and returning the n closest matches (minimizing Levenshtein distance). Although undoubtedly helpful and necessary for most applications, I thought it would be more interesting to explore whether there were any tenable solutions that didn’t rely on a priori information like a vocabulary list and a precomputed k-gram index. A solution like this should, in theory, be able to handle OCR output that contains tokens for which no dictionary exists, or is difficult to get. An example of this would be output from a medical report, a text in Esperanto (or Klingon), or ciphered text.

A document contains many repeated words. The OCR output misrecognizes some of these repeated words, but not all. Given a set of tokens that are likely referring to one word, how can we choose the item with correct spelling?

The token with the lowest average Levenshtein distance from all other tokens is very often the correct spelling.

An OCR reads a document that contains a word many times, and the OCR makes a few mistakes. Of the tokens in the OCR output, most of them will be the proper spelling. The tokens that are misspellings will be the result of (mostly single-edit) transformations (insertions, deletions, substitutions, etc..) of the original word. It stands to reason that, most of the time, these misspellings will be more similar to the original word than to other misspellings because the original word is where the mistake is derived from. For example, misrecognition of the word “cat” can yield “bat” and “cab”. Both “bat” and “cab” are single-edit transformations of “cat” (the Levenshtein distance is 1) but they are two transformations away from each other.

We’ll model incorrect OCR output on a hypothetical document that contains several repeats of a single word. We’ll find the word that satisfies the condition in the theory and compare it with the actual word. We will also compare this to a control, where a spelling is chosen at random from a list of tokens that are known to refer to the same word.

Modeling a crummy OCR:
If an OCR has an 80% character recognition accuracy, its processing of text can be modeled as a series of Bernoulli trials where the probability of success is 80%. When it is unsuccessful, the most common mistake is a mistaken letter, but occasionally it is the insertion of a noise character or the complete removal of a letter.

import random
import numpy as np

def process(word, bernoulli_p):
    new_word = ""
    for letter in word:
        failed_trial = np.random.geometric(p=bernoulli_p)-1
        if failed_trial:
            choice = random.randint(97, 132)
            if 128 <= choice <= 132:
                letter = ""
            elif 123 <= choice <= 127:
                letter = letter + chr(random.randint(97, 122))
                letter = chr(choice)
        new_word += letter
    return new_word

In this python snippet, for every character in the original word, a number is drawn from a geometric distribution with the probability given to the function. If this is 1, the letter is added to the new word. If it is greater than 1, an integer is chosen from 97 to 132 (inclusive). If it is somewhere between 97 and 122 (the ASCII decimal values for ‘a’-’z’) the corresponding character is added to the new word. In the case that it is between 123 to 132, half the time it will skip a letter, and half the time it will read the correct letter but also add a random noise character from ‘a’-’z’.

The test:
For the two words, “morrissey” and “counterrevolutionaries” (I choose the latter word only because it has a lot of letters), hypothetical documents containing 5, 20, and 100 occurrences of the word are processed with accuracies of 99%, 98%, 95%, 90%, 85%, 80% 75%, and 70%. The token with the lowest average Levenshtein distance is chosen and compared to the original word. Its success or failure is noted. As a comparison, a token is chosen from the OCR output randomly and its probability of being the correct spelling of the original word is noted (this is roughly the character accuracy probability to the nth power, where n is the number of characters in the original word). This is repeated 1,000 times and averaged (mean).

The ‘least Levenshtein distance’ method consistently outperforms chance. With the shorter word (“morrissey”), this method will choose the correct spelling over 95% of the time in a document that contains the word 20 times, even when the character-by-character accuracy of the model OCR is 80% (this means that the entire word will be OCR-ed correctly only about 13% of the time).

Spelling1 Bar Mor1

Spelling1 Bar Morr20

The number of times the word appears in the document has a great bearing on the accuracy of the results; the more often the word appears, the more consistently the correct word is identified.

Spelling1 Lines Accuracy

In the 22-character word “counterrevolutionaries”, the results are less exciting. For example, in a document that contains the word 20 times, with an OCR character accuracy of 80%, the method will choose the correct word only 14% of the time. However, this is partially because, in this case, the word is only OCR-ed correctly 0.74% of the time and it is unlikely that the correct spelling of the word even appears in the OCR output. Remember that this method relies on at least one correct spelling in the OCR output to work.

In the interest of open and audit-able science, the code, data, and analysis for the experiment is available on a git repository I set up for this and other experiments I plan to perform on OCR post-processing methods.

Final thoughts:
Even though I find these results exciting, this method of spelling correction is probably unsuitable in most circumstances where the document is not very large and no particular word occurs very often.

I have other ideas for this and related problems that I plan to test and write posts about in the coming weeks.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail