Computational foreign language learning: a study in Spanish verbs usage

Abstract: I did some computer-y stuff to construct a personal Spanish text corpus and create a Spanish verb study guide specifically tailored to the linguistic variety of Spanish I intend to consume and produce. It worked fairly well. It also revealed a (in some small way) generalizable depiction of the relative frequencies of Spanish verb tenses and moods. This technique may prove to be extremely beneficial to Spanish-language pedagogy. If you're uninterested in my motivations or procedure, you can skip to the section labeled "results".

As regular readers of this blog may be aware, one of my favorite activities is marshaling the skills that I use as a computational scientist to study the humanities. For example, in a previous post, we saw how principles from phylogenetic systematics helped textual critics reconstruct the original manuscript for "The Canterbury Tales"; in another, we deployed techniques first used to study physics to the end of fooling vineyards into retweeting fake, computer-generated wine reviews.

For this post, I used both tools from computational linguistics and some good-old-fashioned data wrangling (web-scraping, parsing texts, etc...) to create a custom-fit Spanish verb study guide.

The problems

Problem #1

Although foreign language immersion is the almost certainly the best learning path for most types of foreign language learners, no reasonable student without an lavish budget for traveling can expect to get by without having to do some rote memorization. In the context of Spanish verbs, this either means unguided memorization of a dictionary or consultation of a list of the most commonly used Spanish verbs. But, even if you could trust that the most-popular-verbs list was compiled in a principled manner, there are vast regional and sub-culture-specific variations in verb frequency. For example, the verb coger means "to take" in Spain but in Central America it's... it’s a… pretty vulgar verb. It stands to reason that there are pretty enormous differences in this verb's popularity across regions, contexts, and registers. Depending on which region's dialect you prioritize familiarity with, and depending on how raggle-taggle the people you intend to roll with are—or the media you intend to consume—a one-size-fits-all verb list might let you down.

Problem #2

English isn't a very inflective language—the tense (or person, mood, aspect, etc...) is largely determined, not through verb conjugation, but via periphrasis, the use of personal pronouns, and other auxiliary words. This is in stark comparison to Spanish, a highly-inflective, relatively synthetic language where the verb's conjugation betrays its tense, person, mood, and aspect—all in one word! This linguistic elegance is a learning obstacle, since one verb might be written in a little under 60 different ways (6 persons * (4 tenses in the indicative mood + 3 tenses in the subjunctive mood + 1 imperative mood)).

This pedagogical nightmare is partially allayed by careful prioritization of some tenses and moods, over others—at least initially. For example, a Spanish-language learner almost always learns the commonly-used and versatile present indicative tense first. But beyond the next few obvious choices, the order in which these tenses should be prioritized is not clear and (probably) dependent on how and where you expect to use and consume the language. Further complicating things, there are entire persons (here's looking to you, vosotros) that are very uncommon in most Spanish-speaking countries.

The solution

The solution to this problem is to create a personal corpus of Spanish text, containing examples of the types of text you expect to consume and produce. Then, the verbs need to be identified, have their mood, tense, and person recorded, and converted into infinitive form (for frequency tabulation). The relative frequencies of the persons, mood, and tenses—as well as the frequencies of the verbs (in infinitive form)—will inform the creation of a Spanish verb study guide specifically catered to type of linguistic variety the learner intends to employ. Whether the learner’s primary interest in learning Spanish is to be able to bond with a new family member over their love of Mexican telenovelas or to read and understand Don Quixote in its entirety, this approach will hasten the learner’s sense of accomplishment with respect to cookie-cutter verb study guides, increase learner satisfaction, and increase the likelihood of the learner actually achieving language mastery. I mean, as a learner myself, I would be discouraged if I felt like the main payoff of studying Spanish is to read and understand books that are very obviously juvenile or primary meant for pedagogical purposes. I want to read Márquez and I want to read him now!

The corpus

For my particular corpus, I chose a whole mess of books (most of which I've read—and loved—in English) that I'm interested in reading in the original language. These include Rayuelas and Final De Juego by Julio Cortázar (my favorite short story writer), Cien Años De Soledad by Gabriel García Márquez (generally considered to be a masterpiece), Darios de Motocicleta by Che Guevara, Ficciones by Jorge Luis Borges, and La Cuidad De Las Bestias by Isabel Allende. These texts were obtained electronically—legitimately!—and I used various ad-hoc regexes to remove formatting and conversion-from-PDF-to-text) artifacts.

My interest in Spanish isn't only for consuming literature, though; I wanted to include other sources of text, like movie scripts (I planned on Lo Que le Pasó a Santiago, generally considered to be one of the best Puerto Rican films), but I couldn't find the script online. I also wanted to include the lyrics to my favorite Spanish-language bands (Soda Stereo, El Ultimo Vecíno, Décima Víctima, Caifenes, Shakira, Millie Quezada, ...) but the tool I used to identify the verbs in the corpus often choked on these texts. Why, you ask?...

Parts-of-speech tagging

references are at the bottom of the post

Parts-of-speech tagging (hereafter, 'POS tagging') is when you go through a text and, for each word, identify the which part of speech (verb, noun, adjective, etc...) the word functions as.

This is a non-trivial task because the same word can function as different parts-of-speech depending on the context. Take the following sentence, for example, which is an expanded and modified version of a sentence that is used as an example in this video

Fruit flies like bananas

So, taken individually, all words in this sentence can function as multiple parts of speech. Take "like" for instance; it can be a noun ("my status got mad likes"), a verb ("I like your status"), a quotative ("I was like, 'I enjoyed your status'"), conjunction (“I updated my status like the world depended on it”), a preposition ("I wrote my status like Nathaniel Hawthorne"). Depending on how colloquial the text in question is, "like" can even be used as a discourse marker ("I'm, like, scared of ghosts, Scoob"). As a standalone word, "like" can serve the purpose of 6 different parts of speech.

But even looking at the entire sentence as a whole, the parts-of-speech for each word is ambiguous.

Concretely, the sentence can be interpreted as (a) "fruit flies (noun) like (verb) bananas (noun)", (b) "fruit (noun) flies (verb) like (preposition) bananas (noun) [do]", or even (c) "fruit (noun) flies (verb) like (conjunction(?)) bananas (adjective)"—using the colloquial meaning of the word bananas meaning "crazy".

Note that the POS tag for one word is conditional on the POS tags of other words: whether flies is a noun or a verb affects whether bananas is interpretable as a adjective.

Because this task isn't easy, this job used to be left to humans to perform. Now, various techniques allow for this to be done programmatically to a high degree of accuracy. We'll go through a few of them, ending with the sophisticated method employed by the POS tagger that we will be using, the Stanford Parts-of-speech tagger.

Unigram tagging

A training corpus with the POS tags for each word is read and, for each unique word, the number of times it is used as one of the various parts of speech is tallied. When a word is encountered in untagged text, the tagger chooses the part-of-speech that the word is most commonly used as in the training text. If the word encountered was not in the training text at all, it defaults to a noun. Somehow, this context-free elementary method can yield accuracies of 90%-94% (Brill & Wu, 1998). When Brill and Wu used this method with/on the famous Penn Treebank Wall Street Journal corpus with a 80%/20% training/testing split, it achieved 93.3% accuracy.

n-gram tagging

Using an n-gram model, the tag of a particular word is assumed to be conditionally dependent on the tag of the preceding n-1 words. For example, in a bigram model, the tag of the current word is guessed from the current word, and the tag of the previous word. A trigram model uses tag information from the previous two words, in concert with the conditional probability of a particular tag given a certain word. The unigram tagger is a special case of the n-gram tagger where n is 1. It's not hard to see that n-gram tagging will offer an enormous accuracy improvement.

If this reminds you of the Markov chains that we made use of in the previous post on computer-generating wine reviews, then you have a good eye. N-gram tagging is a type of Hidden Markov Model (HMM). What makes HMMs different than simple Markov models is that the states themselves (the POS tags) are not directly observable; the observable portion of each state are the actual words—and the words are only a probabilistic function of the state.

In addition to testing a unigram model, Brill and Wu also tested this technique's ability on the WSJ corpus. In particular, they used a trigram tagger—with a twist. Weischedel, Ralph, et al (1993) noted that the suffix of a word (-ed, -s, -ing, -ion, -ly, etc...) strongly influenced the probability that the word served as a particular part of speech. When this information was wielded to help classify unknown words, it greatly improved accuracy outcomes. When Brill and Wu used this method with a trigram tagger against the WSJ corpus, the technique yielded an 96.4% accuracy rate.

Maximum Entropy models

Maximum Entropy models are a lot like—insofar as they are equivalent to—multinomial logistic regression models that attempt to model the probability of a given tag class given various predictor variables, or features. Maximum entropy models can use features such as the current word, the previous word, the previous word’s tag, etc...—like would a HMM—but also features like whether the word contains a number, whether the word is capitalized, etc... An optimization algorithm called Generalized Iterative Scaling selects the feature weights that maximize the likelihood function.

Ratnaparkhi (1996) tested a straightforward maximum entropy model on the WSJ corpus and noted that it yielded an accuracy of 96.6%. Four years after that, Toutanova et al. (2000) published a paper in which they show that by adding additional features like whether the word is capitalized and in the middle of a sentence and non-local features that look 8 words back for a modal verb (for disambiguating base form verbs and non-3rd person singular present verbs) they can achieve a WSJ accuracy of 96.8%. This is the benefit of the Maximum Entropy model approach—you can arbitrarily add features (within reason) without necessarily knowing how those features contribute the the probabilities of tag outputs.

Three years after that, Toutanova et al. (2003) achieved a 97.2% accuracy rate on the WSJ corpus by (a) adding features for the words following the word currently being tagged, and (b) using regularization to combat overfitting as a result of using many features—many of which probably only weakly contribute information of the probability of the current word's tag class. Their regularization technique involved placing a zero-centered Gaussian prior on the feature weights and is mathematically tantamount to the L2 regularization that we saw in this previous blog post. This state-of-the-art tagger is the one on which the Stanford tagger we use is based.

[There is another famous type of POS tagger called Transformation-Based tagger. In contrast to all the others that were mentioned above, this is not a probabilistic/stochastic model and is, instead, based on rules and knowledge. I won't describe it here because it’s very different and this post is already too long but I should mention that it can score a 96.6% on the on WSJ corpus (Brill et al., 1998).]

The procedure

These steps assume a POSIX compliant system and some command-line proficiency
The filenames are links and you can find a repo with all the code here

  • Downloaded full version of the Stanford Parts-of-speech tagger
  • Ran the tagger on the text, put each tag on a separate line, and filtered for verbs only. The parts-of-speech were identified using this tagset. As you can see, the verbs all start with the letter "v". This can be achieved by the following incantation:

    ./stanford-postagger.sh models/spanish.tagger THE_BOOK.txt | perl -pe 's/ /\n/g' | grep '_v' > tmp
    


    If this causes you problems, you might want to try to give the tagger (which runs in multicore!) more memory; try adding -Xmx2048M as a argument in the java command in ./stanford-postagger.sh—this will give it 2GBs to work with.

  • For each work, I ran this.py on it, which parsed the stanford tag and made it in nice tab delimited format:

    ./stanford-output-to-nice-tsv.py < tmp > ./output-verbs/THE_BOOK.txt
    

  • Catted all of them together into all.txt–a monstrous text file with 84,437 words that the tagger interpreted as verbs:

    cat rayuelas.txt final-de-juego.txt darios-de-motocicleta.txt cien-anos-de-soledad.txt ficciones.txt la-cuidad-de-las-bestias.txt > all.txt
    

Now we need to get the infinitives, but in order to prioritize which we should get the infinitives for, and not have to repeat conjugated verbs, we need to get the uniques...

  • So I ran

    cat all.txt | perl -pe 's/(.+?)\t.*/\1/g' > all-verbs.txt
    


    to get a list of only verbs (no mood or tense)

  • I wanted to get a list of unique verbs sorted by the number of occurrences; this would normally be a job for the sort | uniq -c. Desafortunademente, this command fails. It turns out that unicode can represent (for example) habría in at least two different ways. For this reason, we have to use the python script process-all-verbs.py which uses the unicodedata module to normalize the verbs and then count them.

    ./process-all-verbs.py | tee all-verbs-count.txt
    

Ok, now were ready to get infinitive forms for these verbs. We are going to do this by programmatically making request to translate the word to the (excellent) website Span¡shD!ct.com. What we want can be extracted from the returned HTML via CSS selectors.

  • get-infinitives.py goes through each line of all-verbs-count.txt and constructs the url to query the website with. It then uses the CSS selector ".mismatch" for information about the verb. In the best case scenario, it says something like " is the ____ form of _____ in the ____". Sometimes, there's more than one possible person or tense so it says "____ represents different conjugations of the verb _____". In either case, we get the infinitive. If it fails, we record it and move on. It waits between 1 and 2 seconds between each verb. After every 20, it dumps the JSON so that in case something bad happens I could just load the intermediate results and restart.
  • You can see that the SpanishDict infinitive conversion systematically failed for certain words. For example, it interpreted inflected verbs like he, dice, and era as English words to translate, not Spanish words to provide information for. In other cases, it interpreted a verb’s past participle (aburrir -> aburrido ("to bore")) as an adjective ("boring"). I manually filled in many of the ones that failed using equal parts regex and black magic. This went into finished-supplemented.json.
  • Finally, we need to inner join all.txt to the information in finished-supplemented.json. The combine.py script does this:

    ./combine.py | tee tagged-plus-infinitives.txt 
    

The tab-delimited tagged-plus-infinitives.txt in now ready to be consumed for analysis.

Some numbers

  • Rayuelas - 203,197 words - 29,882 verbs
    Final de juego - 54,303 words - 8,160 verbs
    Darios de Motocicleta - 53,804 words - 6,557 verbs
    Cien Años de Soledad - 15,4381 words - 20,987 verbs
    Ficciones - 48,845 words - 5,769 verbs
    La Cuidad De Las Bestias - 94,075 words - 13,082 verbs
  • There were 84,437 words that the tagger identified as verbs in all.
  • There were 13,972 unique conjugated verbs.
  • After the first try with SpanishDict, for only 6,852 verbs did we have the infinitives. This greatly increased with the black magic alluded to in the previous section.
  • I went from 84,437 to 71,378 verbs when I inner joined with the verbs that I was able to find infinitives for.

The results

Figure 1: Proportion of Spanish verb moods and tenses in corpus

Figure 1: Proportion of Spanish verb moods and tenses in corpus

The results were rather fascinating. These were the 14 most common conjugated verbs:

conjugated_verbcountperc
había25993.64
era23963.36
es23033.23
dijo17632.47
estaba11691.64
fue8161.14
ser6060.85
habían5170.72
hay5120.72
tenía4670.65
ha4470.63
eran4310.6
podía4120.58
iba3840.54


(you can see the full spreadsheet here)

With this information alone, this whole endeavor was worth it. Sure, most of the verbs in this list aren’t that much of a surprise, but there are two pieces of information that could prove really helpful to me. The first is that 4 verbs in the top 15 are forms of the verb haber ("to have")—including the very first one, which accounts for 3.6% of all conjugated verbs in the corpus. This is a verb that I was, heretofore, relatively unfamiliar with.

In contrast to tener (which also means "to have"), haber is often used as an auxiliary verb as it would in such english sentences as "I have to go to the dentist", "I had all but lost it" (past perfect tense), "there is a freeze-up coming". Because of it's ubiquitous usage as an auxiliary word (like its being used in all sentences in the perfect mood), I should get more familiar with this verb and its conjugations if I ever hope to read these works of literature.

The second important piece of information for me was that a majority of the verbs in the top 14 were in the imperfect tense (a type of past tense). Now, I think I may have been concentrating too much on the preterite tense (another past tense) in comparison.

Next, these were the 14 most common verbs when put into infinitive form:

infinitivecountperc
ser806611.3
haber54617.65
estar27463.85
decir27343.83
tener17742.49
hacer17572.46
ir17212.41
poder16142.26
ver13361.87
dar12101.7
saber8431.18
pasar7301.02
parecer6820.96
pensar5960.83

(you can see the full spreadsheet here)

To me, there wasn't really anything unexpected here except for maybe pasar (to happen) and parecer (to seem), which I was, up until this point—relatively unfamiliar with in spite of the fact that they are used in a number of frequently spoken expressions like ¿Que pasó? ("What happened?") and ¿Que te parece? (~"What do you think?").

Finally, figure 1 is a plot which depicts the proportions in which each mood and tense occur. The large vertical bars show the relative proportions of each mood (I count the Infinitive, Gerund, and Participle as moods) in descending order; they are Indicative (65%), Infinitive (20%), Subjunctive (4%), Participle (4%), Gerund (3%), and Imperative (1%). Each vertical bar is further broken down by the proportion of each tense within that mood (sorted, with the most frequently used on the bottom. For example, the present tense is the most common tense in the indicative mood and accounts for 26% of all mood/tense pairs. The Infinitive, Participle, and Imperative moods (to the extent that there are actually moods) have only one tense (to the extent that they can be said to have tenses).

These results were most surprising to me; for one, I was (again) reminded that I should probably hold nailing down the imperfect tense with as much or more importance as I do with the preterite tense. Second, I was surprised that usage of the future tense was far eclipsed by gerund, participle, and both subjective tenses—in spite of the fact that I use it quite often in my texts to my friends and my internal monologue. Of course, this—and other insights—may just be artifacts of the particular body of literature I chose for my corpus (see next section).

Limitations:
Although this was a wildly fun project that yielded interesting and extremely practical insights, there are a number of important caveats to be aware of when interpreting these results.

First is a generalizability issue; the results indicate the verb popularity and mood/tense breakdowns for just 6 pieces of Spanish literature. Because of this, the corpus is heavily dominated by the writing style of the included authors—at least some of whom have a very idiosyncratic writing style. Additionally, as with most literature, all of the non-short-stories in my corpus were told in the past tense (usually by a third person omniscient narrator). This past tense bias is very clearly non-representative of everyday spoken Spanish (of course, it was never meant to be representative of that). This problem could have been, at least partially, alleviated via the inclusion of more prosaic Spanish from movie scripts and blogs—if only they POS tagged correctly!!

Speaking of tagging correctly, the second issue is one of the correctness of the POS tags. The best POS taggers (Stanford is certainly one) can, at best, achieve an accuracy of 97%. Although this is an incredible feat of computational linguistics and the product of many many years of research, it is important to put this in the proper perspective. Recall that the rudimentary unigram tagger can achieve a 90%-94% accuracy rate (b) the 97% accuracy rate decreases as the testing corpus diverges in style from the training corpus. Especially because of Cortázar—who (at least in English translations) employs highly unusual sentence structure and often straight-up grammatically-incorrect non-human-parsable sentences—this fact must be kept in mind; unless the Spanish model that comes with Stanford was trained with Surrealist literature (it wasn't!), tag accuracy will suffer.

References

Brill, Eric, and Jun Wu. "Classifier combination for improved lexical disambiguation." Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics-Volume 1. Association for Computational Linguistics, 1998.

Ratnaparkhi, Adwait. "A maximum entropy model for part-of-speech tagging." Proceedings of the conference on empirical methods in natural language processing. Vol. 1. 1996.

Toutanova, Kristina, and Christopher D. Manning. "Enriching the knowledge sources used in a maximum entropy part-of-speech tagger." Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics-Volume 13. Association for Computational Linguistics, 2000.

Toutanova, Kristina, et al. "Feature-rich part-of-speech tagging with a cyclic dependency network." Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1. Association for Computational Linguistics, 2003.

Weischedel, Ralph, et al. "Coping with ambiguity and unknown words through probabilistic models." Computational linguistics 19.2 (1993): 361-382.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

What does Flatland have to do with Haskell?

Edwin Abbott's 1884 novella, Flatland, recounts the misadventures of a square that lives in a two-dimensional world called "Flatland". In this story, the square has a dream where he visits a one-dimensional world (Lineland) and unsuccessfully tries to educate the populace about Flatland's existence. Shortly thereafter, a sphere visits Flatland to introduce our protagonist to his own home, Spaceland. The sphere looks like a circle to the square, because the square can only see the part of the sphere that intersects Flatland's plane. The square can't fathom Spaceland until the sphere actually brings him into the third dimension.

Having his view of reality sufficiently rocked, the square postulates the existence of still higher dimensional lands. The sphere denies this possibility and returns the square to Flatland on bad terms.

The irony here is that, in spite of being aware of the square's previous insistence that Flatland is the only reality there is, the square's corresponding limitation of thought, and the square's subsequent enlightenment; the sphere arrogantly asserted that his own land represented the limits of dimensionality and that there can be no more dimensions.

I begin with this summary not as an impromptu book report, but because this is a perfect analogy for why I've decided to learn Haskell.
-
It isn't hard to imagine the inveterate C programmer being hesitant to embrace higher-level languages like Python and Perl. "Imagine the whole world that opens up to you when you don't have to worry about memory management and resolving pointers," requests the proselytizing Python programmer.

But the C programmer got on fine without knowing Python all this time. Besides the (standard) Python interpreter is written in C.

"Why? So I can change types on a whim?" retorts the C programmer. "...so I can pass functions as arguments to other functions? Big deal! I can do that with function pointers in C."

While it is technically true that, in C, functions can be passed as arguments to other functions, and can be returned by functions, this is only a small part of the functional programming techniques that Python allows. But it is the only part that most only trained in C can understand. This is like when the square thought that he saw the sphere because he saw a circle where the sphere intersected Flatland. There's a bigger picture (lambda functions, currying, closures) that only appears when the constraints imposed by a programming paradigm are pushed back.

Shifting tides in industry eventually compel the C programmer to give Python a shot. Once she is fluent in Python and its idioms (becomes a "pythonista", as they say) things begin to change. The OOP of Python allows the programmer to gain a new perspective on programming that had been, up to this point, unavailable to her simply because the concept doesn't exist in C.

Resolved to follow the road to higher abstraction, the (former) C programmer asks the Python programmer if there are other languages that will provoke a similar shift in thought relative to even Python. "Haskell, perhaps?" she offers.

The Python programmer scoffs, "Nah, Python is as good as it gets. Besides, purely functional programmers are a bunch of weirdos."

Lands

Haskell is a relatively obscure programming language (20th place or lower on most popularity indices), but accounts for a disproportionate number of the "this-language-will-change-the-way-you-look-at-programming" claims. By the accounts of Haskell aficionados, finally understanding Haskell is a transformative experience.

Up until recently, I found myself mirroring the thoughts and behavior of the Python programer in this story; I keep hearing about how great Haskell is, and how it'll facilitate an enormous shift in how I'll view computer programming, but I largely dismissed these claims as the rantings of eccentrics that are more concerned with mathematical elegance than practical concerns.

Is it any wonder that I didn't believe the Haskell programmers? I can't see the benefits of Haskell within the constraints of how I view computer programming—constraints subtly imposed by my language of choice. (Lisp programmer Paul Graham describes this as the "Blub Paradox" in this essay Beating The Averages)

But just like the sphere and the arrogant Python programmer should, I have to remain open to the idea that there's a whole world out there that I'm not availing myself of just because I’m too obstinate to try it.

Any language that is Turing-complete will let you do anything. It's trivially true that there is nothing that you can do in one language that you can't do in another. What sets languages apart (from most trivial to least) is the elegance of the syntax, it's community and third party packages, the ease in which you can perform certain tasks, and whether you’ll achieve enlightenment as a result of using it. I don't know if Haskell will do this for me but I think I ought to give it a shot.

---
If the ideas of relativity when applied to the domain of programming languages interests you, you might also be interested in the Sapir–Whorf hypothesis, which applies similar ideas mentioned in this post to the realm of natural languages.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

How to fake a sophisticated knowledge of wine with Markov Chains

How soon is now

Markov chain

To the untrained (like me), wine criticism may seem like an exercise in pretentiousness. It may seem like anybody following a set of basic rules and knowing the proper descriptors can feign sophistication (at least when it comes to wine).

In this post, we will be exploiting the formulaic nature of wine reviews to automatically generate our own reviews that appear (at least to the untrained) to be legitimate.

Markov Chains
A Markov chain is a system that transitions between states using a random, memoryless process. The transition from one state to another is determined by a single random sample from a (usually discrete) probability distribution. Additionally, the current state wanders aimlessly through the chain according to these random transitions with no regard to its previous states.

A roll-of-the-dice board game can be likened to a Markov chain; the dice determine how many squares you move, and is in no way is influenced by your previous rolls. Scores in basketball games appear to act in this way as well (in spite of the myth of the 'hot hand') and a gambler's earnings almost certainly hold the Markov property (see the Monte Carlo Fallacy)

Many more phenomena can be appropriately modeled by a Markovian process... but language isn't one of them.

Markov Chains and Text Generation
The image below shows a Markov chain that is built from the following lyrics:

How Soon is Now

Markov chain of The Smiths lyrics


I am the son
and the heir
of a shyness that is criminally vulgar
I am the son and heir
of nothing in particular

Here, each word is a state, and the transitions are based on the the number of times a word appears after another one. For example, "I" always precedes "am" in the text, so the transition from "I" to "am" occurs with certainty (p=1). Following the word "the", however, "son" occurs twice and "heir" occurs once, so the probability of the transitions are .66 and .33, respectively.

Text can be generated using this Markov chain by changing state until an "absorbing state" is reached, where there are no longer any transitions possible.

Given "I" as an initial state, two possible text generations are

  • I am the heir of nothing in particular
  • I am the son and the son and the son and the son and the son of a shyness that is criminally vulgar.

Very often, the generated text violates basic rules of grammar; after all, the transitions are "dumb" stochastic processes without knowledge of grammar and semantics.

Instead of a memoryless chain, though, we can build a chain where the next state depends on the last n states. This can still satisfy the Markov property if we view each state as holding n words. When using these 'higher order' chains to generate text, something very interesting happens. Since the states are now made up of clauses and phrases (instead of words) the generated text seems to magically follow (some of) the rules of grammar, while still being devoid of semantic sense.

The higher order the chain, more text needs to be fed into the chain to achieve the same level of 'arbitrariness'–but the more the generated text seems to conform to actual correct English. In order to fake our wine reviews, we are going to train an order-two Markov chain on a web-scraped corpus of almost 9,000 wine reviews.

The scraping
The corpus of wine reviews I chose to use was from www.winespectator.com. If you go to this site, you'll see that there 709 pages of reviews. I used SelectorGadget to determine the XPath selector for the content I wanted and wrote a few python scripts along these lines:

#!/usr/bin/env python -tt
 
import urllib2
from lxml.html import fromstring
import sys
import time
 
urlprefix = "http://www.winespectator.com/dailypicks/category/catid/1/page/"
 
#709
for page in xrange(1, 710):
    try:
        out = "-> On page {} of {}....      {}%"
        print out.format(page, "709", str(round(float(page)/709*100, 2)))
        response = urllib2.urlopen(urlprefix + str(page))
        html = response.read()
        dom = fromstring(html)
        sels = dom.xpath('//*[(@id = "searchResults")]//p')
        for review in sels:
            if review.text:
                print review.text.rstrip()
        sys.stdout.flush()
        time.sleep(2)
    except:
        continue

and grabbed/processed it with shell code like this:

# capture output of script
./get-reviews.py | tee prep1.txt

# remove all lines that indicate progress of script
cat grep -E -v '^-' prep1.txt > prep2.txt

# add the words "BEGIN NOW" to the beginning of each line
cat prep2.txt | sed 's/^/BEGIN NOW /' > prep3.txt

# add the word "END" to the end of each line
cat prep3.txt | sed 's/$/ END/' > wine-reviews.txt

This is a sample of what out text file looks like at this point:

BEGIN NOW A balanced red, with black currant, ... lowed by a spiced finish. END
BEGIN NOW Fresh and balanced, with a stony ... pear and spice. END

The "BEGIN NOW" tokens at the beginning of each line will serve as the initial state of our generative Markov process, and the "END" token will denote a stopping point.

Now comes the construction of the Markov chain which will be represented as a python dictionary. We can get away with not calculating the probabilities of the transitions by just storing the word that occurs after each bi-gram (two words) in a list that can be accessed using the bi-gram key to the chain dictionary. We will then 'pickle' (serialize) the dictionary for use in the script that generates the fake review. The code is very simple and reads thusly:

#!/usr/bin/env python -tt
 
import pickle
 
fh = open("wine-reviews.txt", "r")
 
chain = {}
 
def generate_trigram(words):
    if len(words) < 3:
        return
    for i in xrange(len(words) - 2):
        yield (words[i], words[i+1], words[i+2])
 
for line in fh.readlines():
    words = line.split()
    for word1, word2, word3 in generate_trigram(words):
        key = (word1, word2)
        if key in chain:
            chain[key].append(word3)
        else:
            chain[key] = [word3]
 
pickle.dump(chain, open("chain.p", "wb" ))

Finally, the python script to generate the review from the pickled Markov chain dictionary looks like this:

#!/usr/bin/env python -tt
 
import pickle
import random
 
chain = pickle.load(open("chain.p", "rb"))
 
new_review = []
sword1 = "BEGIN"
sword2 = "NOW"
 
while True:
    sword1, sword2 = sword2, random.choice(chain[(sword1, sword2)])
    if sword2 == "END":
        break
    new_review.append(sword2)
 
print ' '.join(new_review)

The random.choice() function allows us to skip the calculation of the transition probabilities because it will choose from the list of possible next states in accordance with the frequencies at which they occur.

The results
Obviously, some generated reviews come out better than others. After playing with the generator for a while, I compiled a list of "greatest hits" and "greatest misses".

Greatest hits

  • Quite rich, but stopping short of opulent, this white sports peach and apricot, yet a little in finesse.
  • Dense and tightly wound, with taut dark berry, black cherry and red licorice. A touch of toast.
  • Delicious red licorice, blood orange and ginger, with nicely rounded frame.
  • This stylish Australian Cabernet is dark, deep and complex, ending with a polished mouthful of spicy fruit and plenty of personality.

Greatest misses

  • From South Africa.
  • Tropical fruit notes of cream notes.
  • Here's a bright structure. Dry and austere on the finish.
  • This has good flesh.
  • Really enticing nose, with orange peel and chamomile for the vintage, this touts black currant, plum and meat notes. Flavors linger enticingly.
  • Blackberry, blueberry and blackberry fruit, with hints of cream. Crunchy and fresh fruit character to carry the finish.

Possibilities for improvement
The results are amazing, but the algorithm needs a little work before it will be able to fool a sommelier.

One major giveaway is the inclusion of contradictory descriptors in the same review. I don't know anything about wine (I drink Pepsi) but even I know that a wine should never be described as both "dry" and "sweet". One possible solution to this would be to use association mining to infer a list of complementary and discordant descriptors.

Another clue that these reviews are nonsense is the indiscriminate chaining of clauses that have nothing to do with each other. I'm not quite sure how to solve this, yet, but I have a few ideas.

An additional hiccup is that there are still grammatically incorrect sentences that creep through. One solution would be to identify and remove them. Unfortunately, this is much easier said than done. In the absence of a formal English grammar, we have to rely on less-than-perfect techniques like context-based identification and simple pattern-matching.

The last obvious problem is that some of the generated reviews are just too long. This increases the likelihood of containing contradictory descriptors and committing grammar errors, as with this review: (which also exemplifies all of the problems stated above)

Luscious, sleek and generous with its gorgeous blueberry, raspberry and blackberry flavors , with hints of herbs, cocoa and graphite. The long, briary edge lingering on the nose and palate. Medium-bodied, with a modest, lightly juicy and brambly flavors of milk chocolate. Full-bodied, with fine focus and its broad, intense and vivid, with a tangy, lip-smacking profile. Light-weight and intense, with a deft balance.

In future posts, I hope to explore some of these avenues of improvement. I also plan to use parts-of-speech tagging to automate an unusual games of wine review mad-libs.

Sooner, though, I'll explain the process I took to set-up the fake wine review twitter bot (@HorseWineReview) that I will use to experiment with different text-generation techniques.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Linguistics, meet Evolutionary Biology

One of the things that I love about my field is the indiscriminate adoption of techniques from other fields. Statistics, computer science, neuroscience, and linguistics are most commonly drawn upon, but no field, no matter how seemingly irrelevant, is off limits.

While working on and doing research for my pet project of making a robust unsupervised OCR post-processor, I had the idea to use something akin to phylogenetic trees to infer the correct spelling of a word from a collection of misspelled variants. I knew better than to think this was a novel idea and, as I found out, the discipline of textual criticism has already used cladistics, a technique from evolutionary biology, in an effort to better understand the relationships between different transcribed manuscripts. The specific undertaking I focused my attention on was the fascinating Canterbury Tales Project.

Before computers and the printing press, copies of manuscripts had to be copied by hand by scribes who, quite understandably, made some mistakes along the way. Particularly with ancient manuscripts, many transcriptions of transcriptions were made. While the transcripts got further and further away from the original manuscript, they picked up errors and idiosyncrasies from their direct ancestors while adding a few of their own. This should sound familiar: this process is tantamount to evolution and “descent with modification”.

In evolutionary and comparative biology, much work has been put into phylogenetic systematics, the study of the diversification of life and evolutionary relationships between organisms. As a result, many sophisticated methods have been devised to taxonomize and group organisms based on shared characteristics, be it from amino acid sequences in proteins or the presence or absence of limbs. These techniques are prime for both admiration and stealing for use in other fields.

Evolutionary biology’s search for the Last Common Ancestor is not unlike trying to recreate the original manuscript archetype from a series of transcriptions, the end goal of stemmatics and one of the end goals of the Canterbury Tales Project.

Likely left unfinished at the time of his death in 1400, the earliest known manuscript of Geoffrey Chaucer’s The Canterbury Tales was not written by Chaucer himself. Currently, there are 83 known manuscripts available to scholars, of varying degrees of completion. The variations and the degree of variation between these manuscripts form the basis of the textual critic’s cladogram, which, superficially, look exactly like phylogenetic trees (or the hierarchical cluster dendrograms I use on a semi-regular basis).

phylogenetic tree I made inferred from dissimilarity in the amino acid sequence of Cytochrome C Oxidase (subunit III)

phylogenetic tree I made inferred from dissimilarity in the amino acid sequence of Cytochrome C Oxidase (subunit III)


Excerpt of a stemma diagram showing the “evolution” of a medieval Icelandic Chivalric saga through dozens of manuscripts, some dating back to the late 14th century [1]

Excerpt of a stemma diagram showing the “evolution” a medieval Icelandic Chivalric saga through dozens of manuscripts, some dating back to the late 14th century[1]

The sheer size of the Tales preclude the possibility of effectively performing the lower criticism manually, but the members of Canterbury Tales Project had the ingenuity to automate the process using techniques and software from phylogenetics.

The results of their analysis was used to identify the group of manuscripts that were most likely closest to Chaucer’s original. It also serves as strong additional evidence that the Tales were unfinished.[2]

More generally, and outside the context of just textual criticism, I think a lot of techniques borrowed from phylogenetics and related fields—which, of course, have their grounding in math and statistics—lend themselves easily to use in linguistics and natural language processing. Certainly, memetics and evolutionary linguistics have used evolutionary models of information transfer, but these methods have applications far beyond just the obvious.

The intuition behind evolutionary biology and the ease-of-use of the software available to bioinformaticians can bridge the gap between heady (read: scary) technical math topics like distances and split decomposition[3] and the more humanistic fields.

While techniques from phylogenetics weren’t so applicable to my OCR post-processor project—because the errors don’t propagate over time—I plan to use some for other pet projects of mine, including using twitter to study the etiology and evolution of slang and neologisms.

References:
[1] Hall, A. (2013). Making stemmas with small samples, and digital approaches to publishing them: testing the stemma of Konráðs saga keisarasonar. Digital Medievalist, 9. (link)
[2] Barbrook, A. C., Howe, C. J., Blake, N., & Robinson, P. (1998). The phylogeny of the Canterbury Tales. Nature, 394, 839. (link)
[3] Bandelt, H. J., & Dress, A. W. (1992). Split decomposition: a new and useful approach to phylogenetic analysis of distance data. Molecular phylogenetics and evolution, 1(3), 242-252.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail