Linguistics, meet Evolutionary Biology

One of the things that I love about my field is the indiscriminate adoption of techniques from other fields. Statistics, computer science, neuroscience, and linguistics are most commonly drawn upon, but no field, no matter how seemingly irrelevant, is off limits.

While working on and doing research for my pet project of making a robust unsupervised OCR post-processor, I had the idea to use something akin to phylogenetic trees to infer the correct spelling of a word from a collection of misspelled variants. I knew better than to think this was a novel idea and, as I found out, the discipline of textual criticism has already used cladistics, a technique from evolutionary biology, in an effort to better understand the relationships between different transcribed manuscripts. The specific undertaking I focused my attention on was the fascinating Canterbury Tales Project.

Before computers and the printing press, copies of manuscripts had to be copied by hand by scribes who, quite understandably, made some mistakes along the way. Particularly with ancient manuscripts, many transcriptions of transcriptions were made. While the transcripts got further and further away from the original manuscript, they picked up errors and idiosyncrasies from their direct ancestors while adding a few of their own. This should sound familiar: this process is tantamount to evolution and “descent with modification”.

In evolutionary and comparative biology, much work has been put into phylogenetic systematics, the study of the diversification of life and evolutionary relationships between organisms. As a result, many sophisticated methods have been devised to taxonomize and group organisms based on shared characteristics, be it from amino acid sequences in proteins or the presence or absence of limbs. These techniques are prime for both admiration and stealing for use in other fields.

Evolutionary biology’s search for the Last Common Ancestor is not unlike trying to recreate the original manuscript archetype from a series of transcriptions, the end goal of stemmatics and one of the end goals of the Canterbury Tales Project.

Likely left unfinished at the time of his death in 1400, the earliest known manuscript of Geoffrey Chaucer’s The Canterbury Tales was not written by Chaucer himself. Currently, there are 83 known manuscripts available to scholars, of varying degrees of completion. The variations and the degree of variation between these manuscripts form the basis of the textual critic’s cladogram, which, superficially, look exactly like phylogenetic trees (or the hierarchical cluster dendrograms I use on a semi-regular basis).

phylogenetic tree I made inferred from dissimilarity in the amino acid sequence of Cytochrome C Oxidase (subunit III)

phylogenetic tree I made inferred from dissimilarity in the amino acid sequence of Cytochrome C Oxidase (subunit III)

Excerpt of a stemma diagram showing the “evolution” of a medieval Icelandic Chivalric saga through dozens of manuscripts, some dating back to the late 14th century [1]

Excerpt of a stemma diagram showing the “evolution” a medieval Icelandic Chivalric saga through dozens of manuscripts, some dating back to the late 14th century[1]

The sheer size of the Tales preclude the possibility of effectively performing the lower criticism manually, but the members of Canterbury Tales Project had the ingenuity to automate the process using techniques and software from phylogenetics.

The results of their analysis was used to identify the group of manuscripts that were most likely closest to Chaucer’s original. It also serves as strong additional evidence that the Tales were unfinished.[2]

More generally, and outside the context of just textual criticism, I think a lot of techniques borrowed from phylogenetics and related fields—which, of course, have their grounding in math and statistics—lend themselves easily to use in linguistics and natural language processing. Certainly, memetics and evolutionary linguistics have used evolutionary models of information transfer, but these methods have applications far beyond just the obvious.

The intuition behind evolutionary biology and the ease-of-use of the software available to bioinformaticians can bridge the gap between heady (read: scary) technical math topics like distances and split decomposition[3] and the more humanistic fields.

While techniques from phylogenetics weren’t so applicable to my OCR post-processor project—because the errors don’t propagate over time—I plan to use some for other pet projects of mine, including using twitter to study the etiology and evolution of slang and neologisms.

[1] Hall, A. (2013). Making stemmas with small samples, and digital approaches to publishing them: testing the stemma of Konráðs saga keisarasonar. Digital Medievalist, 9. (link)
[2] Barbrook, A. C., Howe, C. J., Blake, N., & Robinson, P. (1998). The phylogeny of the Canterbury Tales. Nature, 394, 839. (link)
[3] Bandelt, H. J., & Dress, A. W. (1992). Split decomposition: a new and useful approach to phylogenetic analysis of distance data. Molecular phylogenetics and evolution, 1(3), 242-252.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

1 Response

Leave a Reply

Your email address will not be published. Required fields are marked *