Computational foreign language learning: a study in Spanish verbs usage

Abstract: I did some computer-y stuff to construct a personal Spanish text corpus and create a Spanish verb study guide specifically tailored to the linguistic variety of Spanish I intend to consume and produce. It worked fairly well. It also revealed a (in some small way) generalizable depiction of the relative frequencies of Spanish verb tenses and moods. This technique may prove to be extremely beneficial to Spanish-language pedagogy. If you're uninterested in my motivations or procedure, you can skip to the section labeled "results".

As regular readers of this blog may be aware, one of my favorite activities is marshaling the skills that I use as a computational scientist to study the humanities. For example, in a previous post, we saw how principles from phylogenetic systematics helped textual critics reconstruct the original manuscript for "The Canterbury Tales"; in another, we deployed techniques first used to study physics to the end of fooling vineyards into retweeting fake, computer-generated wine reviews.

For this post, I used both tools from computational linguistics and some good-old-fashioned data wrangling (web-scraping, parsing texts, etc...) to create a custom-fit Spanish verb study guide.

The problems

Problem #1

Although foreign language immersion is the almost certainly the best learning path for most types of foreign language learners, no reasonable student without an lavish budget for traveling can expect to get by without having to do some rote memorization. In the context of Spanish verbs, this either means unguided memorization of a dictionary or consultation of a list of the most commonly used Spanish verbs. But, even if you could trust that the most-popular-verbs list was compiled in a principled manner, there are vast regional and sub-culture-specific variations in verb frequency. For example, the verb coger means "to take" in Spain but in Central America it's... it’s a… pretty vulgar verb. It stands to reason that there are pretty enormous differences in this verb's popularity across regions, contexts, and registers. Depending on which region's dialect you prioritize familiarity with, and depending on how raggle-taggle the people you intend to roll with are—or the media you intend to consume—a one-size-fits-all verb list might let you down.

Problem #2

English isn't a very inflective language—the tense (or person, mood, aspect, etc...) is largely determined, not through verb conjugation, but via periphrasis, the use of personal pronouns, and other auxiliary words. This is in stark comparison to Spanish, a highly-inflective, relatively synthetic language where the verb's conjugation betrays its tense, person, mood, and aspect—all in one word! This linguistic elegance is a learning obstacle, since one verb might be written in a little under 60 different ways (6 persons * (4 tenses in the indicative mood + 3 tenses in the subjunctive mood + 1 imperative mood)).

This pedagogical nightmare is partially allayed by careful prioritization of some tenses and moods, over others—at least initially. For example, a Spanish-language learner almost always learns the commonly-used and versatile present indicative tense first. But beyond the next few obvious choices, the order in which these tenses should be prioritized is not clear and (probably) dependent on how and where you expect to use and consume the language. Further complicating things, there are entire persons (here's looking to you, vosotros) that are very uncommon in most Spanish-speaking countries.

The solution

The solution to this problem is to create a personal corpus of Spanish text, containing examples of the types of text you expect to consume and produce. Then, the verbs need to be identified, have their mood, tense, and person recorded, and converted into infinitive form (for frequency tabulation). The relative frequencies of the persons, mood, and tenses—as well as the frequencies of the verbs (in infinitive form)—will inform the creation of a Spanish verb study guide specifically catered to type of linguistic variety the learner intends to employ. Whether the learner’s primary interest in learning Spanish is to be able to bond with a new family member over their love of Mexican telenovelas or to read and understand Don Quixote in its entirety, this approach will hasten the learner’s sense of accomplishment with respect to cookie-cutter verb study guides, increase learner satisfaction, and increase the likelihood of the learner actually achieving language mastery. I mean, as a learner myself, I would be discouraged if I felt like the main payoff of studying Spanish is to read and understand books that are very obviously juvenile or primary meant for pedagogical purposes. I want to read Márquez and I want to read him now!

The corpus

For my particular corpus, I chose a whole mess of books (most of which I've read—and loved—in English) that I'm interested in reading in the original language. These include Rayuelas and Final De Juego by Julio Cortázar (my favorite short story writer), Cien Años De Soledad by Gabriel García Márquez (generally considered to be a masterpiece), Darios de Motocicleta by Che Guevara, Ficciones by Jorge Luis Borges, and La Cuidad De Las Bestias by Isabel Allende. These texts were obtained electronically—legitimately!—and I used various ad-hoc regexes to remove formatting and conversion-from-PDF-to-text) artifacts.

My interest in Spanish isn't only for consuming literature, though; I wanted to include other sources of text, like movie scripts (I planned on Lo Que le Pasó a Santiago, generally considered to be one of the best Puerto Rican films), but I couldn't find the script online. I also wanted to include the lyrics to my favorite Spanish-language bands (Soda Stereo, El Ultimo Vecíno, Décima Víctima, Caifenes, Shakira, Millie Quezada, ...) but the tool I used to identify the verbs in the corpus often choked on these texts. Why, you ask?...

Parts-of-speech tagging

references are at the bottom of the post

Parts-of-speech tagging (hereafter, 'POS tagging') is when you go through a text and, for each word, identify the which part of speech (verb, noun, adjective, etc...) the word functions as.

This is a non-trivial task because the same word can function as different parts-of-speech depending on the context. Take the following sentence, for example, which is an expanded and modified version of a sentence that is used as an example in this video

Fruit flies like bananas

So, taken individually, all words in this sentence can function as multiple parts of speech. Take "like" for instance; it can be a noun ("my status got mad likes"), a verb ("I like your status"), a quotative ("I was like, 'I enjoyed your status'"), conjunction (“I updated my status like the world depended on it”), a preposition ("I wrote my status like Nathaniel Hawthorne"). Depending on how colloquial the text in question is, "like" can even be used as a discourse marker ("I'm, like, scared of ghosts, Scoob"). As a standalone word, "like" can serve the purpose of 6 different parts of speech.

But even looking at the entire sentence as a whole, the parts-of-speech for each word is ambiguous.

Concretely, the sentence can be interpreted as (a) "fruit flies (noun) like (verb) bananas (noun)", (b) "fruit (noun) flies (verb) like (preposition) bananas (noun) [do]", or even (c) "fruit (noun) flies (verb) like (conjunction(?)) bananas (adjective)"—using the colloquial meaning of the word bananas meaning "crazy".

Note that the POS tag for one word is conditional on the POS tags of other words: whether flies is a noun or a verb affects whether bananas is interpretable as a adjective.

Because this task isn't easy, this job used to be left to humans to perform. Now, various techniques allow for this to be done programmatically to a high degree of accuracy. We'll go through a few of them, ending with the sophisticated method employed by the POS tagger that we will be using, the Stanford Parts-of-speech tagger.

Unigram tagging

A training corpus with the POS tags for each word is read and, for each unique word, the number of times it is used as one of the various parts of speech is tallied. When a word is encountered in untagged text, the tagger chooses the part-of-speech that the word is most commonly used as in the training text. If the word encountered was not in the training text at all, it defaults to a noun. Somehow, this context-free elementary method can yield accuracies of 90%-94% (Brill & Wu, 1998). When Brill and Wu used this method with/on the famous Penn Treebank Wall Street Journal corpus with a 80%/20% training/testing split, it achieved 93.3% accuracy.

n-gram tagging

Using an n-gram model, the tag of a particular word is assumed to be conditionally dependent on the tag of the preceding n-1 words. For example, in a bigram model, the tag of the current word is guessed from the current word, and the tag of the previous word. A trigram model uses tag information from the previous two words, in concert with the conditional probability of a particular tag given a certain word. The unigram tagger is a special case of the n-gram tagger where n is 1. It's not hard to see that n-gram tagging will offer an enormous accuracy improvement.

If this reminds you of the Markov chains that we made use of in the previous post on computer-generating wine reviews, then you have a good eye. N-gram tagging is a type of Hidden Markov Model (HMM). What makes HMMs different than simple Markov models is that the states themselves (the POS tags) are not directly observable; the observable portion of each state are the actual words—and the words are only a probabilistic function of the state.

In addition to testing a unigram model, Brill and Wu also tested this technique's ability on the WSJ corpus. In particular, they used a trigram tagger—with a twist. Weischedel, Ralph, et al (1993) noted that the suffix of a word (-ed, -s, -ing, -ion, -ly, etc...) strongly influenced the probability that the word served as a particular part of speech. When this information was wielded to help classify unknown words, it greatly improved accuracy outcomes. When Brill and Wu used this method with a trigram tagger against the WSJ corpus, the technique yielded an 96.4% accuracy rate.

Maximum Entropy models

Maximum Entropy models are a lot like—insofar as they are equivalent to—multinomial logistic regression models that attempt to model the probability of a given tag class given various predictor variables, or features. Maximum entropy models can use features such as the current word, the previous word, the previous word’s tag, etc...—like would a HMM—but also features like whether the word contains a number, whether the word is capitalized, etc... An optimization algorithm called Generalized Iterative Scaling selects the feature weights that maximize the likelihood function.

Ratnaparkhi (1996) tested a straightforward maximum entropy model on the WSJ corpus and noted that it yielded an accuracy of 96.6%. Four years after that, Toutanova et al. (2000) published a paper in which they show that by adding additional features like whether the word is capitalized and in the middle of a sentence and non-local features that look 8 words back for a modal verb (for disambiguating base form verbs and non-3rd person singular present verbs) they can achieve a WSJ accuracy of 96.8%. This is the benefit of the Maximum Entropy model approach—you can arbitrarily add features (within reason) without necessarily knowing how those features contribute the the probabilities of tag outputs.

Three years after that, Toutanova et al. (2003) achieved a 97.2% accuracy rate on the WSJ corpus by (a) adding features for the words following the word currently being tagged, and (b) using regularization to combat overfitting as a result of using many features—many of which probably only weakly contribute information of the probability of the current word's tag class. Their regularization technique involved placing a zero-centered Gaussian prior on the feature weights and is mathematically tantamount to the L2 regularization that we saw in this previous blog post. This state-of-the-art tagger is the one on which the Stanford tagger we use is based.

[There is another famous type of POS tagger called Transformation-Based tagger. In contrast to all the others that were mentioned above, this is not a probabilistic/stochastic model and is, instead, based on rules and knowledge. I won't describe it here because it’s very different and this post is already too long but I should mention that it can score a 96.6% on the on WSJ corpus (Brill et al., 1998).]

The procedure

These steps assume a POSIX compliant system and some command-line proficiency
The filenames are links and you can find a repo with all the code here

  • Downloaded full version of the Stanford Parts-of-speech tagger
  • Ran the tagger on the text, put each tag on a separate line, and filtered for verbs only. The parts-of-speech were identified using this tagset. As you can see, the verbs all start with the letter "v". This can be achieved by the following incantation:

    ./stanford-postagger.sh models/spanish.tagger THE_BOOK.txt | perl -pe 's/ /\n/g' | grep '_v' > tmp
    


    If this causes you problems, you might want to try to give the tagger (which runs in multicore!) more memory; try adding -Xmx2048M as a argument in the java command in ./stanford-postagger.sh—this will give it 2GBs to work with.

  • For each work, I ran this.py on it, which parsed the stanford tag and made it in nice tab delimited format:

    ./stanford-output-to-nice-tsv.py < tmp > ./output-verbs/THE_BOOK.txt
    

  • Catted all of them together into all.txt–a monstrous text file with 84,437 words that the tagger interpreted as verbs:

    cat rayuelas.txt final-de-juego.txt darios-de-motocicleta.txt cien-anos-de-soledad.txt ficciones.txt la-cuidad-de-las-bestias.txt > all.txt
    

Now we need to get the infinitives, but in order to prioritize which we should get the infinitives for, and not have to repeat conjugated verbs, we need to get the uniques...

  • So I ran

    cat all.txt | perl -pe 's/(.+?)\t.*/\1/g' > all-verbs.txt
    


    to get a list of only verbs (no mood or tense)

  • I wanted to get a list of unique verbs sorted by the number of occurrences; this would normally be a job for the sort | uniq -c. Desafortunademente, this command fails. It turns out that unicode can represent (for example) habría in at least two different ways. For this reason, we have to use the python script process-all-verbs.py which uses the unicodedata module to normalize the verbs and then count them.

    ./process-all-verbs.py | tee all-verbs-count.txt
    

Ok, now were ready to get infinitive forms for these verbs. We are going to do this by programmatically making request to translate the word to the (excellent) website Span¡shD!ct.com. What we want can be extracted from the returned HTML via CSS selectors.

  • get-infinitives.py goes through each line of all-verbs-count.txt and constructs the url to query the website with. It then uses the CSS selector ".mismatch" for information about the verb. In the best case scenario, it says something like " is the ____ form of _____ in the ____". Sometimes, there's more than one possible person or tense so it says "____ represents different conjugations of the verb _____". In either case, we get the infinitive. If it fails, we record it and move on. It waits between 1 and 2 seconds between each verb. After every 20, it dumps the JSON so that in case something bad happens I could just load the intermediate results and restart.
  • You can see that the SpanishDict infinitive conversion systematically failed for certain words. For example, it interpreted inflected verbs like he, dice, and era as English words to translate, not Spanish words to provide information for. In other cases, it interpreted a verb’s past participle (aburrir -> aburrido ("to bore")) as an adjective ("boring"). I manually filled in many of the ones that failed using equal parts regex and black magic. This went into finished-supplemented.json.
  • Finally, we need to inner join all.txt to the information in finished-supplemented.json. The combine.py script does this:

    ./combine.py | tee tagged-plus-infinitives.txt 
    

The tab-delimited tagged-plus-infinitives.txt in now ready to be consumed for analysis.

Some numbers

  • Rayuelas - 203,197 words - 29,882 verbs
    Final de juego - 54,303 words - 8,160 verbs
    Darios de Motocicleta - 53,804 words - 6,557 verbs
    Cien Años de Soledad - 15,4381 words - 20,987 verbs
    Ficciones - 48,845 words - 5,769 verbs
    La Cuidad De Las Bestias - 94,075 words - 13,082 verbs
  • There were 84,437 words that the tagger identified as verbs in all.
  • There were 13,972 unique conjugated verbs.
  • After the first try with SpanishDict, for only 6,852 verbs did we have the infinitives. This greatly increased with the black magic alluded to in the previous section.
  • I went from 84,437 to 71,378 verbs when I inner joined with the verbs that I was able to find infinitives for.

The results

Figure 1: Proportion of Spanish verb moods and tenses in corpus

Figure 1: Proportion of Spanish verb moods and tenses in corpus

The results were rather fascinating. These were the 14 most common conjugated verbs:

conjugated_verbcountperc
había25993.64
era23963.36
es23033.23
dijo17632.47
estaba11691.64
fue8161.14
ser6060.85
habían5170.72
hay5120.72
tenía4670.65
ha4470.63
eran4310.6
podía4120.58
iba3840.54


(you can see the full spreadsheet here)

With this information alone, this whole endeavor was worth it. Sure, most of the verbs in this list aren’t that much of a surprise, but there are two pieces of information that could prove really helpful to me. The first is that 4 verbs in the top 15 are forms of the verb haber ("to have")—including the very first one, which accounts for 3.6% of all conjugated verbs in the corpus. This is a verb that I was, heretofore, relatively unfamiliar with.

In contrast to tener (which also means "to have"), haber is often used as an auxiliary verb as it would in such english sentences as "I have to go to the dentist", "I had all but lost it" (past perfect tense), "there is a freeze-up coming". Because of it's ubiquitous usage as an auxiliary word (like its being used in all sentences in the perfect mood), I should get more familiar with this verb and its conjugations if I ever hope to read these works of literature.

The second important piece of information for me was that a majority of the verbs in the top 14 were in the imperfect tense (a type of past tense). Now, I think I may have been concentrating too much on the preterite tense (another past tense) in comparison.

Next, these were the 14 most common verbs when put into infinitive form:

infinitivecountperc
ser806611.3
haber54617.65
estar27463.85
decir27343.83
tener17742.49
hacer17572.46
ir17212.41
poder16142.26
ver13361.87
dar12101.7
saber8431.18
pasar7301.02
parecer6820.96
pensar5960.83

(you can see the full spreadsheet here)

To me, there wasn't really anything unexpected here except for maybe pasar (to happen) and parecer (to seem), which I was, up until this point—relatively unfamiliar with in spite of the fact that they are used in a number of frequently spoken expressions like ¿Que pasó? ("What happened?") and ¿Que te parece? (~"What do you think?").

Finally, figure 1 is a plot which depicts the proportions in which each mood and tense occur. The large vertical bars show the relative proportions of each mood (I count the Infinitive, Gerund, and Participle as moods) in descending order; they are Indicative (65%), Infinitive (20%), Subjunctive (4%), Participle (4%), Gerund (3%), and Imperative (1%). Each vertical bar is further broken down by the proportion of each tense within that mood (sorted, with the most frequently used on the bottom. For example, the present tense is the most common tense in the indicative mood and accounts for 26% of all mood/tense pairs. The Infinitive, Participle, and Imperative moods (to the extent that there are actually moods) have only one tense (to the extent that they can be said to have tenses).

These results were most surprising to me; for one, I was (again) reminded that I should probably hold nailing down the imperfect tense with as much or more importance as I do with the preterite tense. Second, I was surprised that usage of the future tense was far eclipsed by gerund, participle, and both subjective tenses—in spite of the fact that I use it quite often in my texts to my friends and my internal monologue. Of course, this—and other insights—may just be artifacts of the particular body of literature I chose for my corpus (see next section).

Limitations:
Although this was a wildly fun project that yielded interesting and extremely practical insights, there are a number of important caveats to be aware of when interpreting these results.

First is a generalizability issue; the results indicate the verb popularity and mood/tense breakdowns for just 6 pieces of Spanish literature. Because of this, the corpus is heavily dominated by the writing style of the included authors—at least some of whom have a very idiosyncratic writing style. Additionally, as with most literature, all of the non-short-stories in my corpus were told in the past tense (usually by a third person omniscient narrator). This past tense bias is very clearly non-representative of everyday spoken Spanish (of course, it was never meant to be representative of that). This problem could have been, at least partially, alleviated via the inclusion of more prosaic Spanish from movie scripts and blogs—if only they POS tagged correctly!!

Speaking of tagging correctly, the second issue is one of the correctness of the POS tags. The best POS taggers (Stanford is certainly one) can, at best, achieve an accuracy of 97%. Although this is an incredible feat of computational linguistics and the product of many many years of research, it is important to put this in the proper perspective. Recall that the rudimentary unigram tagger can achieve a 90%-94% accuracy rate (b) the 97% accuracy rate decreases as the testing corpus diverges in style from the training corpus. Especially because of Cortázar—who (at least in English translations) employs highly unusual sentence structure and often straight-up grammatically-incorrect non-human-parsable sentences—this fact must be kept in mind; unless the Spanish model that comes with Stanford was trained with Surrealist literature (it wasn't!), tag accuracy will suffer.

References

Brill, Eric, and Jun Wu. "Classifier combination for improved lexical disambiguation." Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics-Volume 1. Association for Computational Linguistics, 1998.

Ratnaparkhi, Adwait. "A maximum entropy model for part-of-speech tagging." Proceedings of the conference on empirical methods in natural language processing. Vol. 1. 1996.

Toutanova, Kristina, and Christopher D. Manning. "Enriching the knowledge sources used in a maximum entropy part-of-speech tagger." Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics-Volume 13. Association for Computational Linguistics, 2000.

Toutanova, Kristina, et al. "Feature-rich part-of-speech tagging with a cyclic dependency network." Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1. Association for Computational Linguistics, 2003.

Weischedel, Ralph, et al. "Coping with ambiguity and unknown words through probabilistic models." Computational linguistics 19.2 (1993): 361-382.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Playing around with #rstats twitter data

As a bit of weekend fun, I decided to briefly look into the #rstats twitter data that Stephen Turner collected and made available (thanks!). Essentially, this data set contains some basic information about over 100,000 tweets that contain the hashtag "#rstats" that denotes that a tweeter is tweeting about R.

As a warning, I don't know much about how these data were collected, whether it was collected and random times during the day or whether it was biased toward particular times and, therefore, locations. I wouldn't really read too much into this.

Most common co-occuring hashtags
When a tweet uses a hashtag at all, it very often uses more than one. To extract the co-occuring hashtags, I used the following perl script:

#!/usr/bin/perl

while(<>){
    chomp;
    $_ = lc($_);
    $_ =~ s/#rstats//g;
    my @matches;
    push @matches, /(#\w+)/;
    print join "\n" => @matches if @matches;
}

which uses the regular expression "(#\w+)" to search for hashtags after removing "#rstats" from every tweet.

On the unix command-line, I put these other hashtags into a file and sorted via these commands:

cat data/R-hashtag-data.txt | ./PERL_SCRIPT_ABOVE.pl | tee other-hashtags.txt

sort other-hashtags.txt | uniq -c | sort -n -r > sorted-other-hashtags.txt

After running these commands, I get a numbered list of co-occuring hashtags, sorted in descending order. The top 10 co-occuring hashtags were as follows (you can see the rest here :

5258 #datascience
1665 #python
1625 #bigdata
1542 #r
1451 #dataviz
1360 #ggplot2
 852 #statistics
 783 #dplyr
 749 #machinelearning
 743 #analytics

Neat-o. The presence of "#python" and "#ggplot2" in the top 10 made me wonder what the top 10 programming language and R package related hashtags were. Here they are, respectively:

1665 #python
 423 #d3js (plus 72 for #d3) (plus 2 for #js)
 343 #sas
 312 #julialang (plus 43 for #julia)
 240 #fsharp
 140 #spss  (plus 7 for #ibmspss)
 102 #stata
  75 #matlab
  55 #sql
  38 #java

1360 #ggplot2  (plus 298 for ggplot)  (plus for 6 #gglot2) (plus 4 for #ggpot)
 783 #dplyr
 663 #shiny
 557 #rcpp (plus 22 for rcpp11)
 251 #knitr
 156 #magrittr
 105 #lme4
  93 #ggvis   (plus 11 for #ggivs)
  65 #datatable
  46 #rneo4j

You can view the full list here and here.

I was happy to see my favorite languages (python, perl, clojure, lisp, haskell, c) besides R being represented in the first list. Additionally, most of my favorite packages were fairly well tweeted about--at least as far as hashtags-applied-to-a-package go.

#strangehashtags
Before moving on to the next section, I wanted to share my favorite co-occuring hashtags that I found while sifting through the data: #rcatladies, #rdogfella, #bayesianbootycall, #dontbeaplyrhater, #overlyhonestmethods, #rickshaw (??), #statafail, and #monkeysinfrontoftypewriters.

Most prolific #rstats tweeters
One of the first things I did with these data is a simple aggregation and sort to find the tweeters that used the hashtag most often:

library(dplyr)
THE_DATA %>%
  group_by(User) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) -> prolific.rstats.tweeters

Here is the top 10 (you can see the rest here.)

@Rbloggers	1081
@hadleywickham	498
@timelyportfolio	427
@recology_	419
@revodavid	210
@chlalanne	209
@adolfoalvarez	199
@RLangTip	175
@jmgomez	160

Nothing terribly surprising here.

Normalizing by total tweets
In a twitter discussion about these data, a twitter friend Tim Hopper posited that though he had fewer #rstats tweets than another mutual friend, Trey Causey, he would have a higher number of #rstats tweets if you control for total tweet volume. I wondered how this sorting would look.

Answering this question gave me an excuse to use Hadley Wickham's new package, rvest (I literally just got why the package is named as much while typing this out) which makes web scraping easier--in part by leveraging the expressive power of the magrittr package.

To get the total number of tweets for a particular tweeter, I wrote the following function:

library(rvest)
library(magrittr)
get.num.tweets <- function(handle){
  tryCatch({
    unraw <- function(raw_str){
      raw_str <- sub(",", "", raw_str)    # remove commas if any
      if(grepl("K", raw_str)){
        return(as.numeric(sub("K", "", raw_str))*1000)   # in thousands
      }
      return(as.numeric(raw_str))
    }
    html(paste0("http://twitter.com/", sub("@", "", handle))) %>%
      html_nodes(".is-active .ProfileNav-value") %>%
      html_text() %>%
      unraw
    },
    error=function(cond){return(NA)})
}

The real logic (and beauty) of which is contained only in the last few lines:

    html(paste0("http://twitter.com/", sub("@", "", TWITTER_HANDLE))) %>%
      html_nodes(".is-active .ProfileNav-value") %>%
      html_text()

The CSS element that houses the number of total tweets from a useR's twitter page was found easily using SelectorGadget.

After scraping the number of tweets for almost 10,000 #rstats tweeters (waiting a few seconds between each request because I'm considerate) I divided number of #rstats tweets by the total number of tweets to come up with a normalized value.

The top 10 tweeteRs were as follows:

              User count num.of.tweets     ratio 
1     @medzihorsky     9            28 0.3214286 
2        @statworx     5            16 0.3125000 
3    @LearnRinaDay   114           404 0.2821782 
4  @RforExcelUsers     4            15 0.2666667 
5     @showmeshiny    27           102 0.2647059 
6           @tcrug     6            25 0.2400000 
7   @DailyRpackage   155           666 0.2327327 
8   @R_Programming    49           250 0.1960000 
9        @hexadata     8            41 0.1951220 
10     @Deep_RHelp    11            58 0.1896552 

In case you were wondering, Trey Causey still "won" by a long shot:

> tweeters[which(tweeters$User=="@tdhopper"),]   
Source: local data frame [1 x 4]                 
                                                 
       User count num.of.tweets        ratio     
1 @tdhopper     8         26700 0.0002996255     
> tweeters[which(tweeters$User=="@treycausey"),] 
Source: local data frame [1 x 4]                 
                                                 
         User count num.of.tweets      ratio     
1 @treycausey    50         28700 0.00174216

Before ending this post, I feel compelled to issue an almost certainly unnecessary but customary warning against using number of #rstats tweets as a proxy for who likes R the most or who are the biggest R "thought leaders" (whatever that is). Most tweets about R don't use the #rstats hashtag, anyway.

Again, I would't read too much into this :)

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Sending text messages at random times using python

Given my interest for applying statistics and analytics to most (if not all of the) quantifiable aspects of my life, when I learned about self-tracking, and the associated 'Quantified Self' movement, it should come as no surprise to anyone that knows me that I wanted to get started right away.
And...
Given my interest in making life harder than it needs to be, it makes sense that I would eschew existing self-tracking tools and build my own. A neat side-effect of this obstinance is getting to learn new things.

The basic idea is at random times during the day, I fill out a survey that I designed for myself including questions such as: "How happy are you right now?", "How much energy would you say that you have right now?", and "Where are you right now?".

The most reliable and fastest way to get in touch with me is to send a text message. So, sending myself text messages at random times during the day is the best way to prompt me to fill out this self-tracking survey.

To make it easier (and, therefore, more likely that I'll fill it out) the content of the text message should be a link to the survey on the web. And in order to add flexibility to when I have to fill out the survey form but also preserve the randomness of the sampling, the timestamp of the time the text message was sent should be included as a url parameter so that it can be stored in the database along with the answers to the survey.

The service that sends these text messages runs on a Debian GNU/Linux EC2 instance that also hosts the form I fill out and the database that the answers are dumped to.

Before we get to the code, I should explain the modules that we will need for this task, and my rationale for choosing them.

logging
Trying to debug a scheduled task or workflow is a living hell without proper and verbose logging. Since this must be run in the background (and not tied to a particular terminal emulator) simple print statements will not do. The more elegant, scalable, and extensible solution is to use Python's excellent 'logging' module.

smtplib
While there are a few different ways to send text messages (SMS) using Python, the solution I settled on is to use the 'smtplib' standard library module to send an email to an SMS gateway. This gateway will convert the email into a text message sent to my phone. smtplib is needed to send the email message.

apscheduler
Although cron (or equivalently [?] Windows Scheduling Service) should be the tool of choice when scheduling commands to be run at specific times that never change, the fact that the text messages have to be sent at different times everyday requires another solution. Probably the most elegant and cross-platform solution is to use the advanced python scheduling library, apscheduler. The Python standard library comes with a similar module, sched, but apscheduler is more advanced with its scheduling capability and its ability to persistently store tasks in a database that survives process restart. (It supports storage in SQLite, PostgreSQL, MongoDB, Redis, MySQL, Oracle, MS-SQL, Firebird, and Sybase). But, unlike its standard library counterpart, it needs to be pip installed.

We will divide this task up into two python scripts, one that gets run once a day, computes n random times, schedules to send a text message those times, and then sends the message (we will call this send_daily_texts.py), and one script that runs once that calls send_daily_texts at midnight everyday (we will call this run_everyday.py).

send_daily_texts.py

#!/usr/bin/python -tt

import random
import sys
import logging
import smtplib
import email.utils
from email.mime.text import MIMEText
from datetime import datetime, timedelta, date
from apscheduler.schedulers.blocking import BlockingScheduler

# create logger
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
handler = logging.FileHandler('send_daily_texts.log')
handler.setLevel(logging.DEBUG)
logger.addHandler(handler)
logger.info("[{}] - send_daily_texts was run".format(datetime.now()))

# the number of times to schedule and send text messages
# are provided as a command line argument
n = int(sys.argv[1])

logger.info("[{}] - going to choose {} random times".format(datetime.now(), n))

# we need to parse today's state to properly
# schedule the text message sending
dadate = datetime.now()
year = dadate.year
month = dadate.month
day = dadate.day

# the lower bound is 8 o' clock
lower_bound = datetime(year, month, day, 8, 0, 0)
logger.info("[{}] - the lower bound is {}".format(datetime.now(), lower_bound))

# the upper bound is 11 o' clock PM
upper_bound = datetime(year, month, day, 23, 0, 0)
logger.info("[{}] - the upper bound is {}".format(datetime.now(), upper_bound))

sched = BlockingScheduler()
logger.info("[{}] - Created blocking scheduler".format(datetime.now()))

wherefrom = 'YOUEMAILACCOUNTYOCREATE AT gmail DOT com'
whereto = 'YOURPHONENUMBER AT YOURSMSGATEWAY DOT com'
gmail_pw = 'YOURGMAILPASSWORD'

def encode_timestamp(timestamp):
    return str(timestamp).replace(" ", "+").replace(":", "%3A")

def make_message(timestamp, wherefrom, whereto):
    slug = "http://THELINKURL/?timestamp={}".format(encode_timestamp(timestamp))
    msg = MIMEText(slug)
    msg['To'] = email.utils.formataddr(('Recipient', whereto))
    msg['From'] = email.utils.formataddr(('Author', wherefrom))
    msg['Subject'] = 'Time for the survey!'
    return msg

def send_text(should_exit=False):
    logger.info('[{}] - trigger triggered, going to send text'.format(datetime.now()))
    logger.info('[{}] - attempting to connect to gmail'.format(datetime.now()))
    server = smtplib.SMTP("smtp.gmail.com", 587)
    server.starttls()
    server.login(wherefrom, gmail_pw)
    logger.info('[{}] - successfully connected to gmail'.format(datetime.now()))
    timestamp = datetime.now()
    msg = make_message(timestamp, wherefrom, whereto)
    logger.info('[{}] - going to send message {} to {}'.format(datetime.now(),
                                                               damsg.replace('\n', '<br>'),
                                                               whereto))
    ret = server.sendmail(wherefrom, [whereto], damsg)
    server.quit()
    if should_exit:
        logger.info('[{}] - finished... going to exit'.format(datetime.now()))
        sched.shutdown(wait=False)

def random_time(start, end):
    sec_diff = int((end-start).total_seconds())
    secs_to_add = random.randint(0, sec_diff)
    return start + timedelta(seconds=secs_to_add)

def get_n_random_times(n, start, end):
    times = []
    for i in range(0, n):
        times.append(random_time(start, end))
    times.sort()
    return times

times = get_n_random_times(n, lower_bound, upper_bound)
logger.info("[{}] - Received {} times to schedule".format(datetime.now(),
                                                         len(times)))

for ind, atime in enumerate(times):
    if ind == (n-1):
        sched.add_job(send_text, 'date', run_date=atime,
                      args={"should_exit": True})
        logger.info("[{}] - added last task at {}".format(datetime.now(),
                                                         atime))
    else:
        sched.add_job(send_text, 'date', run_date=atime)
        logger.info("[{}] - added task at {}".format(datetime.now(),
                                                     atime))

sched.start()
logger.info("[{}] - everything is done".format(datetime.now()))

Before I describe "run_everyday.py" there are a few things I should note about the snippet above.

When I originally wrote this script, the text messages wouldn’t send even though the logger indicated that it had. I assumed this was because gmail rejected it because it didn't look enough like an email message. In order to correct this, I needed to use the email.mime.text module to add the standard email headers to the message to be send.

Since I am only interested in experience sampling my waking life, I didn't want to fill out the survey during hours that I am normally sleep. I had to make sure the I set 8 o' clock and 23 (11pm) o' clock as my lower and upper bound, respectively.

Third, if you decide to cannibalize this code, make sure you change the values for 'wherefrom', 'whereto', and 'gmail_pw'. The format the SMS gateway you should use depends upon your mobile carrier. My particular SMS gateway is my 10 digit phone number @vtext.com. Your’s will likely be different–consult this list.

run_everyday.py

#!/usr/bin/python -tt

import sys
import logging
from datetime import datetime
from subprocess import Popen, PIPE
from apscheduler.schedulers.blocking import BlockingScheduler

def run_daily_surveys(thelogger):
    thelogger.info("[{}] - Trigger triggered".format(datetime.now()))
    thelogger.info("[{}] - Going to run daily script".format(datetime.now()))
    p = Popen('./send_daily_texts.py 3', shell=True, stdout=PIPE, stderr=PIPE)
    out, err = p.communicate()
    if p.returncode:
        thelogger.error("[{}] - Failed to run daily script".format(datetime.now()))
        sys.exit("Failed to run daily script")
    thelogger.info("[{}] - Ran daily script".format(datetime.now()))
    if p.returncode:
        sys.exit("Command failed to run")

def main():
    logger = logging.getLogger(__name__)
    logger.setLevel(logging.DEBUG)
    handler = logging.FileHandler('run_everyday.log')
    handler.setLevel(logging.DEBUG)
    logger.addHandler(handler)
    logger.info("[{}] - run_everyday.py was run".format(datetime.now()))
    
    sched = BlockingScheduler()
    logger.info("[{}] - blocking scheduler was created".format(datetime.now()))
    sched.add_job(run_daily_surveys, 'interval', days=1, args=[logger])
    logger.info("[{}] - everyday task added, going to start the scheduler".format(datetime.now()))
    sched.start()
    return 0

if __name__ == '__main__':
    STATUS = main()
    sys.exit(STATUS)

I've been running these tasks for about a week now, and its working great!

My next couple of blog posts will be about server-side code and architecture to support my self-tracking project.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Why is my OS X Yosemite install taking so long?: an analysis

Why?
Since the latest Mac OS X update, 10.10 "Yosemite", was released last Thursday, there have been complaints springing up online of the progress bar woefully underestimating the actual time to complete installation. More specifically, it appeared as if, for a certain group of people (myself included), the installer would stall out at "two minutes remaining" or "less than a minute remaining"–sometimes for hours.

In the vast majority of these cases, though, the installation process didn't hang, it was just performing a bunch of unexpected tasks that it couldn't predict.

During the install, striking "Command" + "L" would bring up the install logs. In my case, the logs indicated that the installer was busy right until the very last minute.

Not knowing very much about OS X's installation process and wanting to learn more, and wanting to answer why the installation was taking longer than the progress bar expected, I saved the log to a file on my disk with the intention of analyzing it before the installer automatically restarted my computer.

Cleaning
The log file from the Yosemite installer wasn't in a format that R (or any program) could handle natively so before we can use it, we have to clean/munge it. To do this, we'll write a program in the queen of all text-processing languages: perl.

This script will read the log file, line-by-line from standard input (for easy shell piping), and spit out nicely formatted tab-delimited lines.

#!/usr/bin/perl

use strict;
use warnings;

# read from stdin
while(<>){
    chomp;
    my $line = $_;
    my ($not_message, $message) = split ': ', $line, 2;

    # skip lines with blank messages
    next if $message =~ m/^\s*$/;

    my ($month, $day, $time, $machine, $service) = split " ", $not_message;

    print join("\t", $month, $day, $time, $machine, $service, $message) . "\n";
}

We can output the cleaned log file with this shell command:

echo "Month\tDay\tTime\tMachine\tService\tMessage" > cleaned.log
grep '^Oct' ./YosemiteInstall.log | grep -v ']:  ' | grep -v ': }' |  ./clean-log.pl >> cleaned.log

This cleaned log contains 6 fields: 'Month', 'Day', 'Time', 'Machine (host)', 'Service', and 'Message'. The installation didn't span days (it didn't even span an hour) so technically I didn't need the 'Month' and 'Day' fields, but I left them in for completeness' sake.

Analysis

Let's set some options and load the libraries we are going to use:

# options
options(echo=TRUE)
options(stringsAsFactors=FALSE)

# libraries
library(dplyr)
library(ggplot2)
library(lubridate)
library(reshape2)

Now we read the log file that I cleaned and add a few columns with correctly parsed timestamps using lubridate’s "parse_date_time()" function

yos.log <- read.delim("./cleaned.log", sep="\t") %>%
  mutate(nice.date=paste(Month, Day, "2014", Time)) %>%
  mutate(lub.time=parse_date_time(nice.date, 
                                  "%b %d! %Y! %H!:%M!:%S!", 
                                  tz="EST"))

And remove the rows of dates that didn't parse correctly

yos.log <- yos.log[!is.na(yos.log$lub.time),]

head(yos.log)


##   Month Day     Time   Machine        Service
## 1   Oct  18 11:28:23 localhost opendirectoryd
## 2   Oct  18 11:28:23 localhost opendirectoryd
## 3   Oct  18 11:28:23 localhost opendirectoryd
## 4   Oct  18 11:28:23 localhost opendirectoryd
## 5   Oct  18 11:28:23 localhost opendirectoryd
## 6   Oct  18 11:28:23 localhost opendirectoryd
##                                                                    Message
## 1                   opendirectoryd (build 382.0) launched - installer mode
## 2                                  Logging level limit changed to 'notice'
## 3                                               Initialize trigger support
## 4 created endpoint for mach service 'com.apple.private.opendirectoryd.rpc'
## 5                                set default handler for RPC 'reset_cache'
## 6                           set default handler for RPC 'reset_statistics'
##              nice.date            lub.time
## 1 Oct 18 2014 11:28:23 2014-10-18 11:28:23
## 2 Oct 18 2014 11:28:23 2014-10-18 11:28:23
## 3 Oct 18 2014 11:28:23 2014-10-18 11:28:23
## 4 Oct 18 2014 11:28:23 2014-10-18 11:28:23
## 5 Oct 18 2014 11:28:23 2014-10-18 11:28:23
## 6 Oct 18 2014 11:28:23 2014-10-18 11:28:23

The first question I had was how long the installation process took

install.time <- yos.log[nrow(yos.log), "lub.time"] - yos.log[1, "lub.time"]
(as.duration(install.time))
## [1] "1848s (~30.8 minutes)"

Ok, about a half-hour.

Let's make a column for cumulative time by subtracting each row's time by the start time

yos.log$cumulative <- yos.log$lub.time - min(yos.log$lub.time, na.rm=TRUE)

In order to see what processes were taking the longest, we have to make a column for elapsed time. To do this, we can subtract each row's time from the time of the subsequent row.

yos.log$elapsed <- lead(yos.log$lub.time) - yos.log$lub.time

# remove last row
yos.log <- yos.log[-nrow(yos.log),]

Which services were responsible for the most writes to the log and what services took the longest? We can find out with the following elegant dplyr construct. While we're at it, we should add columns for percentange of the whole for easy plotting.

counts <- yos.log %>%
  group_by(Service) %>%
  summarise(n=n(), totalTime=sum(elapsed)) %>%
  arrange(desc(n)) %>%
  top_n(8, n) %>%
  mutate(percent.n = n/sum(n)) %>%
  mutate(percent.totalTime = as.numeric(totalTime)/sum(as.numeric(totalTime)))
(counts)

## Source: local data frame [8 x 5]
## 
##           Service     n totalTime percent.n percent.totalTime
## 1     OSInstaller 42400 1586 secs 0.9197197          0.867615
## 2  opendirectoryd  3263   43 secs 0.0707794          0.023523
## 3         Unknown   236  157 secs 0.0051192          0.085886
## 4  _mdnsresponder    52   17 secs 0.0011280          0.009300
## 5              OS    49    1 secs 0.0010629          0.000547
## 6 diskmanagementd    47    7 secs 0.0010195          0.003829
## 7     storagekitd    29    2 secs 0.0006291          0.001094
## 8         configd    25   15 secs 0.0005423          0.008206

Ok, the "OSInstaller" is responsible for the vast majority of the writes to the log and to the total time of the installation. "opendirectoryd" was the next most verbose process, but its processes were relatively quick compared to the "Unknown" process' as evidenced by "Unknown" taking almost 4 times longer, in aggregate, in spite of having only 7% of "opendirectoryd"'s log entries.

We can more intuitively view the number-of-entries/time-taken mismatch thusly:

melted <- melt(as.data.frame(counts[,c("Service",
                                       "percent.n",
                                       "percent.totalTime")]))

ggplot(melted, aes(x=Service, y=as.numeric(value), fill=factor(variable))) +
  geom_bar(width=.8, stat="identity", position = "dodge",) +
  ggtitle("Breakdown of services during installation by writes to log") +
  ylab("percent") + xlab("service") +
  scale_fill_discrete(name="Percent of",
                      breaks=c("percent.n", "percent.totalTime"),
                      labels=c("writes to logfile", "time elapsed"))

Breakdown

As you can see, the "Unknown" process took a disproportionately long time for its relatively few log entries; the opposite behavior is observed with "opendirectoryd". The other processes contribute very little to both the number of log entries and the total time in the installation process.

What were the 5 most lengthy processes?

yos.log %>%
  arrange(desc(elapsed)) %>%
  select(Service, Message, elapsed) %>%
  head(n=5)


##       Service
## 1 OSInstaller
## 2 OSInstaller
## 3     Unknown
## 4 OSInstaller
## 5 OSInstaller
##                                                                                                                                            Message
## 1 PackageKit: Extracting file:///System/Installation/Packages/Essentials.pkg (destination=/Volumes/Macintosh HD/.OSInstallSandboxPath/Root, uid=0)
## 2                                    System Reaper: Archiving previous system logs to /Volumes/Macintosh HD/private/var/db/PreviousSystemLogs.cpgz
## 3                       kext file:///Volumes/Macintosh%20HD/System/Library/Extensions/JMicronATA.kext/ is in hash exception list, allowing to load
## 4                                                                   Folder Manager is being asked to create a folder (down) while running as uid 0
## 5                                                                                                                      Checking catalog hierarchy.
##    elapsed
## 1 169 secs
## 2 149 secs
## 3  70 secs
## 4  46 secs
## 5  44 secs

The top processes were:

  • Unpacking and moving the contents of "Essentials.pkg" into what is to become the newsystem directory structure. This ostensibly contains items like all the updated applications (Safari, Mail, etc..). (almost three minutes)
  • Archiving the old system logs (two and a half minutes)
  • Loading the kernel module that allows the onboard serial ATA controller to work (a little over a minute)

Let's view a density plot of the number of writes to the log file during installation.

ggplot(yos.log, aes(x=lub.time)) +
  geom_density(adjust=3, fill="#0072B2") +
  ggtitle("Density plot of number of writes to log file during installation") +
  xlab("time") + ylab("")

density

This graph is very illuminating; the vast majority of log file writes were the result of very quick processes that took place in the last 15 minutes of the install, which is when the progress bar read that only two minutes were remaining.

In particular, there were a very large number of log file writes between 11:47 and 11:48; what was going on here?

# if the first time is in between the second two, this returns TRUE
is.in <- function(time, start, end){
  if(time > start && time < end)
    return(TRUE)
  return(FALSE)
}

the.start <- ymd_hms("14-10-18 11:47:00", tz="EST")
the.end <- ymd_hms("14-10-18 11:48:00", tz="EST")

# logical vector containing indices of writes in time interval
is.in.interval <- sapply(yos.log$lub.time, is.in,
                         the.start,
                         the.end)

# extract only these rows
in.interval <- yos.log[is.in.interval, ]

# what do they look like?
silence <- in.interval %>%
  select(Message) %>%
  sample_n(7) %>%
  apply(1, function (x){cat("\n");cat(x);cat("\n")})

## 
## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/usr/local/texlive/2013/tlpkg/tlpobj/featpost.tlpobj -> /Volumes/Macintosh HD/usr/local/texlive/2013/tlpkg/tlpobj Final name: featpost.tlpobj (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy)
## 
## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/usr/local/texlive/2013/texmf-dist/tex/generic/pst-eucl/pst-eucl.tex -> /Volumes/Macintosh HD/usr/local/texlive/2013/texmf-dist/tex/generic/pst-eucl Final name: pst-eucl.tex (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy)
## 
## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/Library/Python/2.7/site-packages/pandas-0.12.0_943_gaef5061-py2.7-macosx-10.9-intel.egg/pandas/tests/test_groupby.py -> /Volumes/Macintosh HD/Library/Python/2.7/site-packages/pandas-0.12.0_943_gaef5061-py2.7-macosx-10.9-intel.egg/pandas/tests Final name: test_groupby.py (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy)
## 
## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/usr/local/texlive/2013/texmf-dist/tex/latex/ucthesis/uct10.clo -> /Volumes/Macintosh HD/usr/local/texlive/2013/texmf-dist/tex/latex/ucthesis Final name: uct10.clo (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy)
## 
## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/usr/local/texlive/2013/texmf-dist/doc/latex/przechlewski-book/wkmgr1.tex -> /Volumes/Macintosh HD/usr/local/texlive/2013/texmf-dist/doc/latex/przechlewski-book Final name: wkmgr1.tex (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy)
## 
## WARNING : ensureParentPathExists: Created  `/Volumes/Macintosh HD/usr/local/texlive/2013/texmf-dist/doc/latex/moderntimeline' w/ {
## 
## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/usr/local/texlive/2013/texmf-dist/fonts/type1/wadalab/mrj/mrjkx.pfb -> /Volumes/Macintosh HD/usr/local/texlive/2013/texmf-dist/fonts/type1/wadalab/mrj Final name: mrjkx.pfb (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy)

Ah, so these processes are the result of the installer having to move files back into the new installation directory structure. In particular, the vast majority of these move operations are moving files related to a program called "texlive". I'll explain why this is to blame for the inaccurate projected time to completion in the next section.

But lastly, let's view a faceted density plot of the number of log files writes by process. This might give us a sense of what steps go on as the installation progresses by showing us with processes are most active.

# reduce number of service to a select few of the most active
smaller <- yos.log %>%
  filter(Service %in% c("OSInstaller", "opendirectoryd",
                        "Unknown", "OS"))

ggplot(smaller, aes(x=lub.time, color=Service)) +
  geom_density(aes( y = ..scaled..)) +
  ggtitle("Faceted density of log file writes by process (scaled)") +
  xlab("time") + ylab("")

facet

This shows that no one process runs consistently throughout the entire installation process, but rather that the process run in spurts.

the answer
The vast majority of Mac users don't place strange files in certain special system-critical locations like '/usr/local/' and '/Library/'. Among those who do, though, these directories are littered with hundreds and hundreds of custom files that the installer doesn't and can't have prior knowledge of.

In my case, and probably many others, the estimated time-to-completion was inaccurate because the installer couldn't anticipate needing to copy back so many files to certain special directories after unpacking the contents of the new OS. Additionally, for each of these copied files, the installer had to make sure the subdirectories had the exact same meta-data (permissions, owner, reference count, creation date, etc…) as before the installation began. This entire process added many minutes to the procedure at a point when the installer thought it was pretty much done.

What were some of the files that the installer needed to copy back? The answer will be different for each system but, as mentioned above, anything placed '/usr/local' and '/Library' directories that wasn't Apple-supplied needed to be moved and moved back.

/usr/local/
/usr/local/ is used chiefly for user-installed software that isn't part of the OS distribution. In my case, my /usr/local contained a custom compliled Vim; ClamXAV, a lightweight virus scanner that I use only for the benefit of my Windows-using friends; and texlive, software for the TeX typesetting system. texlive was, by far, the biggest time-sink since it had over 123,491 files.

In addition to these programs, many users might find that the Homebrew package manager is to blame for their long installation process, since this software also uses the /usr/local prefix (although it probably should not).

/Library/
Among other things, this directory holds (subdirectories that hold) modules and packages that the Apple-supplied Python, Ruby, and Perl uses. If you use these Apple-supplied versions of these languages and you install your own packages/modules using super-user privileges, the new packages will go into this directory and it will appear foreign to the Yosemite installer.

To get around this issue, either install packages/modules in a local (non-system) library, or use alternate versions of these programming languages that you either download and install yourself, or use MacPorts to install.

---

You can find all the code and logs that I used for this analysis in this git repository

This post is also available as a RMarkdown report here

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

How to make an absurd twitter bot in python

In my last post, I outlined the steps I took to programmatically mimic the wine reviews of a dilettante sommelier. In this post, I'll explain the steps I took to create the twitter bot @HorseWineReview which combines a random wine with a random computer-generated review. I'll keep it short and sweet–the steps are as follows:

  • get a list of wines (from Freebase)
  • create a twitter account and application
  • write script to create and post the tweet
  • automate it with a cron job

Get a list of wines (from Freebase)
Freebase is a collaborative knowledge base that uses a graph database to store semantic information. Information can be retrieved by running MQL queries on their web interface or through a google-powered API. We'll test a query first on their query editor online, but move to accessing the result from the API once we can verify that our query is constructed properly.
The query is very simple, it looks like this:

[{
  "name": null,
  "type": "/wine/wine"
}]

This will return a list of names of all entities of type "/wine/wine" in JSON format. This is an excerpt:

{
  "result": [
    {
      "type": "/wine/wine",
      "name": "1999 Domaine Romanee Conti La Tache"
    },
     .............

Now that we’ve confirmed that this query works and the syntax is correct, let's get the results by using the Google Freebase API. If you don't have one already, you need to get a Google Developers account. From the Google Developers Console, you need to grab a Freebase API key. After that, we can write and run a python script to retrieve and dump out all the wine names. You need the python module "freebase" which can be installed via pip. The code goes thusly:

#!/usr/bin/python
import freebase
import json
import urllib

api_key = "YOUR KEY HERE"
service_url = 'https://www.googleapis.com/freebase/v1/mqlread'

freebase_query = [{'name': None,
                   'limit': 999999999999999,
                   "type": "/wine/wine"}]

params = {"query": json.dumps(freebase_query),
          "key": api_key}

url = service_url + "?" + urllib.urlencode(params)
response = json.loads(urllib.urlopen(url).read())

for item in response['result']:
    print item['name'].encode('utf-8')

Run this script and redirect its contents to a file...

This stores a wine name on each line of the file.

Create a twitter account for your bot and register an application
After following the standard procedure of setting up a twitter account, head over to dev.twitter.com and set-up a developers account on behalf of your bot. Then head over to apps.twitter.com to create a new application; this will be the conduit with which we programmatically update your twitter bot's status. Make sure you read the Terms of Service and don't violate them. Make sure you also allow your application read and write access. After this, if you navigate to the "API Keys" tab, you should record the following information: the API key, API secret, access token, and access token secret. We'll need this to authenticate from our auto-posting python script.

Write script to tweet
We'll use the Twython package to authenticate and serve as an interface to twitter.

A bare bones script to perform this task looks like this:

#!/usr/bin/python

import subprocess
import random
import re
import os
from twython import Twython

twitter = Twython("YOUR API KEY",
                  "YOUR API SECRET",
                  "YOUR ACCESS TOKEN",
                  "YOUR ACCESS TOKEN SECRET")

def output_tweet(text):
    twitter.update_status(status=text)
    os._exit(0)

lengthoftweet = 999
# string to build tweet
tweet = ""

while lengthoftweet > 140:
    tweet = ""
    all_names = [wine.rstrip() for wine
                  in open("ABSOLUTE PATH TO WINE LIST FILE").read().split("\n")]
    wine_name = random.choice(all_names)
    tweet += wine_name + ": "
    rev = subprocess.check_output("ABSOLUTE PATH TO REVIEW GENERATOR")
    # we don't want a review to imply that the wine is either
    # red or white
    if "red" in rev:
        continue
    if "white" in rev:
        continue
    if len(tweet + rev.rstrip()) < 141:
        output_tweet(tweet + rev.rstrip())
    # if it is too long, try to use the first
    # sentence only (using regex)
    rev = re.search("(.+?\.).*", rev).group(1)
    tweet += rev.rstrip()
    lengthoftweet = len(tweet)
    
output_tweet(tweet)
os._exit(0)

A lot of the code is to ensure that the tweet doesn't exceed 140 characters. You might want to add a logging feature to the script–it will help with debugging the cron job.

Automate it with a cron job
If you're on a Unix system we can set up this script to run at specified times automatically using a cron job. Windows users can use Windows Task Scheduler but I've never used it so you're on your own.

Cron jobs are notoriously hard to debug. The #1 problem is encountered is not using absolute paths. When cron calls your script, the working directory is not where the script resides (unlike when you call the script on the command-line from the same directory). Because of this, any file IO in the script that uses relative paths will fail (quietly, if you don't add logging to the script).

The #2 problem you may encounter with cron jobs is that, because it runs as a detached process outside the login environment, the shell it executes it from may not be the one you normally use. Furthermore, it may not have the directories in the PATH that you need, or any other environment variables that you depend on.

The third problem that occurs often in botched cron jobs are bad permissions. Make sure you have all correct permissions (e.g. chmod +x YOUR_SCRIPT.py)

One final problem that I encountered was not putting an empty line after your cron entry–learn from my mistakes!

To add a cron entry, execute

in the shell

In the editor, I wrote the entry like this:

Your paths will obviously depend on what you are running and from where. Make sure you have an empty line after the entry!

To briefly explain this entry…

The first two numbers (00) specify the minute (0-60) to run the job. I chose to run it on the first (zeroth) minute of the hour. The second section (9,14,21) specifies the hour and tells cron to run the job at 9:00 am, 2:00 pm, and 9:00 pm. The next three sections (the asterisks) specify "day of the month" (1-31), "month of the year" (1-12), and "day of the week" (0-7), respectively. The asterisks indicate that this is to run every day of the month and every month of the year.

The next two strings instruct cron to run the script (which was explained in my last post), and the rest of the line redirects any output from logging to a text file called CRON.log

First endnote: Why @HorseWineReviews?
The name pays homage to the infamous twitter bot @Horse_ebooks that (used to) post (unintentionally hilarious) context-free excerpts and Markov chain clumps from books about horses in an (successful) effort to avoid looking like a spam account whilst occasionally tweeting links to promote the sales of e-books.

Final endnote
While the task of tweeting fake wine reviews has already been taken, if you are looking for ideas of a twitter bot of your own, dear reader, you might want to explore the following ideas that I think would a hit:

  • Train a Markov chain on equal parts famous philosophical works and vacuous and decidedly un-philosophical ramblings (a la KimKierkegaardashian and Kantye West). Philosophical corpora can be grabbed from Project Gutenburg. Vacuous babble can probably be obtained from choice subreddits or most of the trending hashtags on twitter.
  • Web-scrape a huge corpus of episode descriptions from various television shows, train a Markov chain on them and let it loose on Twitter. You can get episode descriptions from tvrage.com (example)
  • Train a Markov chain on the abstracts of academic papers.

You’re welcome :)

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Visualizing data analysis pipelines using NetworkX

In complicated data analysis pipelines and scientific workflows, it's often difficult to keep track of which tasks have to be performed before others. Even with informal forms of documentation (my personal favorite is 'notes.txt'), as the size of a project grows, and more dependencies are introduced, a formal documentation process has to be put in place, or else the project will become unsustainable.

I'm writing a automated system for statisticians and scientists for carrying out large multistep analytics processes. I'll discuss this more in later posts, but the details pertinent for this post are that each step of a analytics pipeline is detailed in a YAML document called a "Sakefile" (not-so-clever play on Makefile) with sections explicitly defining dependencies and resulting output files.

Given dependency resolution's usage of concepts from graph theory (topological sorting) I thought it would be easy and neat to write a tool to visualize the components and dependencies that go into an analytics workflow as a directed graph.

I've rustled up a simple example examining correlates of DUI arrests with various adolescent-related data by state. I chose these data sets because they’re very small and freely available on the net.

The "Sakefile" looks like this:

---
format dui stats:
    help: format raw (copy and pasted) dui/state data using perl
    dependencies:
        - rawdata.txt
    formula: >
        perl -pe 's/^(\D+)\s+([\d,]+)\s+([\d,]+)\s*/\1\t\2\t\3\n/'
        rawdata.txt | sed 's/,//g' > duistats.tsv;
    output:
        - duistats.tsv

fetch teen stats:
    help: fetches various teen statstics from the web
    # no dependencies
    formula: >
        curl -o teenstats.xls http://mathforum.org/workshops/sum96/data.collections/datalibrary/US_TeenStats.XL.zip.xls;
    output:
        - teenstats.xls

convert teen stats to csv:
    help: uses gnumerics ssconvert to convert ugly xls to csv and cleans it
    dependencies:
        - teenstats.xls
    formula: >
        ssconvert teenstats.xls messyteenstats.csv;
        cat <(echo -n "state") <(< teenstats.csv sed '55,$d' |
        sed '1,2d') | sed 's/,,/,/g' > teenstats.csv;
        rm messyteenstats.csv;
    output:
        - teenstats.csv

find correlates:
    help: calls R script that finds correlated of DUI arrest in various teen statistics
    dependencies:
        - duistats.tsv
        - teenstats.csv
    formula: >
        ./dui-correlates.R
    output:
        - corrogram.png
        - table.csv

all:
    - format dui stats
    - fetch teen stats
    - convert teen stats to csv
    - find correlates
...

A short description of each of the steps appears in the "help" field on each entry. Basically, there are two source data files: one exists and raw text copy and pasted from a website, and the other is fetched from the web using curl. The former is cleaned and formatted using perl and sed; the latter has to go through a process that converts downloaded excel file into a CSV and strips useless lines. Both of these source data files then get read by an R script which, ultimately, outputs a corrogram graphic and a summarization table.

Below is the small python program that parses the "Sakefile" and created the visualization. It uses the great NetworkX module to create the graph and render it as an image.

#!/usr/bin/env python -tt

import matplotlib.pyplot as plt
import networkx as nx
import pudb
import yaml

sakefile = yaml.load(open("Sakefile.yaml").read())

G = nx.DiGraph()

def check_for_dep_in_outputs(dep):
    print "checking dep {}".format(dep)
    ret_list = []
    for node in G.nodes(data=True):
        if "output" not in node[1]:
            continue
        if dep in node[1]['output']:
            ret_list.append(node[0])
    return ret_list

# make graph nodes for each target
for target in sakefile:
    if target == "all":
        # we don't want this node
        continue
    G.add_node(target, sakefile[target])


for node in G.nodes(data=True):
    print "checking node {} for dependencies".format(node[0])
    if "dependencies" not in node[1]:
        continue
    print "it has dependencies"
    connects = []
    for dep in node[1]['dependencies']:
        matches = check_for_dep_in_outputs(dep)
        if not matches:
            continue
        for match in matches:
            connects.append(match)
    if connects:
        for connect in connects:
            G.add_edge(connect, node[0])


nx.draw(G, node_color="pink", node_size=10000)
plt.savefig("dependency-visualization.png")

The resulting visualization looks like this:

dependency-visualization

Sure, the arrows look weird and this is a really simple example, but it's easy to see that, even for the most byzantine of pipelines, that a visualization like this can really help get a sense all the actions involved in a workflow.

I'll go over the actual running and results of this example in a later post, when I get the "sake" system working properly. :)

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Compiling R from source and why you shouldn't do it

I’ve always thought that it’s silly, in most cases, source compiling software that’s already available in binary form. To the end of making more binary packages available to Mac users, I just started contributing to a project that is creating a repository of 64 bit builds of pkgsrc’s (NetBSD's portable package manager) over 12,000 packages. This means having to get my hands dirty compiling packages myself. After contributing Vim, the next logical thing for me is to provide a R build.

Compiling R from source (again and again) has been tremendously enlightening for me. Not only do I feel like I understand a lot more about R’s internals, but I’ve also come to the conclusion that if the CRAN provides a binary build for your system, you should never really compile R yourself. This, most definitely, includes Mac users.

Before I go into how to build it, let’s explore some of the reasons someone might want to build R themselves and why, in most cases, this is unnecessary.

  • I want a faster R.
  • It’s sometimes assumed that if you build something from source yourself, it’s customized to your particular system and, therefore, runs faster. In practice this requires a lot of intervention (and heartache) at the configuration step of the compilation process. In the case of R on OS X, no amount of compiler optimization and configuration (using the stock linear algebra libraries) I’ve attempted was able to outperform R from CRAN. You don’t know R better than the R Core Team, and they know what’s good for you. Just use theirs.

  • I can compile against other linear algebra libraries and get a speedup that way.
  • You don’t need to compile R against these other libraries in order to use them. I’ll go into how you can use them from your current R installation in another post.

  • I’m on a system for which there is no binary available.
  • Yikes! You’re probably used to heartache. You don’t have a choice than to build R yourself. Have a ball!

  • I just want to.
  • As I’ve discovered, it is a great way to learn more about R’s internals. If you fancy yourself an R ‘guru’ and want to build R yourself, I can’t really blame you—so long as you don’t use your likely botched build in a production environment.

  • I’m a gentoo user.
  • I’m so sorry.

  • I’m a Windows user and a masochist.
  • Compiling R is an excellent choice. The safe word is “GNU”.

  • I’m helping to build a repo of 64 bit binaries for pkgsrc or I’m writing a blog post about compiling R.
  • You’re exempt from criticism or ridicule.

If at this point, you’re still interested in compiling R, in spite of my attesting to it being, for most cases, completely unnecessary, please read on. I also strongly recommend that you read the following guide from CRAN.

Dependencies
Users of most GNU/Linux systems can build the dependencies necessary by running:

sudo apt-get build-dep r-base-dev

or the equivalent command for your system.

On OS X you need

  • Xcode and Xcode command-line tools: Xcode is available from the App Store. The command-line tools have to be downloaded separately from the ‘Preferences’ menu.
  • gfortran: or another compliant Fortran compiler. You need this to chiefly compile the linear algebra libraries.
  • Java: You can grab the Java for OS X developers package from the Apple Developers page or grab another JDK. You need this for the JNI headers.
  • XQuartz: This includes the X11 headers and cairo.
  • MacTex: This isn’t strictly necessary but you will need it to generate R’s PDF documentation. If you don’t want to download this over 2 GB package, there are other recourses available. If you want this package, you have to add "/usr/texbin" to your PATH environment variable. Yay, now you have LaTeX!

Other dependencies are unnecessary because the R source ships with fallback versions of them. These include pcre, zlib, xdr, and a few others. Still other dependencies will be present on any POSIX-compliant system.

Configuration and build
After downloading the source here , you have a few decisions to make. The first is where you want to install R. You don’t have to install R anywhere per se because it can be run straight from the build directory, you can just place the R script (which contains the prefix hardcoded) in the bin subdirectory anywhere on your PATH. If you do not specify the prefix, it will default to the build directory.

It’s customary to set your prefix for user compiled software to /usr/local, so that’s what we’ll do here.

The other decisions that have to be made are very platform/system specific. You can see all the configuration options by running

./configure --help

The auto-configuration is very good at setting sane defaults for most of these options. For example, if you’re building on OS X, it will by default build R as a framework and shared library, which you would need if you want to use R.app. This is a separate install.

On OS X, I ran my pre-configuration and configuration thusly:

export CC="clang"
export CXX="clang"
export F77="gfortran-4.2 -arch x86_64"
export FC=$F77
export OBJC="clang"
./configure -prefix=/usr/local

Assuming everything goes well, you can now start building with

make

If it successfully builds, you can install R to the prefix with

make install

Now you have R.

If your on a Mac, you may have noticed that you have a crippled R install. This is for a few reasons.

  • The binary from CRAN comes with R.app. If you want that, you have to build that yourself.
  • You can no longer download binary builds of your favorite R packages. It has to build them from source now.

As an R user on a Mac, you then realize how good you’ve had it. The binary build from CRAN comes with R.app, a fast R framework, and it installs binary R packages by default. Now you no longer have those options.

Additionally, dear Mac-user, you also have the benefit of using RStudio’s new Cocoa interface. Count your lucky stars, install CRAN’s binary build, and read my next post about how to switch out the linear algebra libraries that R uses for a few other faster alternatives.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

qstats - quick and dirty statistics tool for the Unix pipeline

Back when 200MB hard drives were the size of washing machines and programs had no choice but to be as efficient as possible, Unix was born. In a serendipitous twist of fate, the same programs that were borne of this era of 4MB RAM and 16 bit processors are useful to data analysts with 2,000 times the amount of RAM and 64 bit multicore processors, processing data files over several GBs large.

Like all good things, Unix was started at Bell Labs in the late 60s. It has been honed over 40 years and now runs, if not on your computer, on the vast majority of the web servers you visit, a lot of phones, embedded devices you use, and a toaster near you.

Unices?

Since near everything in Unix is a text file, it grew up to be… very good at processing text. This is why Unix tools are a great addition to the data analyst's toolbox. There are a few great posts on how to get started using these tools in your workflow (here, here, here, and here) which you should read. By the way, when I talk about tools here, I’m talking about pipeline-able tools that take raw text input from standard input like sed; awk; and grep, not perl; tcl; or python.

There are tools to select columns, filter text for regular expressions, join files on a key, and reshape arrays, but I felt like there was one that was missing. After chaining tool after tool together and finally cajoling the data into a format and subset that I want to process and explore, I'd have to redirect the stream to a text file and read it from R. Clearly, if I’m to perform some complicated machine learning algorithm with this data, this is the best way to go. But if I just want to take a peek at the spread of the data, or quickly compare means, this is overkill.

Introducing qstats

Inspired by this gap in the Unix toolchain, I wrote a tool, qstats, that computes simple summary statistics from the command-line. It also includes data-binning and simple bar chart functionality. I designed it, in C, specifically to be as fast as possible, and bare-bones enough to work on any POSIX-compliant system without having to deal with outside dependencies. Let’s see it in action…

Functionality
By default, qstats will print R-like summary statistics on the given data. This includes the minimum value, the 1st quartile, the median, the mean, the third quartile, the maximum value, the range, and the standard deviation. You can use the -m flag to just get the mean. This will be faster because the data does not have to be sorted.

In addition to these statistics, qstats can also produce a frequency tabulation with an arbitrary number of "bins". Calling qstats with the -f10 flag will create 10 equal intervals and -f20 will create 20. Just calling it with -f will use Sturge's rule to come up with a reasonable number of bins in most cases.

Finally, with the -b flag, qstats will output a histogram-like horizontal bar-chart. Much like with the -f flag, you can supply the number of intervals to create. We will see an example of the bar-chart at work in the next section.

Rudimentary spread visualization

To view the spread with a bar-chart, let's output samplings from two distributions, the normal and the chi-square...

# one million normally distributed with a mean of 100 and a standard deviation of 10
millnorm <- rnorm(1000000, mean=100, sd=10)
write.table(millnorm, "millnorm.dat", col.names=FALSE, row.names=FALSE)

# one million values sampled from the chi-square distribution with two degrees of freedom
millchi <- rchisq(1000000, df=2)
write.table(millchi, "millchi.dat", col.names=FALSE, row.names=FALSE)
Normal distribution

Visualization of normal distribution

Chi-square distribution

Visualization of chi-square distribution (two degrees of freedom)

Speed comparisons
Let’s create a file of 100,000,000 floating point numbers to test speeds with R…

# sample from normal distribution with a mean of 100 and a standard deviation of 10
one.h.m <- rnorm(100000000, mean=100, sd=10)
write.table(one.h.m, “one_hundred_million.dat”, row.names=FALSE, col.names=FALSE)

The resulting file is 1.7 GBs large.

  • R
    The R script that we’ll time will look like this…

    #!/usr/bin/rscript —vanilla
    frame <- scan(“one_hundred_million.dat”)
    summary(frame)
    

    and the timing...

    $ time ./rtest.R
    Read 100000000 items
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      44.95   93.26  100.00  100.00  106.70  157.00 
    ./rtest.R  210.66s user 3.57s system 99% cpu 3:35.08 total
    

    3.5 minutes

  • Awk

    $ time awk '{ x+=$1; next } END { print x/NR }' one_hundred_milion.dat
    100.001
    awk '{ x+=$1; next } END { print x/NR }' one_hundred_milion.dat  128.34s user 0.56s system 99% cpu 2:09.01 total
    

    2 minutes.

    Note that this only computes the mean, not any of the other summary statistics. Some of these require sorting, which takes more time.

  • sort command

    $ time sort -n one_hundred_milion.dat > /dev/null
    sort -n one_hundred_milion.dat > /dev/null  151.89s user 3.46s system 99% cpu 2:35.72 total
    

    2.5 minutes

  • qstats

    $ time qstats one_hundred_milion.dat
    Min.     44.947
    1st Qu.  93.2553
    Median   100.001
    Mean     100.001
    3rd Qu.  106.747
    Max.     156.997
    Range    112.05
    Std Dev. 10.0002
    Length   100000000
    qstats one_hundred_milion.dat  53.62s user 1.04s system 99% cpu 54.722 total
    

    a little less than a minute

  • I show these comparisons here not to compare this small program with these great, veteran tools. Instead, I just want to underscore the point that smaller, very-few-trick-pony, specialized programs can afford to be faster than their more capable and robust counterparts. When these small tools will do the trick, they can not only be faster and simpler to use, but they also comport more with the Rule of Parsimony from the Unix philosophy.

Final words
The source code for this project is on Github with installation instructions. You can also download and install from this tarball. I've tested it on OS X, Debian, and NetBSD, but it should compile without any issue on any POSIX system with a reasonably recent C compiler. Please let me know if there are any installation issues for your system.

Please fork me and feel free to send a pull request or add an issue to the repo. I hope you like it!

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

The state of package management on Mac OS X

It's that time again; I suspect that Mavericks will be released in the next few weeks, so I get the once-every-year-(or-so) chance to experiment and modify the hell out of my OS X installation because I'll just do a fresh install soon anyway. This time around I'm experimenting with package managers.

I've actually tried really hard to avoid ever having to use them. I started using Slackware in high school, and after some brief experimentation (in college) with Ubuntu, I took up OS X as my main OS. But, since building from source code is somewhat of a nightmare on a Mac--at least compared to what I was used to--I started to look into package management solutions.

The terrain was difficult to navigate. It seemed like people had some really strong opinions on which one was the best and which ones were on their way out. Since I didn't know who to believe, I just stuck to manual building. But, since I'm going to get a tabula rasa in a few weeks, I thought I'd take this opportunity to document this terrain exploring and present my finding in the most impartial manner that I'm capable of.

Before I start, I want to make a few things clear. (1) There is some disagreement on what actually constitutes a package manager. Here, I'm referring broadly to any centralized software installation framework that tracks or resolves dependencies, whether it builds from source or not. (2) I haven't had the time to become an expert on all of the managers I audited, so keep that in mind. (3) Not only are all of these package managers open source, but many of them have robust configuration options, so I'll be talking mostly about default behavior from the perspective of a new user.

If my old editions of O'Reilly books discussing Mac software are any indication, MacPorts and Fink were the two best options available. Then Homebrew came on the scene and a lot of people seem to be raving about it. I started off with the intention of only trying out these three but in the course of my research, I learned about two others that I wanted to give a chance.

To see a table summary of my findings, you can just scroll down to the end of this post.

Rudix
Rudix is a binary-only package manager that attempts a "hassle-free" way of getting Unix programs on a Mac. It doesn't have many packages available yet, but it has no trouble at all installing and uninstalling the ones that it does offer. For example, their 'Go' installation was the most painless installation of a language that I've ever experienced. My complaints are that (a) the binaries go directly to /usr/bin, so they are not sandboxed, and (b) the man files for these tools were not installed with the binaries.

MacPorts
MacPorts was one of the most recommended package management solutions that I came across in my research. It also probably attracted the most flak. It was built with the likeness of FreeBSD's Ports system, so it's a source building manager. What I liked about MacPorts was the fact that the installation was painless (it updated my PATH for me!), the compiled binaries were sandboxed in /opt/local, and the wealth of packages available was hard not to love.

An interesting thing about MacPorts is that it eschews Apple-supplied libraries and links sources against its own. A benefit of this is that it can ensure a consistent experience across OS X versions and whatever whimsical decisions Apple may choose to make in the future. The drawback to this approach is that building what appears, prima facie, to be a small package may require an extraordinarily large amount of huge programs and libraries to be built as dependencies.

Fink
Fink is modeled after Debian's dpkg and apt-get. Having used Debian-based distros in the past, I was excited to see what Fink had to offer. Like apt-get, Fink can install binaries or build from source. What wasn't like apt-get was that a completely different command was used to build from source ("fink") than to install the binaries. This was somewhat confusing. Furthermore, there is no binary installer for 10.6 to 10.8, so installation was a bit harrowing. Once it was installed, though, and I got used to the separate commands and its differences to "apt-get", I was pleased that my PATH was automatically updated and that the installed binaries were appropriately sandboxed.

Homebrew
Like I mentioned above, a lot of people are really excited about Homebrew. It is being developed with the intention to correct (what it perceived to be) MacPorts' shortcomings. From what I can tell, it tries really hard to work with OS X's existing framework/libraries. For this reason, Homebrew is probably a good choice for someone who is using it to install the occasional tool on a single user system.

A neat thing about Homebrew is that it is written very simply in ruby. Its "recipes" to install packages are easy-to-read ruby scripts. They are also very easy to modify and the community encourages upstream development.

Something not-so-neat about Homebrew is that it is publicly antagonistic towards MacPorts. This is probably something that only I care about, though.

pkgsrc/pkgin
Again, I started with the intention of only auditing Fink, Homebrew and MacPorts. When I learned about pkgsrc, I thought that it was too obscure to be a serious contender and I was considering not looking into it further. I am so glad that, for completeness' sake, I decided to try it out because I virtually have only good things to say about it.

pkgsrc started as NetBSD's package management solution. Given NetBSD's dedication to portability, it is perhaps not a surprise that their package manager would attempt to follow suit. It has now been adapted for use on over a dozen different operating systems. Among these are AIX, Solaris, HP-UX, GNU/Linux, Windows (via Cygwin and Interix) and, of course, OS X. It is the default manager on DragonflyBSD and was even the default manager on a now-discontinued GNU/Linux distro, Bluewall Linux. It is similar to (and, indeed, was forked from) FreeBSD's ports system.

I don't think many Mac power-users know that this is an option for them which is a shame because it turned out to be my favorite. After following some fairly simple steps, a mature and sophisticated package manager with over 8,000 packages is at your disposal.

Probably the best thing about pkgsrc from the perspective of Mac users is a tool called pkgin. It's an apt-like tool for installing binaries from pkgsrc. Installing strange Unix tools on OS X *can not* be easier.

The only caveat I should mention is that I haven't tested installing Python with it because I'm still too far away from Mavericks to risk botching my environment that badly. I suspect that it would cause issues because pkgsrc, being a NetBSD project, can't be as aware of OS X framework idiosyncracies as a Mac-specific package manager can.

I'd like to write more on this topic, but this post is getting unwieldy. I plan to talk more about pkgsrc and OS X in another post but, for this one, I'll conclude with the "too-long-didn't-read" version of my journey through package-manager-land.

categoryRudixMacPortsFinkHomebrewpkgsrc / pkgin
Homepagerudix.org
MacPorts.orgfink.thetis.ig42.orgbrew.shpkgsrc.org and pkgin.net
Twitter@rudix4mac (updates often)@macports (last tweet in July)@finkmac (hasn't had update since 2010)@machomebrew (very active)@pkgsrc (last tweet in September)
Year project started2005200220012009Support for Darwin added in 2001
Number of packages488 (but `rudix available | wc -l` says 351)17,680 (but `port list | wc -l` says 17,686)7,951. `apt-cache search . | wc -l` says 209 stable binary .deps)2,498. `brew search | wc -l` says 2,591. This is not counting various extra "taps"8,884 binaries for OS X (according to `pkgin available | wc -l`)
Source/binary/both?Binary onlyTraditionally only sourceOption for bothSource, but also binaries through "bottles"Both. Traditional pkgsrc will do both but using only pkgin will grab the binaries
Language written inPythonTclPerl (front-end)RubyC
LicenseBSDBSDGPL :(BSDBSD
Gui optionsNot really... but there's an internet package browsing optionCurrently threeTwo: fink commander, and phynchronicityNope, but online package browser at Braumeister.orgOnline package browser at pkgsrc.se but none others that I can find
Default prefixDirectly to /usr/local/opt/local/sw/usr/local/Cellar. Programs symlink to /usr/local/bin/usr/pkg
Power-PC supportNot anymoreYes because it is built from sourceYesNot traditionally, but there are forks available that might provide this functionalityNot unless you build from source
Lastest GCC availableNot available4.8.14.84.9No binary available but pkgsrc has 4.8
Python stuffNot availablePy27 and 33 and a lot of great packagesPy23 and 33 and a lot of great packagesPy27 and 33. I couldn't find any packages but the python installs pip and easy_installPy27 and 33 and a lot of great packages. (see warning above)
Installation of package managerVery easy and fastVery easy and fastNightmarish (no binary installer for 10.6 - 10.8)Easy as pieVery easy and fast with these instructions
Uninstallation of package managerEasy and painlessHell-ishVery easy and fastRelatively easy if you follow this gist: https://gist.github.com/mxcl/1173223Not sure, probably just a rm -rf-ing the /usr/pkg and /usr/pkgsrc directories
Installation of packagesExtremely easySlow, since it builds from sourceThe source builds are understandably slow, but the binaries are (obviously) quickSource compilation is obviously slow. I've had some linking issues sometimes.Trivially easy
Uninstallation of packagesEasy and painlessEasyEasy and fastVery easyTrivially easy
Community supportNot very much is requiredGreatNot so greatVery very goodA few websites have some great documentation but some other information it is hard to find OS X-specific info.
DevelopmentGit. Primarily lead by one person. 5 contributors.Subversion. Very happening. Many many developers.Git. 14 GitHub contributors. Commits are infrequentGit. Most vibrant. Over 3,000 contributors. "Recipes" for compilation are easily modified and you are encouraged to submit pull requests. This project is very easy to contribute to.Pkgsrc is CVS. Pkgin is Git. pkgsrc is well backed by the NetBSD Foundation
share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail