Kickin' it with elastic net regression

With the kind of data that I usually work with, overfitting regression models can be a huge problem if I'm not careful. Ridge regression is a really effective technique for thwarting overfitting. It does this by penalizing the L2 norm (euclidean distance) of the coefficient vector which results in "shrinking" the beta coefficients. The aggressiveness of the penalty is controlled by a parameter $\lambda$.

Lasso regression is a related regularization method. Instead of using the L2 norm, though, it penalizes the L1 norm (manhattan distance) of the coefficient vector. Because it uses the L1 norm, some of the coefficients will shrink to zero while lambda increases. A similar effect would be achieved in Bayesian linear regression using a Laplacian prior (strongly peaked at zero) on each of the beta coefficients.

Because some of the coefficients shrink to zero, the lasso doubles as a crackerjack feature selection technique in addition to a solid shrinkage method. This property gives it a leg up on ridge regression. On the other hand, the lasso will occasionally achieve poor results when there's a high degree of collinearity in the features and ridge regression will perform better. Further, the L1 norm is underdetermined when the number of predictors exceeds the number of observations while ridge regression can handle this.

Elastic net regression is a hybrid approach that blends both penalization of the L2 and L1 norms. Specifically, elastic net regression minimizes the following...

$\lVert y - X\beta \rVert + \lambda[(1-\alpha)\lvert \beta \rvert_2^2 + \alpha\lvert \beta \rvert_1]$

the $\alpha$ hyper-parameter is between 0 and 1 and controls how much L2 or L1 penalization is used (0 is ridge, 1 is lasso).

The usual approach to optimizing the lambda hyper-parameter is through cross-validation—by minimizing the cross-validated mean squared prediction error—but in elastic net regression, the optimal lambda hyper-parameter also depends upon and is heavily dependent on the alpha hyper-parameter (hyper-hyper-parameter?).

This blog post takes a cross-validated approach that uses grid search to find the optimal alpha hyper-parameter while also optimizing the lambda hyper-parameter for three different data sets. I also compare the performances against the stepwise regression and showcase some of the dangers of using stepwise feature selection.

mtcars

In this example, I try to predict “miles per gallon” from the other available attributes. The design matrix has 32 observations and 10 predictors and there is a high degree of collinearity (as measured by the variance inflation factors).

The left panel above shows the leave-one-out cross validation (LOOCV) mean squared error of the model with the optimal lambda (as determined again by LOOCV) for each alpha parameter from 0 to 1. This panel indicates that if our objective is to purely minimize MSE (with no regard for model complexity) than pure ridge regression outperforms any blended elastic-net model. This is probably because of the substantial collinearity. Interestingly, the lasso outperforms blended elastic net models that weight the lasso heavily.

The right panel puts things in perspective by plotting the LOOCV MSEs along with the MSE of the "kitchen sink" regression (the blue line) that includes all features in the model. As you can see, any degree of regularization offers a substantial improvement in model generalizability.

It is also plotted with two estimates of the MSE for models that blindly use the coefficients from automated bi-directional stepwise regression. The first uses the features selected by performing the stepwise procedure on the whole dataset and then assesses the model performance (the red line). The second estimate uses the step procedure and resulting features on only the training set for each fold of the cross validations. This is the estimate without the subtle but treacherous "knowledge leaking" eloquently described in this plot post. This should be considered the more correct assessment of the model. As you can see, if we weren't careful about interpreting the stepwise regression, we would have gotten an incredibly inflated and inaccurate view of the model performance.

Forest Fires

The second example uses a very-difficult-to-model dataset from University of California, Irvine machine learning repository. The task is to predict the burnt area from a forest fire given 11 predictors. It has 517 observations. Further, there is a relatively low degree of collinearity between predictors.

Again, highest performing model is the pure ridge regression. This time, the performance asymptotes as the alpha hyper-parameter increases. The variability in the MSE estimates is due to the fact that I didn't use LOOCV and used 400-k CV instead because I'm impatient.

As with the last example, the properly measured stepwise regression performance isn't so great, and the kitchen sink model outperforms it. However, in contrast to the previous example, there was a lot less variability in the selected features across folds—this is probably because of the significantly larger number of observations.

"QuickStartExample"

This dataset is a contrived one that is included with the excellent glmnet package (the one I'm using for the elastic net regression). This dataset has a relatively low degree of collinearity, has 20 features and 100 observations. I have no idea how the package authors created this dataset.

Finally, an example where the lasso outperforms ridge regression! I think this is because the dataset was specifically manufactured to have a small number of genuine predictors with large effects (as opposed to many weak predictors).

Interestingly, stepwise progression far outperforms both—probably for the very same reason. From fold to fold, there was virtually no variation in the features that the stepwise method automatically chose.

Conclusion

So, there you have it. Elastic net regression is awesome because it can perform at worst as good as the lasso or ridge and—though it didn’t on these examples—can sometimes substantially outperform both.

Also, be careful with step-wise feature selection!

PS: If, for some reason, you are interested in the R code I used to run these simulations, you can find it on this GitHub Gist.

Lessons learned in high-performance R

On this blog, I've had a long running investigation/demonstration of how to make a "embarrassingly-parallel" but computationally intractable (on commodity hardware, at least) R problem more performant by using parallel computation and Rcpp.

The example problem is to find the mean distance between every airport in the United States. This silly example was chosen because it exhibits polynomial growth in running time as a function of the number of airports and, thus, quickly becomes intractable without sampling. It is also easy to parallelize.

The first post used the (now-deprecated in favor of 'parallel') multicore package to achieve a substantial speedup. The second post used Rcpp to achieve a statistically significant but, functionally, trivial speedup by replacing the inner loop (the distance calculation using the Haversine formula) with a version written in C++ using Rcpp. Though I was disappointed in the results, it should be noted that porting the function to C++ took virtually no extra work.

By necessity, I've learned a lot more about high-performance R since writing those two posts (part of this is by trying to make my own R package as performant as possible). In particular, I did the Rcpp version all wrong, and I'd like to rectify that in this post. I also compare the running times of approaches that use both parallelism and Rcpp.

Lesson 1: use Rcpp correctly
The biggest lesson I learned, is that it isn’t sufficient to just replace inner loops with C++ code; the repeated transferring of data from R to C++ comes with a lot of overhead. By actually coding the loop in C++, the speedups to be had are often astounding.

In this example, the pure R version, that takes a matrix of longitude/latitude pairs and computed the mean distance between all combinations, looks like this...

just.R <- function(dframe){
numrows <- nrow(dframe)
combns <- combn(1:nrow(dframe), 2)
numcombs <- ncol(combns)
combns %>%
{mapply(function(x,y){
haversine(dframe[x,1], dframe[x,2],
dframe[y,1], dframe[y,2]) },
.[1,], .[2,])} %>%
sum %>%
(function(x) x/(numrows*(numrows-1)/2))
}


The naive usage of Rcpp (and the one I used in the second blog post on this topic) simply replaces the call to "haversine" with a call to "haversine_cpp", which is written in C++. Again, a small speedup was obtained, but it was functionally trivial.

The better solution is to completely replace the combinations/"mapply" construct with a C++ version. Mine looks like this...

double all_cpp(Rcpp::NumericMatrix& mat){
int nrow = mat.nrow();
int numcomps = nrow*(nrow-1)/2;
double running_sum = 0;
for( int i = 0; i < nrow; i++ ){
for( int j = i+1; j < nrow; j++){
running_sum += haversine_cpp(mat(i,0), mat(i,1),
mat(j,0), mat(j,1));
}
}
return running_sum / numcomps;
}


The difference is incredible…

res <- benchmark(R.calling.cpp.naive(air.locs[,-1]),
just.R(air.locs[,-1]),
all_cpp(as.matrix(air.locs[,-1])),
columns = c("test", "replications", "elapsed", "relative"),
order="relative", replications=10)
res
#                                   test replications elapsed relative
# 3  all_cpp(as.matrix(air.locs[, -1]))           10   0.021    1.000
# 1 R.calling.cpp.naive(air.locs[, -1])           10  14.419  686.619
# 2              just.R(air.locs[, -1])           10  15.068  717.524


The properly written solution in Rcpp is 718 times faster than the native R version and 687 times faster than the naive Rcpp solution (using 200 airports).

Lesson 2: Use mclapply/mcmapply
In the first blog post, I used a messy solution that explicitly called two parallel processes. I’ve learned that using mclapply/mcmapply is a lot cleaner and easier to intregrate into idiomatic/functional R routines. In order to parallelize the native R version above, all I had to do is replace the call to "mapply" to "mcmapply" and set the number of cores (now I have a 4-core machine!).

Here are the benchmarks:

                                           test replications elapsed relative
2 R.calling.cpp.naive.parallel(air.locs[, -1])           10  10.433    1.000
4              just.R.parallel(air.locs[, -1])           10  11.809    1.132
1          R.calling.cpp.naive(air.locs[, -1])           10  15.855    1.520
3                       just.R(air.locs[, -1])           10  17.221    1.651


Lesson 3: Smelly combinations of Rcpp and parallelism are sometimes counterproductive

Because of the nature of the problem and the way I chose to solve it, the solution that uses Rcpp correctly is not easily parallelize-able. I wrote some *extremely* smelly code that used explicit parallelism to use the proper Rcpp solution in a parallel fashion; the results were interesting:

                                          test replications elapsed relative
5           all_cpp(as.matrix(air.locs[, -1]))           10   0.023    1.000
4              just.R.parallel(air.locs[, -1])           10  11.515  500.652
6             all.cpp.parallel(air.locs[, -1])           10  14.027  609.870
2 R.calling.cpp.naive.parallel(air.locs[, -1])           10  17.580  764.348
1          R.calling.cpp.naive(air.locs[, -1])           10  21.215  922.391
3                       just.R(air.locs[, -1])           10  32.907 1430.739


The parallelized proper Rcpp solution (all.cpp.parallel) was outperformed by the parallelized native R version. Further the parallelized native R version was much easier to write and was idiomatic R.

How does it scale?

Two quick things...

• The "all_cpp" solution doesn't appear to exhibit polynomial growth; it does, it's just so much faster than the rest that it looks completely flat
• It's hard to tell, but that's "just.R.parallel" that is tied with "R.calling.cpp.naive.parallel"

If you know C++, try using Rcpp (but correctly). If you don't, try multicore versions of lapply and mapply, if applicable, for great good. If it’s fast enough, leave well enough alone.

PS: I way overstated how "intractable" this problem is. According to my curve fitting, the vanilla R solution would take somewhere between 2.5 and 3.5 hours. The fastest version of these methods, the non-parallelized proper Rcpp one, took 9 seconds to run. In case you were wondering, the answer is 1,869.7 km (1,161 miles). The geometric mean might have been more meaningful in this case, though.

The hardest thing about teaching statistics

(Note: this post should probably be titled "Quantitative Methods of Curricula Planning" but I thought the current title would draw more interest–though they would both lose out to "These Weird Approaches To Lesson Planning Will Leave You Speechless")

Suppose you were tasked with teaching a course about a field of study. There would be, of course, several topics that you are expected to cover by the course end date; how would you decide the order in which to teach them?

Most people would say that the topics should build on one another, with monotonically increasing levels of difficulty. Further, no topic should be brought up that requires comprehension of another topic yet unlearned.

Planning the syllabus under these constraints would, perhaps, come naturally to skilled and empathetic lecturers. But,

• not all lecturers are skilled and empathetic
• even satisfying all of these constraints, there are objectively superior and inferior lesson plans
• there are some subjects for which these constraints cannot be satisfied (statistics)

For these reasons, having a suite of quantitative methods for choosing the best order of topics in teaching a field of study would be valuable to pedagogy (not to mention providing challenging problems for me to focus on instead of writing).

--

I started thinking about this topic as I began to plan writing my book about learning introductory statistics with R. There are, of course, myriad other very good books on this very topic, so I figured that one way I can stand out is to organize the topics in a way that best facilitates mastering the material. I thought that this would be especially appreciated with a field of study that is notoriously scary and difficult to the uninitiated (like statistics is.)

Anyone, anywhere, teaching introductory statistics will be expected to touch on the common topics: measures of central tendency, measures of dispersion, probability, the central limit theorem, sampling theory, etc… I know how everyone else have arranged the topics, but what's the best way?

It might seem strange, but answering that question was probably the hardest thing about putting together this book and in all of my (admittedly limited) experience designing statistics curricula.

Let's speak of graph theory

To explore optimal paths through the topics, we can represent the subject of statistics as a big graph, or network. Each topic would be a node and there would be directed edges indicating when knowledge of a particular topic is a prerequisite to understanding another. Specifically, if there is a edge connecting topic "a" to topic "b", topic "b" requires an understanding of "a"–like long division requires knowledge of subtraction.

This is what a topic network of an excerpt of introductory stats topics might look like.

In graph theory, this is known as a directed acyclic graph (DAG). DAGs have the property that there exists at least one ordering of nodes such that no node in the ordering is connected to ("pointing to") a node earlier in the ordering. This is called a topological sort. For most DAGs, there are a number of different orderings that satisfy the ‘dependency’ constraints.

To get a list of all of them, I wrote a small library and set of algorithms in Haskell. You can view it here but the "meat" of the algorithm is in the following snippet that recursively adds all nodes with no children (topics that have no topics that depend on them) to a list of possible alternatives and removes the childless nodes. This is repeated until there are no nodes left to remove. A potential snag is that the function only takes one path but each function call may generate multiple alternate paths. However, if we view the output of the "gatherAllChildless" function as a non-deterministic computation, we can exploit the fact that the path of nodes is a monad and have the function recursively call itself inside of a monadic bind.

This has a sub-quadratic time complexity (< O(n^2))… not too bad. There are 26 possible orderings of the topics that satisfy these “knowledge dependencies” including:

probability -> central tendency -> measures of dispersion -> sampling theory -> sampling distributions -> probability distributions -> central limit theorem -> statistical inference -> NHST

central tendency -> probability -> measures of dispersion -> probability distributions -> sampling theory -> sampling distributions -> central limit theorem -> statistical inference -> NHST


There are a few of the ordering that intuitively seem like poor choices. Taking the first one, for example: it might be strange to start a book on statistics with probability when readers may want to get starting with univariate analysis right away. Looking at the second one, it seems strange to stick "probability" in between "central tendency" and "measures of dispersion", even though it can technically be done, because most people expect highly related topics to be positioned next to each other.

One way of cutting down on the list is to label each topic node with a difficulty level, and choose the ordering which causes the fewest backwards jumps in difficulty level. This should represent the path that has the most gentle level-of-difficulty slope.

Given the algorithms from lines 67 to 78 of TopoSort.hs and the following (subjective) difficulty mapping:

"central tendency": "1"
"measures of dispersion": "2"
"sampling theory": "3"
"sampling distributions": "3"
"central limit theorem": "5"
"probability": "4"
"probability distributions": "3"
"statistical inference": "5"
"NHST": "5"


the “optimal” ordering is:

central tendency -> measures of dispersion -> sampling theory -> probability -> sampling distributions -> probability distributions -> central limit theorem -> statistical inference -> NHST


Yay! This is pretty close to the ordering I chose.

--

The most truly difficult thing about sorting this out is that the statistics topic network diagram is not a DAG. This means that there is no ordering possible that doesn’t appeal to topics yet unlearned. For example, explaining why sample standard deviation divides by n-1 instead of n requires appealing to sampling theory, which requires a good foundation in measures of dispersion to understand. There are a few more of these cyclical relationships in the field.

All of these instances require some hand-waving on the part of the writer or lecturer ("don't worry about why we divide by 'n-1', we’ll get to that later") and adds to the learner's perceived difficulty of grasping the field.

The best way to reconcile these circular knowledge dependencies is to introduce weight to the edges that represent the extent to which a topic requires knowledge of another. Then, a cycle detection algorithm can be run on the graph. Once all the cycles are detected, the edges in the cycles with the lowest weight can be systematically removed until there are no more cycles and the graph is a DAG. At that point, the specialized topo sort from above may be used. I plan on implementing this when I have more time :)

--

It's my hope that these and other qualitative methods for planning curricula can be applied to other legendarily confusing fields of study. These methods can even be applied to entire undergraduate course catalogues and major requirements to guide students over 4+ years of undergraduate study.

I'm all about that bootstrap ('bout that bootstrap)

As some of my regular readers may know, I'm in the middle of writing a book on introductory data analysis with R. I'm at the point in the writing of the book now where I have to make some hard choices about how I'm going to broach to topic of statistical inference and hypothesis testing.

Given the current climate against NHST (the journal Basic and Applied Social Psychology banned it) and my own personal preferences, I wasn't sure just how much to focus on classical hypothesis testing.

I didn't want to burden my readers with spending weeks trying to learn the intricacies of NHST just to have them being told to forget everything they know about it and not be able to use it without people making fun of them.

So I posed a question to twitter: "Is it too outlandish to not include the topic of parametric HTs in an intro book about data analysis. Asking for a friend.. named Tony…. You know, in favor of bootstrapped CIs, permutation tests, etc…"

To which my friend Zach Jones (@JonesZM) replied: "they could at least be better integrated with monte-carlo methods. i think they'd make it easier to understand". I agreed, which is why I'm proceeding with my original plan to introduce classical tests after and within the context of Monte Carlo bootstrapping (as opposed to exhaustive bootstrapping).

Even though I'm a huge fan of the bootstrap, I want to be careful not to further any misconceptions about it—chiefly, that bootstrapping is a cure-all for having a small sample size. To be able to show how this isn’t the case, I wrote an R script to take 1,000 samples from a population, calculate 95% confidence intervals using various methods and record the proportion of times the population mean was within the CIs.

The four ways I created the CIs were:

• the z interval method: which assumes that the sampling distribution is normal around the sample mean (1.96 * the standard error)
• the t interval method: which assumes that the population is normally distributed and the sampling distribution is normally distributed around the sample mean (t-distribution quantile at .975 [with appropriate degrees of freedom] * standard error)
• basic bootstrap CI estimation (with boot() and boot.CI() from the boot R package)
• adjusted percentile CI estimation (with boot() and boot.CI() from the boot R package)

I did this for various sample sizes and two different distributions, the normal and the very non-normal beta distribution (alpha=0.5, beta=0.5). Below is a plot depicting all of this information.

So, clearly the normal (basic) boot doesn’t make up for small sample sizes.

It's no surprise the the t interval method blows everything else out of the water when sampling from a normal distribution. It even performs reasonably well with the beta distribution, although the adjusted bootstrap wins out for most sample sizes.

In addition to recording the proportion of the times the population mean was within the confidence intervals, I also kept track of the range of these intervals. All things being equal, narrower intervals are far preferable to wide ones. Check out this plot depicting the mean ranges of the estimated CIs:

The t interval method always produces huge ranges.

The adjusted bootstrap produces ranges that are more or less on par with the other three methods BUT it outperforms the t interval method for non-normal populations. This suggests the the adjustments to the percentiles of the bootstrap distribution do a really good job at correcting for bias. It also shows that, if we are dealing with a non-normal population (common!), we should use adjusted percentile bootstrapped CIs.

Some final thoughts:

• The bootstrap is not a panacea for small sample sizes
• The bootstrap is cool because it doesn’t assume anything about the population distribution, unlike the z and t interval methods
• Basic bootstrap intervals are whack. They’re pathologically narrow for small sample sizes.

Also, if you're not using Windows, you can parallelize your bootstrap calculations really easily in R; below is the way I bootstrapped the mean for this project:

library(boot)
dasboot <- boot(a.sample, function(x, i){mean(x[i])}, 10000,
parallel="multicore", ncpus=4)


which uses 4 cores to perform the bootstrap in almost one fourth the time.

In later post, I plan to further demonstrate the value of the bootstrap by testing difference in means and show why permutation tests comparing means between two samples is always better than t-testing.

Playing around with #rstats twitter data

As a bit of weekend fun, I decided to briefly look into the #rstats twitter data that Stephen Turner collected and made available (thanks!). Essentially, this data set contains some basic information about over 100,000 tweets that contain the hashtag "#rstats" that denotes that a tweeter is tweeting about R.

As a warning, I don't know much about how these data were collected, whether it was collected and random times during the day or whether it was biased toward particular times and, therefore, locations. I wouldn't really read too much into this.

Most common co-occuring hashtags
When a tweet uses a hashtag at all, it very often uses more than one. To extract the co-occuring hashtags, I used the following perl script:

#!/usr/bin/perl

while(<>){
chomp;
$_ = lc($_);
$_ =~ s/#rstats//g; my @matches; push @matches, /(#\w+)/; print join "\n" => @matches if @matches; }  which uses the regular expression "(#\w+)" to search for hashtags after removing "#rstats" from every tweet. On the unix command-line, I put these other hashtags into a file and sorted via these commands: cat data/R-hashtag-data.txt | ./PERL_SCRIPT_ABOVE.pl | tee other-hashtags.txt sort other-hashtags.txt | uniq -c | sort -n -r > sorted-other-hashtags.txt  After running these commands, I get a numbered list of co-occuring hashtags, sorted in descending order. The top 10 co-occuring hashtags were as follows (you can see the rest here : 5258 #datascience 1665 #python 1625 #bigdata 1542 #r 1451 #dataviz 1360 #ggplot2 852 #statistics 783 #dplyr 749 #machinelearning 743 #analytics  Neat-o. The presence of "#python" and "#ggplot2" in the top 10 made me wonder what the top 10 programming language and R package related hashtags were. Here they are, respectively: 1665 #python 423 #d3js (plus 72 for #d3) (plus 2 for #js) 343 #sas 312 #julialang (plus 43 for #julia) 240 #fsharp 140 #spss (plus 7 for #ibmspss) 102 #stata 75 #matlab 55 #sql 38 #java  1360 #ggplot2 (plus 298 for ggplot) (plus for 6 #gglot2) (plus 4 for #ggpot) 783 #dplyr 663 #shiny 557 #rcpp (plus 22 for rcpp11) 251 #knitr 156 #magrittr 105 #lme4 93 #ggvis (plus 11 for #ggivs) 65 #datatable 46 #rneo4j  You can view the full list here and here. I was happy to see my favorite languages (python, perl, clojure, lisp, haskell, c) besides R being represented in the first list. Additionally, most of my favorite packages were fairly well tweeted about--at least as far as hashtags-applied-to-a-package go. #strangehashtags Before moving on to the next section, I wanted to share my favorite co-occuring hashtags that I found while sifting through the data: #rcatladies, #rdogfella, #bayesianbootycall, #dontbeaplyrhater, #overlyhonestmethods, #rickshaw (??), #statafail, and #monkeysinfrontoftypewriters. Most prolific #rstats tweeters One of the first things I did with these data is a simple aggregation and sort to find the tweeters that used the hashtag most often: library(dplyr) THE_DATA %>% group_by(User) %>% summarise(count = n()) %>% arrange(desc(count)) -> prolific.rstats.tweeters  Here is the top 10 (you can see the rest here.) @Rbloggers 1081 @hadleywickham 498 @timelyportfolio 427 @recology_ 419 @revodavid 210 @chlalanne 209 @adolfoalvarez 199 @RLangTip 175 @jmgomez 160  Nothing terribly surprising here. Normalizing by total tweets In a twitter discussion about these data, a twitter friend Tim Hopper posited that though he had fewer #rstats tweets than another mutual friend, Trey Causey, he would have a higher number of #rstats tweets if you control for total tweet volume. I wondered how this sorting would look. Answering this question gave me an excuse to use Hadley Wickham's new package, rvest (I literally just got why the package is named as much while typing this out) which makes web scraping easier--in part by leveraging the expressive power of the magrittr package. To get the total number of tweets for a particular tweeter, I wrote the following function: library(rvest) library(magrittr) get.num.tweets <- function(handle){ tryCatch({ unraw <- function(raw_str){ raw_str <- sub(",", "", raw_str) # remove commas if any if(grepl("K", raw_str)){ return(as.numeric(sub("K", "", raw_str))*1000) # in thousands } return(as.numeric(raw_str)) } html(paste0("http://twitter.com/", sub("@", "", handle))) %>% html_nodes(".is-active .ProfileNav-value") %>% html_text() %>% unraw }, error=function(cond){return(NA)}) }  The real logic (and beauty) of which is contained only in the last few lines:  html(paste0("http://twitter.com/", sub("@", "", TWITTER_HANDLE))) %>% html_nodes(".is-active .ProfileNav-value") %>% html_text()  The CSS element that houses the number of total tweets from a useR's twitter page was found easily using SelectorGadget. After scraping the number of tweets for almost 10,000 #rstats tweeters (waiting a few seconds between each request because I'm considerate) I divided number of #rstats tweets by the total number of tweets to come up with a normalized value. The top 10 tweeteRs were as follows:  User count num.of.tweets ratio 1 @medzihorsky 9 28 0.3214286 2 @statworx 5 16 0.3125000 3 @LearnRinaDay 114 404 0.2821782 4 @RforExcelUsers 4 15 0.2666667 5 @showmeshiny 27 102 0.2647059 6 @tcrug 6 25 0.2400000 7 @DailyRpackage 155 666 0.2327327 8 @R_Programming 49 250 0.1960000 9 @hexadata 8 41 0.1951220 10 @Deep_RHelp 11 58 0.1896552  In case you were wondering, Trey Causey still "won" by a long shot: > tweeters[which(tweeters$User=="@tdhopper"),]
Source: local data frame [1 x 4]

User count num.of.tweets        ratio
1 @tdhopper     8         26700 0.0002996255
> tweeters[which(tweeters$User=="@treycausey"),] Source: local data frame [1 x 4] User count num.of.tweets ratio 1 @treycausey 50 28700 0.00174216  Before ending this post, I feel compelled to issue an almost certainly unnecessary but customary warning against using number of #rstats tweets as a proxy for who likes R the most or who are the biggest R "thought leaders" (whatever that is). Most tweets about R don't use the #rstats hashtag, anyway. Again, I would't read too much into this :) share this: Assertive R programming in dplyr/magrittr pipelines A lot of my job–and side projects, for that matter–involve running R scripts on updates of open government data. While I’m infinitely grateful to have access to any interesting open datasets in the first place, I can’t ignore that dealing with open data is often a messy affair. In fact this seems to be characteristic of most data sets I work with, open access or otherwise. So... let's say I have a labyrinthine analysis workflow that uses a wide array of government sources to answer an interesting question. The workflow is full of analyses that return components that are components of still other analyses. Then there’s an update of the data! Whoopee! I rerun the scripts/workflow on updated (or partially updated) data. Then one of four things happen: • In the best case scenario, everything works because there were no errors in the data. • In the likely scenario, something very late in this labyrinthine analysis workflow breaks and it’s not clear what datum caused this error. • In the worst case scenario, nothing breaks and the error is only caught when the results–or part of them–are nonsensical. • In the worst worst case scenario, the results or some of the results are wrong but it looks ok and it goes undetected. In an effort to help solve this common problem–and inspired by the elegance of dplyr/magrittr pipelines–I created a R package called assertr. assertr works by adding two new verbs to the pipeline, verify and assert, and a couple of predicate functions. Early on in the pipeline, you make certain assertions about how the data should look. If the data conform to these assertions, then we go on with the pipeline. If not, the verbs produce errors that terminate any further pipeline computations. The benefit of the verbs, over the truth assurance functions already in R (like stopifnot) is that they needn’t interrupt the flow of the pipeline. Take, for example, the following contrived snippet making sure that there are only 0s and 1s (automatic and manual, respectively) in R’s Motor Trend Car Road Test built-in dataset before calculating the average miles per gallon per category. mtcars %>% verify(am %in% c(0,1)) %>% group_by(cyl) %>% summarise(mean.mpg=mean(mpg)) # am mean.mpg # 0 17.14737 # 1 24.39231  Let’s say this dataset was much bigger, not built in to R, and curated and disseminated by someone with less perfectionistic (read obsessive/compulsive) tendencies than yours truly. If we wanted to find the average miles per gallon aggregated by number of engine cylinders, we might first want to check if the number of cylinders is reasonable (either 4, 6, or 8) and that the miles per gallon was a reasonable number (between 10 and 40 mpg) and not a data entry error that would greatly throw off our non-robust estimator: mtcars %>% assert(in_set(4, 6, 8), cyl) %>% assert(within_bounds(10, 40), mpg) %>% group_by(cyl) %>% summarise(mean.mpg=mean(mpg)) # cyl mean.mpg # 4 26.66364 # 6 19.74286 # 8 15.10000  Perhaps one day there will be cars that have more than 8 cylinders or less than 2. We might want to only check if there are an even number of cylinders (since it has to be even, I think); we can change the first assert line to: assert(function(x) x%%2==0, cyl) %>%  assertr subscribes to the general idea that it is better to fail fast to spot data errors early. The benefit of assertr’s particular approach is that it’s friendly to the pipeline paradigm used by magrittr and dplyr. The best thing about assertr’s approach, though, is that it forces you to state your assumptions up front. When your assumptions are stated clearly and verified, errors from messy data tend to disappear. To learn more about assertr and the kinds of assertions that you can make with it, visit its page on github. You can also read the vignette here. share this: What does Flatland have to do with Haskell? Edwin Abbott's 1884 novella, Flatland, recounts the misadventures of a square that lives in a two-dimensional world called "Flatland". In this story, the square has a dream where he visits a one-dimensional world (Lineland) and unsuccessfully tries to educate the populace about Flatland's existence. Shortly thereafter, a sphere visits Flatland to introduce our protagonist to his own home, Spaceland. The sphere looks like a circle to the square, because the square can only see the part of the sphere that intersects Flatland's plane. The square can't fathom Spaceland until the sphere actually brings him into the third dimension. Having his view of reality sufficiently rocked, the square postulates the existence of still higher dimensional lands. The sphere denies this possibility and returns the square to Flatland on bad terms. The irony here is that, in spite of being aware of the square's previous insistence that Flatland is the only reality there is, the square's corresponding limitation of thought, and the square's subsequent enlightenment; the sphere arrogantly asserted that his own land represented the limits of dimensionality and that there can be no more dimensions. I begin with this summary not as an impromptu book report, but because this is a perfect analogy for why I've decided to learn Haskell. - It isn't hard to imagine the inveterate C programmer being hesitant to embrace higher-level languages like Python and Perl. "Imagine the whole world that opens up to you when you don't have to worry about memory management and resolving pointers," requests the proselytizing Python programmer. But the C programmer got on fine without knowing Python all this time. Besides the (standard) Python interpreter is written in C. "Why? So I can change types on a whim?" retorts the C programmer. "...so I can pass functions as arguments to other functions? Big deal! I can do that with function pointers in C." While it is technically true that, in C, functions can be passed as arguments to other functions, and can be returned by functions, this is only a small part of the functional programming techniques that Python allows. But it is the only part that most only trained in C can understand. This is like when the square thought that he saw the sphere because he saw a circle where the sphere intersected Flatland. There's a bigger picture (lambda functions, currying, closures) that only appears when the constraints imposed by a programming paradigm are pushed back. Shifting tides in industry eventually compel the C programmer to give Python a shot. Once she is fluent in Python and its idioms (becomes a "pythonista", as they say) things begin to change. The OOP of Python allows the programmer to gain a new perspective on programming that had been, up to this point, unavailable to her simply because the concept doesn't exist in C. Resolved to follow the road to higher abstraction, the (former) C programmer asks the Python programmer if there are other languages that will provoke a similar shift in thought relative to even Python. "Haskell, perhaps?" she offers. The Python programmer scoffs, "Nah, Python is as good as it gets. Besides, purely functional programmers are a bunch of weirdos." Haskell is a relatively obscure programming language (20th place or lower on most popularity indices), but accounts for a disproportionate number of the "this-language-will-change-the-way-you-look-at-programming" claims. By the accounts of Haskell aficionados, finally understanding Haskell is a transformative experience. Up until recently, I found myself mirroring the thoughts and behavior of the Python programer in this story; I keep hearing about how great Haskell is, and how it'll facilitate an enormous shift in how I'll view computer programming, but I largely dismissed these claims as the rantings of eccentrics that are more concerned with mathematical elegance than practical concerns. Is it any wonder that I didn't believe the Haskell programmers? I can't see the benefits of Haskell within the constraints of how I view computer programming—constraints subtly imposed by my language of choice. (Lisp programmer Paul Graham describes this as the "Blub Paradox" in this essay Beating The Averages) But just like the sphere and the arrogant Python programmer should, I have to remain open to the idea that there's a whole world out there that I'm not availing myself of just because I’m too obstinate to try it. Any language that is Turing-complete will let you do anything. It's trivially true that there is nothing that you can do in one language that you can't do in another. What sets languages apart (from most trivial to least) is the elegance of the syntax, it's community and third party packages, the ease in which you can perform certain tasks, and whether you’ll achieve enlightenment as a result of using it. I don't know if Haskell will do this for me but I think I ought to give it a shot. --- If the ideas of relativity when applied to the domain of programming languages interests you, you might also be interested in the Sapir–Whorf hypothesis, which applies similar ideas mentioned in this post to the realm of natural languages. share this: Sending text messages at random times using python Given my interest for applying statistics and analytics to most (if not all of the) quantifiable aspects of my life, when I learned about self-tracking, and the associated 'Quantified Self' movement, it should come as no surprise to anyone that knows me that I wanted to get started right away. And... Given my interest in making life harder than it needs to be, it makes sense that I would eschew existing self-tracking tools and build my own. A neat side-effect of this obstinance is getting to learn new things. The basic idea is at random times during the day, I fill out a survey that I designed for myself including questions such as: "How happy are you right now?", "How much energy would you say that you have right now?", and "Where are you right now?". The most reliable and fastest way to get in touch with me is to send a text message. So, sending myself text messages at random times during the day is the best way to prompt me to fill out this self-tracking survey. To make it easier (and, therefore, more likely that I'll fill it out) the content of the text message should be a link to the survey on the web. And in order to add flexibility to when I have to fill out the survey form but also preserve the randomness of the sampling, the timestamp of the time the text message was sent should be included as a url parameter so that it can be stored in the database along with the answers to the survey. The service that sends these text messages runs on a Debian GNU/Linux EC2 instance that also hosts the form I fill out and the database that the answers are dumped to. Before we get to the code, I should explain the modules that we will need for this task, and my rationale for choosing them. logging Trying to debug a scheduled task or workflow is a living hell without proper and verbose logging. Since this must be run in the background (and not tied to a particular terminal emulator) simple print statements will not do. The more elegant, scalable, and extensible solution is to use Python's excellent 'logging' module. smtplib While there are a few different ways to send text messages (SMS) using Python, the solution I settled on is to use the 'smtplib' standard library module to send an email to an SMS gateway. This gateway will convert the email into a text message sent to my phone. smtplib is needed to send the email message. apscheduler Although cron (or equivalently [?] Windows Scheduling Service) should be the tool of choice when scheduling commands to be run at specific times that never change, the fact that the text messages have to be sent at different times everyday requires another solution. Probably the most elegant and cross-platform solution is to use the advanced python scheduling library, apscheduler. The Python standard library comes with a similar module, sched, but apscheduler is more advanced with its scheduling capability and its ability to persistently store tasks in a database that survives process restart. (It supports storage in SQLite, PostgreSQL, MongoDB, Redis, MySQL, Oracle, MS-SQL, Firebird, and Sybase). But, unlike its standard library counterpart, it needs to be pip installed. We will divide this task up into two python scripts, one that gets run once a day, computes n random times, schedules to send a text message those times, and then sends the message (we will call this send_daily_texts.py), and one script that runs once that calls send_daily_texts at midnight everyday (we will call this run_everyday.py). send_daily_texts.py #!/usr/bin/python -tt import random import sys import logging import smtplib import email.utils from email.mime.text import MIMEText from datetime import datetime, timedelta, date from apscheduler.schedulers.blocking import BlockingScheduler # create logger logger = logging.getLogger(__name__) logger.setLevel(logging.DEBUG) handler = logging.FileHandler('send_daily_texts.log') handler.setLevel(logging.DEBUG) logger.addHandler(handler) logger.info("[{}] - send_daily_texts was run".format(datetime.now())) # the number of times to schedule and send text messages # are provided as a command line argument n = int(sys.argv[1]) logger.info("[{}] - going to choose {} random times".format(datetime.now(), n)) # we need to parse today's state to properly # schedule the text message sending dadate = datetime.now() year = dadate.year month = dadate.month day = dadate.day # the lower bound is 8 o' clock lower_bound = datetime(year, month, day, 8, 0, 0) logger.info("[{}] - the lower bound is {}".format(datetime.now(), lower_bound)) # the upper bound is 11 o' clock PM upper_bound = datetime(year, month, day, 23, 0, 0) logger.info("[{}] - the upper bound is {}".format(datetime.now(), upper_bound)) sched = BlockingScheduler() logger.info("[{}] - Created blocking scheduler".format(datetime.now())) wherefrom = 'YOUEMAILACCOUNTYOCREATE AT gmail DOT com' whereto = 'YOURPHONENUMBER AT YOURSMSGATEWAY DOT com' gmail_pw = 'YOURGMAILPASSWORD' def encode_timestamp(timestamp): return str(timestamp).replace(" ", "+").replace(":", "%3A") def make_message(timestamp, wherefrom, whereto): slug = "http://THELINKURL/?timestamp={}".format(encode_timestamp(timestamp)) msg = MIMEText(slug) msg['To'] = email.utils.formataddr(('Recipient', whereto)) msg['From'] = email.utils.formataddr(('Author', wherefrom)) msg['Subject'] = 'Time for the survey!' return msg def send_text(should_exit=False): logger.info('[{}] - trigger triggered, going to send text'.format(datetime.now())) logger.info('[{}] - attempting to connect to gmail'.format(datetime.now())) server = smtplib.SMTP("smtp.gmail.com", 587) server.starttls() server.login(wherefrom, gmail_pw) logger.info('[{}] - successfully connected to gmail'.format(datetime.now())) timestamp = datetime.now() msg = make_message(timestamp, wherefrom, whereto) logger.info('[{}] - going to send message {} to {}'.format(datetime.now(), damsg.replace('\n', '<br>'), whereto)) ret = server.sendmail(wherefrom, [whereto], damsg) server.quit() if should_exit: logger.info('[{}] - finished... going to exit'.format(datetime.now())) sched.shutdown(wait=False) def random_time(start, end): sec_diff = int((end-start).total_seconds()) secs_to_add = random.randint(0, sec_diff) return start + timedelta(seconds=secs_to_add) def get_n_random_times(n, start, end): times = [] for i in range(0, n): times.append(random_time(start, end)) times.sort() return times times = get_n_random_times(n, lower_bound, upper_bound) logger.info("[{}] - Received {} times to schedule".format(datetime.now(), len(times))) for ind, atime in enumerate(times): if ind == (n-1): sched.add_job(send_text, 'date', run_date=atime, args={"should_exit": True}) logger.info("[{}] - added last task at {}".format(datetime.now(), atime)) else: sched.add_job(send_text, 'date', run_date=atime) logger.info("[{}] - added task at {}".format(datetime.now(), atime)) sched.start() logger.info("[{}] - everything is done".format(datetime.now()))  Before I describe "run_everyday.py" there are a few things I should note about the snippet above. When I originally wrote this script, the text messages wouldn’t send even though the logger indicated that it had. I assumed this was because gmail rejected it because it didn't look enough like an email message. In order to correct this, I needed to use the email.mime.text module to add the standard email headers to the message to be send. Since I am only interested in experience sampling my waking life, I didn't want to fill out the survey during hours that I am normally sleep. I had to make sure the I set 8 o' clock and 23 (11pm) o' clock as my lower and upper bound, respectively. Third, if you decide to cannibalize this code, make sure you change the values for 'wherefrom', 'whereto', and 'gmail_pw'. The format the SMS gateway you should use depends upon your mobile carrier. My particular SMS gateway is my 10 digit phone number @vtext.com. Your’s will likely be different–consult this list. run_everyday.py #!/usr/bin/python -tt import sys import logging from datetime import datetime from subprocess import Popen, PIPE from apscheduler.schedulers.blocking import BlockingScheduler def run_daily_surveys(thelogger): thelogger.info("[{}] - Trigger triggered".format(datetime.now())) thelogger.info("[{}] - Going to run daily script".format(datetime.now())) p = Popen('./send_daily_texts.py 3', shell=True, stdout=PIPE, stderr=PIPE) out, err = p.communicate() if p.returncode: thelogger.error("[{}] - Failed to run daily script".format(datetime.now())) sys.exit("Failed to run daily script") thelogger.info("[{}] - Ran daily script".format(datetime.now())) if p.returncode: sys.exit("Command failed to run") def main(): logger = logging.getLogger(__name__) logger.setLevel(logging.DEBUG) handler = logging.FileHandler('run_everyday.log') handler.setLevel(logging.DEBUG) logger.addHandler(handler) logger.info("[{}] - run_everyday.py was run".format(datetime.now())) sched = BlockingScheduler() logger.info("[{}] - blocking scheduler was created".format(datetime.now())) sched.add_job(run_daily_surveys, 'interval', days=1, args=[logger]) logger.info("[{}] - everyday task added, going to start the scheduler".format(datetime.now())) sched.start() return 0 if __name__ == '__main__': STATUS = main() sys.exit(STATUS)  I've been running these tasks for about a week now, and its working great! My next couple of blog posts will be about server-side code and architecture to support my self-tracking project. share this: Why is my OS X Yosemite install taking so long?: an analysis Why? Since the latest Mac OS X update, 10.10 "Yosemite", was released last Thursday, there have been complaints springing up online of the progress bar woefully underestimating the actual time to complete installation. More specifically, it appeared as if, for a certain group of people (myself included), the installer would stall out at "two minutes remaining" or "less than a minute remaining"–sometimes for hours. In the vast majority of these cases, though, the installation process didn't hang, it was just performing a bunch of unexpected tasks that it couldn't predict. During the install, striking "Command" + "L" would bring up the install logs. In my case, the logs indicated that the installer was busy right until the very last minute. Not knowing very much about OS X's installation process and wanting to learn more, and wanting to answer why the installation was taking longer than the progress bar expected, I saved the log to a file on my disk with the intention of analyzing it before the installer automatically restarted my computer. Cleaning The log file from the Yosemite installer wasn't in a format that R (or any program) could handle natively so before we can use it, we have to clean/munge it. To do this, we'll write a program in the queen of all text-processing languages: perl. This script will read the log file, line-by-line from standard input (for easy shell piping), and spit out nicely formatted tab-delimited lines. #!/usr/bin/perl use strict; use warnings; # read from stdin while(<>){ chomp; my$line = $_; my ($not_message, $message) = split ': ',$line, 2;

# skip lines with blank messages
next if $message =~ m/^\s*$/;

my ($month,$day, $time,$machine, $service) = split " ",$not_message;

print join("\t", $month,$day, $time,$machine, $service,$message) . "\n";
}


We can output the cleaned log file with this shell command:

echo "Month\tDay\tTime\tMachine\tService\tMessage" > cleaned.log
grep '^Oct' ./YosemiteInstall.log | grep -v ']:  ' | grep -v ': }' |  ./clean-log.pl >> cleaned.log


This cleaned log contains 6 fields: 'Month', 'Day', 'Time', 'Machine (host)', 'Service', and 'Message'. The installation didn't span days (it didn't even span an hour) so technically I didn't need the 'Month' and 'Day' fields, but I left them in for completeness' sake.

Analysis

Let's set some options and load the libraries we are going to use:

# options
options(echo=TRUE)
options(stringsAsFactors=FALSE)

# libraries
library(dplyr)
library(ggplot2)
library(lubridate)
library(reshape2)


Now we read the log file that I cleaned and add a few columns with correctly parsed timestamps using lubridate’s "parse_date_time()" function

yos.log <- read.delim("./cleaned.log", sep="\t") %>%
mutate(nice.date=paste(Month, Day, "2014", Time)) %>%
mutate(lub.time=parse_date_time(nice.date,
"%b %d! %Y! %H!:%M!:%S!",
tz="EST"))


And remove the rows of dates that didn't parse correctly

yos.log <- yos.log[!is.na(yos.log$lub.time),] head(yos.log)  ## Month Day Time Machine Service ## 1 Oct 18 11:28:23 localhost opendirectoryd ## 2 Oct 18 11:28:23 localhost opendirectoryd ## 3 Oct 18 11:28:23 localhost opendirectoryd ## 4 Oct 18 11:28:23 localhost opendirectoryd ## 5 Oct 18 11:28:23 localhost opendirectoryd ## 6 Oct 18 11:28:23 localhost opendirectoryd ## Message ## 1 opendirectoryd (build 382.0) launched - installer mode ## 2 Logging level limit changed to 'notice' ## 3 Initialize trigger support ## 4 created endpoint for mach service 'com.apple.private.opendirectoryd.rpc' ## 5 set default handler for RPC 'reset_cache' ## 6 set default handler for RPC 'reset_statistics' ## nice.date lub.time ## 1 Oct 18 2014 11:28:23 2014-10-18 11:28:23 ## 2 Oct 18 2014 11:28:23 2014-10-18 11:28:23 ## 3 Oct 18 2014 11:28:23 2014-10-18 11:28:23 ## 4 Oct 18 2014 11:28:23 2014-10-18 11:28:23 ## 5 Oct 18 2014 11:28:23 2014-10-18 11:28:23 ## 6 Oct 18 2014 11:28:23 2014-10-18 11:28:23  The first question I had was how long the installation process took install.time <- yos.log[nrow(yos.log), "lub.time"] - yos.log[1, "lub.time"] (as.duration(install.time)) ## [1] "1848s (~30.8 minutes)"  Ok, about a half-hour. Let's make a column for cumulative time by subtracting each row's time by the start time yos.log$cumulative <- yos.log$lub.time - min(yos.log$lub.time, na.rm=TRUE)


In order to see what processes were taking the longest, we have to make a column for elapsed time. To do this, we can subtract each row's time from the time of the subsequent row.

yos.log$elapsed <- lead(yos.log$lub.time) - yos.log$lub.time # remove last row yos.log <- yos.log[-nrow(yos.log),]  Which services were responsible for the most writes to the log and what services took the longest? We can find out with the following elegant dplyr construct. While we're at it, we should add columns for percentange of the whole for easy plotting. counts <- yos.log %>% group_by(Service) %>% summarise(n=n(), totalTime=sum(elapsed)) %>% arrange(desc(n)) %>% top_n(8, n) %>% mutate(percent.n = n/sum(n)) %>% mutate(percent.totalTime = as.numeric(totalTime)/sum(as.numeric(totalTime))) (counts)  ## Source: local data frame [8 x 5] ## ## Service n totalTime percent.n percent.totalTime ## 1 OSInstaller 42400 1586 secs 0.9197197 0.867615 ## 2 opendirectoryd 3263 43 secs 0.0707794 0.023523 ## 3 Unknown 236 157 secs 0.0051192 0.085886 ## 4 _mdnsresponder 52 17 secs 0.0011280 0.009300 ## 5 OS 49 1 secs 0.0010629 0.000547 ## 6 diskmanagementd 47 7 secs 0.0010195 0.003829 ## 7 storagekitd 29 2 secs 0.0006291 0.001094 ## 8 configd 25 15 secs 0.0005423 0.008206  Ok, the "OSInstaller" is responsible for the vast majority of the writes to the log and to the total time of the installation. "opendirectoryd" was the next most verbose process, but its processes were relatively quick compared to the "Unknown" process' as evidenced by "Unknown" taking almost 4 times longer, in aggregate, in spite of having only 7% of "opendirectoryd"'s log entries. We can more intuitively view the number-of-entries/time-taken mismatch thusly: melted <- melt(as.data.frame(counts[,c("Service", "percent.n", "percent.totalTime")])) ggplot(melted, aes(x=Service, y=as.numeric(value), fill=factor(variable))) + geom_bar(width=.8, stat="identity", position = "dodge",) + ggtitle("Breakdown of services during installation by writes to log") + ylab("percent") + xlab("service") + scale_fill_discrete(name="Percent of", breaks=c("percent.n", "percent.totalTime"), labels=c("writes to logfile", "time elapsed"))  As you can see, the "Unknown" process took a disproportionately long time for its relatively few log entries; the opposite behavior is observed with "opendirectoryd". The other processes contribute very little to both the number of log entries and the total time in the installation process. What were the 5 most lengthy processes? yos.log %>% arrange(desc(elapsed)) %>% select(Service, Message, elapsed) %>% head(n=5)  ## Service ## 1 OSInstaller ## 2 OSInstaller ## 3 Unknown ## 4 OSInstaller ## 5 OSInstaller ## Message ## 1 PackageKit: Extracting file:///System/Installation/Packages/Essentials.pkg (destination=/Volumes/Macintosh HD/.OSInstallSandboxPath/Root, uid=0) ## 2 System Reaper: Archiving previous system logs to /Volumes/Macintosh HD/private/var/db/PreviousSystemLogs.cpgz ## 3 kext file:///Volumes/Macintosh%20HD/System/Library/Extensions/JMicronATA.kext/ is in hash exception list, allowing to load ## 4 Folder Manager is being asked to create a folder (down) while running as uid 0 ## 5 Checking catalog hierarchy. ## elapsed ## 1 169 secs ## 2 149 secs ## 3 70 secs ## 4 46 secs ## 5 44 secs  The top processes were: • Unpacking and moving the contents of "Essentials.pkg" into what is to become the newsystem directory structure. This ostensibly contains items like all the updated applications (Safari, Mail, etc..). (almost three minutes) • Archiving the old system logs (two and a half minutes) • Loading the kernel module that allows the onboard serial ATA controller to work (a little over a minute) Let's view a density plot of the number of writes to the log file during installation. ggplot(yos.log, aes(x=lub.time)) + geom_density(adjust=3, fill="#0072B2") + ggtitle("Density plot of number of writes to log file during installation") + xlab("time") + ylab("")  This graph is very illuminating; the vast majority of log file writes were the result of very quick processes that took place in the last 15 minutes of the install, which is when the progress bar read that only two minutes were remaining. In particular, there were a very large number of log file writes between 11:47 and 11:48; what was going on here? # if the first time is in between the second two, this returns TRUE is.in <- function(time, start, end){ if(time > start && time < end) return(TRUE) return(FALSE) } the.start <- ymd_hms("14-10-18 11:47:00", tz="EST") the.end <- ymd_hms("14-10-18 11:48:00", tz="EST") # logical vector containing indices of writes in time interval is.in.interval <- sapply(yos.log$lub.time, is.in,
the.start,
the.end)

# extract only these rows
in.interval <- yos.log[is.in.interval, ]

# what do they look like?
silence <- in.interval %>%
select(Message) %>%
sample_n(7) %>%
apply(1, function (x){cat("\n");cat(x);cat("\n")})


##
## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/usr/local/texlive/2013/tlpkg/tlpobj/featpost.tlpobj -> /Volumes/Macintosh HD/usr/local/texlive/2013/tlpkg/tlpobj Final name: featpost.tlpobj (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy)
##
## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/usr/local/texlive/2013/texmf-dist/tex/generic/pst-eucl/pst-eucl.tex -> /Volumes/Macintosh HD/usr/local/texlive/2013/texmf-dist/tex/generic/pst-eucl Final name: pst-eucl.tex (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy)
##
## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/Library/Python/2.7/site-packages/pandas-0.12.0_943_gaef5061-py2.7-macosx-10.9-intel.egg/pandas/tests/test_groupby.py -> /Volumes/Macintosh HD/Library/Python/2.7/site-packages/pandas-0.12.0_943_gaef5061-py2.7-macosx-10.9-intel.egg/pandas/tests Final name: test_groupby.py (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy)
##
## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/usr/local/texlive/2013/texmf-dist/tex/latex/ucthesis/uct10.clo -> /Volumes/Macintosh HD/usr/local/texlive/2013/texmf-dist/tex/latex/ucthesis Final name: uct10.clo (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy)
##
## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/usr/local/texlive/2013/texmf-dist/doc/latex/przechlewski-book/wkmgr1.tex -> /Volumes/Macintosh HD/usr/local/texlive/2013/texmf-dist/doc/latex/przechlewski-book Final name: wkmgr1.tex (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy)
##
## WARNING : ensureParentPathExists: Created  /Volumes/Macintosh HD/usr/local/texlive/2013/texmf-dist/doc/latex/moderntimeline' w/ {
##
## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/usr/local/texlive/2013/texmf-dist/fonts/type1/wadalab/mrj/mrjkx.pfb -> /Volumes/Macintosh HD/usr/local/texlive/2013/texmf-dist/fonts/type1/wadalab/mrj Final name: mrjkx.pfb (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy)


Ah, so these processes are the result of the installer having to move files back into the new installation directory structure. In particular, the vast majority of these move operations are moving files related to a program called "texlive". I'll explain why this is to blame for the inaccurate projected time to completion in the next section.

But lastly, let's view a faceted density plot of the number of log files writes by process. This might give us a sense of what steps go on as the installation progresses by showing us with processes are most active.

# reduce number of service to a select few of the most active
smaller <- yos.log %>%
filter(Service %in% c("OSInstaller", "opendirectoryd",
"Unknown", "OS"))

ggplot(smaller, aes(x=lub.time, color=Service)) +
geom_density(aes( y = ..scaled..)) +
ggtitle("Faceted density of log file writes by process (scaled)") +
xlab("time") + ylab("")


This shows that no one process runs consistently throughout the entire installation process, but rather that the process run in spurts.

The vast majority of Mac users don't place strange files in certain special system-critical locations like '/usr/local/' and '/Library/'. Among those who do, though, these directories are littered with hundreds and hundreds of custom files that the installer doesn't and can't have prior knowledge of.

In my case, and probably many others, the estimated time-to-completion was inaccurate because the installer couldn't anticipate needing to copy back so many files to certain special directories after unpacking the contents of the new OS. Additionally, for each of these copied files, the installer had to make sure the subdirectories had the exact same meta-data (permissions, owner, reference count, creation date, etc…) as before the installation began. This entire process added many minutes to the procedure at a point when the installer thought it was pretty much done.

What were some of the files that the installer needed to copy back? The answer will be different for each system but, as mentioned above, anything placed '/usr/local' and '/Library' directories that wasn't Apple-supplied needed to be moved and moved back.

/usr/local/
/usr/local/ is used chiefly for user-installed software that isn't part of the OS distribution. In my case, my /usr/local contained a custom compliled Vim; ClamXAV, a lightweight virus scanner that I use only for the benefit of my Windows-using friends; and texlive, software for the TeX typesetting system. texlive was, by far, the biggest time-sink since it had over 123,491 files.

In addition to these programs, many users might find that the Homebrew package manager is to blame for their long installation process, since this software also uses the /usr/local prefix (although it probably should not).

/Library/
Among other things, this directory holds (subdirectories that hold) modules and packages that the Apple-supplied Python, Ruby, and Perl uses. If you use these Apple-supplied versions of these languages and you install your own packages/modules using super-user privileges, the new packages will go into this directory and it will appear foreign to the Yosemite installer.

To get around this issue, either install packages/modules in a local (non-system) library, or use alternate versions of these programming languages that you either download and install yourself, or use MacPorts to install.

---

You can find all the code and logs that I used for this analysis in this git repository

This post is also available as a RMarkdown report here

Fun with .Rprofile and customizing R startup

Over the years, I've meticulously compiled–and version controlled–massive and extensive configuration files for virtually all of my most used utilities, most notably vim, tmux, and zsh.

In fact, one of the only configurable utilities for which I had no special configuration schema was R. This is extremely surprising, given that I use R everyday.

One reason I think that this was the case is because I came to R from using general-purpose programming languages for which there is no provision to configure the language in a way that would actually change results or program output.

I only vaguely knew that .Rprofile was a configuration file that some people used, and that others warned against using, but it never occurred to me to actually use it for myself.

Because I never used it, I developed odd habits and rituals in my interactive R programming including adding "stringsAsFactors=FALSE" to all of my "read.csv" function calls and making frequent calls to the "option()" function.

Since I actually began to use and expand my R configuration, though, I've realized how much I've been missing. I pre-set all my preferred options (saving time) and I've even made provisions for some cool tricks and hacks.

That being said, there's a certain danger in using a custom R profile but we'll talk about how to thwart that later.

The R Startup Process

In the absence of any command-line flags being used, when R starts up, it will "source" (run) the site-wide R startup configuration file/script if it exists. In a fresh install of R, this will rarely exist, but if it does, it will usually be in '/Library/Frameworks/R.framework/Resources/etc/' on OS X, 'C:\Program Files\R\R-***\etc\' on Windows, or '/etc/R/' on Debian. Next, it will check for a .Rprofile hidden file in the current working directory (the directory where R is started on the command-line) to source. Failing that, it will check your home directory for the .Rprofile hidden file.

You can check if you have a site-wide R configuration script by running

R.home(component = "home")


in the R console and then checking for the presence of Rprofile.site in that directory. The presence of the user-defined R configuration can be checked for in the directory of whichever path

path.expand("~")


indicates.

More information on the R startup process can be found here and here.

The site-wide R configuration script
For most R installations on primarily single-user systems, using the site-wide R configuration script should be given up in favor of the user-specific configuration. That being said, a look at the boilerplate site-wide R profile that Debian furnishes (but comments out by default) provides some good insight into what might be a good idea to include in this file, if you choose to use it.

##                      Emacs please make this -*- R -*-
## empty Rprofile.site for R on Debian
##
## Copyright (C) 2008 Dirk Eddelbuettel and GPL'ed
##
## see help(Startup) for documentation on ~/.Rprofile and Rprofile.site

# ## Example of .Rprofile
# options(width=65, digits=5)
# options(show.signif.stars=FALSE)
#         function(...) grDevices::ps.options(horizontal=FALSE))
# set.seed(1234)
# .First <- function() cat("\n   Welcome to R!\n\n")
# .Last <- function()  cat("\n   Goodbye!\n\n")

# ## Example of Rprofile.site
# local({
#  # add MASS to the default packages, set a CRAN mirror
#  old <- getOption("defaultPackages"); r <- getOption("repos")
#  r["CRAN"] <- "http://my.local.cran"
#  options(defaultPackages = c(old, "MASS"), repos = r)
#})


Two things you might want to do in a site-wide R configuration file is add other packages to the default packages list and set a preferred CRAN mirror. Other things that the above snippet indicates you can do is set various width and number display options, setting a random-number seed (making random number generation deterministic for reproducible analysis), and hiding the stars that R shows for different significance levels (ostensibly because of their connection to the much-maligned NHST paradigm).

The user-specific .Rprofile
In contrast to the site-wide config (that will be used for all users on the system), the user-specific R configuration file is a place to put more personal preferences, shortcuts, aliases, and hacks. Immediately below is my .Rprofile.

local({r <- getOption("repos")
r["CRAN"] <- "http://cran.revolutionanalytics.com"
options(repos=r)})

options(stringsAsFactors=FALSE)

options(max.print=100)

options(scipen=10)

options(editor="vim")

# options(show.signif.stars=FALSE)

options(prompt="> ")
options(continue="... ")

options(width = 80)

q <- function (save="no", ...) {
quit(save=save, ...)
}

utils::rc.settings(ipck=TRUE)

.First <- function(){
if(interactive()){
library(utils)
timestamp(,prefix=paste("##------ [",getwd(),"] ",sep=""))

}
}

.Last <- function(){
if(interactive()){
hist_file <- Sys.getenv("R_HISTFILE")
if(hist_file=="") hist_file <- "~/.RHistory"
savehistory(hist_file)
}
}

if(Sys.getenv("TERM") == "xterm-256color")
library("colorout")

sshhh <- function(a.package){
suppressWarnings(suppressPackageStartupMessages(
library(a.package, character.only=TRUE)))
}

if(interactive()){
}

.env <- new.env()

.env$unrowname <- function(x) { rownames(x) <- NULL x } .env$unfactor <- function(df){
id <- sapply(df, is.factor)
df[id] <- lapply(df[id], as.character)
df
}

attach(.env)



[Lines 1-3]: First, because I don't have a site-wide R configuration script, I set my local CRAN mirror here. My particular choice of mirror is largely arbitrary.

[Line 5]: If stringsAsFactors hasn't bitten you yet, it will.

[Line 9]: Setting 'scipen=10' effectively forces R to never use scientific notation to express very small or large numbers.

[Line 13]: I included the snippet to turn off significance stars because it is a popular choice, but I have it commented out because ever since 1st grade I've used number of stars as a proxy for my worth as a person.

[Line 15]: I don't have time for Tk to load; I'd prefer to use the console, if possible.

[Lines 17-18]: Often, when working in the interactive console I forget a closing brace or paren. When I start a new line, R changes the prompt to "+" to indicate that it is expecting a continuation. Because "+" and ">" are the same width, though, I often don't notice and really screw things up. These two lines make the R REPL mimic the Python REPL by changing the continuation prompt to the wider "...".

[Lines 22-24]: Change the default behavior of "q()" to quit immediately and not save workspace.

[Line 26]: This snippet allows you to tab-complete package names for use in "library()" or "require()" calls. Credit for this one goes to @mikelove.

[Lines 28-34]: There are three main reasons I like to have R save every command I run in the console into a history file.

• Occasionally I come up with a clever way to solve a problem in an interactive session that I may want to remember for later use; instead of it getting lost in the ether, if I save it to a history file, I can look it up later.
• Sometimes I need an alibi for what I was doing at a particular time
• I ran a bunch of commands in the console to perform an analysis not realizing that I would have to repeat this analysis later. I can retrieve the commands from a history file and put it into a script where it belongs.

These lines instruct R to, before anything else, echo a timestamp to the console and to my R history file. The timestamp greatly improves my ability to search through my history for relevant commands.

[Lines 36-42]: These lines instruct R, right before exiting, to write all commands I used in that session to my R command history file. I usually have this set in an environment variable called "R_HISTFILE" on most of my systems, but in case I don't have this defined, I write it to a file in my home directory called .Rhistory.

[Line 44]: Enables the colorized output from R (provided by the colorout package) on appropriate consoles.

[Lines 47-50]: This defines a function that loads a libary into the namespace without any warning or startup messages clobbering my console.

[Line 52]: I often want to autoload the 'dplyr' and 'ggplot2' packages (particularly 'dplyr' as it is now an integral part of my R experience).

[Lines 54-56]: This loads the packages in my "auto.loads" vector if the R session is interactive.

[Lines 58-59]: This creates a new hidden namespace that we can store some functions in. We need to do this in order for these functions to survive a call to "rm(list=ls())" which will remove everything in the current namespace. This is described wonderfully in this blog post.

[Lines 61-64]: This defines a simple function to remove any row names a data.frame might have. This was stolen from Stephen Turner (which was in turn stolen from plyr).

[Lines 66-70]: This defines a function to sanely undo a "factor()" call. This was stolen from Dason Kurkiewicz.

There are some compelling reasons to abstain from using an R configuration file at all. The most persuasive argument against using it is the portability issue: As you begin to rely more and more on shortcuts and options you define in your .Rprofile, your R scripts will depend on them more and more. If a script is then transferred to a friend or colleague, often it won't work; in the worst case scenario, it will run without error but produce wrong results.

There are several ways this pitfall can be avoided, though:

• For R sessions/scripts that might be shared or used on systems without your .Rprofile, make sure to start the R interpreter with the --vanilla option, or add/change your shebang lines to "#!/usr/bin/Rscript --vanilla". The "--vanilla" option will tell R to ignore any configuration files. Writing scripts that will conform to a vanilla R startup environment is a great thing to do for portability.
• Use your .Rprofile everywhere! This is a bit of an untenable solution because you can't put your .Rprofile everywhere. However, if you put your .Rprofile on GitHub, you can easily clone it on any system that needs it. You can find mine here.
• Save your .Rprofile to another file name and, at the start of every R session where you want to use your custom configuration, manually source the file. This will behave just as it would if it were automatically sourced by R but removes the potential for the .Rprofile to be sourced when it is unwanted.

A variation on this theme is to create a shell alias to use R with your special configuration. For example, adding a shell alias like this:

alias aR="R_PROFILE_USER=~/.myR/aR.profile R"
`

will make it so that when "R" is run, it will run as before (without special configuration). In order to have R startup and auto-source your configuration you now have to run "aR". When 'aR' is run, the shell temporarily assigns an environment variable that R will follow to source a config script. In this case, it will source a config file called aR.profile in a hidden .myR subdirectory of my home folder. This path can be changed to anything, though.

This is the solution I have settled on because it is very easy to live with and invalidates concerns about portability.