Kickin' it with elastic net regression

With the kind of data that I usually work with, overfitting regression models can be a huge problem if I'm not careful. Ridge regression is a really effective technique for thwarting overfitting. It does this by penalizing the L2 norm (euclidean distance) of the coefficient vector which results in "shrinking" the beta coefficients. The aggressiveness of the penalty is controlled by a parameter \lambda.

Lasso regression is a related regularization method. Instead of using the L2 norm, though, it penalizes the L1 norm (manhattan distance) of the coefficient vector. Because it uses the L1 norm, some of the coefficients will shrink to zero while lambda increases. A similar effect would be achieved in Bayesian linear regression using a Laplacian prior (strongly peaked at zero) on each of the beta coefficients.

Because some of the coefficients shrink to zero, the lasso doubles as a crackerjack feature selection technique in addition to a solid shrinkage method. This property gives it a leg up on ridge regression. On the other hand, the lasso will occasionally achieve poor results when there's a high degree of collinearity in the features and ridge regression will perform better. Further, the L1 norm is underdetermined when the number of predictors exceeds the number of observations while ridge regression can handle this.

Elastic net regression is a hybrid approach that blends both penalization of the L2 and L1 norms. Specifically, elastic net regression minimizes the following...

\lVert y - X\beta \rVert + \lambda[(1-\alpha)\lvert \beta \rvert_2^2 + \alpha\lvert \beta \rvert_1]

the \alpha hyper-parameter is between 0 and 1 and controls how much L2 or L1 penalization is used (0 is ridge, 1 is lasso).

The usual approach to optimizing the lambda hyper-parameter is through cross-validation—by minimizing the cross-validated mean squared prediction error—but in elastic net regression, the optimal lambda hyper-parameter also depends upon and is heavily dependent on the alpha hyper-parameter (hyper-hyper-parameter?).

This blog post takes a cross-validated approach that uses grid search to find the optimal alpha hyper-parameter while also optimizing the lambda hyper-parameter for three different data sets. I also compare the performances against the stepwise regression and showcase some of the dangers of using stepwise feature selection.

mtcars

In this example, I try to predict “miles per gallon” from the other available attributes. The design matrix has 32 observations and 10 predictors and there is a high degree of collinearity (as measured by the variance inflation factors).

mtcars and elastic net regression

The left panel above shows the leave-one-out cross validation (LOOCV) mean squared error of the model with the optimal lambda (as determined again by LOOCV) for each alpha parameter from 0 to 1. This panel indicates that if our objective is to purely minimize MSE (with no regard for model complexity) than pure ridge regression outperforms any blended elastic-net model. This is probably because of the substantial collinearity. Interestingly, the lasso outperforms blended elastic net models that weight the lasso heavily.

The right panel puts things in perspective by plotting the LOOCV MSEs along with the MSE of the "kitchen sink" regression (the blue line) that includes all features in the model. As you can see, any degree of regularization offers a substantial improvement in model generalizability.

It is also plotted with two estimates of the MSE for models that blindly use the coefficients from automated bi-directional stepwise regression. The first uses the features selected by performing the stepwise procedure on the whole dataset and then assesses the model performance (the red line). The second estimate uses the step procedure and resulting features on only the training set for each fold of the cross validations. This is the estimate without the subtle but treacherous "knowledge leaking" eloquently described in this plot post. This should be considered the more correct assessment of the model. As you can see, if we weren't careful about interpreting the stepwise regression, we would have gotten an incredibly inflated and inaccurate view of the model performance.

Forest Fires

The second example uses a very-difficult-to-model dataset from University of California, Irvine machine learning repository. The task is to predict the burnt area from a forest fire given 11 predictors. It has 517 observations. Further, there is a relatively low degree of collinearity between predictors.

fireplot

Again, highest performing model is the pure ridge regression. This time, the performance asymptotes as the alpha hyper-parameter increases. The variability in the MSE estimates is due to the fact that I didn't use LOOCV and used 400-k CV instead because I'm impatient.

As with the last example, the properly measured stepwise regression performance isn't so great, and the kitchen sink model outperforms it. However, in contrast to the previous example, there was a lot less variability in the selected features across folds—this is probably because of the significantly larger number of observations.

"QuickStartExample"

This dataset is a contrived one that is included with the excellent glmnet package (the one I'm using for the elastic net regression). This dataset has a relatively low degree of collinearity, has 20 features and 100 observations. I have no idea how the package authors created this dataset.

quickstartplot

Finally, an example where the lasso outperforms ridge regression! I think this is because the dataset was specifically manufactured to have a small number of genuine predictors with large effects (as opposed to many weak predictors).

Interestingly, stepwise progression far outperforms both—probably for the very same reason. From fold to fold, there was virtually no variation in the features that the stepwise method automatically chose.

Conclusion

So, there you have it. Elastic net regression is awesome because it can perform at worst as good as the lasso or ridge and—though it didn’t on these examples—can sometimes substantially outperform both.

Also, be careful with step-wise feature selection!

PS: If, for some reason, you are interested in the R code I used to run these simulations, you can find it on this GitHub Gist.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

I'm all about that bootstrap ('bout that bootstrap)

As some of my regular readers may know, I'm in the middle of writing a book on introductory data analysis with R. I'm at the point in the writing of the book now where I have to make some hard choices about how I'm going to broach to topic of statistical inference and hypothesis testing.

Given the current climate against NHST (the journal Basic and Applied Social Psychology banned it) and my own personal preferences, I wasn't sure just how much to focus on classical hypothesis testing.

I didn't want to burden my readers with spending weeks trying to learn the intricacies of NHST just to have them being told to forget everything they know about it and not be able to use it without people making fun of them.

So I posed a question to twitter: "Is it too outlandish to not include the topic of parametric HTs in an intro book about data analysis. Asking for a friend.. named Tony…. You know, in favor of bootstrapped CIs, permutation tests, etc…"

To which my friend Zach Jones (@JonesZM) replied: "they could at least be better integrated with monte-carlo methods. i think they'd make it easier to understand". I agreed, which is why I'm proceeding with my original plan to introduce classical tests after and within the context of Monte Carlo bootstrapping (as opposed to exhaustive bootstrapping).

Even though I'm a huge fan of the bootstrap, I want to be careful not to further any misconceptions about it—chiefly, that bootstrapping is a cure-all for having a small sample size. To be able to show how this isn’t the case, I wrote an R script to take 1,000 samples from a population, calculate 95% confidence intervals using various methods and record the proportion of times the population mean was within the CIs.

The four ways I created the CIs were:

  • the z interval method: which assumes that the sampling distribution is normal around the sample mean (1.96 * the standard error)
  • the t interval method: which assumes that the population is normally distributed and the sampling distribution is normally distributed around the sample mean (t-distribution quantile at .975 [with appropriate degrees of freedom] * standard error)
  • basic bootstrap CI estimation (with boot() and boot.CI() from the boot R package)
  • adjusted percentile CI estimation (with boot() and boot.CI() from the boot R package)

I did this for various sample sizes and two different distributions, the normal and the very non-normal beta distribution (alpha=0.5, beta=0.5). Below is a plot depicting all of this information.

Accuracy of different CIs

So, clearly the normal (basic) boot doesn’t make up for small sample sizes.

It's no surprise the the t interval method blows everything else out of the water when sampling from a normal distribution. It even performs reasonably well with the beta distribution, although the adjusted bootstrap wins out for most sample sizes.

In addition to recording the proportion of the times the population mean was within the confidence intervals, I also kept track of the range of these intervals. All things being equal, narrower intervals are far preferable to wide ones. Check out this plot depicting the mean ranges of the estimated CIs:

Mean ranges for difference CIs

The t interval method always produces huge ranges.

The adjusted bootstrap produces ranges that are more or less on par with the other three methods BUT it outperforms the t interval method for non-normal populations. This suggests the the adjustments to the percentiles of the bootstrap distribution do a really good job at correcting for bias. It also shows that, if we are dealing with a non-normal population (common!), we should use adjusted percentile bootstrapped CIs.

Some final thoughts:

  • The bootstrap is not a panacea for small sample sizes
  • The bootstrap is cool because it doesn’t assume anything about the population distribution, unlike the z and t interval methods
  • Basic bootstrap intervals are whack. They’re pathologically narrow for small sample sizes.
  • Adjusted percentile intervals are great! You should always use them instead. Thanks Bradley Efron!

Also, if you're not using Windows, you can parallelize your bootstrap calculations really easily in R; below is the way I bootstrapped the mean for this project:

library(boot)
dasboot <- boot(a.sample, function(x, i){mean(x[i])}, 10000,
                           parallel="multicore", ncpus=4)


which uses 4 cores to perform the bootstrap in almost one fourth the time.

In later post, I plan to further demonstrate the value of the bootstrap by testing difference in means and show why permutation tests comparing means between two samples is always better than t-testing.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Why is my OS X Yosemite install taking so long?: an analysis

Why?
Since the latest Mac OS X update, 10.10 "Yosemite", was released last Thursday, there have been complaints springing up online of the progress bar woefully underestimating the actual time to complete installation. More specifically, it appeared as if, for a certain group of people (myself included), the installer would stall out at "two minutes remaining" or "less than a minute remaining"–sometimes for hours.

In the vast majority of these cases, though, the installation process didn't hang, it was just performing a bunch of unexpected tasks that it couldn't predict.

During the install, striking "Command" + "L" would bring up the install logs. In my case, the logs indicated that the installer was busy right until the very last minute.

Not knowing very much about OS X's installation process and wanting to learn more, and wanting to answer why the installation was taking longer than the progress bar expected, I saved the log to a file on my disk with the intention of analyzing it before the installer automatically restarted my computer.

Cleaning
The log file from the Yosemite installer wasn't in a format that R (or any program) could handle natively so before we can use it, we have to clean/munge it. To do this, we'll write a program in the queen of all text-processing languages: perl.

This script will read the log file, line-by-line from standard input (for easy shell piping), and spit out nicely formatted tab-delimited lines.

#!/usr/bin/perl

use strict;
use warnings;

# read from stdin
while(<>){
    chomp;
    my $line = $_;
    my ($not_message, $message) = split ': ', $line, 2;

    # skip lines with blank messages
    next if $message =~ m/^\s*$/;

    my ($month, $day, $time, $machine, $service) = split " ", $not_message;

    print join("\t", $month, $day, $time, $machine, $service, $message) . "\n";
}

We can output the cleaned log file with this shell command:

echo "Month\tDay\tTime\tMachine\tService\tMessage" > cleaned.log
grep '^Oct' ./YosemiteInstall.log | grep -v ']:  ' | grep -v ': }' |  ./clean-log.pl >> cleaned.log

This cleaned log contains 6 fields: 'Month', 'Day', 'Time', 'Machine (host)', 'Service', and 'Message'. The installation didn't span days (it didn't even span an hour) so technically I didn't need the 'Month' and 'Day' fields, but I left them in for completeness' sake.

Analysis

Let's set some options and load the libraries we are going to use:

# options
options(echo=TRUE)
options(stringsAsFactors=FALSE)

# libraries
library(dplyr)
library(ggplot2)
library(lubridate)
library(reshape2)

Now we read the log file that I cleaned and add a few columns with correctly parsed timestamps using lubridate’s "parse_date_time()" function

yos.log <- read.delim("./cleaned.log", sep="\t") %>%
  mutate(nice.date=paste(Month, Day, "2014", Time)) %>%
  mutate(lub.time=parse_date_time(nice.date, 
                                  "%b %d! %Y! %H!:%M!:%S!", 
                                  tz="EST"))

And remove the rows of dates that didn't parse correctly

yos.log <- yos.log[!is.na(yos.log$lub.time),]

head(yos.log)


##   Month Day     Time   Machine        Service
## 1   Oct  18 11:28:23 localhost opendirectoryd
## 2   Oct  18 11:28:23 localhost opendirectoryd
## 3   Oct  18 11:28:23 localhost opendirectoryd
## 4   Oct  18 11:28:23 localhost opendirectoryd
## 5   Oct  18 11:28:23 localhost opendirectoryd
## 6   Oct  18 11:28:23 localhost opendirectoryd
##                                                                    Message
## 1                   opendirectoryd (build 382.0) launched - installer mode
## 2                                  Logging level limit changed to 'notice'
## 3                                               Initialize trigger support
## 4 created endpoint for mach service 'com.apple.private.opendirectoryd.rpc'
## 5                                set default handler for RPC 'reset_cache'
## 6                           set default handler for RPC 'reset_statistics'
##              nice.date            lub.time
## 1 Oct 18 2014 11:28:23 2014-10-18 11:28:23
## 2 Oct 18 2014 11:28:23 2014-10-18 11:28:23
## 3 Oct 18 2014 11:28:23 2014-10-18 11:28:23
## 4 Oct 18 2014 11:28:23 2014-10-18 11:28:23
## 5 Oct 18 2014 11:28:23 2014-10-18 11:28:23
## 6 Oct 18 2014 11:28:23 2014-10-18 11:28:23

The first question I had was how long the installation process took

install.time <- yos.log[nrow(yos.log), "lub.time"] - yos.log[1, "lub.time"]
(as.duration(install.time))
## [1] "1848s (~30.8 minutes)"

Ok, about a half-hour.

Let's make a column for cumulative time by subtracting each row's time by the start time

yos.log$cumulative <- yos.log$lub.time - min(yos.log$lub.time, na.rm=TRUE)

In order to see what processes were taking the longest, we have to make a column for elapsed time. To do this, we can subtract each row's time from the time of the subsequent row.

yos.log$elapsed <- lead(yos.log$lub.time) - yos.log$lub.time

# remove last row
yos.log <- yos.log[-nrow(yos.log),]

Which services were responsible for the most writes to the log and what services took the longest? We can find out with the following elegant dplyr construct. While we're at it, we should add columns for percentange of the whole for easy plotting.

counts <- yos.log %>%
  group_by(Service) %>%
  summarise(n=n(), totalTime=sum(elapsed)) %>%
  arrange(desc(n)) %>%
  top_n(8, n) %>%
  mutate(percent.n = n/sum(n)) %>%
  mutate(percent.totalTime = as.numeric(totalTime)/sum(as.numeric(totalTime)))
(counts)

## Source: local data frame [8 x 5]
## 
##           Service     n totalTime percent.n percent.totalTime
## 1     OSInstaller 42400 1586 secs 0.9197197          0.867615
## 2  opendirectoryd  3263   43 secs 0.0707794          0.023523
## 3         Unknown   236  157 secs 0.0051192          0.085886
## 4  _mdnsresponder    52   17 secs 0.0011280          0.009300
## 5              OS    49    1 secs 0.0010629          0.000547
## 6 diskmanagementd    47    7 secs 0.0010195          0.003829
## 7     storagekitd    29    2 secs 0.0006291          0.001094
## 8         configd    25   15 secs 0.0005423          0.008206

Ok, the "OSInstaller" is responsible for the vast majority of the writes to the log and to the total time of the installation. "opendirectoryd" was the next most verbose process, but its processes were relatively quick compared to the "Unknown" process' as evidenced by "Unknown" taking almost 4 times longer, in aggregate, in spite of having only 7% of "opendirectoryd"'s log entries.

We can more intuitively view the number-of-entries/time-taken mismatch thusly:

melted <- melt(as.data.frame(counts[,c("Service",
                                       "percent.n",
                                       "percent.totalTime")]))

ggplot(melted, aes(x=Service, y=as.numeric(value), fill=factor(variable))) +
  geom_bar(width=.8, stat="identity", position = "dodge",) +
  ggtitle("Breakdown of services during installation by writes to log") +
  ylab("percent") + xlab("service") +
  scale_fill_discrete(name="Percent of",
                      breaks=c("percent.n", "percent.totalTime"),
                      labels=c("writes to logfile", "time elapsed"))

Breakdown

As you can see, the "Unknown" process took a disproportionately long time for its relatively few log entries; the opposite behavior is observed with "opendirectoryd". The other processes contribute very little to both the number of log entries and the total time in the installation process.

What were the 5 most lengthy processes?

yos.log %>%
  arrange(desc(elapsed)) %>%
  select(Service, Message, elapsed) %>%
  head(n=5)


##       Service
## 1 OSInstaller
## 2 OSInstaller
## 3     Unknown
## 4 OSInstaller
## 5 OSInstaller
##                                                                                                                                            Message
## 1 PackageKit: Extracting file:///System/Installation/Packages/Essentials.pkg (destination=/Volumes/Macintosh HD/.OSInstallSandboxPath/Root, uid=0)
## 2                                    System Reaper: Archiving previous system logs to /Volumes/Macintosh HD/private/var/db/PreviousSystemLogs.cpgz
## 3                       kext file:///Volumes/Macintosh%20HD/System/Library/Extensions/JMicronATA.kext/ is in hash exception list, allowing to load
## 4                                                                   Folder Manager is being asked to create a folder (down) while running as uid 0
## 5                                                                                                                      Checking catalog hierarchy.
##    elapsed
## 1 169 secs
## 2 149 secs
## 3  70 secs
## 4  46 secs
## 5  44 secs

The top processes were:

  • Unpacking and moving the contents of "Essentials.pkg" into what is to become the newsystem directory structure. This ostensibly contains items like all the updated applications (Safari, Mail, etc..). (almost three minutes)
  • Archiving the old system logs (two and a half minutes)
  • Loading the kernel module that allows the onboard serial ATA controller to work (a little over a minute)

Let's view a density plot of the number of writes to the log file during installation.

ggplot(yos.log, aes(x=lub.time)) +
  geom_density(adjust=3, fill="#0072B2") +
  ggtitle("Density plot of number of writes to log file during installation") +
  xlab("time") + ylab("")

density

This graph is very illuminating; the vast majority of log file writes were the result of very quick processes that took place in the last 15 minutes of the install, which is when the progress bar read that only two minutes were remaining.

In particular, there were a very large number of log file writes between 11:47 and 11:48; what was going on here?

# if the first time is in between the second two, this returns TRUE
is.in <- function(time, start, end){
  if(time > start && time < end)
    return(TRUE)
  return(FALSE)
}

the.start <- ymd_hms("14-10-18 11:47:00", tz="EST")
the.end <- ymd_hms("14-10-18 11:48:00", tz="EST")

# logical vector containing indices of writes in time interval
is.in.interval <- sapply(yos.log$lub.time, is.in,
                         the.start,
                         the.end)

# extract only these rows
in.interval <- yos.log[is.in.interval, ]

# what do they look like?
silence <- in.interval %>%
  select(Message) %>%
  sample_n(7) %>%
  apply(1, function (x){cat("\n");cat(x);cat("\n")})

## 
## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/usr/local/texlive/2013/tlpkg/tlpobj/featpost.tlpobj -> /Volumes/Macintosh HD/usr/local/texlive/2013/tlpkg/tlpobj Final name: featpost.tlpobj (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy)
## 
## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/usr/local/texlive/2013/texmf-dist/tex/generic/pst-eucl/pst-eucl.tex -> /Volumes/Macintosh HD/usr/local/texlive/2013/texmf-dist/tex/generic/pst-eucl Final name: pst-eucl.tex (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy)
## 
## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/Library/Python/2.7/site-packages/pandas-0.12.0_943_gaef5061-py2.7-macosx-10.9-intel.egg/pandas/tests/test_groupby.py -> /Volumes/Macintosh HD/Library/Python/2.7/site-packages/pandas-0.12.0_943_gaef5061-py2.7-macosx-10.9-intel.egg/pandas/tests Final name: test_groupby.py (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy)
## 
## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/usr/local/texlive/2013/texmf-dist/tex/latex/ucthesis/uct10.clo -> /Volumes/Macintosh HD/usr/local/texlive/2013/texmf-dist/tex/latex/ucthesis Final name: uct10.clo (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy)
## 
## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/usr/local/texlive/2013/texmf-dist/doc/latex/przechlewski-book/wkmgr1.tex -> /Volumes/Macintosh HD/usr/local/texlive/2013/texmf-dist/doc/latex/przechlewski-book Final name: wkmgr1.tex (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy)
## 
## WARNING : ensureParentPathExists: Created  `/Volumes/Macintosh HD/usr/local/texlive/2013/texmf-dist/doc/latex/moderntimeline' w/ {
## 
## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/usr/local/texlive/2013/texmf-dist/fonts/type1/wadalab/mrj/mrjkx.pfb -> /Volumes/Macintosh HD/usr/local/texlive/2013/texmf-dist/fonts/type1/wadalab/mrj Final name: mrjkx.pfb (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy)

Ah, so these processes are the result of the installer having to move files back into the new installation directory structure. In particular, the vast majority of these move operations are moving files related to a program called "texlive". I'll explain why this is to blame for the inaccurate projected time to completion in the next section.

But lastly, let's view a faceted density plot of the number of log files writes by process. This might give us a sense of what steps go on as the installation progresses by showing us with processes are most active.

# reduce number of service to a select few of the most active
smaller <- yos.log %>%
  filter(Service %in% c("OSInstaller", "opendirectoryd",
                        "Unknown", "OS"))

ggplot(smaller, aes(x=lub.time, color=Service)) +
  geom_density(aes( y = ..scaled..)) +
  ggtitle("Faceted density of log file writes by process (scaled)") +
  xlab("time") + ylab("")

facet

This shows that no one process runs consistently throughout the entire installation process, but rather that the process run in spurts.

the answer
The vast majority of Mac users don't place strange files in certain special system-critical locations like '/usr/local/' and '/Library/'. Among those who do, though, these directories are littered with hundreds and hundreds of custom files that the installer doesn't and can't have prior knowledge of.

In my case, and probably many others, the estimated time-to-completion was inaccurate because the installer couldn't anticipate needing to copy back so many files to certain special directories after unpacking the contents of the new OS. Additionally, for each of these copied files, the installer had to make sure the subdirectories had the exact same meta-data (permissions, owner, reference count, creation date, etc…) as before the installation began. This entire process added many minutes to the procedure at a point when the installer thought it was pretty much done.

What were some of the files that the installer needed to copy back? The answer will be different for each system but, as mentioned above, anything placed '/usr/local' and '/Library' directories that wasn't Apple-supplied needed to be moved and moved back.

/usr/local/
/usr/local/ is used chiefly for user-installed software that isn't part of the OS distribution. In my case, my /usr/local contained a custom compliled Vim; ClamXAV, a lightweight virus scanner that I use only for the benefit of my Windows-using friends; and texlive, software for the TeX typesetting system. texlive was, by far, the biggest time-sink since it had over 123,491 files.

In addition to these programs, many users might find that the Homebrew package manager is to blame for their long installation process, since this software also uses the /usr/local prefix (although it probably should not).

/Library/
Among other things, this directory holds (subdirectories that hold) modules and packages that the Apple-supplied Python, Ruby, and Perl uses. If you use these Apple-supplied versions of these languages and you install your own packages/modules using super-user privileges, the new packages will go into this directory and it will appear foreign to the Yosemite installer.

To get around this issue, either install packages/modules in a local (non-system) library, or use alternate versions of these programming languages that you either download and install yourself, or use MacPorts to install.

---

You can find all the code and logs that I used for this analysis in this git repository

This post is also available as a RMarkdown report here

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Take a look, it's in a book: distribution of kindle e-book highlights

If you've ever started a book and not finished it, it may comfort you to know that you are not alone. It's hard to get accurate estimates of the percentage books that are discontinued, but the rise of e-reading (and resulting circumvention of privacy) affords us the opportunity to answer related questions.

The kindle e-reading devices allow readers to highlight salient passages of books and optionally share them with Amazon. Amazon then curates these highlights and displays them to readers who opt-in. These are called "popular highlights".

After reading a few books on the Kindle, it's hard not to notice a pattern with popular highlights: they become sparser the further you get into a book. Given my penchant for answering mildly interesting questions with statistics, I couldn't help but analyze and visualize the distribution of these popular highlights.

I organized the location of the 10 most popular highlights of 64 books (21 fiction and 43 non-fiction) along with the location of the end of the book (this doesn't include the index, notes, and references of non-fiction books) and loaded it into R:

library(dplyr)

ebook.frame <- read.csv("./ebooks.csv",
                        stringsAsFactors=FALSE)

ebook.frame <- ebook.frame %.%
  mutate(normalized=location/end.location)

In order to meaningfully compare locations across books, I needed to express each location as a percentage of the total length of the book. Let's use ggplot2 to visualize the distribution of where the popular highlights appear across all books:

library(ggplot2)
ggplot(ebook.frame, aes(x=normalized)) +
  geom_density(adjust=2, fill="#0072B2", alpha=.8) +
  labs(title="Distribution of e-book highlights\n") +
  xlab("location in book (percent)") +
  theme(axis.ticks = element_blank(),
        axis.text.y = element_blank()) +
  guides(fill=guide_legend(title=NULL))
ggsave(file="DistributionOfEBookHighlights.png")

Distribution Of E-Book Highlights

Distribution Of E-Book Highlights

Before we go on, it's important to express a few words of warning...
These books are not a proper sample of all kindle e-books; since these books came from my personal collection, books on science and philosophy are oversampled, books about vampires are woefully underrepresented, and there is far more Janet Evanovich than chance would dictate. Because of this, any insights gleaned from these data (to the extent that these data offer any) are only applicable to the reading habits of a certain type of e-reader, namely, boring ones that don't like to have fun.

The spreadsheet I loaded also contained a logical field representing whether the book was fiction. We can take a look at the differences in the highlight locations between fiction and non-fiction books thusly:

ggplot(ebook.frame, aes(x=normalized)) +
  geom_density(adjust=2, aes(fill=factor(fiction)), 
                             alpha=.5) +
  labs(title="Distribution of e-book highlights\n") +
  scale_fill_discrete(labels=c(“non-fiction",
                               "fiction")) +
  xlab("location in book (percent)") +
  theme(axis.ticks = element_blank(), 
        axis.text.y = element_blank()) +
  guides(fill=guide_legend(title=NULL))
ggsave(file="DistributionOfEBookHighlightsFictionDistinction.png")

Distribution Of E-Book Highlights Fiction Distinction

Distribution Of E-Book Highlights Fiction Distinction

It would appear as if non-fiction books have a less uniform distribution of popular highlights. There are likely many causes for this, but one explanation could be that the reader is less likely to make it to the end of a non-fiction book.

In order to make some quantifiable claims, let's look at the empirical cumulative distribution function:

ggplot(ebook.frame, aes(normalized, colour=factor(fiction))) +
  stat_ecdf() +
  labs(title="Cumulative distribution of e-book highlights\n") +
  scale_colour_discrete(labels=c("non-fiction", "fiction")) +
  xlab("location in book (percent)") +
  ylab("cumulative percentage of highlights") +
  theme(legend.title=element_blank())
ggsave(file="CumulativeDistributionofEbookHighlights.png")

Cumulative Distribution of E-book Highlights

Cumulative Distribution of E-book Highlights

Interestingly, for non-fiction books, a full 75% of the highlights are contained in the first 25% percent of the book; not quite pareto, but close).

Before we come to any conclusions regarding the proportion of readers that make it through a book, let's check our assumptions:

  • e-readers that highlight passages (and choose to share them with Amazon) behave just like e-readers that don't
  • salient passages are uniformly distributed throughout a book and, thus, the distribution of highlights is uniform across the entire length of the read portion of a book
  • the fact that a passage was already highlighted by many e-readers has no bearing on the reader’s decision to highlight the same passage

These assumptions don't hold up to critical scrutiny. Nevertheless, these results serve as strong evidence that at least some e-books go unfinished. As for the percentage of books that go unfinished, perhaps Amazon is in a better position to answer that question.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail