Playing around with #rstats twitter data

As a bit of weekend fun, I decided to briefly look into the #rstats twitter data that Stephen Turner collected and made available (thanks!). Essentially, this data set contains some basic information about over 100,000 tweets that contain the hashtag "#rstats" that denotes that a tweeter is tweeting about R.

As a warning, I don't know much about how these data were collected, whether it was collected and random times during the day or whether it was biased toward particular times and, therefore, locations. I wouldn't really read too much into this.

Most common co-occuring hashtags
When a tweet uses a hashtag at all, it very often uses more than one. To extract the co-occuring hashtags, I used the following perl script:


    $_ = lc($_);
    $_ =~ s/#rstats//g;
    my @matches;
    push @matches, /(#\w+)/;
    print join "\n" => @matches if @matches;

which uses the regular expression "(#\w+)" to search for hashtags after removing "#rstats" from every tweet.

On the unix command-line, I put these other hashtags into a file and sorted via these commands:

cat data/R-hashtag-data.txt | ./ | tee other-hashtags.txt

sort other-hashtags.txt | uniq -c | sort -n -r > sorted-other-hashtags.txt

After running these commands, I get a numbered list of co-occuring hashtags, sorted in descending order. The top 10 co-occuring hashtags were as follows (you can see the rest here :

5258 #datascience
1665 #python
1625 #bigdata
1542 #r
1451 #dataviz
1360 #ggplot2
 852 #statistics
 783 #dplyr
 749 #machinelearning
 743 #analytics

Neat-o. The presence of "#python" and "#ggplot2" in the top 10 made me wonder what the top 10 programming language and R package related hashtags were. Here they are, respectively:

1665 #python
 423 #d3js (plus 72 for #d3) (plus 2 for #js)
 343 #sas
 312 #julialang (plus 43 for #julia)
 240 #fsharp
 140 #spss  (plus 7 for #ibmspss)
 102 #stata
  75 #matlab
  55 #sql
  38 #java

1360 #ggplot2  (plus 298 for ggplot)  (plus for 6 #gglot2) (plus 4 for #ggpot)
 783 #dplyr
 663 #shiny
 557 #rcpp (plus 22 for rcpp11)
 251 #knitr
 156 #magrittr
 105 #lme4
  93 #ggvis   (plus 11 for #ggivs)
  65 #datatable
  46 #rneo4j

You can view the full list here and here.

I was happy to see my favorite languages (python, perl, clojure, lisp, haskell, c) besides R being represented in the first list. Additionally, most of my favorite packages were fairly well tweeted about--at least as far as hashtags-applied-to-a-package go.

Before moving on to the next section, I wanted to share my favorite co-occuring hashtags that I found while sifting through the data: #rcatladies, #rdogfella, #bayesianbootycall, #dontbeaplyrhater, #overlyhonestmethods, #rickshaw (??), #statafail, and #monkeysinfrontoftypewriters.

Most prolific #rstats tweeters
One of the first things I did with these data is a simple aggregation and sort to find the tweeters that used the hashtag most often:

  group_by(User) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) -> prolific.rstats.tweeters

Here is the top 10 (you can see the rest here.)

@Rbloggers	1081
@hadleywickham	498
@timelyportfolio	427
@recology_	419
@revodavid	210
@chlalanne	209
@adolfoalvarez	199
@RLangTip	175
@jmgomez	160

Nothing terribly surprising here.

Normalizing by total tweets
In a twitter discussion about these data, a twitter friend Tim Hopper posited that though he had fewer #rstats tweets than another mutual friend, Trey Causey, he would have a higher number of #rstats tweets if you control for total tweet volume. I wondered how this sorting would look.

Answering this question gave me an excuse to use Hadley Wickham's new package, rvest (I literally just got why the package is named as much while typing this out) which makes web scraping easier--in part by leveraging the expressive power of the magrittr package.

To get the total number of tweets for a particular tweeter, I wrote the following function:

get.num.tweets <- function(handle){
    unraw <- function(raw_str){
      raw_str <- sub(",", "", raw_str)    # remove commas if any
      if(grepl("K", raw_str)){
        return(as.numeric(sub("K", "", raw_str))*1000)   # in thousands
    html(paste0("", sub("@", "", handle))) %>%
      html_nodes(".is-active .ProfileNav-value") %>%
      html_text() %>%

The real logic (and beauty) of which is contained only in the last few lines:

    html(paste0("", sub("@", "", TWITTER_HANDLE))) %>%
      html_nodes(".is-active .ProfileNav-value") %>%

The CSS element that houses the number of total tweets from a useR's twitter page was found easily using SelectorGadget.

After scraping the number of tweets for almost 10,000 #rstats tweeters (waiting a few seconds between each request because I'm considerate) I divided number of #rstats tweets by the total number of tweets to come up with a normalized value.

The top 10 tweeteRs were as follows:

              User count num.of.tweets     ratio 
1     @medzihorsky     9            28 0.3214286 
2        @statworx     5            16 0.3125000 
3    @LearnRinaDay   114           404 0.2821782 
4  @RforExcelUsers     4            15 0.2666667 
5     @showmeshiny    27           102 0.2647059 
6           @tcrug     6            25 0.2400000 
7   @DailyRpackage   155           666 0.2327327 
8   @R_Programming    49           250 0.1960000 
9        @hexadata     8            41 0.1951220 
10     @Deep_RHelp    11            58 0.1896552 

In case you were wondering, Trey Causey still "won" by a long shot:

> tweeters[which(tweeters$User=="@tdhopper"),]   
Source: local data frame [1 x 4]                 
       User count num.of.tweets        ratio     
1 @tdhopper     8         26700 0.0002996255     
> tweeters[which(tweeters$User=="@treycausey"),] 
Source: local data frame [1 x 4]                 
         User count num.of.tweets      ratio     
1 @treycausey    50         28700 0.00174216

Before ending this post, I feel compelled to issue an almost certainly unnecessary but customary warning against using number of #rstats tweets as a proxy for who likes R the most or who are the biggest R "thought leaders" (whatever that is). Most tweets about R don't use the #rstats hashtag, anyway.

Again, I would't read too much into this :)

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Assertive R programming in dplyr/magrittr pipelines

A lot of my job–and side projects, for that matter–involve running R scripts on updates of open government data. While I’m infinitely grateful to have access to any interesting open datasets in the first place, I can’t ignore that dealing with open data is often a messy affair. In fact this seems to be characteristic of most data sets I work with, open access or otherwise.

So... let's say I have a labyrinthine analysis workflow that uses a wide array of government sources to answer an interesting question. The workflow is full of analyses that return components that are components of still other analyses.

Then there’s an update of the data! Whoopee! I rerun the scripts/workflow on updated (or partially updated) data. Then one of four things happen:

  • In the best case scenario, everything works because there were no errors in the data.
  • In the likely scenario, something very late in this labyrinthine analysis workflow breaks and it’s not clear what datum caused this error.
  • In the worst case scenario, nothing breaks and the error is only caught when the results–or part of them–are nonsensical.
  • In the worst worst case scenario, the results or some of the results are wrong but it looks ok and it goes undetected.

In an effort to help solve this common problem–and inspired by the elegance of dplyr/magrittr pipelines–I created a R package called assertr.

assertr works by adding two new verbs to the pipeline, verify and assert, and a couple of predicate functions. Early on in the pipeline, you make certain assertions about how the data should look. If the data conform to these assertions, then we go on with the pipeline. If not, the verbs produce errors that terminate any further pipeline computations. The benefit of the verbs, over the truth assurance functions already in R (like stopifnot) is that they needn’t interrupt the flow of the pipeline.

Take, for example, the following contrived snippet making sure that there are only 0s and 1s (automatic and manual, respectively) in R’s Motor Trend Car Road Test built-in dataset before calculating the average miles per gallon per category.

mtcars %>%
  verify(am %in% c(0,1)) %>%
  group_by(cyl) %>%

#   am     mean.mpg
#   0      17.14737
#   1      24.39231

Let’s say this dataset was much bigger, not built in to R, and curated and disseminated by someone with less perfectionistic (read obsessive/compulsive) tendencies than yours truly. If we wanted to find the average miles per gallon aggregated by number of engine cylinders, we might first want to check if the number of cylinders is reasonable (either 4, 6, or 8) and that the miles per gallon was a reasonable number (between 10 and 40 mpg) and not a data entry error that would greatly throw off our non-robust estimator:

mtcars %>%
  assert(in_set(4, 6, 8), cyl) %>%
  assert(within_bounds(10, 40), mpg) %>%
  group_by(cyl) %>%

#  cyl   mean.mpg
#   4     26.66364
#   6     19.74286
#   8     15.10000

Perhaps one day there will be cars that have more than 8 cylinders or less than 2. We might want to only check if there are an even number of cylinders (since it has to be even, I think); we can change the first assert line to:

assert(function(x) x%%2==0, cyl) %>%

assertr subscribes to the general idea that it is better to fail fast to spot data errors early. The benefit of assertr’s particular approach is that it’s friendly to the pipeline paradigm used by magrittr and dplyr.

The best thing about assertr’s approach, though, is that it forces you to state your assumptions up front. When your assumptions are stated clearly and verified, errors from messy data tend to disappear.

To learn more about assertr and the kinds of assertions that you can make with it, visit its page on github.

You can also read the vignette here.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Why is my OS X Yosemite install taking so long?: an analysis

Since the latest Mac OS X update, 10.10 "Yosemite", was released last Thursday, there have been complaints springing up online of the progress bar woefully underestimating the actual time to complete installation. More specifically, it appeared as if, for a certain group of people (myself included), the installer would stall out at "two minutes remaining" or "less than a minute remaining"–sometimes for hours.

In the vast majority of these cases, though, the installation process didn't hang, it was just performing a bunch of unexpected tasks that it couldn't predict.

During the install, striking "Command" + "L" would bring up the install logs. In my case, the logs indicated that the installer was busy right until the very last minute.

Not knowing very much about OS X's installation process and wanting to learn more, and wanting to answer why the installation was taking longer than the progress bar expected, I saved the log to a file on my disk with the intention of analyzing it before the installer automatically restarted my computer.

The log file from the Yosemite installer wasn't in a format that R (or any program) could handle natively so before we can use it, we have to clean/munge it. To do this, we'll write a program in the queen of all text-processing languages: perl.

This script will read the log file, line-by-line from standard input (for easy shell piping), and spit out nicely formatted tab-delimited lines.


use strict;
use warnings;

# read from stdin
    my $line = $_;
    my ($not_message, $message) = split ': ', $line, 2;

    # skip lines with blank messages
    next if $message =~ m/^\s*$/;

    my ($month, $day, $time, $machine, $service) = split " ", $not_message;

    print join("\t", $month, $day, $time, $machine, $service, $message) . "\n";

We can output the cleaned log file with this shell command:

echo "Month\tDay\tTime\tMachine\tService\tMessage" > cleaned.log
grep '^Oct' ./YosemiteInstall.log | grep -v ']:  ' | grep -v ': }' |  ./ >> cleaned.log

This cleaned log contains 6 fields: 'Month', 'Day', 'Time', 'Machine (host)', 'Service', and 'Message'. The installation didn't span days (it didn't even span an hour) so technically I didn't need the 'Month' and 'Day' fields, but I left them in for completeness' sake.


Let's set some options and load the libraries we are going to use:

# options

# libraries

Now we read the log file that I cleaned and add a few columns with correctly parsed timestamps using lubridate’s "parse_date_time()" function

yos.log <- read.delim("./cleaned.log", sep="\t") %>%
  mutate(, Day, "2014", Time)) %>%
                                  "%b %d! %Y! %H!:%M!:%S!", 

And remove the rows of dates that didn't parse correctly

yos.log <- yos.log[!$lub.time),]


##   Month Day     Time   Machine        Service
## 1   Oct  18 11:28:23 localhost opendirectoryd
## 2   Oct  18 11:28:23 localhost opendirectoryd
## 3   Oct  18 11:28:23 localhost opendirectoryd
## 4   Oct  18 11:28:23 localhost opendirectoryd
## 5   Oct  18 11:28:23 localhost opendirectoryd
## 6   Oct  18 11:28:23 localhost opendirectoryd
##                                                                    Message
## 1                   opendirectoryd (build 382.0) launched - installer mode
## 2                                  Logging level limit changed to 'notice'
## 3                                               Initialize trigger support
## 4 created endpoint for mach service ''
## 5                                set default handler for RPC 'reset_cache'
## 6                           set default handler for RPC 'reset_statistics'
##                lub.time
## 1 Oct 18 2014 11:28:23 2014-10-18 11:28:23
## 2 Oct 18 2014 11:28:23 2014-10-18 11:28:23
## 3 Oct 18 2014 11:28:23 2014-10-18 11:28:23
## 4 Oct 18 2014 11:28:23 2014-10-18 11:28:23
## 5 Oct 18 2014 11:28:23 2014-10-18 11:28:23
## 6 Oct 18 2014 11:28:23 2014-10-18 11:28:23

The first question I had was how long the installation process took

install.time <- yos.log[nrow(yos.log), "lub.time"] - yos.log[1, "lub.time"]
## [1] "1848s (~30.8 minutes)"

Ok, about a half-hour.

Let's make a column for cumulative time by subtracting each row's time by the start time

yos.log$cumulative <- yos.log$lub.time - min(yos.log$lub.time, na.rm=TRUE)

In order to see what processes were taking the longest, we have to make a column for elapsed time. To do this, we can subtract each row's time from the time of the subsequent row.

yos.log$elapsed <- lead(yos.log$lub.time) - yos.log$lub.time

# remove last row
yos.log <- yos.log[-nrow(yos.log),]

Which services were responsible for the most writes to the log and what services took the longest? We can find out with the following elegant dplyr construct. While we're at it, we should add columns for percentange of the whole for easy plotting.

counts <- yos.log %>%
  group_by(Service) %>%
  summarise(n=n(), totalTime=sum(elapsed)) %>%
  arrange(desc(n)) %>%
  top_n(8, n) %>%
  mutate(percent.n = n/sum(n)) %>%
  mutate(percent.totalTime = as.numeric(totalTime)/sum(as.numeric(totalTime)))

## Source: local data frame [8 x 5]
##           Service     n totalTime percent.n percent.totalTime
## 1     OSInstaller 42400 1586 secs 0.9197197          0.867615
## 2  opendirectoryd  3263   43 secs 0.0707794          0.023523
## 3         Unknown   236  157 secs 0.0051192          0.085886
## 4  _mdnsresponder    52   17 secs 0.0011280          0.009300
## 5              OS    49    1 secs 0.0010629          0.000547
## 6 diskmanagementd    47    7 secs 0.0010195          0.003829
## 7     storagekitd    29    2 secs 0.0006291          0.001094
## 8         configd    25   15 secs 0.0005423          0.008206

Ok, the "OSInstaller" is responsible for the vast majority of the writes to the log and to the total time of the installation. "opendirectoryd" was the next most verbose process, but its processes were relatively quick compared to the "Unknown" process' as evidenced by "Unknown" taking almost 4 times longer, in aggregate, in spite of having only 7% of "opendirectoryd"'s log entries.

We can more intuitively view the number-of-entries/time-taken mismatch thusly:

melted <- melt([,c("Service",

ggplot(melted, aes(x=Service, y=as.numeric(value), fill=factor(variable))) +
  geom_bar(width=.8, stat="identity", position = "dodge",) +
  ggtitle("Breakdown of services during installation by writes to log") +
  ylab("percent") + xlab("service") +
  scale_fill_discrete(name="Percent of",
                      breaks=c("percent.n", "percent.totalTime"),
                      labels=c("writes to logfile", "time elapsed"))


As you can see, the "Unknown" process took a disproportionately long time for its relatively few log entries; the opposite behavior is observed with "opendirectoryd". The other processes contribute very little to both the number of log entries and the total time in the installation process.

What were the 5 most lengthy processes?

yos.log %>%
  arrange(desc(elapsed)) %>%
  select(Service, Message, elapsed) %>%

##       Service
## 1 OSInstaller
## 2 OSInstaller
## 3     Unknown
## 4 OSInstaller
## 5 OSInstaller
##                                                                                                                                            Message
## 1 PackageKit: Extracting file:///System/Installation/Packages/Essentials.pkg (destination=/Volumes/Macintosh HD/.OSInstallSandboxPath/Root, uid=0)
## 2                                    System Reaper: Archiving previous system logs to /Volumes/Macintosh HD/private/var/db/PreviousSystemLogs.cpgz
## 3                       kext file:///Volumes/Macintosh%20HD/System/Library/Extensions/JMicronATA.kext/ is in hash exception list, allowing to load
## 4                                                                   Folder Manager is being asked to create a folder (down) while running as uid 0
## 5                                                                                                                      Checking catalog hierarchy.
##    elapsed
## 1 169 secs
## 2 149 secs
## 3  70 secs
## 4  46 secs
## 5  44 secs

The top processes were:

  • Unpacking and moving the contents of "Essentials.pkg" into what is to become the newsystem directory structure. This ostensibly contains items like all the updated applications (Safari, Mail, etc..). (almost three minutes)
  • Archiving the old system logs (two and a half minutes)
  • Loading the kernel module that allows the onboard serial ATA controller to work (a little over a minute)

Let's view a density plot of the number of writes to the log file during installation.

ggplot(yos.log, aes(x=lub.time)) +
  geom_density(adjust=3, fill="#0072B2") +
  ggtitle("Density plot of number of writes to log file during installation") +
  xlab("time") + ylab("")


This graph is very illuminating; the vast majority of log file writes were the result of very quick processes that took place in the last 15 minutes of the install, which is when the progress bar read that only two minutes were remaining.

In particular, there were a very large number of log file writes between 11:47 and 11:48; what was going on here?

# if the first time is in between the second two, this returns TRUE <- function(time, start, end){
  if(time > start && time < end)

the.start <- ymd_hms("14-10-18 11:47:00", tz="EST")
the.end <- ymd_hms("14-10-18 11:48:00", tz="EST")

# logical vector containing indices of writes in time interval <- sapply(yos.log$lub.time,,

# extract only these rows
in.interval <- yos.log[, ]

# what do they look like?
silence <- in.interval %>%
  select(Message) %>%
  sample_n(7) %>%
  apply(1, function (x){cat("\n");cat(x);cat("\n")})

## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/usr/local/texlive/2013/tlpkg/tlpobj/featpost.tlpobj -> /Volumes/Macintosh HD/usr/local/texlive/2013/tlpkg/tlpobj Final name: featpost.tlpobj (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy)
## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/usr/local/texlive/2013/texmf-dist/tex/generic/pst-eucl/pst-eucl.tex -> /Volumes/Macintosh HD/usr/local/texlive/2013/texmf-dist/tex/generic/pst-eucl Final name: pst-eucl.tex (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy)
## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/Library/Python/2.7/site-packages/pandas-0.12.0_943_gaef5061-py2.7-macosx-10.9-intel.egg/pandas/tests/ -> /Volumes/Macintosh HD/Library/Python/2.7/site-packages/pandas-0.12.0_943_gaef5061-py2.7-macosx-10.9-intel.egg/pandas/tests Final name: (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy)
## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/usr/local/texlive/2013/texmf-dist/tex/latex/ucthesis/uct10.clo -> /Volumes/Macintosh HD/usr/local/texlive/2013/texmf-dist/tex/latex/ucthesis Final name: uct10.clo (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy)
## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/usr/local/texlive/2013/texmf-dist/doc/latex/przechlewski-book/wkmgr1.tex -> /Volumes/Macintosh HD/usr/local/texlive/2013/texmf-dist/doc/latex/przechlewski-book Final name: wkmgr1.tex (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy)
## WARNING : ensureParentPathExists: Created  `/Volumes/Macintosh HD/usr/local/texlive/2013/texmf-dist/doc/latex/moderntimeline' w/ {
## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/usr/local/texlive/2013/texmf-dist/fonts/type1/wadalab/mrj/mrjkx.pfb -> /Volumes/Macintosh HD/usr/local/texlive/2013/texmf-dist/fonts/type1/wadalab/mrj Final name: mrjkx.pfb (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy)

Ah, so these processes are the result of the installer having to move files back into the new installation directory structure. In particular, the vast majority of these move operations are moving files related to a program called "texlive". I'll explain why this is to blame for the inaccurate projected time to completion in the next section.

But lastly, let's view a faceted density plot of the number of log files writes by process. This might give us a sense of what steps go on as the installation progresses by showing us with processes are most active.

# reduce number of service to a select few of the most active
smaller <- yos.log %>%
  filter(Service %in% c("OSInstaller", "opendirectoryd",
                        "Unknown", "OS"))

ggplot(smaller, aes(x=lub.time, color=Service)) +
  geom_density(aes( y = ..scaled..)) +
  ggtitle("Faceted density of log file writes by process (scaled)") +
  xlab("time") + ylab("")


This shows that no one process runs consistently throughout the entire installation process, but rather that the process run in spurts.

the answer
The vast majority of Mac users don't place strange files in certain special system-critical locations like '/usr/local/' and '/Library/'. Among those who do, though, these directories are littered with hundreds and hundreds of custom files that the installer doesn't and can't have prior knowledge of.

In my case, and probably many others, the estimated time-to-completion was inaccurate because the installer couldn't anticipate needing to copy back so many files to certain special directories after unpacking the contents of the new OS. Additionally, for each of these copied files, the installer had to make sure the subdirectories had the exact same meta-data (permissions, owner, reference count, creation date, etc…) as before the installation began. This entire process added many minutes to the procedure at a point when the installer thought it was pretty much done.

What were some of the files that the installer needed to copy back? The answer will be different for each system but, as mentioned above, anything placed '/usr/local' and '/Library' directories that wasn't Apple-supplied needed to be moved and moved back.

/usr/local/ is used chiefly for user-installed software that isn't part of the OS distribution. In my case, my /usr/local contained a custom compliled Vim; ClamXAV, a lightweight virus scanner that I use only for the benefit of my Windows-using friends; and texlive, software for the TeX typesetting system. texlive was, by far, the biggest time-sink since it had over 123,491 files.

In addition to these programs, many users might find that the Homebrew package manager is to blame for their long installation process, since this software also uses the /usr/local prefix (although it probably should not).

Among other things, this directory holds (subdirectories that hold) modules and packages that the Apple-supplied Python, Ruby, and Perl uses. If you use these Apple-supplied versions of these languages and you install your own packages/modules using super-user privileges, the new packages will go into this directory and it will appear foreign to the Yosemite installer.

To get around this issue, either install packages/modules in a local (non-system) library, or use alternate versions of these programming languages that you either download and install yourself, or use MacPorts to install.


You can find all the code and logs that I used for this analysis in this git repository

This post is also available as a RMarkdown report here

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

Take a look, it's in a book: distribution of kindle e-book highlights

If you've ever started a book and not finished it, it may comfort you to know that you are not alone. It's hard to get accurate estimates of the percentage books that are discontinued, but the rise of e-reading (and resulting circumvention of privacy) affords us the opportunity to answer related questions.

The kindle e-reading devices allow readers to highlight salient passages of books and optionally share them with Amazon. Amazon then curates these highlights and displays them to readers who opt-in. These are called "popular highlights".

After reading a few books on the Kindle, it's hard not to notice a pattern with popular highlights: they become sparser the further you get into a book. Given my penchant for answering mildly interesting questions with statistics, I couldn't help but analyze and visualize the distribution of these popular highlights.

I organized the location of the 10 most popular highlights of 64 books (21 fiction and 43 non-fiction) along with the location of the end of the book (this doesn't include the index, notes, and references of non-fiction books) and loaded it into R:


ebook.frame <- read.csv("./ebooks.csv",

ebook.frame <- ebook.frame %.%

In order to meaningfully compare locations across books, I needed to express each location as a percentage of the total length of the book. Let's use ggplot2 to visualize the distribution of where the popular highlights appear across all books:

ggplot(ebook.frame, aes(x=normalized)) +
  geom_density(adjust=2, fill="#0072B2", alpha=.8) +
  labs(title="Distribution of e-book highlights\n") +
  xlab("location in book (percent)") +
  theme(axis.ticks = element_blank(),
        axis.text.y = element_blank()) +

Distribution Of E-Book Highlights

Distribution Of E-Book Highlights

Before we go on, it's important to express a few words of warning...
These books are not a proper sample of all kindle e-books; since these books came from my personal collection, books on science and philosophy are oversampled, books about vampires are woefully underrepresented, and there is far more Janet Evanovich than chance would dictate. Because of this, any insights gleaned from these data (to the extent that these data offer any) are only applicable to the reading habits of a certain type of e-reader, namely, boring ones that don't like to have fun.

The spreadsheet I loaded also contained a logical field representing whether the book was fiction. We can take a look at the differences in the highlight locations between fiction and non-fiction books thusly:

ggplot(ebook.frame, aes(x=normalized)) +
  geom_density(adjust=2, aes(fill=factor(fiction)), 
                             alpha=.5) +
  labs(title="Distribution of e-book highlights\n") +
                               "fiction")) +
  xlab("location in book (percent)") +
  theme(axis.ticks = element_blank(), 
        axis.text.y = element_blank()) +

Distribution Of E-Book Highlights Fiction Distinction

Distribution Of E-Book Highlights Fiction Distinction

It would appear as if non-fiction books have a less uniform distribution of popular highlights. There are likely many causes for this, but one explanation could be that the reader is less likely to make it to the end of a non-fiction book.

In order to make some quantifiable claims, let's look at the empirical cumulative distribution function:

ggplot(ebook.frame, aes(normalized, colour=factor(fiction))) +
  stat_ecdf() +
  labs(title="Cumulative distribution of e-book highlights\n") +
  scale_colour_discrete(labels=c("non-fiction", "fiction")) +
  xlab("location in book (percent)") +
  ylab("cumulative percentage of highlights") +

Cumulative Distribution of E-book Highlights

Cumulative Distribution of E-book Highlights

Interestingly, for non-fiction books, a full 75% of the highlights are contained in the first 25% percent of the book; not quite pareto, but close).

Before we come to any conclusions regarding the proportion of readers that make it through a book, let's check our assumptions:

  • e-readers that highlight passages (and choose to share them with Amazon) behave just like e-readers that don't
  • salient passages are uniformly distributed throughout a book and, thus, the distribution of highlights is uniform across the entire length of the read portion of a book
  • the fact that a passage was already highlighted by many e-readers has no bearing on the reader’s decision to highlight the same passage

These assumptions don't hold up to critical scrutiny. Nevertheless, these results serve as strong evidence that at least some e-books go unfinished. As for the percentage of books that go unfinished, perhaps Amazon is in a better position to answer that question.

share this: Facebooktwittergoogle_plusredditpinterestlinkedintumblrmail