Take a look, it's in a book: distribution of kindle e-book highlights

If you've ever started a book and not finished it, it may comfort you to know that you are not alone. It's hard to get accurate estimates of the percentage books that are discontinued, but the rise of e-reading (and resulting circumvention of privacy) affords us the opportunity to answer related questions.

The kindle e-reading devices allow readers to highlight salient passages of books and optionally share them with Amazon. Amazon then curates these highlights and displays them to readers who opt-in. These are called "popular highlights".

After reading a few books on the Kindle, it's hard not to notice a pattern with popular highlights: they become sparser the further you get into a book. Given my penchant for answering mildly interesting questions with statistics, I couldn't help but analyze and visualize the distribution of these popular highlights.

I organized the location of the 10 most popular highlights of 64 books (21 fiction and 43 non-fiction) along with the location of the end of the book (this doesn't include the index, notes, and references of non-fiction books) and loaded it into R:

library(dplyr)

ebook.frame <- read.csv("./ebooks.csv",
                        stringsAsFactors=FALSE)

ebook.frame <- ebook.frame %.%
  mutate(normalized=location/end.location)

In order to meaningfully compare locations across books, I needed to express each location as a percentage of the total length of the book. Let's use ggplot2 to visualize the distribution of where the popular highlights appear across all books:

library(ggplot2)
ggplot(ebook.frame, aes(x=normalized)) +
  geom_density(adjust=2, fill="#0072B2", alpha=.8) +
  labs(title="Distribution of e-book highlights\n") +
  xlab("location in book (percent)") +
  theme(axis.ticks = element_blank(),
        axis.text.y = element_blank()) +
  guides(fill=guide_legend(title=NULL))
ggsave(file="DistributionOfEBookHighlights.png")

Distribution Of E-Book Highlights

Distribution Of E-Book Highlights

Before we go on, it's important to express a few words of warning...
These books are not a proper sample of all kindle e-books; since these books came from my personal collection, books on science and philosophy are oversampled, books about vampires are woefully underrepresented, and there is far more Janet Evanovich than chance would dictate. Because of this, any insights gleaned from these data (to the extent that these data offer any) are only applicable to the reading habits of a certain type of e-reader, namely, boring ones that don't like to have fun.

The spreadsheet I loaded also contained a logical field representing whether the book was fiction. We can take a look at the differences in the highlight locations between fiction and non-fiction books thusly:

ggplot(ebook.frame, aes(x=normalized)) +
  geom_density(adjust=2, aes(fill=factor(fiction)), 
                             alpha=.5) +
  labs(title="Distribution of e-book highlights\n") +
  scale_fill_discrete(labels=c(“non-fiction",
                               "fiction")) +
  xlab("location in book (percent)") +
  theme(axis.ticks = element_blank(), 
        axis.text.y = element_blank()) +
  guides(fill=guide_legend(title=NULL))
ggsave(file="DistributionOfEBookHighlightsFictionDistinction.png")

Distribution Of E-Book Highlights Fiction Distinction

Distribution Of E-Book Highlights Fiction Distinction

It would appear as if non-fiction books have a less uniform distribution of popular highlights. There are likely many causes for this, but one explanation could be that the reader is less likely to make it to the end of a non-fiction book.

In order to make some quantifiable claims, let's look at the empirical cumulative distribution function:

ggplot(ebook.frame, aes(normalized, colour=factor(fiction))) +
  stat_ecdf() +
  labs(title="Cumulative distribution of e-book highlights\n") +
  scale_colour_discrete(labels=c("non-fiction", "fiction")) +
  xlab("location in book (percent)") +
  ylab("cumulative percentage of highlights") +
  theme(legend.title=element_blank())
ggsave(file="CumulativeDistributionofEbookHighlights.png")

Cumulative Distribution of E-book Highlights

Cumulative Distribution of E-book Highlights

Interestingly, for non-fiction books, a full 75% of the highlights are contained in the first 25% percent of the book; not quite pareto, but close).

Before we come to any conclusions regarding the proportion of readers that make it through a book, let's check our assumptions:

  • e-readers that highlight passages (and choose to share them with Amazon) behave just like e-readers that don't
  • salient passages are uniformly distributed throughout a book and, thus, the distribution of highlights is uniform across the entire length of the read portion of a book
  • the fact that a passage was already highlighted by many e-readers has no bearing on the reader’s decision to highlight the same passage

These assumptions don't hold up to critical scrutiny. Nevertheless, these results serve as strong evidence that at least some e-books go unfinished. As for the percentage of books that go unfinished, perhaps Amazon is in a better position to answer that question.

share this: facebooktwittergoogle_plusredditpinterestlinkedintumblrmail

4 Comments

  1. Alex April 11, 2014 4:07 pm Reply

    There's also the implicit assumption that e-readers are equally likely to highlight salient passages regardless of their position in the book. I suspect people are less likely to take the time to highlight passages the further along they get in the book (especially for non-fiction).

    http://thousandfold.net/cz

  2. [email protected] April 12, 2014 12:14 am Reply

    That's a really good point. Is that much different than the assumption that the distribution of highlights is uniform across the entire length of the read portion of a book, though?

  3. Ioanides April 15, 2014 9:32 am Reply

    I think this last proposed assumption might be a bit bold as well if the books are anything like movies (or video-games) and put more energy into hooking you in the first part of the experience while the mid and last part might be less ambitious in maintaining the interest very high.
    Actually I've read quite some books that started really well, with a tall idea, and kept on without much gusto as the initial idea wasn't supposed to develop much.

    • [email protected] April 16, 2014 11:59 pm Reply

      I've read a couple of books like that, too. It is a pretty bold assumption, isn't it?
      Thanks for making that point!

Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">