## I'm all about that bootstrap ('bout that bootstrap)

As some of my regular readers may know, I'm in the middle of writing a book on introductory data analysis with R. I'm at the point in the writing of the book now where I have to make some hard choices about how I'm going to broach to topic of statistical inference and hypothesis testing.

Given the current climate against NHST (the journal Basic and Applied Social Psychology banned it) and my own personal preferences, I wasn't sure just how much to focus on classical hypothesis testing.

I didn't want to burden my readers with spending weeks trying to learn the intricacies of NHST just to have them being told to forget everything they know about it and not be able to use it without people making fun of them.

So I posed a question to twitter: "Is it too outlandish to not include the topic of parametric HTs in an intro book about data analysis. Asking for a friend.. named Tony…. You know, in favor of bootstrapped CIs, permutation tests, etc…"

To which my friend Zach Jones (@JonesZM) replied: "they could at least be better integrated with monte-carlo methods. i think they'd make it easier to understand". I agreed, which is why I'm proceeding with my original plan to introduce classical tests after and within the context of Monte Carlo bootstrapping (as opposed to exhaustive bootstrapping).

Even though I'm a huge fan of the bootstrap, I want to be careful not to further any misconceptions about it—chiefly, that bootstrapping is a cure-all for having a small sample size. To be able to show how this isn’t the case, I wrote an R script to take 1,000 samples from a population, calculate 95% confidence intervals using various methods and record the proportion of times the population mean was within the CIs.

The four ways I created the CIs were:

• the z interval method: which assumes that the sampling distribution is normal around the sample mean (1.96 * the standard error)
• the t interval method: which assumes that the population is normally distributed and the sampling distribution is normally distributed around the sample mean (t-distribution quantile at .975 [with appropriate degrees of freedom] * standard error)
• basic bootstrap CI estimation (with boot() and boot.CI() from the boot R package)
• adjusted percentile CI estimation (with boot() and boot.CI() from the boot R package)

I did this for various sample sizes and two different distributions, the normal and the very non-normal beta distribution (alpha=0.5, beta=0.5). Below is a plot depicting all of this information.

So, clearly the normal (basic) boot doesn’t make up for small sample sizes.

It's no surprise the the t interval method blows everything else out of the water when sampling from a normal distribution. It even performs reasonably well with the beta distribution, although the adjusted bootstrap wins out for most sample sizes.

In addition to recording the proportion of the times the population mean was within the confidence intervals, I also kept track of the range of these intervals. All things being equal, narrower intervals are far preferable to wide ones. Check out this plot depicting the mean ranges of the estimated CIs:

The t interval method always produces huge ranges.

The adjusted bootstrap produces ranges that are more or less on par with the other three methods BUT it outperforms the t interval method for non-normal populations. This suggests the the adjustments to the percentiles of the bootstrap distribution do a really good job at correcting for bias. It also shows that, if we are dealing with a non-normal population (common!), we should use adjusted percentile bootstrapped CIs.

Some final thoughts:

• The bootstrap is not a panacea for small sample sizes
• The bootstrap is cool because it doesn’t assume anything about the population distribution, unlike the z and t interval methods
• Basic bootstrap intervals are whack. They’re pathologically narrow for small sample sizes.
• Adjusted percentile intervals are great! You should always use them instead. Thanks Bradley Efron!

Also, if you're not using Windows, you can parallelize your bootstrap calculations really easily in R; below is the way I bootstrapped the mean for this project:

```library(boot)
dasboot <- boot(a.sample, function(x, i){mean(x[i])}, 10000,
parallel="multicore", ncpus=4)
```

which uses 4 cores to perform the bootstrap in almost one fourth the time.

In later post, I plan to further demonstrate the value of the bootstrap by testing difference in means and show why permutation tests comparing means between two samples is always better than t-testing.

## On the misinterpretation of p-values:

First, let me start of by saying I'm a classical statistics and p-value apologist--I think it's the cats pajamas. It was mine (and many others') first introduction to statistics. So, in spite of my being a card-carrying member of The Bayesian Consipiracy, there will always be a place in my heart (grinch-sized though it is) for classical statistics.

That being said, I think that, often, the classical statistics' approach to hypothesis testing lends itself to misinterpretation and encourages academic dishonesty. There has already been much written about the controversial p-value, but I thought I'd weigh in with my ideas and experience.

One of the problems I see that encourages misinterpretation is how statisticians communicate with their superiors. Since Bayesian inference is almost certainly how people actually reason, the non-intuitive linguistic gymnastics that frequentist hypothesis testing forces us to use to make a true statement ("we fail to reject the null hypothesis") meets with confusion, or worse, from non-statisticians. When some people ask me about the results of a test they are interested in, if the result was statistically significant at p<.05, I might carefully say something to the effect of "an unlikely event occured or we can reject the null hypothesis". After what I assume is a mental eye-roll, I get asked "Is there an effect, or not?" At this point I have two options: (a) I can pendantically assert that I can't answer that question, or (b) I can say "yep".

When we first learned about frequentist hypothesis testing at school, many of my classmates remarked that .05 significance "cutoff" was too demanding (or, occasionally "too liberal"). It is then usually explained that increasing the significance level will necessarily result in more Type II errors, which are frequently (but not always) more damaging than the opposite error). Still, something doesn't seem right to the neophyte (perhaps it is the artifical dichomomy between "significance" and "non-significance" that has no basis in reality).

That raises an interesting point: newcomers to statistics have an important and un-tainted perspective because they can see the arbitrariness of significance cutoffs more readily than after it's beaten into their heads by some research advisors that they have to "publish or perish" and that not significant results won't attract attention.

 Just after I wrote this article, I read this germane blog post entitled "Use The Wrong P-value, Go To Jail: Not A Joke"
[another edit] I keep finding great articles and blog posts on this subject. Here's one from John Myles White: Criticism 1 of Null Hypothesis Significance Testing