Over the past several lessons we have been learning about estimates, standard error of the estimates and confidence interval of the estimates.
We have been using the ‘sample’ to estimate the true value of the parameter. What we estimate from the sample will enable us to obtain the closest answer in some sense for the true unknown population parameter.
For example, the mean , variance , or proportion computed from the sample data are good guesses (estimates or estimators) of the mean , variance and proportion of the population.
We also know that when we think of an estimate, we think of an interval or a probability distribution, instead of a point value. The truth may be in this interval if we have good representative samples, i.e., if the sample distribution is similar to the population distribution.
Assumptions or Approximations
In this inferential journey, to compute the standard error or to derive the confidence interval of the estimates, we have been making some assumptions and approximations that are deemed reasonable.
For example, it is reasonable to assume a normal distribution for the sample mean .
The sample mean is an unbiased estimate of the true mean, so the expected value of the sample mean is equal to the truth. .
The standard deviation of the sample mean, or the standard error of the estimate is .
This visual should be handy.
—
To derive the confidence interval of the variance and standard deviation, we assumed that follows a Chi-square distribution with degrees of freedom.
Depending on the degrees of freedom, the distribution of looks like this.
—
Most recently, we assumed that the estimate for proportion can be approximated by a normal distribution.
We derived the confidence interval of the population proportion as , based on this assumption.
Let’s examine this assumption once again.
In a sample of size n, proportion can be estimated as , where is the number of favorable instances for the thing we are measuring. can be approximated to a normal distribution since can be approximated to a normal distribution.
If we take Bernoulli random variables (0,1) for , , the number of successes, follows a Binomial distribution .
For a large enough sample size n, the distribution function of can be well-approximated by the normal distribution.
Let’s do some experiments and see if this is reasonable.
Look at this animation. I am showing the Binomial probability distribution function for p = 0.5 while n increases from 10 to 100.
It looks like an approximation to a normal distribution is very reasonable.
Now, look at these two animations that show the Binomial probability function for p = 0.1 and p = 0.95, i.e., when p is near the boundaries.
Clearly, the distributions exhibit skew and are not symmetric. An approximation to normal distribution even for large values of n, i.e., a big sample, is not appropriate.
How then can we be confident about the standard error and the confidence intervals?
For that matter, how can we derive the standard error or the intervals of a parameter whose limiting form is not known, or mathematically very complicated?
Enter the Bootstrap
Bradley Efron invented a computer-based method, the bootstrap, for estimating the standard error and the confidence intervals of parameters. There is no need for any theoretical calculations or underlying assumptions regarding the mathematical structure or the type of the distribution of the parameter. Instead, bootstrap samples from the data are used.
What is a bootstrap sample?
Suppose we have a data sample, , a bootstrap sample is a random sample of size n drawn with replacement from these n data points.
Imagine we have the following data: 28.4, 28.6, 27.5, 28.7, 26.7, 26.3 and 27.7 as the concentration of Ozone measured in seven locations in New York City.
Assuming that each data value is equally likely, i.e., the probability of occurrence of any of these seven data points is 1/7, we can randomly draw seven numbers from these seven values.
Think that you are playing the game of Bingo and these seven numbers are chips in your rolling machine. The only difference is, each time you get a number, record it and put it back in the roller until you draw seven numbers. Sample with replacement.
Since each value is equally likely, the bootstrap sample will consist of numbers from the original data (28.4, 28.6, 27.5, 28.7, 26.7, 26.3 and 27.7), some may appear more than one time, and some may not appear at all in a random sample.
I played this using the roller. Here is a bootstrap sample from the original numbers.
As you can see, 28.4 appeared one time, 28.6, 27.5 and 28.7 did not appear, 26.3 appeared 2 times and 27.7 appeared 3 times.
Here are two more bootstrap samples like that.
Basis
The basis for bootstrap samples is that the sample data can be used to approximate the probability distribution function of the population. As you saw before, by putting a probability of 1/n on each data point, we use the discrete empirical distribution as an approximation of the population distribution .
Take a very simple example of rolling two dice in the game of monopoly. The true probability distribution of the count (dice 1 + dice 2) is based on the fact that there are 11 possible outcomes and the likelihood of each outcome is the ratio of the total ways we can get the number to 36. An outcome 2 can only be achieved if we get a (1,1). Hence the probability of getting 2 is 1/36.
Suppose we roll the dice a hundred times and record the total count, we can use the observed frequencies of the outcomes from this sample data to approximate the actual probability distribution.
Look at these 100 counts as outcomes of rolling two dice 100 times.
The frequency plot shown in black lines closely approximates the true frequency shown in red.
The empirical distribution is the proportion of times each value in the data sample occurs. The observed frequency is a sufficient statistic for the true distribution with an assumption that the data have been generated by random sampling from the true distribution . All the information of the true distribution is contained in the empirical distribution .
An unknown population distribution has produced the observed data . We can use the observed data to approximate by its empirical distribution and then use the empirical distribution to generate bootstrap replicates of the data. Since generated x, can be used to generate the bootstrap samples.
has given can be used to estimate will be used to generate a .
This is the basis.
Bootstrap Replicates
Once we generate enough bootstrap samples, we can use the estimators (formula to estimate the parameter) on these samples. For example, if we want to represent the true population mean , we can apply the equation for the sample mean on each of these bootstrap samples to generate bootstrap replicates of the mean.
If we want to represent the true population variance using an interval, we can apply on these bootstrap samples to generate replicates of the variance.
Likewise, if we want an interval for the true proportion, we apply on the bootstrap samples to get replicates of the proportion.
Each bootstrap sample will produce a replicate of the parameter. Efron prescribes anywhere between 25 to 200 bootstrap replications for a good approximation of the limiting distribution of the estimate. As the number of bootstrap replicates approaches infinity, the standard error as measured by the standard deviation of these replicates will approach the true standard error.
Let’s look at the bootstrap replicates of the sample mean and the sample standard deviation for the Ozone data for which we used the Bingo machine to generate the bootstrap samples. In a later coding lesson, we will learn how to do it using simple functions in RStudio.
For bootstrap sample 1, the sample mean is 27.26 and the sample standard deviation is 0.82.
For bootstrap sample 2, the sample mean is 27.61 and the sample standard deviation is 0.708.
I do this 200 times. Here is how the distribution of the sample mean () obtained from 200 bootstrap replicates looks like.
Here is the distribution of the sample standard deviation.
Like this, we can develop the intervals of any type of parameters by applying the relevant estimator on the bootstrap sample.
Bootstrap confidence interval
Finally, we can use the percentiles of the bootstrap replicates as the confidence limits of the parameter.
Take a 90% confidence interval for instance. From the bootstrap replicates, we can say that there is a 90% probability that the true mean will be between and , the 5th and the 95th percentiles of the bootstrap replicates.
For the Ozone example, the 90% confidence interval of the true mean is [27.114, 28.286] and the 90% confidence interval of the true standard deviation is [0.531 1.087].
Look at these plots.
We can define a bootstrap confidence interval for the true mean as .
There are many more uses of the bootstrap, and we will have much fun with it in due course.
But for today, let’s end with Efron’s statement on what other names were suggested to him for his method.
“I also wish to thank the many friends who suggested names more colorful than Bootstrap, including Swiss Army Knife, Meat Axe, Swan-Dive, Jack-Rabbit, and my personal favorite, the Shotgun, which, to paraphrase Tukey, “can blow the head off any problem if the statistician can stand the resulting mess.””
If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.