Two-Sample Hypothesis Tests – Part VII
These days, a peek out of the window is greeted by chilling rain or warm snow. On days when it is not raining or snowing, there is biting cold. So we gaze at our bicycles, waiting for that pleasant month of April when we can joyfully bike — to work, or for pleasure.
Speaking of bikes, since I have nothing much to do today except watch the snow, I decided to explore some data from our favorite “Open Data for All New Yorkers” page.
Interestingly, I found data on the bicycle counts for East River Bridges. New York City DOT keeps track of the daily total of bike counts on the Brooklyn Bridge, Manhattan Bridge, Williamsburg Bridge, and Queensboro Bridge.
I could find the data for April to October during 2016 and 2017. Here is how the data for April 2017 looks.
Being a frequent biker on the Manhattan Bridge, my curiosity got kindled. I wanted to verify how different the total bike counts on the Manhattan Bridge are from the Williamsburg Bridge.
At the same time, I also wanted to share the benefits of the bootstrap method for two-sample hypothesis tests.
To keep it simple and easy for you to follow the bootstrap method’s logical development, I will test how different the total bike counts data on Manhattan Bridge are from that of the Williamsburg Bridge during all the non-holiday weekdays with no precipitation.
Here is the data of the total bike counts on Manhattan Bridge during all the non-holiday weekdays with no precipitation in April of 2017 — essentially, the data from the yellow-highlighted rows in the table for Manhattan Bridge.
5276, 6359, 7247, 6052, 5054, 6691, 5311, 6774
And the data of the total bike counts on Williamsburg Bridge during all the non-holiday weekdays with no precipitation in April of 2017.
5711, 6881, 8079, 6775, 5877, 7341, 6026, 7196
Their distributions look like this.
I want answers to the following questions.
Is the mean of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge.
Is the median of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge.
Is the variance of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge.
Is the interquartile range of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge.
Is the proportion of the total bike counts on Manhattan Bridge less than 6352, different from that on Williamsburg Bridge. or
What do we know so far?
We know how to test the difference in means using the t-Test under the proposition that the population variances are equal (Lesson 94) or using Welch’s t-Test when we cannot assume equality of population variances (Lesson 95). We also know how to do this using Wilcoxon’s Rank-sum Test that uses the ranking method to approximate the significance of the differences in means (Lesson 96).
We know how to test the equality of variances using F-distribution (Lesson 97).
We know how to test the difference in proportions using either Fisher’s Exact test (Lesson 92) or using the normal distribution as the null distribution under the large-sample approximation (Lesson 93).
In all these tests, we made critical assumptions on the limiting distributions of the test-statistics.
- What is the limiting distribution of the test-statistic that computes the difference in medians?
- What is the limiting distribution of the test-statistic that compares interquartile ranges of two populations?
- What if we do not want to make any assumptions on data distributions or the limiting forms of the test-statistics?
Enter the Bootstrap
I would urge you to go back to Lesson 79 to get a quick refresher on the bootstrap, and Lesson 90 to recollect how we used it for the one-sample hypothesis tests.
The idea of the bootstrap is that we can generate replicates of the original sample to approximate the probability distribution function of the population. Assuming that each data value in the sample is equally likely with a probability of 1/n, we can randomly draw n values with replacement. By putting a probability of 1/n on each data point, we use the discrete empirical distribution as an approximation of the population distribution f.
Take the data for Manhattan Bridge.
5276, 6359, 7247, 6052, 5054, 6691, 5311, 6774
Assuming that each data value is equally likely, i.e., the probability of occurrence of any of these eight data points is 1/8, we can randomly draw eight numbers from these eight values — with replacement.
Since each value is equally likely, the bootstrap sample will consist of numbers from the original data (5276, 6359, 7247, 6052, 5054, 6691, 5311, 6774), some may appear more than one time, and some may not show up at all in a random sample.
Here is one such bootstrap replicate.
6359, 6359, 6359, 6052, 6774, 6359, 5276, 6359
The value 6359 appeared five times. Some values like 7247, 5054, 6691, and 5311 did not appear at all in this replicate.
Here is another replicate.
6359, 5276, 5276, 5276, 7247, 5311, 6052, 5311
Such bootstrap replicates are representations of the empirical distribution , i.e., the proportion of times each value in the data sample occurs. We can generate all the information contained in the true distribution by creating , the empirical distribution.
Using the Bootstrap for Two-Sample Hypothesis Tests
Since each bootstrap replicate is a possible representation of the population, we can compute the relevant test-statistics from this bootstrap sample. By repeating this, we can have many simulated values of the test-statistics that form the null distribution to test the hypothesis. There is no need to make any assumptions on the distributional nature of the data or the limiting distribution for the test-statistic. As long as we can compute a test-statistic from the bootstrap sample, we can test the hypothesis on any statistic — mean, median, variance, interquartile range, proportion, etc.
Let’s now use the bootstrap method for two-sample hypothesis tests. Suppose there are two random variables, X and Y, and any statistic computed from them are and . We may have a sample of values representing X and a sample of values to represent Y.
can be mean, median, variance, proportion, etc. Any computable statistic from the original data is of the form .
The null hypothesis is that there is no difference between the statistic of X or Y.
The alternate hypothesis is
or
or
We first create a bootstrap replicate of X and Y by randomly drawing with replacement values from X and values from Y.
For each bootstrap replicate i from X and Y, we compute the statistics and and check whether . If yes, we register . If not, we register .
For example, one bootstrap replicate for X (Manhattan Bridge) and Y (Williamsburg Bridge) may look like this:
xboot: 6691 5311 6774 5311 6359 5311 5311 6052
yboot: 6775 6881 7341 7196 6775 7341 6775 7196
Mean of this bootstrap replicate for X and Y are 5890 and 7035. Since , we register
Another bootstrap replicate for X and Y may look like this:
xboot: 6774 6359 6359 6359 6052 5054 6052 6691
yboot: 6775 7196 7196 6026 6881 7341 6881 5711
Mean of this bootstrap replicate for X and Y are 6212.5 and 6750.875. Since , we register
We repeat this process of creating bootstrap replicates of X and Y, computing the statistics and , and verifying whether and registering a large number of times, say N=10,000.
The proportion of times in a set of N bootstrap-replicated statistics is the p-value.
Based on the p-value, we can use the rule of rejection at the selected rate of rejection .
For a one-sided alternate hypothesis, we reject the null hypothesis if or .
For a two-sided alternate hypothesis, we reject the null hypothesis if or .
Manhattan Bridge vs. Williamsburg Bridge
Is the mean of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge?
Let’s take a two-sided alternate hypothesis.
Here is the null distribution of for N = 10,000.
The p-value is 0.0466. 466 out of the 10,000 bootstrap replicates had . For a 10% rate of error (), we reject the null hypothesis since the p-value is less than 0.05. Since more than 95% of the times, the mean of the total bike counts on Manhattan Bridge is less than that of the Williamsburg Bridge; there is sufficient evidence that they are not equal. So we reject the null hypothesis.
Can we reject the null hypothesis if we select a 5% rate of error?
Is the median of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge?
The p-value is 0.1549. 1549 out of the 10,000 bootstrap replicates had . For a 10% rate of error (), we cannot reject the null hypothesis since the p-value is greater than 0.05. The evidence (84.51% of the times) that the median of the total bike counts on Manhattan Bridge is less than that of the Williamsburg Bridge is not sufficient to reject equality.
Is the variance of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge?
The p-value is 0.4839. 4839 out of the 10,000 bootstrap replicates had . For a 10% rate of error (), we cannot reject the null hypothesis since the p-value is greater than 0.05. The evidence (51.61% of the times) that the variance of the total bike counts on Manhattan Bridge is less than that of the Williamsburg Bridge is not sufficient to reject equality.
Is the interquartile range of the total bike counts on Manhattan Bridge different than that on Williamsburg Bridge?
The p-value is 0.5453. 5453 out of the 10,000 bootstrap replicates had . For a 10% rate of error (), we cannot reject the null hypothesis since the p-value is greater than 0.05. The evidence (45.47% of the times) that the interquartile range of the total bike counts on Manhattan Bridge is less than that of the Williamsburg Bridge is not sufficient to reject equality.
Finally, is the proportion of the total bike counts on Manhattan Bridge less than 6352, different from that on Williamsburg Bridge?
The p-value is 0.5991. 5991 out of the 10,000 bootstrap replicates had . For a 10% rate of error (), we cannot reject the null hypothesis since the p-value is greater than 0.05. The evidence (40.09% of the times) that the proportion of the total bike counts less than 6352 on Manhattan Bridge is less than that of the Williamsburg Bridge is not sufficient to reject equality.
Can you see the bootstrap concept’s flexibility and how widely we can apply it for hypothesis testing? Just remember that the underlying assumption is that the data are independent.
To summarize,
Repeatedly sample with replacement from original samples of X and Y -- N times.
Each time draw a sample of size from X and a sample of size from Y.
Compute the desired statistic (mean, median, skew, etc.) from each
bootstrap sample.
The null hypothesis can now be tested as follows:
- if , else,
- (average over all N bootstrap-replicated statistics)
- If or , reject the null hypothesis for a two-sided hypothesis test at a selected rejection rate of .
- If , reject the null hypothesis for a left-sided hypothesis test at a selected rejection rate of .
- If , reject the null hypothesis for a right-sided hypothesis test at a selected rejection rate of .
After seven lessons, we are now equipped with all the theory of the two-sample hypothesis tests. It is time to put them to practice. Dust off your programming machines and get set.
If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.