Part V of the Two-Sample Hypothesis Tests
United Airlines is the best flight service out there. They always arrived on time or earlier when I traveled with them.
How can you say that? Did you compare it to any other airline? I can claim that Jet Blue is the best flight service. I traveled with them five times; the flight arrived twice ahead of schedule, and the most lengthy delay was only 15 minutes.
I seem to be a clear winner here. My delays are -4 and 0 minutes. π
My delays are -1, -3, -2, 15 and 8 minutes. Can you still claim you are the winner?
You seem to be bathing in hypothesis tests, and you know the idea of “beyond statistical doubt.”
Can you prove that United flights tend to arrive earlier than Jet Blue flights? Or rather, can you disprove that there is no significant difference between the delays of United flights and Jet Blue flights?
I can use the standard version of the two-sample t-Test or the version proposed by Welch to verify whether the difference in mean delays is significant or not.
Okay. But do you think it is reasonable to compute the sample statistics of mean and variance on such a small sample? At least I have a sample of five. You only have two.
π π π
We could employ the rank-sum test developed by Frank Wilcoxon in 1945.
Go on…
Wilcoxon, in his paper titled “Individual Comparisons by Ranking Methods,” indicated the possibility of using ranking methods to approximate the significance of the differences in means. By ranking methods, he meant to replace the actual numerical data with their ranks.
How exactly is it done?
Let’s employ our small data samples and understand the logic and the procedure of the Wilcoxon’s Rank-sum Test.
But we need to start with defining the null and the alternate hypotheses. Take the delays of United flights to be and Jet Blue’s delays to be . As the null distribution, we assume that and are samples from the same distribution.
Then, since they will be intermingled. In other words, if and are samples from the same distribution, there will be an equal likelihood of one exceeding the other.
Let me reason out the alternate hypothesis. If the null hypothesis is not true, then we could have if the delays of United flights are generally greater than Jet Blue. More observations of United will be towards the right of the observations from Jet Blue if is from a distribution that is generally greater than .
We could have if the delays of United flights are generally smaller than Jet Blue. More observation of United will be towards the left of (or below) the observations of Jet Blue if is from a distribution that is generally lower than .
Or, we could say to establish the fact that United delays are different than Jet Blue delays.
Great. Let me summarize them.
Now, let’s pool all the data into a single set of seven values and order them from smallest to the largest. I will use red color to write the United’s delay times to know that this is from a different sample.
United
-4, 0
Jet Blue
-1, 3, -2, 15, 8
Pooled and ordered
-4, -2, -1, 0, 3, 8, 15
Can you look at the pooled and ordered data and tell me what the rank of -4 is?
That would be rank 1. The smallest value.
Right. And the rank of 15, the largest value in the pooled set, is 7. When two or more values are the same, then we assign the mean rank to all those values.
Would you agree that if either most of the smallest ranks or most of the largest ranks are coming from United flights, i.e., sample , there would be more substantial evidence to reject the null hypothesis?
That is sensible.
So we can think of the sum of the ranks that correspond to values from as the test-statistic. In our hypothesis test, the sum of the ranks of the values from United flights is 1 + 4 = 5. United flights have delays of -4 and 0 minutes. These values, in the pooled and ordered data, have ranks of 1 and 4. Their rank-sum is 5.
Test-statistic W = sum of the ranks associated with , the variable with a smaller sample size, in the pooled and ordered data.
I get it. We should now check how likely it is to see a rank-sum value of 5 in the null distribution of the rank-sums under the assumption that is true.
Yes, that is correct. Let’s outline all possible values of the rank-sums.
We have 2 values from and 5 values from . If the null hypothesis is true, i.e., if and are samples from the same distribution, the values will be intermingled and the values in can take the following ranks.
Nice table. I can see that the smallest possible rank-sum is 3 when the two values in take a rank of 1 and 2. If they are intermingled, it is also equally likely that they take on ranks of 1 and 7, in which case, the rank-sum will be 8.
Yup. Like this, I have written down all possible rank-combinations — from 1-2 that gives a rank-sum of 3 to 6-7 that gives a rank-sum of 13.
There are 21 such combinations.
We can also get the total combinations as . Seven values are arranged two at a time.
Right. More generally, if there are values in and values in , the total rank-combinations are .
From here, let me deduce the null distribution of the rank-sums. From the table, I can see that the possible values of rank-sums are 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, and 13.
The rank-sum values of 3 and 4 occur only one time.
The rank-sum value of 5 occurs two times.
The rank-sum value of 6 occurs two times.
The rank-sum value of 7 occurs three times.
The rank-sum value of 8 occurs three times.
The rank-sum value of 9 occurs three times.
The rank-sum value of 10 occurs two times.
The rank-sum value of 11 occurs two times.
The rank-sum values of 12 and 13 occur one time.
Here, I also prepared a summary table to shows the frequency, probability, and cumulative probability.
The null distribution looks like this.
Very nicely done! You are showing the possible values of the test-statistic (W) on the x-axis and their relative frequency or the probability on the y-axis. W ranges from 3 to 13.
More generally, the possible values of W range from (the smallest rank-sum) to (the largest rank-sum).
How did you get that? π
The smallest possible rank-sum will occur when all values of are to the left of . It is the sum of the first ranks — .
In the same way, the largest possible rank-sum will occur when all values of are to the right of . To get the rank-sum of the last values, we should first compute the rank-sum of the whole set of values and subtract the rank-sum of values from it.
In our example, the largest rank-sum is 6+7=13, which is the sum of the last two ranks. We can get the sum of 13 by first getting the full rank-sum 1+2+3+4+5+6+7 = 28 and then subtracting the rank-sum of the first five values: 1+2+3+4+5 = 15.
The full rank-sum is the sum of integers =
The rank-sum of values is . This means that values are on the right side (in the pooled and ordered values), and we are only computing the rank-sum of the remaiming values.
When we subtract this rank-sum from the full rank-sum we get
Ah! That is clever.
Can you also see in your null distribution table, a symmetry around the value of ?
Yes.
Can you get a generalized version of this mid-point rank-sum?
It is the mid-point of the smallest rank-sum of and the largest rank-sum of
Nice! The generalization of the frequency for each rank-sum can also be derived like this. Wilcoxon showed it in his paper.
At any rate, the observed test-statistic for our experiment is 5. So we compute the probability of finding a value as large as 5 — , in the null distribution. It turns out to be 0.1905.
If we opt for a 5% rate of rejection, we cannot reject the null hypothesis that and are samples from the same distribution. It is the case even for a 10% rate of rejection. The evidence against the null hypothesis is not convincing.
I will concede π
But I like this method. It seems to have no assumptions on the limiting distributions, is a data-driven approach, and works very well for small sample sizes. The technique is a little tedious for larger sample sizes, isn’t it?
Yes. It is rather tedious to come up with the rank-sum table each time. But there are standardized lookup tables for it. There is also a large-sum approximation to a normal distribution. As you noticed, the null distribution is symmetric. It tends to a normal distribution for larger sample sizes, and we can derive the appropriate formula for the test-statistic. Most statistical programs use this approximation. We will look at how it is done in RStudio at a later time.
Did you notice that your rank-sum should have been 3 for you to reject the null hypothesis at the 5% rejection rate?
Ah. That is correct. It means that United Airlines’ delays should have always been lower than Jet Blue’s delays to disprove that there is no significant difference between the delays of United flights and Jet Blue flights.
When we are applying this test on small sample sizes, it seldom results in a highly compelling case for rejecting the null hypothesis . If we add more data, the null distribution smoothens out, and it would be possible to attain lower p-values, which provides a stronger case against .
By the way, it looks like we are competing for the last prize. Based on the Department of Transportation’s on-time ranking, United and Jet Blue are close to last. You have some brownie points in claiming that United is better than Jet blue.
If you find this useful, please like, share and subscribe.
You can also follow me on TwitterΒ @realDevineniΒ for updates on new lessons.