normal approximation – dataanalysisclassroom

On the Difference in Proportions

$H_{0}: p_{1}-p_{2} = 0$

$H_{A}: p_{1}-p_{2} > 0$

$H_{A}: p_{1}-p_{2} < 0$

$H_{A}: p_{1}-p_{2} \neq 0$

Joe and Mumble are interested in getting people’s opinion on the preference for a higher than 55 mph speed limit for New York State.

Joe spoke to **ten** of his rural friends, of which **seven** supported the idea of increasing the speed limit to 65 mph. Mumble spoke to **eighteen** of his urban friends, of which **five** favored a speed limit of 65 mph over the current limit of 55 mph.

Can we say that the sentiment for increasing the speed limit is stronger among rural than among urban residents?

We can use a hypothesis testing framework to address this question.

Last week, we learned how Fisher’s Exact test could be used to verify the difference in proportions. The test-statistic for the two-sample hypothesis test follows a hypergeometric distribution when $H_{0}$ is true.

We also learned that, in more generalized cases where the number of successes is not known apriori, we could assume that the number of successes is fixed at $t=x_{1}+x_{2}$ , and, for a fixed value of $t$ , we reject $H_{0}:p_{1}=p_{2}$ for the alternate hypothesis $H_{A}:p_{1}>p_{2}$ if there are more successes in random variable $X_{1}$ compared to $X_{2}$ .

In short, the p-value can be derived under the assumption that the number of successes $X=k$ in the first sample $X_{1}$ has a hypergeometric distribution when $H_{0}$ is true and conditional on a total number of t successes that can come from any of the two random variables $X_{1}$ and $X_{2}$ .

$P(X=k) = \frac{\binom{t}{k}*\binom{n_{1}+n_{2}-t}{n_{1}-k}}{\binom{n_{1}+n_{2}}{n_{1}}}$

Let’s apply this principle to the two samples that Joe and Mumble collected.

Let $X_{1}$ be the random variable that denotes Joe’s rural sample. He surveyed a total of $n_{1}=10$ people and $x_{1}=7$ favored an increase in the speed limit. So the proportion $p_{1}$ based on the number of successes is 0.7.

Let $X_{2}$ be the random variable that denotes Mumble’s urban sample. He surveyed a total of $n_{2}=18$ people. $x_{2}=5$ out of the 18 favored an increase in the speed limit. So the proportion $p_{2}$ based on the number of successes is 0.2778.

Let the total number of successes in both the samples be $t=x_{1}+x_{2}=7+5=12$ .

Let’s also establish the null and alternate hypotheses.

$H_{0}: p_{1}-p_{2}=0$

$H_{A}: p_{1}-p_{2}>0$

The alternate hypothesis says that the sentiment for increasing the speed limit is stronger among rural ( $p_{1}$ ) than among urban residents ( $p_{2}$ ).

Larger values of $x_{1}$ and smaller values of $x_{2}$ support the alternate hypothesis $H_{A}$ that $p_{1}>p_{2}$ when t is fixed.

For a fixed value of t, we reject $H_{0}$ , if there are more number of successes in $X_{1}$ compared to $X_{2}$ .

Conditional on a total number of t successes from any of the two random variables, the number of successes $X=k$ in the first sample has a hypergeometric distribution when $H_{0}$ is true.

In the rural sample that Joe surveyed, seven favored an increase in the speed limit. So we can compute the p-value as the probability of obtaining more than seven successes in a rural sample of 10 when the total successes t from either urban or rural samples are twelve.

$p-value=P(X \ge k) = P(X \ge 7)$

$P(X=k) = \frac{\binom{t}{k}*\binom{n_{1}+n_{2}-t}{n_{1}-k}}{\binom{n_{1}+n_{2}}{n_{1}}}$

$P(X=7) = \frac{\binom{12}{7}\binom{10+18-12}{10-7}}{\binom{10+18}{10}} =\frac{\binom{12}{7}\binom{16}{3}}{\binom{28}{10}} = 0.0338$

A total of 12 successes exist, out of which the number of ways of choosing 7 is $\binom{12}{7}$ .

A total of 28 – 12 = 16 non-successes exist, out of which the number of ways of choosing 10 – 7 = 3 non-successes is $\binom{16}{3}$ .

A total sample of 10 + 18 = 28 exists, out of which the number of ways of choosing ten samples is $\binom{28}{10}$ .

When we put them together, we can derive the probability $P(X=7)$ for the hypergeometric distribution when $H_{0}$ is true.

$P(X=7) = \frac{\binom{12}{7}\binom{10+18-12}{10-7}}{\binom{18+18}{10}} =\frac{\binom{12}{7}\binom{16}{3}}{\binom{28}{10}} = 0.0338$

Applying the same logic for k = 8, 9, and 10, we can derive their respective probabilities.

$P(X=8) = \frac{\binom{12}{8}\binom{10+18-12}{10-8}}{\binom{10+18}{10}} =\frac{\binom{12}{8}\binom{16}{2}}{\binom{28}{10}} = 0.0045$

$P(X=9) = \frac{\binom{12}{9}\binom{10+18-12}{10-9}}{\binom{10+18}{10}} =\frac{\binom{12}{9}\binom{16}{1}}{\binom{28}{10}} = 0.0003$

$P(X=10) = \frac{\binom{12}{10}\binom{10+18-12}{10-10}}{\binom{10+18}{10}} =\frac{\binom{12}{10}\binom{16}{0}}{\binom{28}{10}} = 5.029296*10^{-6}$

The p-value can be computed as the sum of these probabilities.

$p-value=P(X \ge k) = P(X = 7)+P(X = 8)+P(X = 9)+P(X = 10)=0.0386$

Visually, the null distribution will look like this.

The x-axis shows the number of possible successes in $X_{1}$ . They range from k = 0 to k = 10. The vertical bars are showing $P(X=k)$ as derived from the hypergeometric distribution. The area highlighted in red is the *p-value*, the probability of finding $\ge$ seven successes in a rural sample of 10 people.

The p-value is the probability of obtaining the computed test statistic under the null hypothesis.

The smaller the p-value, the less likely the observed statistic under the null hypothesis – and stronger evidence of rejecting the null.

Suppose we select a rate of error $\alpha$ of 5%.

Since the p-value (0.0386) is smaller than our selected rate of error (0.05), we reject the null hypothesis for the alternate view that the sentiment for increasing the speed limit is stronger among rural ( $p_{1}$ ) than among urban residents ( $p_{2}$ ).

Let me remind you that this decision is based on the assumption that the null hypothesis is correct. Under this assumption, since we selected , we will reject the true null hypothesis 5% of the time. At the same time, we will fail to reject the null hypothesis 95% of the time. In other words, 95% of the time, our decision to not reject the null hypothesis will be correct.

What if Joe and Mumble surveyed many more people?

You must be wondering that Joe and Mumble surveyed just a few people, which is not enough to derive any decent conclusion for a question like this. Perhaps they just called up their friends!

Let’s do a thought experiment. How would the null distribution look like if Joe and Mumble had double the sample size and the successes also increase in the same proportion? Would the p-value change?

Say Joe had surveyed 20 people, and 14 had favored an increase in the speed limit. $n_{1} = 20; x_{1} = 14; p_{1} = 0.7$ .

Say Mumble had surveyed 36 people, and 10 had favored an increase in the speed limit. $n_{2} = 36; x_{2} = 10; p_{2} = 0.2778$ .

p-value will then be $P(X \ge 14)$ when there are 24 total successes.

The null distribution will look like this.

Notice that the null distribution is much more symmetric and looks like a bell curve (normal distribution) with an increase in the sample size. The p-value is 0.0026. More substantial evidence for rejecting the null hypothesis.

Is there a limiting distribution for the difference in proportion? If there is one, can we use it as the null distribution for the hypothesis test on the difference in proportion when the sample sizes are large.

While we embark on this derivation, let’s ask Joe and Mumble to survey many more people. When they are back, we will use new data to test the hypothesis.

But first, what is the limiting distribution for the difference in proportion?

We have two samples $X_{1}$ and $X_{2}$ of sizes $n_{1}$ and $n_{2}$ .

We might observe $x_{1}$ and $x_{2}$ successes in each of these samples. Hence, the proportions $p_{1}, p_{2}$ can be estimated using $\hat{p_{1}} = \frac{x_{1}}{n_{1}}$ and $\hat{p_{2}} = \frac{x_{2}}{n_{2}}$ .

See, we are using $\hat{p_{1}}, \hat{p_{2}}$ as the estimates of the true proportions $p_{1}, p_{2}$ .

Take $X_{1}$ . If the probability of success (proportion) is $p_{1}$ , in a sample of $n_{1}$ , we could observe $x_{1}=0, 1, 2, 3, \cdots, n_{1}$ successes with a probabilty $P(X=x_{1})$ that is governed by a binomial distribution. In other words,

$x_{1} \sim Bin(n_{1},p_{1})$

Same logic applies to $X_{2}$ .

$x_{2} \sim Bin(n_{2},p_{2})$

A binomial distribution tends to a normal distribution for large sample sizes; it can be estimated very accurately using the normal density function. We learned this in Lesson 48.

If you are curious as to how a binomial distribution function $f(x)=\frac{n!}{(n-x)!x!}p^{x}(1-p)^{n-x}$ can approximated to a normal density function $f(x)=\frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{\frac{-1}{2}(\frac{x-\mu}{\sigma})^{2}}$ , look at this link.

But what is the limiting distribution for  and ?

$x_{1}$ is the sum of $n_{1}$ independent Bernoulli random variables (yes or no responses from the people). For a large enough sample size $n_{1}$ , the distribution function of $x_{1}$ , which is a binomial distribution, can be well-approximated by the normal distribution. Since $\hat{p_{1}}$ is a linear function of $x_{1}$ , the random variable $\hat{p_{1}}$ can also be assumed to be normally distributed.

When both $\hat{p_{1}}$ and $\hat{p_{2}}$ are normally distributed, and when they are independent of each other, their sum or difference will also be normally distributed. We can derive it using the convolution of $\hat{p_{1}}$ and $\hat{p_{2}}$ .

Let $Y = \hat{p_{1}}-\hat{p_{2}}$

$Y \sim N(E[Y], V[Y])$ since both $\hat{p_{1}}, \hat{p_{2}} \sim N()$

If $Y \sim N(E[Y], V[Y])$ , we can standardize it to a standard normal variable as

$Z = \frac{Y-E[Y]}{\sqrt{V[Y]}} \sim N(0, 1)$

We should now derive the expected value $E[Y]$ and the variance $V[Y]$ of Y.

$Y = \hat{p_{1}}-\hat{p_{2}}$

$E[Y] = E[\hat{p_{1}}-\hat{p_{2}}] = E[\hat{p_{1}}] - E[\hat{p_{2}}]$

$V[Y] = V[\hat{p_{1}}-\hat{p_{2}}] = V[\hat{p_{1}}] + V[\hat{p_{2}}]$

Since they are independent, the co-variability term which carries the negative sign is zero.

We know that $E[\hat{p_{1}}] = p_{1}$ and $V[\hat{p_{1}}]=\frac{p_{1}(1-p_{1})}{n_{1}}$ . Recall Lesson 76.

When we put them together,

$E[Y] = p_{1} - p_{2}$

$V[Y] = \frac{p_{1}(1-p_{1})}{n_{1}} + \frac{p_{2}(1-p_{2})}{n_{2}}$

and finally since $Z = \frac{Y-E[Y]}{\sqrt{V[Y]}} \sim N(0, 1)$ ,

$Z = \frac{\hat{p_{1}} - \hat{p_{2}} - (p_{1} - p_{2})}{\sqrt{\frac{p_{1}(1-p_{1})}{n_{1}} + \frac{p_{2}(1-p_{2})}{n_{2}}}} \sim N(0, 1)$

A few more steps and we are done. Joe and Mumble must be waiting for us.

The null hypothesis is $H_{0}: p_{1}-p_{2}=0$ . Or, $p_{1}=p_{2}$ .

We need the distribution under the null hypothesis — the null distribution.

Under the null hypothesis, let’s assume that $p_{1}=p_{2}$ is $p$ , a common value for the two population proportions.

Then, the expected value of Y, $E[Y]=p_{1}-p_{2}=p-p = 0$ and the variance $V[Y] = \frac{p(1-p)}{n_{1}} + \frac{p(1-p)}{n_{2}}}$

$V[Y] = p(1-p)*(\frac{1}{n_{1}}+\frac{1}{n_{2}})$

This shared value $p$ for the two population proportions can be estimated by pooling the samples together into one sample of size $n_{1}+n_{2}$ where there are $x_{1}$ and $x_{2}$ total successes.

$p = \frac{x_{1}+x_{2}}{n_{1}+n_{2}}$

Look at this estimate carefully. Can you see that the pooled estimate $p$ is a weighted average of the two proportions ( $p_{1}$ and $p_{2}$ )?

.
.
.
Okay, tell me what $x_{1}$ and $x_{2}$ are? Aren’t they $n_{1}\hat{p_{1}}$ and $n_{2}\hat{p_{2}}$ for the given two samples?

So $p = \frac{n_{1}\hat{p_{1}}+n_{2}\hat{p_{2}}}{n_{1}+n_{2}}=\frac{n_{1}}{n_{1}+n_{2}}\hat{p_{1}}+ \frac{n_{2}}{n_{1}+n_{2}}\hat{p_{2}}$

or, $p = w_{1}\hat{p_{1}}+ w_{2}\hat{p_{2}}$

At any rate,

$E[Y]= 0$

$V[Y] = p(1-p)*(\frac{1}{n_{1}}+\frac{1}{n_{2}})$

$p=\frac{x_{1}+x_{2}}{n_{1}+n_{2}}$

To summarize, when the null hypothesis is

$H_{0}:p_{1}-p_{2}=0$

for large sample sizes, the test-statistic $z = \frac{\hat{p_{1}}-\hat{p_{2}}}{\sqrt{p(1-p)*(\frac{1}{n_{1}}+\frac{1}{n_{2}})}} \sim N(0,1)$

If the alternate hypothesis $H_{A}$ is $p_{1}-p_{2}>0$ , we reject the null hypothesis when the p-value $P(Z \ge z)$ is less than the rate of rejection $\alpha$ . We can also say that when $z > z_{\alpha}$ , we reject the null hypothesis.

If the alternate hypothesis $H_{A}$ is $p_{1}-p_{2}<0$ , we reject the null hypothesis when the p-value $P(Z \le z)$ is less than the rate of rejection $\alpha$ . Or when $z < -z_{\alpha}$ , we reject the null hypothesis.

If the alternate hypothesis $H_{A}$ is $p_{1}-p_{2} \neq 0$ , we reject the null hypothesis when the p-value $P(Z \le z)$ or $P(Z \ge z)$ is less than the rate of rejection $\frac{\alpha}{2}$ . Or when $z < -z_{\frac{\alpha}{2}}$ or $z > z_{\frac{\alpha}{2}}$ , we reject the null hypothesis.

Okay, we are done. Let’s see what Joe and Mumble have.

The rural sample $X_{1}$ has $n_{1}=190$ and $x_{1}=70$ .

The urban sample $X_{2}$ has $n_{2}=310$ and $x_{2}=65$ .

Let’s first compute the estimates for the respective proportions — $p_{1}$ and $p_{2}$ .

$\hat{p_{1}}=\frac{x_{1}}{n_{1}}=\frac{70}{190} = 0.3684$

$\hat{p_{2}}=\frac{x_{2}}{n_{2}}=\frac{65}{310} = 0.2097$

Then, let’s compute the pooled estimate $p$ for the population proportions.

$p = \frac{x_{1}+x_{2}}{n_{1}+n_{2}}=\frac{70+65}{190+310}=\frac{135}{500}=0.27$

Next, let’s compute the test-statistics under the large-sample assumption. 190 and 310 are pretty large samples.

$z = \frac{\hat{p_{1}}-\hat{p_{2}}}{\sqrt{p(1-p)*(\frac{1}{n_{1}}+\frac{1}{n_{2}})}}$

$z = \frac{0.3684-0.2097}{\sqrt{0.27(0.73)*(\frac{1}{190}+\frac{1}{310})}}=3.8798$

Since our alternate hypothesis $H_{A}$ is $p_{1}-p_{2}>0$ , we compute the p-value as,
$p-value=P(Z \ge 3.8798) = 5.227119*10^{-5} \approx 0$

Since the p-value (~0) is smaller than our selected rate of error (0.05), we reject the null hypothesis for the alternate view that the sentiment for increasing the speed limit is stronger among rural () than among urban residents ().

Remember that the test-statistic is computed for the null hypothesis that $p_{1}-p_{2}=0$ . What if the null hypothesis is not that the difference in proportions is zero but is equal to some value? $p_{1}-p_{2}=0.25$

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Tag: normal approximation

Lesson 93 – The Two-Sample Hypothesis Test – Part II

On the Difference in Proportions

What if Joe and Mumble surveyed many more people?

Okay, we are done. Let’s see what Joe and Mumble have.

Enjoy this blog? Please spread the word :)