Lesson 76 – What is your confidence in polls?

Looking at these two recent polls, one wonders which one is truer and what confidence we can have in polls.

Of course, that is a rhetorical question. The real question is if these polls are measuring proportions (or probability) of people agreeing to something, what is the confidence interval of the true proportion or probability.

For example, the first poll tells us that 16% of Americans overall said that they would like to permanently move to another country if they could. The second poll tells us that 59% of adults in the US believe that 2019 will be a year of full or increasing employment. These two proportions are estimated from a selected sample of approximately 1000 U.S. adults. If so, what would be the confidence interval for the true proportion of all Americans who would answer these questions?

In lessons 71 to 75, we learned how to derive the confidence interval of the true mean, true variance, and true standard deviation, i.e., the population mean and population variance and standard deviation.

There are many applications, as these example polls, that require estimation of proportions or the probability of occurrence of events. We already know that the maximum likelihood estimator of p, the probability of an event is $\hat{p}=\frac{r}{n}$ where r is the number of successes (times the event happened) in a sample of size n. In other words, the probability can be estimated as the proportion of occurrence in a Bernoulli sequence.

If $\hat{p}$ is the estimate of the proportion, with a few assumptions, we can derive the confidence interval of the true proportion p.

Let’s learn how to do this using a simple polling exercise.

Assume that we want to estimate through a poll, the proportion of people who want to move out of the U.S. It will not be possible to ask everyone whether or not they will move out. However, we can take a sample, i.e., select a subset of the population and ask them for their preference.

In a sample of size n, the preference of the people can be represented by Bernoulli random variables $X_{1}, X_{2}, X_{3}, …, X_{n}$ where $X_{i} = 1$ if a person wants to move out and 0 otherwise. If $S_{n} = X_{1} + X_{2} + X_{3} + … + X_{n}$ , the proportion of people who wish to move out can be estimated as $\hat{p} = \frac{S_{n}}{n}$ .

By now, you must be familiar that $\hat{p}$ is a random variable since the estimate can change with a change in the sample. What assumption can we make for the distribution function of this random variable?

Since $S_{n}$ is the sum of n independent random variables, for a large enough sample size n, the distribution function of $S_{n}$ can be well-approximated by the normal distribution. Further, since $\hat{p}$ is a linear function of $S_{n}$ , the random variable $\hat{p}$ can also be assumed to be normally distributed.

$\hat{p} \sim N(E[\hat{p}], V[\hat{p}])$

We can standardize $\hat{p}$ and relate it to the standard normal distribution Z.

$Z = \frac{\hat{p}-E[\hat{p}]}{\sqrt{V[\hat{p}]}} \sim N(0,1)$

Before we proceed to derive the confidence interval, we should first derive the expected value and the variance of $\hat{p}$ .

Expected Value $E[\hat{p}]$

$\hat{p} = \frac{1}{n}\sum_{i=1}^{n}X_{i}$

$E[\hat{p}] = E[\frac{1}{n}\sum_{i=1}^{n}X_{i}] = \frac{1}{n}\sum_{i=1}^{n}E[X_{i}]$

Since $X_{i}$ is a Bernoulli distribution, $E[X_{i}]=1(p) + 0(1-p) = p$

$E[\hat{p}] = \frac{1}{n}\sum_{i=1}^{n}p$

$E[\hat{p}] = \frac{1}{n}np = p$

Variance $V[\hat{p}]$

$V[\hat{p}] = V[\frac{1}{n}\sum_{i=1}^{n}X_{i}]$

$V[\hat{p}] = \frac{1}{n^{2}}\sum_{i=1}^{n}V[X_{i}]$

$V[X_{i}] = E[X_{i}^{2}] - (E[X_{i}])^{2}$

$E[X_{i}^{2}] = 1^{2}(p) + 0^{2}(1-p) = p$

$V[X_{i}] = p - p^{2} = p(1-p)$

So,

$V[\hat{p}] = \frac{1}{n^{2}}\sum_{i=1}^{n}p(1-p)$

$V[\hat{p}] = \frac{1}{n^{2}}np(1-p)$

$V[\hat{p}] = \frac{p(1-p)}{n}$

Confidence Interval of $\hat{p}$

Now, if we are interested in the 95% confidence interval of the true estimate p, we can use the standardized version of $\hat{p}$ to say that there is a 95% probability that the standard normal variable $\frac{\hat{p}-E[\hat{p}]}{\sqrt{V[\hat{p}]}}$ is between -1.96 and 1.96.

$P(-1.96 \le \frac{\hat{p}-p}{\sqrt{\frac{p(1-p)}{n}}} \le 1.96) = 0.95$

We can rearrange this to obtain,

$P(\hat{p} -1.96\sqrt{\frac{p(1-p)}{n}} \le p \le \hat{p} + 1.96\sqrt{\frac{p(1-p)}{n}}) = 0.95$

We can use $\hat{p}$ in place of p for the variance term.

This interval $[\hat{p} - 1.96\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}, \hat{p} + 1.96\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}]$ is called the 95% confidence interval of the population proportion p. The interval itself is random since it is derived from $\hat{p}$ . A different sample will have a different $\hat{p}$ and hence a different interval or range.

There is a 95% probability that this random interval $[\hat{p} - 1.96\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}, \hat{p} + 1.96\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}]$ contains the true value of p.

Put another way, if we use this method to estimate the confidence interval of p for a large number of samples we can expect that in about 95% of the samples the true value of p will be within the confidence interval obtained from the sample.

Let’s now compute the 95% confidence interval for the proportions we saw in the two polls. 16% of Americans overall said that they would like to permanently move to another country if they could. 59% of adults in the US believe that 2019 will be a year of full or increasing employment. These two proportions are estimated from a selected sample of approximately 1000 U.S. adults.

95% confidence interval for the proportion of people who want to move out of the U.S.

$[\hat{p} - 1.96\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}, \hat{p} + 1.96\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}]$

$[0.16 - 1.96\sqrt{\frac{(0.16)(0.84)}{1000}}, 0.16 + 1.96\sqrt{\frac{(0.16)(0.84)}{1000}}]$

$[0.137 \le p \le 0.183]$

is the 95% confidence interval of the true proportion.

95% confidence interval for the proportion of people that believe that 2019 will be a year of full or increasing employment.

$[0.59 - 1.96\sqrt{\frac{(0.59)(0.41)}{1000}}, 0.59 + 1.96\sqrt{\frac{(0.59)(0.41)}{1000}}]$

$[0.56 \le p \le 0.62]$

is the 95% confidence interval of the true proportion.

We can generalize this to any confidence level by defining a $100(1-\alpha)%$ confidence interval for the true proportion p as $[\hat{p} - Z_{\frac{\alpha}{2}}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}, \hat{p} + Z_{\frac{\alpha}{2}}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}]$

Keep in mind that the main assumption behind this is that the estimate $\hat{p}$ can be approximated by a normal distribution for a reasonably large sample size.

How do we know what size of the sample is sufficient? In the first graphic that showed the polls, I highlighted margin of errors. Can you guess what that is?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Enjoy this blog? Please spread the word :)