What we know so far
The confidence interval for the true mean
If the sample size n is very large, we can substitute the sample standard deviation s in place of the unknown .
However, for small sample sizes, the sample standard deviation s is itself subject to error. In other words, it may be far from the true value of . Hence, we cannot assume that
will tend to a normal distribution, which was the basis for deriving the confidence intervals in the first place.
Last week’s mathematical excursion took us back in time, introduced us to “Student” and the roots of his famous contribution, the Student’s t-distribution.
“Student” derived the frequency distribution of to be a t-distribution with (n-1) degrees of freedom. We paddled earnestly through a stream of functions to arrive at this.
The probability of t within any limits is fully known if we know n, the sample size of the experiment. The function is symmetric with an expected value and variance of: and
While the t-distribution resembles the standard normal distribution Z, it has heavier tails, i.e., it has more probability in the tails than the normal distribution. As the sample size increases () the t-distribution approaches Z.
Can we derive the confidence interval from the t-distribution?
Let’s follow the same logic used to derive the confidence interval from the normal distribution. The only difference now will be that
Suppose we are interested in deriving the 95% confidence interval for the true mean , we can use the probability rule
It is equivalent to saying that there is a 95% probability that the variable is between
, the 2.5 percentile value of t.
is like
. While
will depend on the sample size n.
Notice I am using (n-1), the degrees of freedom in the subscript for t to denote the fact that the value will be different for a different sample size.
To generalize, we can define a confidence interval for the true mean
using the probability equation
is between 0 and 1. For a 95% confidence interval,
, and for a 99% confidence interval,
. Like how the Z-critical value is denoted using
, the t-critical value can be denoted using
We can modify the inequality in the probability equation to arrive at the confidence interval.
Multiplying throughout by , we get
Subtracting and multiplying by -1 throughout, we get
This interval
is called the
confidence interval of the population mean.
As we discussed before in lesson 72, the interval itself is random since it is derived from and s. A different sample will have a different
and s, and hence a different interval or range.
Solving Jenny’s problem
Let us develop the 95% confidence intervals of the mean water quality at the Rockaway beach. Jenny was using the data from the ROCKAWAY BEACH 95TH – 116TH. She identified 48 data points (n = 48) from 2005 to 2018 that exceeded the detection limit.
The sample mean is 7.9125 counts per 100 ml. The sample standard deviation s is 6.96 counts per 100 ml.
The 95% confidence interval of the true mean water quality () is
How can we get the value for , the 2.5 percentile from the t-distribution with (n-1) degrees of freedom?
You must have started integrating the function into its cumulative density function
Save the effort. These are calculated already and are available in a table. It is popular as the t-table. You can find it in any statistics textbook, or simply type “t-table” in any search engine and you will get it. There may be slight differences in how the table is presented. Here is an example.
This table shows the right-sided t-distribution critical value . Since the t-distribution is symmetric, the left tail critical values are
. You must have noticed that the last row is indicating the confidence level
Take, for instance, a 95% confidence interval, . The upper tail probability p is 0.025. From the table, you look into the sixth column under 0.025 and slide down to
. For instance, if we had a sample size of 10, the degrees of freedom are df = 9 and the t-critical (
) will be 2.262; like this:
Since our sample size is 48, the degrees of freedom df = 47.
In the sixth column under upper tail probability 0.025, we should slide down to df = 47. Since there is no value for df = 47, we should interpolate from the values 2.021 (df = 40) and 2.009 (df = 50).
The t-critical value for 95% confidence interval and df = 47 is 2.011. I got it from R. We will see how in an R lesson later.
This table is also providing at the end. See
for a 95% confidence interval.
Let’s compute the 95% confidence interval for the mean water quality.
Like in the case of the confidence interval derived from the normal distribution, if we have a lot of different samples, and if we compute the 95% confidence intervals for these samples using the sample mean (), sample standard deviation (s) and the t-critical from the t-distribution, in the long-run, 95% of these intervals will contain the true value of
Here is how eight different 95% confidence intervals look relative to the truth. These eight intervals are constructed based on the samples from eight different locations. In the long-run, 5% of the samples will not contain the true mean for 95% confidence intervals.
I am also showing the confidence intervals derived from the normal distribution and known variance assumption (). They are in green color.
Can you spot anything?
How do they compare?
What can we learn about the width of the intervals derived from the normal distribution (Z) and the t-distribution?
Is there anything that is related to the sample size?
Think about it until we come back with a lesson in R for confidence intervals.
There are other applications of the t-distribution that we will learn in due course of time.
Remember that you will cross a “t” whenever there is error.
I will end with these notes from Fisher in his paper “Student” written in 1939 in remembrance of W.S. Gosset.
“Five years, however, passed, without the writers in Biometrika, the journal in which he had published, showing any sign of appreciating the significance of his work. This weighty apathy must greatly have chilled his enthusiasm.”
“The fruition of his work was, therefore, greatly postponed by the lack of appreciation of others. It would not be too much to ascribe this to the increasing dissociation of theoretical statistics from the practical problems of scientific research.”
It is now 110 years since he published his famous work using a pseudo name “Student.” Suffice it to say that “Student” and his work will still be appreciated 110 years from now, and people will derive confidence from the “t.”
If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.