Monday, February 18, 2019
In which the students come back with samples. Jenny shows them sampling distributions, the real meaning of confidence interval, and a few other exciting things using their data. She sends them off on a second task.
Where is the Mean
Samantha, John, and Christina explore
Samantha: These are the London planetrees I have data for.
The confidence interval of the population mean () is the interval .
I have a sample of 30 trees. For these 30 trees, the sample mean is 20.6 inches and the sample standard deviation is 9.06 inches. Based on this, the confidence interval of is .
[17.22 inches, 23.98 inches]
John: I collected data for 30 trees too. Here is my data.
And here is the confidence interval I came up with; [17.68 inches, 24.05 inches]. For my data, the sample mean is 20.87 inches and the sample standard deviation is 8.52 inches.
Christina: I have a sample of 30 too. Here is where I took them from.
And, here is the 95% confidence interval of the mean; [19.9 inches, 24.9 inches].
Their sample statistics are different. Their confidence intervals are different. They begin to wonder whether their interval contains the truth, , or, whose interval contains the truth?
Jenny puts all the intervals in context
Jenny shares with the students, an empty data file with the following headers.
The students fill out the file with the data they collected last week. Her class has 40 students, so when the file came back to Jenny, it had a total of 40*30 = 1200 row entries.
Jenny uploaded this file here, for the analysis she is going to show the class.
Jenny: I am sure you all had much fun last week visiting different places in the city and collecting data for the analysis. I am hoping that all of you randomly selected the places to visit. Based on what Sam and Christina showed, it looks like the places are fairly spread out — so we would have gotten a random sample from the population of the trees.
Sam, John, and Christina; the three of them computed the 95% confidence interval of the true mean based on their sample statistics. They found their intervals to be different.
Let’s look at all the 40 confidence intervals. Now that I have the data from you all, I will show you how to do it in R. Essentially, I will take each one of your data, compute the sample statistics and use them to compute the respective confidence intervals. We have 40 different samples, so we will get 40 different confidence intervals — each interval is a statement about the truth. If you remember what we have been discussing about the confidence intervals, for a 95% confidence level,
There is a 95% probability of selecting a sample whose confidence interval will contain the true value of .
In other words, approximately 95% of the 40 samples (38 of them) may contain the truth. 5% (2 of them) may not contain the truth.
Let’s see who is covering the truth and who is not!
I want you all to work out the code with me. Here we go.
First, we will read the data file into R workspace.
# Reading the tree data file from the students #
students_data = read.csv("students_data.csv",header=T)
Next, use the following lines to compute the confidence interval for each student. I am essentially repeating the computation of sample mean, sample standard deviation and the confidence intervals, in a loop, 40 times, one for each student.
#40 students went to collect data from random locations in NYC #
#Each student collects 30 samples #
#Each student develops the confidence intervals #
nstudents = 40
nsamples = 30
students = 1:nstudents
alpha = 0.95
t = alpha + (1-alpha)/2
sample_mean = matrix(NA,nrow=nstudents,ncol=1)
sample_sd = matrix(NA,nrow=nstudents,ncol=1)
ci_t = matrix(NA,nrow=nstudents,ncol=2)
for(i in 1:nstudents)
{
ind = which(students_data$student_index == i)
sample_data = students_data$tree_dbh[ind]
sample_mean[i] = mean(sample_data)
sample_sd[i] = sd(sample_data)
cit_lb = sample_mean[i] - qt(t,df=(n-1))((sample_sd[i]/sqrt(n)))
cit_ub = sample_mean[i] + qt(t,df=(n-1))((sample_sd[i]/sqrt(n)))
ci_t[i,] = c(cit_lb,cit_ub)
}
Now, let’s plot the 40 intervals to see them better. A picture is worth a thousand numbers. Use these lines. They explain themselves.
#Plot all the CIs #
stripchart(ci_t[1,],type="l",col="green",main="",xlab="Diameter (inches)",xlim=c(15,30),ylim=c(1,nstudents))
stripchart(sample_mean[1],add=T,col="green")
for (i in 2:nstudents)
{
stripchart(ci_t[i,],type="l",col="green",main="",add=T,at = i)
stripchart(sample_mean[i],col="green", add=T, at = i)
}
The students are overjoyed looking at the pretty image.
John: This is neat. All our intervals are different, owing to each one of us bringing a different random sample. How we know which one of us contains the truth and which one of us does not is still not clear.
Jenny: We can get a close approximation based on the principle of consistency.
As n approaches infinity, the sample estimate approaches the true parameter. We can take the full data record that you all brought, 1200 data points and compute the overall mean.
true_mu = mean(students_data$tree_dbh)
While it is not exactly the true value, based on the idea of consistency, we can assume that it is converging to the true one.
= 21.3575 inches.
Look at this.
# Truth Convergence Plot #
true_mean = sample_mean
for (i in 1:nstudents)
{
ind = which(students_data$student_index <= i)
true_mean[i] = mean(students_data$tree_dbh[ind])
}
plot(true_mean,type="o",xlab="Increasing Sample",ylab="Convergence of the Sample Mean",font=2,font.lab=2,ylim=c(20,22))
My sixth sense and divine visions also tell me that this is converging to the truth!
Now, we can check which of these intervals contains the true mean , 21.3575 inches.
You can use this simple code to check.
false_samples = students
for (i in 1:nstudents)
{
if( (ci_t[i,1] > true_mu) || (ci_t[i,2] < true_mu) )
{false_samples[i]=1} else
{false_samples[i]=0}
}
false_ind = which(false_samples == 1)
It looks like number 4 and number 26 are not covering the truth. Harry and Miguel.
Let us point them out. Here, look at another pretty plot. The thick black vertical line is the truth, 21.3575 inches. The brown line is how the sample mean () is converging to the truth () as we collect more samples. Harry and Miguel stand out. That is 2 out of 40; 5%.
# Plot all the CIs; now show the false samples #
stripchart(ci_t[1,],type="l",col="green",main="",xlab="Diameter (inches)",xlim=c(15,30),ylim=c(1,nstudents))
stripchart(sample_mean[1],add=T,col="green")
for (i in 2:nstudents)
{
stripchart(ci_t[i,],type="l",col="green",main="",add=T,at = i)
stripchart(sample_mean[i],add=T,col="green",at = i)
}
abline(v=true_mu,lwd=3)
lines(true_mean,students,type="o",col="brown",cex=0.5)
for (i in 1:length(false_ind))
{
j = false_ind[i]
stripchart(ci_t[j,],type="l",col="red",lwd=3, main="", add = T, at = j)
stripchart(sample_mean[j],col="red", lwd=3, add=T, at = j)
}
text(16,4.6,"Harry")
text(26,26.6,"Miguel")
There is a 95% probability of selecting a sample whose confidence interval will contain the true value of .
Miguel: What happens to the confidence interval when we compute it with larger sample sizes? I mean, what is the relationship between the confidence interval and sample size?
Jenny: Look at the equation for the confidence interval of the population mean; .
Theoretically, as n tends to infinity, the interval shrinks to 0 since the sample mean converges to the true mean. So, as n gets larger, the interval gets smaller. Let me show you how it works. Take our data, and, just as we computed the sample mean for increasing sample size to see the convergence, compute the confidence interval also at each step.
# How does CI change with more sample size
sample_mean_update = matrix(NA,nrow=nstudents,ncol=1)
sample_sd_update = matrix(NA,nrow=nstudents,ncol=1)
ci_t_update = matrix(NA,nrow=nstudents,ncol=2)
for(i in 1:nstudents)
{
ind = which(students_data$student_index <= i)
ns = length(ind)
sample_data = students_data$tree_dbh[ind]
sample_mean_update[i] = mean(sample_data)
sample_sd_update[i] = sd(sample_data)
cit_lb = sample_mean_update[i] - qt(t,df=(ns-1))((sample_sd_update[i]/sqrt(ns)))
cit_ub = sample_mean_update[i] + qt(t,df=(ns-1))((sample_sd_update[i]/sqrt(ns)))
ci_t_update[i,] = c(cit_lb,cit_ub)
}
Now, we can plot these confidence intervals against increasing sample size to see what happens. Just remember that each interval is an interval that is based on an increased sample size than the previous. The first interval has 30 data points, the second one has 60 data points, the third one, 90, the fourth, 120 etc.
# Plot all the CIs #
stripchart(ci_t_update[1,],vertical=T,type="l",col="green",main="",ylab="Diameter (inches)",ylim=c(16,25),xlim=c(1,nstudents),xlab="Increasing Sample Size")
stripchart(sample_mean_update[1],vertical=T,add=T,col="green")
for (i in 2:nstudents)
{
stripchart(ci_t_update[i,],type="l",col="green",main="",add=T,at = i,vertical=T)
stripchart(sample_mean_update[i],col="green", add=T, at = i,vertical=T)
}
lines(students,true_mean,type="o",col="brown",cex=0.5)
See how the interval gets smaller and smaller as the sample size increases. If we do this enough number of times, the interval vanishes — as we will approach the entire population.
T or Z
Christina: If we have a large sample, why can’t we use as the confidence interval for the true mean ? For a large sample size, we can assume s, the sample standard deviation of the large sample as .
Jenny: Yes, we can do that. As n increases, the difference between T and Z is minimal. However, it is better to use t-distribution for confidence intervals of , so we don’t fall into traps. I will show those traps next week.
Where is the Standard Deviation and Proportion
Jenny: Can you follow the same logic and develop the confidence intervals for ? How many of you will cover the truth, how many of you will not?
Again, approximately, 95% of the intervals will cover the true , i.e., roughly 38 of your intervals will have the truth, 2 will not. Check this out.
alpha = 0.95
u = alpha + (1-alpha)/2
l = 1 - u
sample_var = matrix(NA,nrow=nstudents,ncol=1)
ci_sd = matrix(NA,nrow=nstudents,ncol=2)
for(i in 1:nstudents)
{
ind = which(students_data$student_index == i)
ns = length(ind)
sample_data = students_data$tree_dbh[ind]
sample_var[i] = var(sample_data)
chi_low = ((ns-1)sample_var[i])/qchisq(u,df=(ns-1))
chi_up = ((ns-1)sample_var[i])/qchisq(l,df=(ns-1))
ci_sd[i,] = c(sqrt(chi_low),sqrt(chi_up))
}
true_sd = sqrt((sum((students_data$tree_dbh-true_mu)^2))/length(students_data$tree_dbh))
# False samples #
false_samples_sd = students
for (i in 1:nstudents)
{
if( (ci_sd[i,1] > true_sd) || (ci_sd[i,2] < true_sd) )
{false_samples_sd[i]=1} else
{false_samples_sd[i]=0}
}
false_sd_ind = which(false_samples_sd == 1)
# Plot all the CIs; now show the false samples #
stripchart(ci_sd[1,],type="l",col="green",main="CI on Standard Deviation",xlab="Diameter (inches)",xlim=c(5,15),ylim=c(1,nstudents))
stripchart(sample_sd[1],add=T,col="green")
for (i in 2:nstudents)
{
stripchart(ci_sd[i,],type="l",col="green",main="",add=T,at = i)
stripchart(sample_sd[i],add=T,col="green",at = i)
}
abline(v=true_sd,lwd=3)
Jenny: Here all the intervals cover the truth, but, in the long-run, 95% of the intervals cover the truth.
John: How about proportion?
Jenny: Yes, can you all develop the confidence intervals for the proportion of damaged trees? I want you to use bootstrap confidence intervals for p instead of the one based on the assumption of normal distribution.
So the students typed away a few lines of code to finally create this.
# Confidence Interval of p #
alpha = 0.95
u = alpha + (1-alpha)/2
l = 1 - u
nboot = 1000
sample_p = matrix(NA,nrow=nstudents,ncol=1)
ci_p_boot = matrix(NA,nrow=nstudents,ncol=2)
for(i in 1:nstudents)
{
ind = which(students_data$student_index == i)
ns = length(ind)
sample_data = students_data$brnch_ligh[ind]
sample_p[i] = length(which(sample_data=="Yes"))/ns
bootstrap_replicate_proportion = matrix(NA,nrow=nboot,ncol=1)
for (j in 1:nboot)
{
ind2 = 1:ns
bootstrap_ind = sample(ind2,ns,replace=T)
bootstrap_sample_damage = sample_data[bootstrap_ind]
bootstrap_replicate_proportion[j,1] = length(which(bootstrap_sample_damage=="Yes"))/ns }
ci_p_boot[i,] = quantile(bootstrap_replicate_proportion,c(l,u))
}
true_p = length(which(students_data$brnch_ligh=="Yes"))/length(students_data$brnch_ligh)
# False samples #
false_samples_p = students
for (i in 1:nstudents)
{
if( (ci_p_boot[i,1] > true_p) || (ci_p_boot[i,2] < true_p) )
{false_samples_p[i]=1} else
{false_samples_p[i]=0}
}
false_p_ind = which(false_samples_p == 1)
# Plot all the CIs; now show the false samples #
stripchart(ci_p_boot[1,],type="l",col="green",main="CI on Proportion",xlab="Proportion",xlim=c(0,0.5),ylim=c(1,nstudents))
stripchart(sample_p[1],add=T,col="green")
for (i in 2:nstudents)
{
stripchart(ci_p_boot[i,],type="l",col="green",main="",add=T,at = i)
stripchart(sample_p[i],add=T,col="green",at = i)
}
abline(v=true_p,lwd=3)
for (i in 1:length(false_p_ind))
{
j = false_p_ind[i]
stripchart(ci_p_boot[j,],type="l",col="red",lwd=3, main="", add = T, at = j)
stripchart(sample_p[j],col="red", lwd=3, add=T, at = j)
}
You can also check out their full code here.
As the class comes to an end, Samantha looks fulfilled. “Wow, that is quite a day,” she said. “We covered many things today, but more importantly the true meaning of the confidence intervals. There is a 95% probability of selecting a sample whose confidence interval will contain the true value.”
“Assuming our data is collected without any bias,” interjected Jenny. The students looked puzzled.
“This week, I want you to go out and collect the data again, but this time, divide yourselves into boroughs. Eight of you will belong to team Manhattan and will collect tree data from Manhattan only. You will still collect data randomly but will do it only from Manhattan. Eight of you are team Queens and you collect data randomly from Queens. The other three teams, team Bronx, team Brooklyn, and team Staten Island will do the same in their boroughs. We will again have 30 trees each, and a total of 1200 trees.”
So, the students go off on another mission. Some of them are still puzzled or rather have a “what difference does it make” face on them. Some of them are as usual excited to hug the trees again. We will all wait and see what they come up with. More fun with data.
If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.