“In theory, there is no difference between theory and practice. In practice, there is.” — ??
By now, you recognize the pattern in this classroom. We follow up theory with practice. A set of lessons with a common theme will culminate with some experience in R. Being true to our trend, today, we leap into the world of extRemes. Put on your hard hat and bring your tools and machinery. Don’t forget your drink of the day.
On the seventeenth day of March, two thousand eighteen, we met Maximus Extremus Distributus. He was taking different forms depending on the origin (parent) distribution. You can create him too using a few simple lines of code in R.
Here is the creation code for normal origins. We generate N = 1000 normally distributed random variables with a zero mean and unit standard deviation, select the maximum value out of these 1000 values, and repeat the process 1000 times to get 1000 maximum values. These maximum values converge to the Type I extreme value distribution – Gumbel (). The code runs like an animation. You can control the speed by changing the number in Sys.sleep(). There is a plot that shows the convergence of the average values to a Normal distribution. Do you know why?
library(locfit) library(logspline) # Normal Origins # par(mfrow=c(3,1)) x = rnorm(10000,0,1) plot(logspline(x),xlim=c(-5,5),ylim=c(0,0.5),font=2,cex.lab=2,font.lab=2,xlab="Normal Distribution") N = 1000 ave = matrix(NA,nrow=N,ncol=1) max1 = matrix(NA,nrow=N,ncol=1) for (i in 1:N) { x = rnorm(N,0,1) lines(locfit(~x),col="grey") points(mean(x),0,col="blue",pch=17) points(max(x),0,col="red",pch=17) ave[i] = mean(x) max1[i] = max(x) Sys.sleep(0.01) } Sys.sleep(1) plot(locfit(~ave),xlim=c(-5,5),ylim=c(0,9),ylab="",cex.lab=2,font=2,font.lab=2,xlab="Normal Distribution",col="white") lines(locfit(~ave),lwd=2,col="blue") Sys.sleep(1) plot(locfit(~max1),xlim=c(-5,5),font=2,font.lab=2,ylab="",xlab="Extreme Value Distribution (Gumbel)",cex.lab=2,col="white") lines(locfit(~max1),lwd=2,col="red")
The creation code for exponential origins has the same procedure. We generate N = 1000 exponentially distributed random variables with as the parent. The maximum values of an exponential distribution again converge to the Gumbel distribution.
## EVD from Exponential## par(mfrow=c(3,1)) x = rexp(10000,0.5) hist(x,prob=T,xlim=c(0,25),ylab="",main="",font=2,cex.lab=2,font.lab=2,xlab="Exponential Distribution") plot(logspline(x),add=T) N = 1000 ave = matrix(NA,nrow=N,ncol=1) max1 = matrix(NA,nrow=N,ncol=1) for (i in 1:N) { x = rexp(N,0.5) lines(locfit(~x),col="grey") points(mean(x),0,col="blue",pch=17) points(max(x),0,col="red",pch=17) ave[i] = mean(x) max1[i] = max(x) Sys.sleep(0) } Sys.sleep(1) plot(locfit(~ave),xlim=c(0,25),ylim=c(0,4),ylab="",cex.lab=2,font=2,font.lab=2,xlab="Normal Distribution",col="white") lines(locfit(~ave),lwd=2,col="blue") Sys.sleep(1) plot(locfit(~max1),xlim=c(0,25),font=2,font.lab=2,ylab="",xlab="Extreme Value Distribution (Gumbel)",cex.lab=2,col="white") lines(locfit(~max1),lwd=2,col="red")
Here is the creation code for uniform origins. This time the maximum values from uniform distribution converge to a different type of extreme value distribution, the Type III Weibull distribution ().
# Uniform Origins # x = runif(100000,0,1) #hist(x,prob=T,xlim=c(0,1),ylab="",main="",font=2,cex.lab=2,font.lab=2,xlab="Uniform Distribution") plot(density(x),xlim=c(0,1),ylab="",main="",font=2,cex.lab=2,font.lab=2,xlab="Uniform Distribution") N = 200 ave = matrix(NA,nrow=N,ncol=1) max1 = matrix(NA,nrow=N,ncol=1) for (i in 1:N) { x = runif(N,0,1) # lines(locfit(~x),col="grey") points(mean(x),0,col="blue",pch=17) points(max(x),0,col="red",pch=17) ave[i] = mean(x) max1[i] = max(x) Sys.sleep(0) } Sys.sleep(1) plot(locfit(~ave),xlim=c(0,1),ylim=c(0,30),ylab="",cex.lab=2,font=2,font.lab=2,xlab="Normal Distribution",col="white") lines(locfit(~ave),lwd=2,col="blue") Sys.sleep(1) plot(locfit(~max1),xlim=c(0,1),ylim=c(0,65),font=2,font.lab=2,ylab="",xlab="Extreme Value Distribution (Weibull)",cex.lab=2,col="white") lines(density(max1),lwd=2,col="red")
I want you to experiment with other types of origin distributions.
The three Types (Gumbel, Frechet, Weibull) and the GENERALIZED EXTREME VALUE DISTRIBUTION (GEV)
The three types of extreme value distributions have double exponential and single exponential forms. The maxima of independent random variables converge (in the limit when ) to one of the three types, Gumbel (), Frechet () or Weibull () depending on the parent distribution.
We saw last week that these three types could be combined into a single function called the generalized extreme value distribution (GEV).
is the location parameter. is the scale parameter. controls the shape of the distribution (shape parameter).
When , GEV tends to a Gumbel distribution.
When , GEV tends to the Frechet distribution.
When , GEV tends to the Weibull distribution.
Its okay if you don’t know the origin distribution for an extreme dataset. GEV folds all the three types into one form, and the parameters , , and can be estimated from the data. The function has a closed form solution to compute the quantiles and probabilities.
GEV IN R
Let’s play with some data and use GEV in R. We will use two datasets, NYC temperature, and cycles to fatigue of steel. The NYC temperature data is the same file we used in lesson 56. The cycles to fatigue is the data from our labs where we measured the maximum number of cycles before failure due to fatigue for ten steel specimens.
Read the files into the workspace using the following lines.
# fatigue data fatigue = read.table("cycles_fatigue.txt",header=F) # temperature data temperature_data = read.csv("cp_temperature.csv",header=T)
ExtRemes Package
We need to install a package in R called extRemes. Type the following lines in your code.
install.packages("extRemes") library(extRemes)
This package has functions built for GEV distribution. Let’s try a few simple things first by generating random variables of the three types. Type the following lines in your code.
x = revd(10000,loc=0,scale=1,shape=0)
This command (revd) will generate 10000 GEV random variables with a location of 0, scale of 1 and shape of 0. So, this is a Gumbel distribution.
Type these lines and see what you get.
# Generate Gumbel Random numbers x = revd(10000,loc=0,scale=1,shape=0) hist(x,prob=T,xlab="Random Variables from Gumbel (location = 0,scale = 1, shape =0)",main="Gumbel Distribution",ylab="f(x)",font=2,family="serif",font.lab=2,cex.lab=1.5) plot(logspline(x),add=T)
Now hold the shape parameter constant at 0 and alter the location and scale parameters. A change in the location parameter will shift the distribution; a change in the scale parameter will stretch or shrink the distribution.
I guess you know the chores now.
Yes, we experiment with positive and negative shape parameters to generate the Frechet and Weibull distributions.
Generate Frechet Random numbers
# Frechet distribution plot y = revd(10000,loc=0,scale=1,shape=0.2) hist(y,prob=T,ylim=c(0,0.4),xlab="Random Variables from Frechet (location = 0, scale = 1, shape = 0.2)",main="Frechet Distribution",ylab="f(x)",font=2,family="serif",font.lab=2,cex.lab=1.5) plot(logspline(y),add=T,col="red") plot(logspline(x),add=T)
Generate Weibull Random numbers
## Weibull Distribution Plot z = revd(10000,loc=0,scale=1,shape=-0.6) hist(z,prob=T,ylim=c(0,0.5),xlab="Random Variables from Weibull (location = 0, scale = 1, shape = -0.6)",main="Weibull Distribution",ylab="f(x)",font=2,family="serif",font.lab=2,cex.lab=1.5) plot(logspline(z),add=T,col="red")
If you are comfortable with this, it is time to get your hand dirty with real data.
Fitting GEV distribution to data
Let’s examine the maximum cycles to fatigue data. We do not know which extreme value distribution it follows. If we fit a GEV and observe the shape parameter, we can say with certain confidence that the data follows Type I, Type II or Type III distribution.
For this, we can use the fevd command. Let’s use the fevd function to fit a GEV to the cycles to fatigue data.
fatigue_cycles = as.matrix(fatigue) fit = fevd(fatigue_cycles,type="GEV") summary(fit) [1] "Estimation Method used: MLE" Negative Log-Likelihood Value: 132.8045 Estimated parameters: location scale shape 5.765016e+05 1.680629e+05 -5.231015e-01 Standard Error Estimates: location scale shape 6.276363e+04 6.709826e+04 4.199849e-01 Estimated parameter covariance matrix. location scale shape location 3.939273e+09 -527662537.18 -9.654182e+03 scale -5.276625e+08 4502177071.08 -2.306339e+04 shape -9.654182e+03 -23063.39 1.763873e-01 AIC = 271.6089 BIC = 272.5167
Pay attention to the red highlighted text for now. It shows the results for the estimated parameters. The shape parameter is -0.52 (). So, a Weibull distribution fits the data with high likelihood.
Block Maxima
Let’s examine the temperature data. Assume we are interested in analyzing the data for maximum temperature each year. Remember we only care about the extremes. For this, we should first extract the annual maximum temperature values. This idea is called the block maxima. Each year is a block, and we get the maximum for each year.
Type the following lines in your code to get the annual maximum temperature values from 1951 to 2017.
# Annual Maximum Ave Temperature # yr1 = 1951 yr2 = 2017 n = yr2 - yr1 + 1 annmax = matrix(NA,nrow=n,ncol=1) for (i in 1:n) { yr = 1950 + i index = which((temperature_data[,1] == yr)) temperature_yr = temperature_data[index,4] annmax[i,1] = max(temperature_yr,na.rm=T) } hist(annmax,xlab="Annual Maximum Temperature",font=2,main="",font.lab=2)
There is one value very different and far away from all other values. This phenomenon is the feature of the extreme values.
Rare events never seen before can occur. Using a model (e.g., GEV function) for these “unknowns” comes with uncertainty.
Now let’s fit GEV to the temperature data and look at a few things.
fit_temperature = fevd(annmax,type="GEV")
The fevd function will fit a GEV distribution to the data. The location, scale and shape parameters of the function are estimated based on the data.
Now type the following line in your code.
plot(fit_temperature)
You should see the following figure appear in the plot window.
For now, focus on the bottom two images. The one in the bottom left is showing how well the GEV function (blue line) is matching the observed data (black line). Pretty reasonable.
Can you tell me what is the image on the bottom right?
Okay, let me ask you this. What is the probability that the annual maximum temperature will be less than or equal to 92 degrees F?
Yes, you can compute this from the cumulative distribution function of the GEV distribution.
Substitute in the equation and you get .
I am sure you know why I substituted . These are the estimated parameters you get from the summary of the fit.
So, the probability that the annual maximum temperature will be less than or equal to 92 degrees F is 0.9.
Can I say, that the probability of exceeding 92 degrees F is 0.1?
A temperature of 92 degrees F is exceeded 10% of the times.
What do you know about exceedance probability and return period? If a random variable is exceeded with 10% probability, what is the frequency of its occurrence? Lesson 34? The language of return period?
Return Period
So if 92 degrees is the temperature that exceeds 10% of the times, its return period is ten years.
On the average, daily temperature as high as 92 degrees will occur once every ten years in New York City.
In equation form, Return Period of a quantile z is .
Now, can you answer this?
What temperature (z) occurs once in 50 years?
…
I know you are thinking of inverting the return period equation. That is the way.
Now if you substitute years and , you get z = 95.95.
The 50 year return period temperature is 95.95 degrees F.
This is what is being computed internally within the extRemes package to create the plot you see. For quantile z, extRemes package has qevd() function where you have to input probability p and other parameters. For probability , it is pevd(), and you have to input the quantile z and the other parameters.
We can get their relations from the plot.
What are those dashed lines in the return period plot?
The 500 year return period temperature is 101.3 degree F. What does it even mean when they say the return period is 500 years?
How can something happen once in 500 years when we do not have 500 years of data?
Do you want to believe it? How confident are you in this estimate?
If this does not spin your head, let me add more. Look at the summary of the model once again.
summary(fit_temperature) fevd(x = annmax, type = "GEV") [1] "Estimation Method used: MLE" Negative Log-Likelihood Value: 169.1602 Estimated parameters: location scale shape 86.14530247 2.64886498 -0.02706872 Standard Error Estimates: location scale shape 0.3552756 0.2500867 0.0663774 Estimated parameter covariance matrix. location scale shape location 0.126220730 0.028413268 -0.006875094 scale 0.028413268 0.062543342 -0.003859995 shape -0.006875094 -0.003859995 0.004405960 AIC = 344.3204 BIC = 350.9344
The shape parameter is negative but very close to 0. Can we say that the GEV is a Gumbel distribution? Why not Weibull?
Did you ask yourself how these parameters are estimated from the data? How much data is required for this?
Did you ask yourself whether the shape parameter value is significantly different from 0 to say it is not Gumbel?
You can see so many new terms, “Estimation Method used: MLE,” standard error, and so on. What are these? What about the other two images in the fevd plot? Aren’t you curious about them? Well, hold on to that suspense for a few weeks. Very soon, we will start a new journey of inference.
If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.
what is the relevance of estimated parameters co-variance matrix in the EVD results.
This is an awesome introduction to GEV for R – but there are a lot of unanswered questions at the end of the article. I’m looking forward to the further insights. Thank you very much for the effort of explaining.