Commander Joe, the time has come → execute order statistics.
No, we are not going to destruct the Jedi Order. We are going to construct useful graphics to get better insights on the data. Also known as Exploratory Data Analysis (EDA), this procedure allows us to summarize and present the data in a concise form. It is also an excellent tool to detect outliers or unusual numbers. Imagine trying to understand a table full of numbers. I bet you will be as overwhelmed and lost as I am here.
Not to worry. You can solve this puzzle using order statistics and exploratory graphics. Recall lesson four where we visualized the sets using Venn diagrams. Since our brain perceives visual information better, a useful first step is to visualize the data. I looked for a small but powerful dataset to get us started. The FORCE guided me towards box office revenues of the STAR WARS films. Here they are.
This table presents the data in the order of the release of the films, but we are trying to understand the data on the revenue collected. Look at the table and tell me which film had the lowest revenue?
…
Yes, it is “The Clone Wars.” It collected $39 million dollars in revenue. Look at the table again and tell me which film collected the highest revenue?
…
Yes, the epic “STAR WARS.”
You have just ordered the data from the smallest to the largest. You can take these ordered data and place them on a number line like this.
What you are seeing is a graphic called the dot plot. Each data point is shown as a dot at its value on the number line. You can see where each point is, relative to other data points. You can also see the range of the data. For our small data set, the box office revenue ranges from $39 Million to $1331 Million.
Okay, we have arranged our data in order and constructed a simple graphic to visualize it. Now, imagine how useful it would be if we can summarize the data into a few numbers (statistics). For example, can you tell me what is the middle value in the data, i.e. what number divides the data on the dot plot into two parts?
…
Yes, the fifth number (The Phantom Menace – Revenue $708 Million). The fifth number divides the nine numbers into two halves; 1, 2, 3, 4 on one side and 6, 7, 8, 9 on the other side. This middle value is called the 50th percentile→ 50% of the numbers are less than this number.
I threw in the “percentile” term there. Some of you must have remembered your SAT scores. What is your percentile score? If you have a 90th percentile score, 90% of the students who took the test have a score below yours. If you have a 75th percentile score, 75% of the students have a score below yours and so on.
Percentiles, also called order statistics for the data are a nice way to summarize the big data and express them in a few numbers. For our data, these are some order statistics. I am showing 25th, 50th, 75th and 95th percentiles on the dot plot.
Let us take one more step and construct another useful graphic by joining these order statistics. Let us put a box around the 25th, and 75th percentiles. This box will show the region with 50 percent of the data → 25th to 75th. Half of our data will be in the box. Let us also draw a line at the 50th percentile to indicate the middle data point.
Now, let us use wings (whiskers) and extend to lower and higher percentiles. We can stretch out the whiskers up to 1.5 times the box length.
If we cannot reach a data point using the whisker extensions from the box, we give up and call the data point an outlier or unusual data point.
This graphic is called the boxplot. Like the dot plot, we get a nice visual of the data range, its percentiles or order statistics, and we can visually detect outliers in the data.
The “STAR WARS” is an outlier in the data, a one of its kind.
Today is not May the fourth.
It is not revenge of the fifth.
It is the graphic saber of the sixth. Use it to conquer the data.
May the FORCE be with you.
If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.