The prices of golden delicious apples sold yesterday at selected U.S. cities’ terminal markets are
The prices of navel oranges sold yesterday at selected U.S. cities’ terminal markets are
I want to compare apples and oranges, needless to say, prices. I will assume the prices of apples to be a random variable and will plot the probability distribution function. I will consider the prices of oranges to be another random variable and will plot its probability distribution function. Since the prices can be defined on a continuous number scale, I am assuming continuous probability distribution functions that look like this
Notice the two random variables are on a different footing. Apples are centered on $44 with a standard deviation (spread) of $9. Oranges are centered on $25 with a standard deviation of $5.
If our interest lies in working simultaneously with data that are on different scales or units, we can standardize them, so they are placed on the same level or footing. In other words, we will move the distributions from their original scales to a new common scale. This transformation will enable an easy way to work with data that are related, but not strictly comparable. In our example, while both the prices data are in the same units, they are clearly on different scales.
We can re-express the random variable (data) as standardized anomalies so they can be compared or analyzed. This process of standardization can be achieved by subtracting the mean of the data and dividing by the standard deviation. For any random variable, if we subtract the mean and divide by the standard deviation, the expected value and the variance of the new standardized variable is 0 and 1. Here’s why.
The common scale is hence a mean of 0 and a standard deviation of 1. We just removed the influence of the location (center) and spread (standard deviation) from the data. Any variable, once standardized, will be on this scale regardless of the type of the random variable and shape of the distribution.
You must have observed that the units will cancel out → the standardized random variable is dimensionless. The standardized data for apples and oranges will look this
The first step where we subtract the mean will give anomalies, i.e. differences from the mean that are now centered on zero. Positive anomalies are the values that are greater than the mean, and negative anomalies are the values that are less than the mean. The second step where we divide by the standard deviation will provide a scaling factor; unit standard deviation for the new data that enables comparison across different distributions.
The standardized scores can also be seen as a distance measures between the original data value and the mean, the units being the standard deviation. For example, the price of apples in New York terminal market is $55, about 1.14 standard deviations from the mean of ~ $44. $44.63 + 1.14*9.
We will revisit this idea of standardization and use it to our advantage when we learn normal probability distributions. Until then, here are some examples of “standardize” in sentences as provided by Merriam — my rants attached.
“The plan is to standardize the test for reading comprehension so that we can see how students across the state compare” – One size does not fit all.
“He standardized procedures for the industry” – Interns getting coffee is not one of them. So make your own.
If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.