Boxplot: The Five Number Summary
Before we move on to the third measure of spread (standard deviation), we’ll summarize what we’ve learned so far about measuring spread and use it to introduce another graphical display of the distribution of a quantitative variable, the boxplot.
didn’t follow me before no problem in tag cloud click statistics read previous posts.
The Five Number Summary
So far, in our discussion about measures of spread, the key players were:
- the extremes (min and Max), which provide the range covered by all the data; and
- the quartiles (Q1, M and Q3), which together provide the IQR, the range covered by the middle 50% of the data.
The combination of all five numbers (min, Q1, M, Q3, Max) is called the five number summary, and provides a quick numerical description of both the center and spread of a distribution.
a small intro into the statistics of 5 number summary is provided in github do check if you are unaware of it.
The boxplot graphically represents the distribution of a quantitative variable by visually displaying the five-number summary and any observation that was classified as a suspected outlier using the 1.5(IQR) criterion.
There are several ways to plot the whiskers on a boxplot. One convention is to plot whiskers down to the minimum and up to the maximum value. We use the 1.5(IQR criterion), also known as the Tukey method for plotting whiskers. First, calculate the IQR, the difference between the 75th and 25th percentiles (or Q3 – Q1). Multiply the IQR by 1.5. Add this value to the 75th percentile. If the value is greater than (or equal to) the maximum value in the dataset, draw the upper whisker to the maximum value. Otherwise, stop the whisker at the largest value that is less than 75th percentile + 1.5 * IQR. Plot any values that are greater than this as individual points that are outliers. Similarly, subtract 1.5 * IQR from the 25th percentile. If this value is smaller than the minimum value in the dataset, draw the lower whisker to the minimum value. Otherwise, stop the whisker at the lowest value that is greater than 25th percentile – 1.5 * IQR. Plot any values that are smaller than this as individual points that are outliers.
see it to understand how to —boxplot
Side-By-Side (Comparative) Boxplots
As we learned in the beginning of this module, the distribution of a quantitative variable is best represented graphically by a histogram. Boxplots are most useful when presented side-by-side for comparing and contrasting distributions from two or more groups.
Recall also that we found the five-number summary and means for both distributions. Here are the results for the Best Actor and Best Actress datasets:
- Actors: min = 31, Q1 = 38, M = 43.5, Q3 = 50.5, Max = 76
- Actresses: min = 21, Q1 = 30.5, M = 34.5, Q3 = 42, Max = 80
Based on the graph and numerical measures, we can make the following comparison between the two distributions:
Center: The graph reveals that the age distribution of the males is higher than the females’ age distribution. This is supported by the numerical measures. The median age for females (34.5) is lower than for the males (43.5). Actually, it should be noted that even the third quartile of the females’ distribution (42) is lower than the median age for males. We therefore conclude that in general, actresses win the Best Actress Oscar at a younger age than actors do.
Spread: Judging by the range of the data, there is much more variability in the females’ distribution (range = 59) than there is in the males’ distribution (range = 47). On the other hand, if we look at the IQR, which measures the variability only among the middle 50% of the distribution, we see slightly more spread in the ages of males (IQR = 12.5) than females (IQR = 11.5). We conclude that among all the winners, the actors’ ages are more alike than the actresses’ ages. However, the middle 50% of the age distribution of actresses is more homogeneous than the actors’ age distribution.
Outliers: We see that we have outliers in both distributions. There is only one high outlier in the actors’ distribution (76, Henry Fonda, On Golden Pond), compared with five high outliers in the actresses’ distribution.
A simple boxplot creation in R is attached into github do check
Standard Deviation Introduction
So far, we have introduced two measures of spread; the range (covered by all the data) and the inter-quartile range (IQR), which looks at the range covered by the middle 50% of the distribution. We also noted that the IQR should be paired as a measure of spread with the median as a measure of center. We now move on to another measure of spread, the standard deviation, which quantifies the spread of a distribution in a completely different way.
The idea behind the standard deviation is to quantify the spread of a distribution by measuring how far the observations are from their mean, x¯. The standard deviation gives the average (or typical distance) between a data point and the mean, x¯.
There are many notations for the standard deviation: SD, s, Sd, StDev. Here, we’ll use SD as an abbreviation for standard deviation, and use s as the symbol.
In order to get a better understanding of the standard deviation, it would be useful to see an example of how it is calculated. In practice, we will use a computer to do the calculation.
Example: Video Store Customers
The following are the number of customers who entered a video store in 8 consecutive hours:7, 9, 5, 13, 3, 11, 15, 9
To find the standard deviation of the number of hourly customers:
- Find the mean, x¯ of your data: 7+9+5+. . .+98=9
- Find the deviations from the mean: the difference between each observation and the mean(7 – 9), (9 – 9), (5 – 9), (13 – 9), (3 – 9), (11 – 9), (15 – 9), (9 – 9)
-2, 0, -4, 4, -6, 2, 6, 0
Since the standard deviation is the average (typical) distance between the data points and their mean, it would make sense to average the deviations we got. Note, however, that the sum of the deviations from the mean, x¯ is 0 (add them up and see for yourself). This is always the case, and is the reason why we have to do a more complicated calculation to determine the standard deviation:
- Square each of the deviations:The first few are(-2)2 = 4, (0)2 = 0, (-4)2 = 16, and the rest are 16, 36, 4, 36, 0.
- Average the square deviations by adding them up, and dividing by n – 1, (one less than the sample size):4+0+16+16+36+4+36+0(8−1)=1127=16
- the reason why we “sort of” average the square deviations (divide by n – 1) rather than take the actual average (divide by n) is beyond the scope of the course at this point, but will be addressed later.
- This average of the squared deviations is called the variance of the data.
- The SD of the data is the square root of the variance: SD = sqrt(16) = 4
- Why do we take the square root? Note that 16 is an average of the squared deviations, and therefore has different units of measurement. In this case 16 is measured in “squared customers,” which obviously cannot be interpreted. We therefore take the square root in order to compensate for the fact that we squared our deviations, and in order to go back to the original unit of measurement.Recall that the average number of customers who enter the store in an hour is 9. The interpretation of SD = 4 is that on average, the actual number of customers that enter the store each hour is 4 away from 9.
Properties of the Standard Deviation
- It should be clear from the discussion thus far that the SD should be paired as a measure of spread with the mean as a measure of center.
- Note that the only way, mathematically, in which the SD = 0, is when all the observations have the same value (Ex: 5, 5, 5, … , 5), in which case, the deviations from the mean (which is also 5) are all 0. This is intuitive, since if all the data points have the same value, we have no variability (spread) in the data, and expect the measure of spread (like the SD) to be 0. Indeed, in this case, not only is the SD equal to 0, but the range and the IQR are also equal to 0. Do you understand why?
- Like the mean, the SD is strongly influenced by outliers in the data. Consider the example concerning video store customers: 3, 5, 7, 9, 9, 11, 13, 15 (data ordered). If the largest observation was wrongly recorded as 150, then the average would jump up to x¯ = 25.9, and the standard deviation would jump up to SD = 50.3. Note that in this simple example, it is easy to see that while the standard deviation is strongly influenced by outliers, the IQR is not. The IQR would be the same in both cases, since, like the median, the calculation of the quartiles depends only on the order of the data rather than the actual values.
Choosing Numerical Summaries
Use x¯ (the mean) and the standard deviation as measures of center and spread only for reasonably symmetric distributions with no outliers.
Use the five-number summary (which gives the median, IQR and range) for all other cases.
A example is included in github
The Standard Deviation Rule
In the previous activity we tried to help you develop better intuition about the concept of standard deviation. The rule that we are about to present, called “The Standard Deviation Rule” (also known as “The Empirical Rule”) will hopefully also contribute to building your intuition about this concept.
Consider a symmetric mound-shaped distribution:
For distributions having this shape (also known as the normal shape), the following rule applies:
The Standard Deviation Rule:
- Approximately 68% of the observations fall within 1 standard deviation of the mean.
- Approximately 95% of the observations fall within 2 standard deviations of the mean.
- Approximately 99.7% (or virtually all) of the observations fall within 3 standard deviations of the mean.
The following picture illustrates this rule:
This rule provides another way to interpret the standard deviation of a distribution, and thus also provides a bit more intuition about it.
- The standard deviation measures the spread by reporting a typical (average) distance between the data points and their average.
- It is appropriate to use the SD as a measure of spread with the mean as the measure of center.
- Since the mean and standard deviations are highly influenced by extreme observations, they should be used as numerical descriptions of the center and spread only for distributions that are roughly symmetric, and have no outliers.
- For symmetric mound-shaped distributions, the Standard Deviation Rule tells us what percentage of the observations falls within 1, 2, and 3 standard deviations of the mean, and thus provides another way to interpret the standard deviation’s value for distributions of this type.
what can you do after this post the biggest advantage now when you do a histogram distribution of any variable in your project by seeing its distribution you have a clear idea whether you are going with mean,sd or 5 number summary.
Github link:click here.