The four different scales of measurement, from least to most precise, are
The nominal scale of measurement is a qualitative measure that uses discrete categories to describe a characteristic of the research participants. For each participant, the researcher determines the presence, absence, and type of the attribute. Nominal scales of measurement may have two categories, such as citizen status (citizen/non-citizen), or they can have more than two categories, like religious affiliation (e.g., Agnostic, Buddhist, Jewish, Muslim) or marital status (e.g., divorced, married, single). Often, as described here, the categories have names; however, researchers code them with numbers for use in statistical analyses. These categories are not ordered or ranked in any way.
An ordinal scale of measurement rank-orders participants on some scale or attribute, but the difference between numbers does not convey fixed or equal differences. Thus, with ordinal data, we know that a one-unit increase in an ordinal scales represents “more,” but we don’t know how much more. For example, a group of participants can be rank-ordered from least to most politically active. We know that a person who is ranked as 5 is more politically active than a person who is ranked as 4, but not how much more politically active. The value of the variable is used to order participants according to the strength/presence of the attribute and not to calculate differences between participants.
The interval scale of measurement takes numerical form, and the distance between pairs of consecutive numbers is assumed to be equal. However, interval variables do not have a meaningful zero point; thus, a zero does not mean the absence of the attribute, but rather it is a particular (but arbitrary) point on the scale. A good example of an interval measure is temperature in the Fahrenheit scale: a temperature of zero degrees Fahrenheit is still a temperature, not the absence of temperature. In education, measures like achievement, motivation, and self-concept are considered interval measures; a zero on a measure of such variables does not mean the absence of the characteristic in the participant.
The ratio scale of measurement is similar to the interval scale. As with the interval scale, a number is assigned to a subject that represents the amount of the attribute that the subject has and the difference between consecutive numbers is assumed to be equal. The main difference between interval and ratio measurements has to do with how we interpret a value of zero. For ratio measures, the zero is meaningful and tell us that the attribute is not present in the participant. Examples of ratio measures include a participant’s number of children, number of AP courses taken, or cumulative college credits: for each of these variables, a score of zero represents that the participant has none of the attribute.
To display data from one quantitative variable graphically, we can use either the histogram or the stemplot. (Another graph, the boxplot, will be covered in another section).
Interpreting the Histogram
Once the distribution has been displayed graphically, we can describe the overall pattern of the distribution and mention any striking deviations from that pattern. More specifically, we should consider the following features of the distribution:
We will get a sense of the overall pattern of the data from the histogram’s center, spread and shape, while outliers will highlight deviations from that pattern.
When describing the shape of a distribution, we should consider:
- Symmetry/skewness of the distribution.
- Peakedness (modality)—the number of peaks (modes) the distribution has.
We distinguish between:
Note that all three distributions are symmetric, but are different in their modality (peakedness). The first distribution is unimodal—it has one mode (roughly at 10) around which the observations are concentrated. The second distribution is bimodal—it has two modes (roughly at 10 and 20) around which the observations are concentrated. The third distribution is kind of flat, or uniform. The distribution has no modes, or no value around which the observations are concentrated. Rather, we see that the observations are roughly uniformly distributed among the different values.
Skewed Right Distributions
A distribution is called skewed right if, as in the histogram above, the right tail (larger values) is much longer than the left tail (small values). Note that in a skewed right distribution, the bulk of the observations are small/medium, with a few observations that are much larger than the rest. An example of a real-life variable that has a skewed right distribution is salary. Most people earn in the low/medium range of salaries, with a few exceptions (CEOs, professional athletes etc.) that are distributed along a large range (long “tail”) of higher values.
Skewed Left Distributions
A distribution is called skewed left if, as in the histogram above, the left tail (smaller values) is much longer than the right tail (larger values). Note that in a skewed left distribution, the bulk of the observations are medium/large, with a few observations that are much smaller than the rest. An example of a real life variable that has a skewed left distribution is age of death from natural causes (heart disease, cancer etc.). Most such deaths happen at older ages, with fewer cases happening at younger ages.
- Note that skewed distributions can also be bimodal. Here is an example. A medium size neighborhood 24-hour convenience store collected data from 537 customers on the amount of money spend in a single visit to the store. The following histogram displays the data.Note that the overall shape of the distribution is skewed to the right with a clear mode around $25. In addition it has another (smaller) “peak” (mode) around $50-55. The majority of the customers spend around $25 but there is a cluster of customers who enter the store and spend around $50-55.
- If a distribution has more than two modes, we say that the distribution is multimodal.
The spread (also called variability) of the distribution can be described by the approximate range covered by the data. From looking at the histogram, we can approximate the smallest observation (min), and the largest observation (max), and thus approximate the range. (More exact ways of finding measures of spread will be discussed in the next section.)
In our example:
- approximate min: 45 (the middle of the lowest interval of scores)
- approximate max: 95 (the middle of the highest interval of scores)
- approximate range: 95-45=50
Outliers are observations that fall outside the overall pattern. For example, the following histogram represents a distribution that has a high probable outlier:
Go back and check the histogram of scores at the top of this page. As you can see, there are no outliers.
- The histogram is a graphical display of the distribution of a quantitative variable. It plots the number (count) of observations that fall in intervals of values.
- When examining the distribution of a quantitative variable, one should describe the overall pattern of the data (shape, center, spread), and any deviations from the pattern (outliers).
- When describing the shape of a distribution, one should consider:
- Symmetry/skewness of the distribution
- Peakedness (modality)—the number of peaks (modes) the distribution has.
- Outliers are data points that fall outside the overall pattern of the distribution and need further research before continuing the analysis.
- It is always important to interpret what the features of the distribution (as they appear in the histogram) mean in the context of the data.
Numerical Measures Introduction
The overall pattern of the distribution of a quantitative variable is described by its shape, center, and spread. By inspecting the histogram, we can describe the shape of the distribution, but as we saw, we can only get a rough estimate for the center and spread. A description of the distribution of a quantitative variable must include, in addition to the graphical display, a more precise numerical description of the center and spread of the distribution. In this section we will learn:
- how to quantify the center and spread of a distribution with various numerical measures;
- some of the properties of those numerical measures; and
- how to choose the appropriate numerical measures of center and spread to supplement the histogram.
ntuitively speaking, the numerical measure of center is telling us what is a “typical value” of the distribution.
The three main numerical measures for the center of a distribution are the mode, the mean and the median. Each one of these measures is based on a completely different idea of describing the center of a distribution. We will first present each one of the measures, and then compare their properties.
So far, when we looked at the shape of the distribution, we identified the mode as the value where the distribution has a “peak” and saw examples when distributions have one mode (unimodal distributions) or two modes (bimodal distributions). In other words, so far we identified the mode visually from the histogram.
Technically, the mode is the most commonly occurring value in a distribution. For simple datasets where the frequency of each value is available or easily determined, the value that occurs with the highest frequency is the mode.
The mean is the average of a set of observations (i.e., the sum of the observations divided by the number of observations). If the n observations are x1, x2, … , xn, their mean, which we denote by x¯ (and read x¯), is therefore: x¯ = x1 + x2 +…+xn/n
The median M is the midpoint of the distribution. It is the number such that half of the observations fall above, and half fall below. To find the median:
- Order the data from smallest to largest.
- Consider whether n, the number of observations, is even or odd.
- If n is odd, the median M is the center observation in the ordered list. This observation is the one “sitting” in the (n + 1) / 2 spot in the ordered list.
- If n is even, the median M is the mean of the two center observations in the ordered list. These two observations are the ones “sitting” in the n / 2 and n / 2 + 1 spots in the ordered list.
Inter-Quartile Range (IQR)
While the range quantifies the variability by looking at the range covered by ALL the data, the IQR measures the variability of a distribution by giving us the range covered by the MIDDLE 50% of the data.
The following picture illustrates this idea: (Think about the horizontal line as the data ranging from the min to the Max).
Here is how the IQR is actually found:
- Arrange the data in increasing order, and find the median M. Recall that the median divides the data, so that 50% of the data points are below the median, and 50% of the data points are above the median.
- Find the median of the lower 50% of the data. This is called the first quartile of the distribution, and the point is denoted by Q1. Note from the picture that Q1 divides the lower 50% of the data into two halves, containing 25% of the data points in each half. Q1 is called the first quartile, since one quarter of the data points fall below it.
- Repeat this again for the top 50% of the data. Find the median of the top 50% of the data. This point is called the third quartile of the distribution, and is denoted by Q3. Note from the picture that Q3 divides the top 50% of the data into two halves, with 25% of the data points in each. Q3 is called the third quartile, since three quarters of the data points fall below it.
- The middle 50% of the data falls between Q1 and Q3, and therefore:IQR = Q3 – Q1
- The last picture shows that Q1, M, and Q3 divide the data into four quarters with 25% of the data points in each, where the median is essentially the second quartile. The use of IQR = Q3 – Q1 as a measure of spread is therefore particularly appropriate when the median M is used as a measure of center.
- We can define a bit more precisely what is considered the bottom or top 50% of the data. The bottom (top) 50% of the data is all the observations whose position in the ordered list is to the left (right) of the location of the overall median M. The following picture will visually illustrate this for the simple cases of n = 7 and n = 8.
Note that when n is odd (as in n = 7 above), the median is not included in either the bottom or top half of the data; When n is even (as in n = 8 above), the data are naturally divided into two halves.
Using the IQR to Detect Outliers
So far we have quantified the idea of center, and we are in the middle of the discussion about measuring spread, but we haven’t really talked about a method or rule that will help us classify extreme observations as outliers. The IQR is used as the basis for a rule of thumb for identifying outliers.
The 1.5(IQR) Criterion for Outliers
An observation is considered a suspected outlier if it is:
- below Q1 – 1.5(IQR) or
- above Q3 + 1.5(IQR)
The following picture illustrates this rule: