Scales Of Measurement:

 The four different scales of measurement, from least to most precise, are

  • Nominal
  • Ordinal
  • Interval
  • Ratio

Nominal:

The nominal scale of measurement is a qualitative measure that uses discrete categories to describe a characteristic of the research participants. For each participant, the researcher determines the presence, absence, and type of the attribute. Nominal scales of measurement may have two categories, such as citizen status (citizen/non-citizen), or they can have more than two categories, like religious affiliation (e.g., Agnostic, Buddhist, Jewish, Muslim) or marital status (e.g., divorced, married, single). Often, as described here, the categories have names; however, researchers code them with numbers for use in statistical analyses. These categories are not ordered or ranked in any way.

Ordinal

An ordinal scale of measurement rank-orders participants on some scale or attribute, but the difference between numbers does not convey fixed or equal differences. Thus, with ordinal data, we know that a one-unit increase in an ordinal scales represents “more,” but we don’t know how much more. For example, a group of participants can be rank-ordered from least to most politically active. We know that a person who is ranked as 5 is more politically active than a person who is ranked as 4, but not how much more politically active. The value of the variable is used to order participants according to the strength/presence of the attribute and not to calculate differences between participants.

Interval 

The interval scale of measurement takes numerical form, and the distance between pairs of consecutive numbers is assumed to be equal. However, interval variables do not have a meaningful zero point; thus, a zero does not mean the absence of the attribute, but rather it is a particular (but arbitrary) point on the scale. A good example of an interval measure is temperature in the Fahrenheit scale: a temperature of zero degrees Fahrenheit is still a temperature, not the absence of temperature. In education, measures like achievement, motivation, and self-concept are considered interval measures; a zero on a measure of such variables does not mean the absence of the characteristic in the participant.

Ratio 

The ratio scale of measurement is similar to the interval scale. As with the interval scale, a number is assigned to a subject that represents the amount of the attribute that the subject has and the difference between consecutive numbers is assumed to be equal. The main difference between interval and ratio measurements has to do with how we interpret a value of zero. For ratio measures, the zero is meaningful and tell us that the attribute is not present in the participant. Examples of ratio measures include a participant’s number of children, number of AP courses taken, or cumulative college credits: for each of these variables, a score of zero represents that the participant has none of the attribute.

Quantitative Variable

  To display data from one quantitative variable graphically, we can use either the histogram or the stemplot. (Another graph, the boxplot, will be covered in another section).

Interpreting the Histogram

Once the distribution has been displayed graphically, we can describe the overall pattern of the distribution and mention any striking deviations from that pattern. More specifically, we should consider the following features of the distribution:

The overall pattern of the distribution can be described by the shape, center, and spread of the histogram. Outliers in the distribution are deviations from the pattern.

We will get a sense of the overall pattern of the data from the histogram’s center, spread and shape, while outliers will highlight deviations from that pattern.

Shape

When describing the shape of a distribution, we should consider:

  1. Symmetry/skewness of the distribution.
  2. Peakedness (modality)—the number of peaks (modes) the distribution has.

We distinguish between:

Symmetric Distributions

A symmetric, Single-peaked (Unimodal) distribution. The histogram's bars start at low values close to 0 on the left and rise to a peak where the x-axis is labeled 10. Then, the values decrease as we go right, back down to nearly 0.
A symmetric, Double-peaked (Bimodal) distribution. The histogram's bars start at low values close to 0 on the left and rise to the first peak where the x-axis is labeled 10. Then, the values decrease as we go right, back down to nearly 0 at roughly where x=15. The values increase again and peak at x=20, and then, continuing right, decrease to nearly 0.
A symmetric, Uniform distribution. Throughout the entire range of the x-axis the bars are roughly the same height, meaning they are the same value.

Note that all three distributions are symmetric, but are different in their modality (peakedness). The first distribution is unimodal—it has one mode (roughly at 10) around which the observations are concentrated. The second distribution is bimodal—it has two modes (roughly at 10 and 20) around which the observations are concentrated. The third distribution is kind of flat, or uniform. The distribution has no modes, or no value around which the observations are concentrated. Rather, we see that the observations are roughly uniformly distributed among the different values.

Skewed Right Distributions

A Skewed-right histogram. As we proceed from left to right across the x-axis, the bars rapidly increase to the peak of the histogram, located at roughly x=33. From there, the values slowly decrease, and the last measurement is at x=200. The bars of the histogram are barely visible above the x-axis starting at about x=150.

A distribution is called skewed right if, as in the histogram above, the right tail (larger values) is much longer than the left tail (small values). Note that in a skewed right distribution, the bulk of the observations are small/medium, with a few observations that are much larger than the rest. An example of a real-life variable that has a skewed right distribution is salary. Most people earn in the low/medium range of salaries, with a few exceptions (CEOs, professional athletes etc.) that are distributed along a large range (long “tail”) of higher values.

Skewed Left Distributions

A Skewed-Left histogram. As we proceed from left to right across the x-axis, the bars rapidly slowly to the peak of the histogram, located at roughly x=78. From there, the values rapidly decrease, and the last measurement is at x=90. Since the X-axis starts at 0, the peak is offset to the right of the center of the histogram.

A distribution is called skewed left if, as in the histogram above, the left tail (smaller values) is much longer than the right tail (larger values). Note that in a skewed left distribution, the bulk of the observations are medium/large, with a few observations that are much smaller than the rest. An example of a real life variable that has a skewed left distribution is age of death from natural causes (heart disease, cancer etc.). Most such deaths happen at older ages, with fewer cases happening at younger ages.

Comments:

  1. Note that skewed distributions can also be bimodal. Here is an example. A medium size neighborhood 24-hour convenience store collected data from 537 customers on the amount of money spend in a single visit to the store. The following histogram displays the data.Note that the overall shape of the distribution is skewed to the right with a clear mode around $25. In addition it has another (smaller) “peak” (mode) around $50-55. The majority of the customers spend around $25 but there is a cluster of customers who enter the store and spend around $50-55.
  2. If a distribution has more than two modes, we say that the distribution is multimodal.

Spread

The spread (also called variability) of the distribution can be described by the approximate range covered by the data. From looking at the histogram, we can approximate the smallest observation (min), and the largest observation (max), and thus approximate the range. (More exact ways of finding measures of spread will be discussed in the next section.)

In our example:

  • approximate min: 45 (the middle of the lowest interval of scores)
  • approximate max: 95 (the middle of the highest interval of scores)
  • approximate range: 95-45=50

Outliers

Outliers are observations that fall outside the overall pattern. For example, the following histogram represents a distribution that has a high probable outlier:

A histogram with frequency on the Y-axis. As we go from left to right on the x-axis, the frequency increases to a peak at x=5, then decreases. Eventually, we reach 0 at x=11. All of x __z__gt__zz__ 10 have a frequency of 0, exception for x=15, which has a frequency of greater than zero. This is a outlier.

Go back and check the histogram of scores at the top of this page. As you can see, there are no outliers.

Let’s Summarize

  • The histogram is a graphical display of the distribution of a quantitative variable. It plots the number (count) of observations that fall in intervals of values.
  • When examining the distribution of a quantitative variable, one should describe the overall pattern of the data (shape, center, spread), and any deviations from the pattern (outliers).
  • When describing the shape of a distribution, one should consider:
    • Symmetry/skewness of the distribution
    • Peakedness (modality)—the number of peaks (modes) the distribution has.
    Not all distributions have a simple, recognizable shape.
  • Outliers are data points that fall outside the overall pattern of the distribution and need further research before continuing the analysis.
  • It is always important to interpret what the features of the distribution (as they appear in the histogram) mean in the context of the data.

Numerical Measures Introduction

The overall pattern of the distribution of a quantitative variable is described by its shape, center, and spread. By inspecting the histogram, we can describe the shape of the distribution, but as we saw, we can only get a rough estimate for the center and spread. A description of the distribution of a quantitative variable must include, in addition to the graphical display, a more precise numerical description of the center and spread of the distribution. In this section we will learn:

  • how to quantify the center and spread of a distribution with various numerical measures;
  • some of the properties of those numerical measures; and
  • how to choose the appropriate numerical measures of center and spread to supplement the histogram.

ntuitively speaking, the numerical measure of center is telling us what is a “typical value” of the distribution.

The three main numerical measures for the center of a distribution are the mode, the mean and the median. Each one of these measures is based on a completely different idea of describing the center of a distribution. We will first present each one of the measures, and then compare their properties.

Mode

So far, when we looked at the shape of the distribution, we identified the mode as the value where the distribution has a “peak” and saw examples when distributions have one mode (unimodal distributions) or two modes (bimodal distributions). In other words, so far we identified the mode visually from the histogram.

Technically, the mode is the most commonly occurring value in a distribution. For simple datasets where the frequency of each value is available or easily determined, the value that occurs with the highest frequency is the mode.

Mean

The mean is the average of a set of observations (i.e., the sum of the observations divided by the number of observations). If the n observations are x1, x2, … , xn, their mean, which we denote by x¯ (and read x¯), is therefore: x¯ = x1 + x2 +…+xn/n


Median

The median M is the midpoint of the distribution. It is the number such that half of the observations fall above, and half fall below. To find the median:

  • Order the data from smallest to largest.
  • Consider whether n, the number of observations, is even or odd.
    • If n is odd, the median M is the center observation in the ordered list. This observation is the one “sitting” in the (n + 1) / 2 spot in the ordered list.
    • If n is even, the median M is the mean of the two center observations in the ordered list. These two observations are the ones “sitting” in the n / 2 and n / 2 + 1 spots in the ordered list.


Inter-Quartile Range (IQR)

While the range quantifies the variability by looking at the range covered by ALL the data, the IQR measures the variability of a distribution by giving us the range covered by the MIDDLE 50% of the data.

The following picture illustrates this idea: (Think about the horizontal line as the data ranging from the min to the Max).

A horizontal line representing all of the data. The entire line represents the range of the data, and the leftmost point is the minimum data point. The rightmost point is the maximum data point. 25% of the range spanning the area between the leftmost point and 1/4 of the line from the leftmost point is labeled the Bottom 25% of the data. The area from the 1/4 point to the 3/4 point is labeled the middle 50% of the data. This is where the IQR is calculated. Indeed, the middle 50% represents half of the line. The rest of the line, the remaining 1/4 from the 3/4 point to the rightmost point, is the top 25% of the data.

Here is how the IQR is actually found:

  1. Arrange the data in increasing order, and find the median M. Recall that the median divides the data, so that 50% of the data points are below the median, and 50% of the data points are above the median.
  2. Find the median of the lower 50% of the data. This is called the first quartile of the distribution, and the point is denoted by Q1. Note from the picture that Q1 divides the lower 50% of the data into two halves, containing 25% of the data points in each half. Q1 is called the first quartile, since one quarter of the data points fall below it.
  3. Repeat this again for the top 50% of the data. Find the median of the top 50% of the data. This point is called the third quartile of the distribution, and is denoted by Q3. Note from the picture that Q3 divides the top 50% of the data into two halves, with 25% of the data points in each. Q3 is called the third quartile, since three quarters of the data points fall below it.
  4. The middle 50% of the data falls between Q1 and Q3, and therefore:IQR = Q3 – Q1

Comments

  1. The last picture shows that Q1, M, and Q3 divide the data into four quarters with 25% of the data points in each, where the median is essentially the second quartile. The use of IQR = Q3 – Q1 as a measure of spread is therefore particularly appropriate when the median M is used as a measure of center.
  2. We can define a bit more precisely what is considered the bottom or top 50% of the data. The bottom (top) 50% of the data is all the observations whose position in the ordered list is to the left (right) of the location of the overall median M. The following picture will visually illustrate this for the simple cases of n = 7 and n = 8.
    Note that when n is odd (as in n = 7 above), the median is not included in either the bottom or top half of the data; When n is even (as in n = 8 above), the data are naturally divided into two halves.


Using the IQR to Detect Outliers

So far we have quantified the idea of center, and we are in the middle of the discussion about measuring spread, but we haven’t really talked about a method or rule that will help us classify extreme observations as outliers. The IQR is used as the basis for a rule of thumb for identifying outliers.

The 1.5(IQR) Criterion for Outliers

An observation is considered a suspected outlier if it is:

  • below Q1 – 1.5(IQR) or
  • above Q3 + 1.5(IQR)

The following picture illustrates this rule:

A line representing all of the data. The data is ordered so that the minimum point is the leftmost on the line and the maximum point is the rightmost. At the center of the line is M, the median, and to the left of M is Q1. Even farther to the left of Q1 is Q1-1.5(IQR). Points farther left than this are suspected outliers. To the right of M is Q3, and farther to the right is Q3+1.5(IQR). Points even farther than this are also suspected outliers.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s