In most studies involving two variables, each of the variables has a role. We distinguish between:
- the explanatory variable (also commonly referred to as the independent variable)-—the variable that claims to explain, predict or affect the response; and
- the response variable (also commonly referred to as the dependent variable)-—the outcome of the study.
Typically the explanatory (or independent) variable is denoted by X, while the response (or dependent) variable is denoted by Y.
If we further classify each of the two relevant variables according to type (categorical or quantitative), we get the following 4 possibilities for “role-type classification”
- Categorical explanatory and quantitative response
- Categorical explanatory and categorical response
- Quantitative explanatory and quantitative response
- Quantitative explanatory and categorical response
Categorical Explanatory Variable and Quantitative Response Variable
We are now ready to start with Case C→Q, exploring the relationship between two variables where the explanatory variable is categorical, and the response variable is quantitative. As you’ll discover, exploring relationships of this type is something we’ve already discussed in this Blog, but we didn’t frame the discussion this way.
Background: People who are concerned about their health may prefer hot dogs that are low in calories. A study was conducted by a concerned health group in which 54 major hot dog brands were examined, and their calorie contents recorded. In addition, each brand was classified by type: beef, poultry, and meat (mostly pork and beef, but up to 15% poultry meat). The purpose of the study was to examine whether the number of calories a hot dog has is related to (or affected by) its type. (Reference: Moore, David S., and George P. McCabe (1989). Introduction to the Practice of Statistics. Original source: Consumer Reports, June 1986, pp. 366-367.)
Answering this question requires us to examine the relationship between the categorical variable, Type and the quantitative variable Calories. Because the question of interest is whether the type of hot dog affects calorie content,
- the explanatory variable is Type, and
- the response variable is Calories.
Here is what the raw data look like:
The raw data are a list of types and calorie contents, and are not very useful in that form. To explore how the number of calories is related to the type of hot dog, we need an informative visual display of the data that will compare the three types of hot dogs with respect to their calorie content.
The visual display that we’ll use is side-by-side boxplots (which we’ve seen before). The side-by-side boxplots will allow us to compare the distribution of calorie counts within each category of the explanatory variable, hot dog type:
By examining the three side-by-side boxplots and the numerical summaries, we see at once that poultry hot dogs, as a group, contain fewer calories than those made of beef or meat. The median number of calories in poultry hot dogs (113) is less than the median (and even the first quartile) of either of the other two distributions (medians 152.5 and 153). The spread of the three distributions is about the same, if IQR is considered (all slightly above 40), but the (full) ranges vary slightly more (beef: 80, meat: 88, poultry: 66). The general recommendation to the health-conscious consumer is to eat poultry hot dogs. It should be noted, though, that since each of the three types of hot dogs shows quite a large spread among brands, simply buying a poultry hot dog does not guarantee a low-calorie food.
What we learn from this example is that when exploring the relationship between a categorical explanatory variable and a quantitative response (Case C→Q), we essentially compare the distributions of the quantitative response for each category of the explanatory variable using side-by-side boxplots supplemented by descriptive statistics. Recall that we have actually done this before when we talked about the boxplot and argued that boxplots are most useful when presented side by side for comparing distributions of two or more groups. This is exactly what we are doing here!
We are done with case C→Q, and will now move on to case C→C, where we examine the relationship between two categorical variables.
Earlier in the Blog, (when we discussed the distribution of a single categorical variable) we examined the data obtained when a random sample of 1,200 U.S. college students were asked about their body image (underweight, overweight, or about right.) We are now returning to this example, to address the following question:
If we had separated our sample of 1,200 U.S. college students by gender and looked at males and females separately, would we have found a similar distribution across body-image categories? More specifically, are men and women just as likely to think their weight is about right? Among those students who do not think their weight is about right, is there a difference between the genders in feelings about body image?
Answering these questions requires us to examine the relationship between two categorical variables, gender and body image. Because the question of interest is whether there is a gender effect on body image,
- the explanatory variable is gender, and
- the response variable is body image.
Here is what the raw data look like when we include the gender of each student:
Once again the raw data is a long list of 1,200 genders and responses, and thus not very useful in that form. To start our exploration of how body image is related to gender, we need an informative display that summarizes the data. In order to summarize the relationship between two categorical variables, we create a display called a two-way table.
Here is the two-way table for our example:
The table has the possible genders in the rows, and the possible responses regarding body image in the columns. At each intersection between row and column, we put the counts for how many times that combination of gender and body image occurred in the data. We sum across the rows to fill in the Total column, and we sum across the columns to fill in the Total row.
So far, we have organized the raw data in a much more informative display—the two-way table:
Remember, though, that our primary goal is to explore how body image is related to gender. Exploring the relationship between two categorical variables (in this case, body image and gender) amounts to comparing the distributions of the response variable (in this case, body image) across the different values of the explanatory variable (in this case, males and females):
Note that it doesn’t make sense to compare raw counts, because there are more females than males overall. So, for example, it is not very informative to say, “There are 560 females who responded ‘about right’ compared to only 295 males,” since the 560 females are out of a total of 760, and the 295 males are out of a total of only 440.
We need to supplement our display, the two-way table, with some numerical summaries that will allow us to compare the distributions. These numerical summaries are found by simply converting the counts to percentages within (or restricted to) each value of the explanatory variable separately.
In our example, we look at each gender separately and convert the counts to percentages within that gender. Let’s start with females:
Note that each count is converted to a percentage by dividing by the total number of females, 760. These numerical summaries are called conditional percentages, since we find them by “conditioning” on one of the genders.
Now find for the males.
- In our example, we chose to organize the data with the explanatory variable gender in rows and the response variable body image in columns, and thus our conditional percentages were row percentages, calculated within each row separately. Similarly, if the explanatory variable happens to sit in columns and the response variable in rows, our conditional percentages will be column percentages, calculated within each column separately. For an example, see the “Did I Get This?” exercises below.
- Another way to visualize the conditional percentages, instead of in a table, is to use a double bar chart. This display is quite common in newspapers.
Now that we have summarized the relationship between the categorical variables gender and body image, let’s go back and interpret the results in the context of the questions that we posed.
Two Quantitative Variables
In the previous two cases we had a categorical explanatory variable, and therefore exploring the relationship between the two variables was done by comparing the distribution of the response variable for each category of the explanatory variable:
- In case C→Q we compared distributions of the quantitative response.
- In case C→C we compared distributions of the categorical response.
Case Q→Q is different in the sense that both variables (in particular the explanatory variable) are quantitative, and therefore, as you’ll discover, this case will require a different kind of treatment and tools.
Example: Highway Signs
A Pennsylvania research firm conducted a study in which 30 drivers (of ages 18 to 82 years old) were sampled, and for each one, the maximum distance (in feet) at which he/she could read a newly designed sign was determined. The goal of this study was to explore the relationship between a driver’s age and the maximum distance at which signs were legible, and then use the study’s findings to improve safety for older drivers. (Reference: Utts and Heckard, Mind on Statistics (2002). Originally source: Data collected by Last Resource, Inc, Bellfonte, PA.)
Since the purpose of this study is to explore the effect of age on maximum legibility distance,
- the explanatory variable is Age, and
- the response variable is Distance.
Here is what the raw data look like:
Note that the data structure is such that for each individual (in this case driver 1….driver 30) we have a pair of values (in this case representing the driver’s age and distance). We can therefore think about these data as 30 pairs of values: (18, 510), (32, 410), (55, 420), … , (82, 360).
The first step in exploring the relationship between driver age and sign legibility distance is to create an appropriate and informative graphical display. The appropriate graphical display for examining the relationship between two quantitative variables is the scatterplot. Here is how a scatterplot is constructed for our example:
To create a scatterplot, each pair of values is plotted, so that the value of the explanatory variable (X) is plotted on the horizontal axis, and the value of the response variable (Y) is plotted on the vertical axis. In other words, each individual (driver, in our example) appears on the scatterplot as a single point whose X-coordinate is the value of the explanatory variable for that individual, and whose Y-coordinate is the value of the response variable. Here is an illustration:
And here is the completed scatterplot:
It is important to mention again that when creating a scatterplot, the explanatory variable should always be plotted on the horizontal X-axis, and the response variable should be plotted on the vertical Y-axis. If in a specific example we do not have a clear distinction between explanatory and response variables, each of the variables can be plotted on either axis.
Interpreting the Scatterplot
How do we explore the relationship between two quantitative variables using the scatterplot? What should we look at, or pay attention to?
Recall that when we described the distribution of a single quantitative variable with a histogram, we described the overall pattern of the distribution (shape, center, spread) and any deviations from that pattern (outliers). We do the same thing with the scatterplot. The following figure summarizes this point:
As the figure explains, when describing the overall pattern of the relationship we look at its direction, form and strength.
- The direction of the relationship can be positive, negative, or neither:
A positive (or increasing) relationship means that an increase in one of the variables is associated with an increase in the other.A negative (or decreasing) relationship means that an increase in one of the variables is associated with a decrease in the other.Not all relationships can be classified as either positive or negative.
- The form of the relationship is its general shape. When identifying the form, we try to find the simplest way to describe the shape of the scatterplot. There are many possible forms. Here are a couple that are quite common:Relationships with a linear form are most simply described as points scattered about a line:
Relationships with a curvilinear form are most simply described as points dispersed around the same curved line:
There are many other possible forms for the relationship between two quantitative variables, but linear and curvilinear forms are quite common and easy to identify. Another form-related pattern that we should be aware of is clusters in the data:
- The strength of the relationship is determined by how closely the data follow the form of the relationship. Let’s look, for example, at the following two scatterplots displaying positive, linear relationships:
The strength of the relationship is determined by how closely the data points follow the form. We can see that in the top scatterplot the data points follow the linear pattern quite closely. This is an example of a strong relationship. In the bottom scatterplot, the points also follow the linear pattern, but much less closely, and therefore we can say that the relationship is weaker. In general, though, assessing the strength of a relationship just by looking at the scatterplot is quite problematic, and we need a numerical measure to help us with that. We will discuss that later in this section.
Data points that deviate from the pattern of the relationship are called outliers. We will see several examples of outliers during this section. Two outliers are illustrated in the scatterplot below:
Let’s go back now to our example, and use the scatterplot to examine the relationship between the age of the driver and the maximum sign legibility distance. Here is the scatterplot:
The direction of the relationship is negative, which makes sense in context, since as you get older your eyesight weakens, and in particular older drivers tend to be able to read signs only at lesser distances. An arrow drawn over the scatterplot illustrates the negative direction of this relationship:
The form of the relationship seems to be linear. Notice how the points tend to be scattered about the line. Although, as we mentioned earlier, it is problematic to assess the strength without a numerical measure, the relationship appears to be moderately strong, as the data is fairly tightly scattered about the line. Finally, all the data points seem to “obey” the pattern—there do not appear to be any outliers.
In next we will explore different relationship and move to more statistical measures.