Causation and Lurking Variables With simpson’s paradox

The one and only principle rule in statistics is

Principle:Association does not imply causation!

The scatterplot below illustrates how the number of firefighters sent to fires (X) is related to the amount of damage caused by fires (Y) in a certain city.

A scatterplot in which the horizontal axis is labeled "# Of Firefighters", and the vertical axis is labeled "Damage ($)". The vertical axis ranges from $0 to $2500000 and the horizontal axis ranges from 0 to 40.

The scatterplot clearly displays a fairly strong (slightly curved) positive relationship between the two variables. Would it, then, be reasonable to conclude that sending more firefighters to a fire causes more damage, or that the city should send fewer firefighters to a fire, in order to decrease the amount of damage done by the fire? Of course not! So what is going on here?

There is a third variable in the background—the seriousness of the fire—that is responsible for the observed relationship. More serious fires require more firefighters, and also cause more damage.

The following figure will help you visualize this situation:

A flowchart. The " Seriousness of the fire" is a "lurking variable." This is a cause of both "Number of firefighters (X)" and "amount of damage (Y)" We have falsely observed a "observed association" between "Number of firefighters (X) " and "Amount of damage (Y)"

Here, the seriousness of the fire is a lurking variable. lurking variable is a variable that is not among the explanatory or response variables in a study, but could substantially affect your interpretation of the relationship among those variables.


In particular, as in our example, the lurking variable might have an effect on both the explanatory and the response variables. This common effect creates the observed association between the explanatory and response variables, even though there is no causal link between them. This possibility, that there might be a lurking variable (which we might not be thinking about) that is responsible for the observed relationship leads to our principle:

Principle:Association does not imply causation!

Example: SAT Test

For U.S. colleges and universities, a standard entrance examination is the SAT test. The side-by-side boxplots below provide evidence of a relationship between the student’s country of origin (the United States or another country) and the student’s SAT Math score.

A side-by-side boxplot. The vertical axis is labeled "SAT Math Score", and it ranges from 450 to 800. The horizontal axis is labeled "Country" and has two categories, "Other" and "US".

The distribution of international students’ scores is higher than that of U.S. students. The international students’ median score (about 700) exceeds the third quartile of U.S. students’ scores. Can we conclude that the country of origin is the cause of the difference in SAT Math scores, and that students in the United States are weaker at math than students in other countries?

No, not necessarily. While it might be true that U.S. students differ in math ability from other students—i.e. due to differences in educational systems—we can’t conclude that a student’s country of origin is the cause of the disparity. One important lurking variable that might explain the observed relationship is the educational level of the two populations taking the SAT Math test. In the United States, the SAT is a standard test, and therefore a broad cross-section of all U.S. students (in terms of educational level) take this test. Among all international students, on the other hand, only those who plan on coming to the U.S. to study, which is usually a more selected subgroup, take the test.

The following figure will help you visualize this explanation:

A flowchart. We have two causes, one of which is "Education level of SAT Takers". This is a "Lurking variable " The other cause is "Nationality (X)". Both of these might be causes of " SAT-Math score (Y)". We have observed an association between "Nationality (X)" and "SAT-Math Score (Y)". Notice that between these two variables is also a suspected cause relationship.

Here, the explanatory variable (X) may have a causal relationship with the response variable (Y), but the lurking variable might be a contributing factor as well, which makes it very hard to isolate the effect of the explanatory variable and prove that it has a causal link with the response variable. In this case, we say that the lurking variable is confounded with the explanatory variable, since their effects on the response variable cannot be distinguished from each other.


Note that in each of the above two examples, the lurking variable interacts differently with the variables studied. In example 1, the lurking variable has an effect on both the explanatory and the response variables, creating the illusion that there is a causal link between them. In example two, the lurking variable is confounded with the explanatory variable, making it hard to assess the isolated effect of the explanatory variable on the response variable.

The distinction between these two types of interactions is not as important as the fact that in either case, the observed association can be at least partially explained by the lurking variable. The most important message from these two examples is therefore: An observed association between two variables is not enough evidence that there is a causal relationship between them.

Simpson’s Paradox

So far, we have:

  • discussed what lurking variables are,
  • demonstrated different ways in which the lurking variables can interact with the two studied variables, and
  • understood that the existence of a possible lurking variable is the main reason why we say that association does not imply causation.

As you recall, a lurking variable, by definition, is a variable that was not included in the study, but could have a substantial effect on our understanding of the relationship between the two studied variables.

What if we did include a lurking variable in our study? What kind of effect could that have on our understanding of the relationship? These are the questions we are going to discuss next.

Background: A government study collected data on the death rates in nearly 6,000 hospitals in the United States. These results were then challenged by researchers, who said that the federal analyses failed to take into account the variation among hospitals in the severity of patients’ illnesses when they were hospitalized. As a result, said the researchers, some hospitals were treated unfairly in the findings, which named hospitals with higher-than-expected death rates. What the researchers meant is that when the federal government explored the relationship between the two variables—hospital and death rate—it also should have included in the study (or taken into account) the lurking variable—severity of illness.

We will use a simplified version of this study to illustrate the researchers’ claim, and see what the possible effect could be of including a lurking variable in a study. (Reference: Moore and McCabe (2003). Introduction to the Practice of Statistics.)

Consider the following two-way table, which summarizes the data about the status of patients who were admitted to two hospitals in a certain city (Hospital A and Hospital B). Note that since the purpose of the study is to examine whether there is a “hospital effect” on patients’ status, “Hospital is the explanatory variable, and “Patient’s Status” is the response variable.

A two-way table. The columns are the categories within the variable "Patient's Status". These categories are "Died" and "Survived." In addition, there is a "Total" column. The rows are categories for the variable "Hospital". These categories are "Hospital A" and "Hospital B". Like usual there is also a "Total" Row. Here is the data in "Row,Column: Value " format: Hospital A, Died: 63; Hospital A, Survived: 2037; Hospital A, Total: 2100; Hospital B, Died: 16; Hospital B, Survived: 784; Hospital B, Total: 800; Total, Died: 79; Total, Survived: 2821; Total, Total: 2900;

When we supplement the two-way table with the conditional percents within each hospital:

A two-way table with the same rows and columns as the previous two-way table, except the Total row has been removed. Here is the data in the same format: Hospital A, Died: 3%; Hospital A, Survived: 97%; Hospital A, Total: 100%; Hospital B, Died: 2%; Hospital B, Survived: 98%; Hospital B, Total: 100%;

we find that Hospital A has a higher death rate (3%) than Hospital B (2%). Should we jump to the conclusion that a sick patient admitted to Hospital A is 50% more likely to die than if he/she were admitted to Hospital B? Not so fast …

Maybe Hospital A gets most of the severe cases, and that explains why it has a higher death rate. In order to explore this, we need to include (or account for) the lurking variable “severity of illness” in our analysis. To do this, we go back to the two-way table and split it up to look separately at patents who are severely ill, and patients who are not.

The original two-way table has been split into two two-way tables, one for "Patients severely ill" and one for "Patients not severely ill." Once again, here are the columns, for the variable "Patient's Status": "Died", "Survived", "Total". The rows, for the variable "Hospital": "Hospital A", "Hospital B", " Total". Data will be given in "Row,Column: Value" format. Table for "Patients severely ill:" Hospital A, Died: 57; Hospital A, Survived: 1443; Hospital A, Total: 1500; Hospital B, Died: 8; Hospital B, Survived: 192; Hospital B, Total: 200; Total, Died: 65; Total, Survived: 1635; Total, Total: 1700; Table for "Patients not severely ill:" Hospital A, Died: 6; Hospital A, Survived: 594; Hospital A, Total: 600; Hospital B, Died: 8; Hospital B, Survived: 592; Hospital B, Total: 600; Total, Died: 14; Total, Survived: 1186; Total, Total: 1200;

As we can see, Hospital A did admit many more severely ill patients than Hospital B (1,500 vs. 200). In fact, from the way the totals were split, we see that in Hospital A, severely ill patients were a much higher proportion of the patients—1,500 out of a total of 2,100 patients. In contrast, only 200 out of 800 patients at Hospital B were severely ill. To better see the effect of including the lurking variable, we need to supplement each of the two new two-way tables with its conditional percentages:

Two two-way tables with the same rows and columns as the previous two-way table, except the Total row has been removed. Here is the data in the same format: Table for "Patients severely ill:" Hospital A, Died: 3.8%; Hospital A, Survived: 96.2%; Hospital A, Total: 100%; Hospital B, Died: 4.0%; Hospital B, Survived: 96.0%; Hospital B, Total: 100%; Table for "Patients not severely ill:" Hospital A, Died: 1.0%; Hospital A, Survived: 99.0%; Hospital A, Total: 100%; Hospital B, Died: 1.3%; Hospital B, Survived: 98.7%; Hospital B, Total: 100%;

Note that despite our earlier finding that overall Hospital A has a higher death rate (3% vs. 2%), when we take into account the lurking variable, we find that actually it is Hospital B that has the higher death rate both among the severely ill patients (4% vs. 3.8%) and among the not severely ill patients (1.3% vs. 1%). Thus, we see that adding a lurking variable can change the direction of an association.

Whenever including a lurking variable causes us to rethink the direction of an association, this is called Simpson’s paradox.

Let’s Summarize

  • lurking variable is a variable that was not included in your analysis, but that could substantially change your interpretation of the data if it were included.
  • Because of the possibility of lurking variables, we adhere to the principle that association does not imply causation.
  • Including a lurking variable in our exploration may:
    • help us to gain a deeper understanding of the relationship between variables, or
    • lead us to rethink the direction of an association.
  • Whenever including a lurking variable causes us to rethink the direction of an association, this is an instance of Simpson’s paradox.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s