We are now moving to the other kind of inference, hypothesis testing. We say that hypothesis testing is “the other kind” because, unlike the inferential methods we presented so far, where the goal was estimating the unknown parameter, the idea, logic and goal of hypothesis testing are quite different.
In the first part of this section we will discuss the idea behind hypothesis testing, explain how it works, and introduce new terminology that emerges in this form of inference. The next two parts will be more specific and will discuss hypothesis testing for the population proportion (p), and for the population mean (μ).
General Idea and Logic
The purpose of this section is to gradually build your understanding about how statistical hypothesis testing works. We start by explaining the general logic behind the process of hypothesis testing. Once we are confident that you understand this logic, we will add some more details and terminology.
General Idea and Logic of Hypothesis Testing
To start our discussion about the idea behind statistical hypothesis testing, consider the following example:
A case of suspected cheating on an exam is brought in front of the disciplinary committee at a certain university.
There are two opposing claims in this case:
- The student’s claim: I did not cheat on the exam.
- The instructor’s claim: The student did cheat on the exam.
Adhering to the principle “innocent until proven guilty,” the committee asks the instructor for evidence to support his claim. The instructor explains that the exam had two versions, and shows the committee members that on three separate exam questions, the student used in his solution numbers that were given in the other version of the exam.
The committee members all agree that it would be extremely unlikely to get evidence like that if the student’s claim of not cheating had been true. In other words, the committee members all agree that the instructor brought forward strong enough evidence to reject the student’s claim, and conclude that the student did cheat on the exam.
What does this example have to do with statistics?
While it is true that this story seems unrelated to statistics, it captures all the elements of hypothesis testing and the logic behind it. Before you read on to understand why, it would be useful to read the example again. Please do so now.
Statistical hypothesis testing is defined as:
Assessing evidence provided by the data in favor of or against some claim about the population.
Here is how the process of statistical hypothesis testing works:
- We have two claims about what is going on in the population. Let’s call them for now claim 1 and claim 2. Much like the story above, where the student’s claim is challenged by the instructor’s claim, claim 1 is challenged by claim 2.(Comment: as you’ll see in the examples that follow, these claims are usually about the value of population parameter(s) or about the existence or nonexistence of a relationship between two variables in the population).
- We choose a sample, collect relevant data and summarize them (this is similar to the instructor collecting evidence from the student’s exam).
- We figure out how likely it is to observe data like the data we got, had claim 1 been true. (Note that the wording “how likely …” implies that this step requires some kind of probability calculation). In the story, the committee members assessed how likely it is to observe the evidence like that which the instructor provided, had the student’s claim of not cheating been true.
- Based on what we found in the previous step, we make our decision:
- If we find that if claim 1 were true it would be extremely unlikely to observe the data that we observed, then we have strong evidence against claim 1, and we reject it in favor of claim 2.
- If we find that if claim 1 were true observing the data that we observed is not very unlikely, then we do not have enough evidence against claim 1, and therefore we cannot reject it in favor of claim 2.
In our story, the committee decided that it would be extremely unlikely to find the evidence that the instructor provided had the student’s claim of not cheating been true. In other words, the members felt that it is extremely unlikely that it is just a coincidence that the student used the numbers from the other version of the exam on three separate problems. The committee members therefore decided to reject the student’s claim and concluded that the student had, indeed, cheated on the exam. (Wouldn’t you conclude the same?)
Hopefully this example helped you understand the logic behind hypothesis testing.
More Details and Terminology
Hypothesis testing step 1: Stating the claims.
our aim is to decide between two opposing points of view, Claim 1 and Claim 2. In hypothesis testing, Claim 1 is called the null hypothesis(denoted “H0“), and Claim 2 plays the role of the alternative hypothesis (denoted “Ha“). As we saw in the three examples, the null hypothesis suggests nothing special is going on; in other words, there is no change from the status quo, no difference from the traditional state of affairs, no relationship. In contrast, the alternative hypothesis disagrees with this, stating that something is going on, or there is a change from the status quo, or there is a difference from the traditional state of affairs. The alternative hypothesis, Ha, usually represents what we want to check or what we suspect is really going on.
Hypothesis testing step 2: Choosing a sample and collecting data.
This step is pretty obvious. This is what inference is all about. You look at sampled data in order to draw conclusions about the entire population. In the case of hypothesis testing, based on the data, you draw conclusions about whether or not there is enough evidence to reject Ho.
There is, however, one detail that we would like to add here. In this step we collect data and summarize it. Go back and look at the second step in our three examples. Note that in order to summarize the data we used simple sample statistics such as the sample proportion (pˆ), sample mean (x¯) and the sample standard deviation (s).
In practice, you go a step further and use these sample statistics to summarize the data with what’s called a test statistic. We are not going to go into any details right now, but we will discuss test statistics when we go through the specific tests.
Hypothesis testing step 3: Assessing the evidence.
As we saw, this is the step where we calculate how likely is it to get data like that observed when Ho true. In a sense, this is the heart of the process, since we draw our conclusions based on this probability. If this probability is very small (see example 2), then that means that it would be very surprising to get data like that observed if H0were true. The fact that we did observe such data is therefore evidence against H0, and we should reject it. On the other hand, if this probability is not very small (see example 3) this means that observing data like that observed is not very surprising if H0 were true, so the fact that we observed such data does not provide evidence against Ho. This crucial probability, therefore, has a special name. It is called the p-value of the test.
Obviously, the smaller the p-value, the more surprising it is to get data like ours when H0 is true, and therefore, the stronger the evidence the data provide against H0. Looking at the three p-values of our three examples, we see that the data that we observed in example 2 provide the strongest evidence against the null hypothesis, followed by example 1, while the data in example 3 provides the least evidence against H0.
Right now we will not go into specific details about p-value calculations, but just mention that since the p-value is the probability of getting data like those observed when H0 is true, it would make sense that the calculation of the p-value will be based on the data summary, which, as we mentioned, is the test statistic. Indeed, this is the case. In practice, we will mostly use software to provide the p-value for us.
It should be noted that in the past, before statistical software was such an integral part of intro stats courses it was common to use critical values (rather than p-values) in order to assess the evidence provided by the data. While this courses focuses on p-values, we will provide some details about the critical values approach later in this module for those students who are interested in learning more about it.
Hypothesis testing step 4: Making conclusions.
Since our conclusion is based on how small the p-value is, or in other words, how surprising our data are when Ho is true, it would be nice to have some kind of guideline or cutoff that will help determine how small the p-value must be, or how “rare” (unlikely) our data must be when Ho is true, for us to conclude that we have enough evidence to reject Ho.
This cutoff exists, and because it is so important, it has a special name. It is called the significance level of the test and is usually denoted by the Greek letter α. The most commonly used significance level is α = 0.05 (or 5%). This means that:
- if the p-value < α (usually 0.05), then the data we got is considered to be “rare (or surprising) enough” when Ho is true, and we say that the data provide significant evidence against Ho, so we reject Ho and accept Ha.
- if the p-value > α (usually 0.05), then our data are not considered to be “surprising enough” when Ho is true, and we say that our data do not provide enough evidence to reject Ho (or, equivalently, that the data do not provide enough evidence to accept Ha).
Important comment about wording.
Another common wording (mostly in scientific journals) is:
“The results are statistically significant” – when the p-value < α.
“The results are not statistically significant” – when the p-value > α.
- Although the significance level provides a good guideline for drawing our conclusions, it should not be treated as an incontrovertible truth. There is a lot of room for personal interpretation. What if your p-value is 0.052? You might want to stick to the rules and say “0.052 > 0.05 and therefore I don’t have enough evidence to reject Ho“, but you might decide that 0.052 is small enough for you to believe that Ho should be rejected.It should be noted that scientific journals do consider 0.05 to be the cutoff point for which any p-value below the cutoff indicates enough evidence against Ho, and any p-value above it, or even equal to it, indicates there is not enough evidence against Ho.
- It is important to draw your conclusions in context. It is never enough to say: “p-value = …, and therefore I have enough evidence to reject Ho at the .05 significance level.”You should always add: “… and conclude that … (what it means in the context of the problem)”.
- Let’s go back to the issue of the nature of the two types of conclusions that I can make.Either I reject Ho and accept Ha (when the p-value is smaller than the significance level) or I cannot reject Ho (when the p-value is larger than the significance level).
As we mentioned earlier, note that the second conclusion does not imply that I accept Ho, but just that I don’t have enough evidence to reject it. Saying (by mistake) “I don’t have enough evidence to reject Ho so I accept it” indicates that the data provide evidence that Ho is true, which is not necessarily the case.
Hypothesis Testing for the Population Proportion p Overview
Now that we understand the process we go through in hypothesis testing and the logic behind it, we are ready to start learning about specific statistical tests (also known as significance tests).
The first test we are going to learn is the test about the population proportion (p). This is test is widely known as the z-test for the population proportion (p). (We will understand later where the “z-test” part comes from.)
When we conduct a test about a population proportion, we are working with a categorical variable. after we have learned a variety of hypothesis tests, we will need to be able to identify which test is appropriate for which situation. Identifying the variable as categorical or quantitative is an important component of choosing an appropriate hypothesis test.
A machine is known to produce 20% defective products, and is therefore sent for repair. After the machine is repaired, 400 products produced by the machine are chosen at random and 64 of them are found to be defective. Do the data provide enough evidence that the proportion of defective products produced by the machine (p) has been reduced as a result of the repair?
The following figure displays the information, as well as the question of interest:
The question of interest helps us formulate the null and alternative hypotheses in terms of p, the proportion of defective products produced by the machine following the repair:
Ho: p = 0.20 (No change; the repair did not help).
Ha: p < 0.20 (The repair was effective).
Recall that there are basically 4 steps in the process of hypothesis testing:
1. State the null and alternative hypotheses.
2. Collect relevant data from a random sample and summarize them (using a test statistic).
3. Find the p-value, the probability of observing data like those observed assuming that Ho is true.
4. Based on the p-value, decide whether we have enough evidence to reject Ho (and accept Ha), and draw our conclusions in context.
We are now going to go through these steps as they apply to the hypothesis testing for the population proportion p. It should be noted that even though the details will be specific to this particular test, some of the ideas that we will add apply to hypothesis testing in general.
1. Stating the Hypotheses
Here again are the three set of hypotheses that are being tested in each of our three examples:
recall the example:
Has the proportion of defective products been reduced as a result of the repair?
Ho: p = 0.20 (No change; the repair did not help).
Ha: p < 0.20 (The repair was effective).
2. Collecting and Summarizing the Data (Using a Test Statistic)
After the hypotheses have been stated, the next step is to obtain a sample (on which the inference will be based), collect relevant data, and summarize them.
It is extremely important that our sample is representative of the population about which we want to draw conclusions. This is ensured when the sample is chosen at random. Beyond the practical issue of ensuring representativeness, choosing a random sample has theoretical importance that we will mention later.
In the case of hypothesis testing for the population proportion (p), we will collect data on the relevant categorical variable from the individuals in the sample and start by calculating the sample proportion, pˆ (the natural quantity to calculate when the parameter of interest is p).
Let’s go back to our three examples and add this step to our figures.
As we mentioned earlier without going into details, when we summarize the data in hypothesis testing, we go a step beyond calculating the sample statistic and summarize the data with a test statistic. Every test has a test statistic, which to some degree captures the essence of the test. In fact, the p-value, which so far we have looked upon as “the king” (in the sense that everything is determined by it), is actually determined by (or derived from) the test statistic. We will now gradually introduce the test statistic.
The test statistic is a measure of how far the sample proportion pˆ is from the null value p0, the value that the null hypothesis claims is the value of p. In other words, since pˆ is what the data estimates p to be, the test statistic can be viewed as a measure of the “distance” between what the data tells us about p and what the null hypothesis claims p to be.
Let’s use our examples to understand this:
The parameter of interest is p, the proportion of defective products following the repair.
The data estimate p to be pˆ=0.16
The null hypothesis claims that p = 0.20
The data are therefore 0.04 (or 4 percentage points) below the null hypothesis with respect to what they each tell us about p.
It is hard to evaluate whether this difference of 4% in defective products is enough evidence to say that the repair was effective, but clearly, the larger the difference, the more evidence it is against the null hypothesis. So if, for example, our sample proportion of defective products had been, say, 0.10 instead of 0.16, then I think you would all agree that cutting the proportion of defective products in half (from 20% to 10%) would be extremely strong evidence that the repair was effective.
Comments About the Test Statistic
1. We mentioned earlier that to some degree, the test statistic captures the essence of the test. In this case, the test statistic measures the difference between pˆ and p0 in standard deviations. This is exactly what this test is about. Get data, and look at the discrepancy between what the data estimates p to be (represented by pˆ) and what Hoclaims about p (represented by p0).
2. You can think about this test statistic as a measure of evidence in the data against Ho. The larger the test statistic, the “further the data are from Ho” and therefore the more evidence the data provide against Ho.
- It should now be clear why this test is commonly known as the z-test for the population proportion. The name comes from the fact that it is based on a test statistic that is a z-score.
- Recall fact 1 that we used for constructing the z-test statistic. Here is part of it again:When we take a random sample of size n from a population with population proportion p, the possible values of the sample proportion (pˆ) (when certain conditions are met) have approximately a normal distribution with a mean of … and a standard deviation of ….This result provides the theoretical justification for constructing the test statistic the way we did, and therefore the assumptions under which this result holds (in bold, above) are the conditions that our data need to satisfy so that we can use this test. These two conditions are:
- The sample has to be random.
- The conditions under which the sampling distribution of pˆ is normal are met. In other words:
- Here we will pause to say more about condition (i.) above, the need for a random sample. In the Probability Unit we discussed sampling plans based on probability (such as a simple random sample, cluster, or stratified sampling) that produce a non-biased sample, which can be safely used in order to make inferences about a population. We noted in the Probability Unit that, in practice, other (non-random) sampling techniques are sometimes used when random sampling is not feasible. It is important though, when these techniques are used, to be aware of the type of bias that they introduce, and thus the limitations of the conclusions that can be drawn from them.For our purpose here, we will focus on one such practice, the situation in which a sample is not really chosen randomly, but in the context of the categorical variable that is being studied, the sample is regarded as random. For example, say that you are interested in the proportion of students at a certain college who suffer from seasonal allergies. For that purpose, the students in a large engineering class could be considered as a random sample, since there is nothing about being in an engineering class that makes you more or less likely to suffer from seasonal allergies. Technically, the engineering class is a convenience sample, but it is treated as a random sample in the context of this categorical variable. On the other hand, if you are interested in the proportion of students in the college who have math anxiety, then the class of engineering students clearly could not be viewed as a random sample, since engineering students probably have a much lower incidence of math anxiety than the college population overall.
3. Finding the P-value of the Test
So far we’ve talked about the p-value at the intuitive level: understanding what it is (or what it measures) and how we use it to draw conclusions about the significance of our results. We will now go more deeply into how the p-value is calculated.
It should be mentioned that eventually we will rely on technology to calculate the p-value for us (as well as the test statistic), but in order to make intelligent use of the output, it is important to first understand the details, and only then let the computer do the calculations for us. Let’s start.
Recall that so far we have said that the p-value is the probability of obtaining data like those observed assuming that Ho is true. Like the test statistic, the p-value is, therefore, a measure of the evidence against Ho. In the case of the test statistic, the larger it is in magnitude (positive or negative) , the further pˆ is from p0 , the more evidence we have against Ho. In the case of the p-value, it is the opposite; the smaller it is, the more unlikely it is to get data like those observed when Ho is true, the more evidence it is against Ho. One can actually draw conclusions in hypothesis testing just using the test statistic, and as we’ll see the p-value is, in a sense, just another way of looking at the test statistic. The reason that we actually take the extra step in this course and derive the p-value from the test statistic is that even though in this case (the test about the population proportion) and some other tests, the value of the test statistic has a very clear and intuitive interpretation, there are some tests where its value is not as easy to interpret. On the other hand, the p-value keeps its intuitive appeal across all statistical tests.
How is the p-value calculated?
Intuitively, the p-value is the probability of observing data like those observed assuming that Hois true. Let’s be a bit more formal:
- Since this is a probability question about the data, it makes sense that the calculation will involve the data summary, the test statistic.
- What do we mean by “like” those observed? By “like” we mean “as extreme or even more extreme.”
Putting it all together, we get that in general:
The p-value is the probability of observing a test statistic as extreme as that observed (or even more extreme) assuming that the null hypothesis is true.
By “extreme” we mean extreme in the direction of the alternative hypothesis.
Specifically, for the z-test for the population proportion:
- If the alternative hypothesis is Ha:p<p0 (less than), then “extreme” means small, and the p-value is:The probability of observing a test statistic as small as that observed or smaller if the null hypothesis is true.
- If the alternative hypothesis is Ha:p>p0 (greater than), then “extreme” means large, and the p-value is:The probability of observing a test statistic as large as that observed or larger if the null hypothesis is true.
- if the alternative is Ha:p≠p0 (different from), then “extreme” means extreme in either direction either small or large (i.e., large in magnitude), and the p-value therefore is:The probability of observing a test statistic as large in magnitude as that observed or larger if the null hypothesis is true.
(Examples: If z = -2.5: p-value = probability of observing a test statistic as small as -2.5 or smaller or as large as 2.5 or larger.
If z = 1.5: p-value = probability of observing a test statistic as large as 1.5 or larger, or as small as -1.5 or smaller.)
The Critical Value Method
Concepts of the Critical Value Method
There are several concepts that are important to understand in the critical value method. They are the: 1) critical value and 2) the critical region. As shown in the graph below, the critical value is the value, which cuts off an area referred to as the critical region (or area of rejection), as applied to the z test.
When z test statistics fall in the critical region (the blue shaded areas in the above graph), they are far enough from the mean that they are significantly different from the mean; therefore, in these instances, the null hypothesis would be rejected. The critical region is determined by a critical value that is based on two things: 1) the significance level of the test (either 0.05 or 0.01) AND the direction of the test (ex. left-tailed, right-tailed, or two-tailed).
Not Equal To
For a two-tailed z test, there will be critical regions on both sides of the distribution. For a two-tailed test using a 0.05 level of significance, we need to determine a value that would put 0.025 or 2.5%, in each tail. We can determine this value by using the Normal Table.
First, we need to look in the body of the normal table (click here), where we will see that the exact value 0.0250 is associated with the z score of -1.96; thus, -1.96 is the critical value that puts 0.025 (or 2.5%) in the left tail of the distribution. Since the standard normal distribution is symmetrical, +1.96 is the critical value that puts 0.025 (or 2.5%) in the right tail. Thus, the critical values of -1.96 and +1.96 would define the critical regions for a two-tailed z-test using a 0.05 significance level.
In order to test the null hypothesis, we need to look at where the z-test statistic falls in relation to the critical regions formed by the critical values. In a two-tailed z-test, a z-test statistic of -1.5 would not fall in a critical region Therefore, we know that the p-value would be more than 0.05. Thus, we would not reject the null hypothesis, since the p-value is greater than 0.05 (or, stated another way, p-value > 0.05).
On the other hand, a z-test statistic of -2.5 would fall within the critical region on the left hand side of the distribution; therefore, we know that the p-value would be less than 0.05. In this instance, we would reject the null hypothesis at a less than 0.05 level (or p-value < 0.05). Furthermore, it is possible to figure out the exact p-value for the z-test statistic of -2.5, by using the Normal Table, which is 0.0062.
The same logic applies to one-tailed z-tests. For the one-tailed “less than” z-test, the critical value for a 0.05 significance level is -1.645 (note: since the p-value for -1.64 is 0.0505 and the p-value for -1.65 is 0.0495, the critical value for 0.0500 would be between the two z scores or -1.645).
With a “less than” one-tailed z test, any z-test statistic that is less than -1.645, would fall in the critical region and therefore, would have a p-value less than 0.05. For instance, -2.5 would be less than -1.645 and would fall in the critical region. Thus, it would have a p-value less than 0.05 and the null hypothesis would be rejected.
Any z-test statistic that is larger than -1.645 would have a probability level of greater than 0.05 (or p-value > 0.05). For instance, -1.5 would be greater than -1.645 and, therefore, would not fall in the critical region. Thus, it would have a p-value greater than 0.05 and the null hypothesis would not be rejected.
For the one-tailed “greater than” z-test, the critical value for a 0.05 significance level is 1.645.
Thus, with a “greater than” one-tailed z test, any z-test statistic larger than 1.645, would fall in the critical region and therefore, would have a p-value less than 0.05. For instance, 2.5 would be greater than 1.645 and would fall in the critical region. Thus, it would have a p-value less than 0.05 (or p-value < 0.05) and the null hypothesis would be rejected .
Any z-test statistic that less than 1.645 would have a probability level of greater than 0.05 (or p-value > 0.05). For instance, 1.5 would be less than 1.645 and, therefore, would not fall in the critical region. Thus, it would have a p-value greater than 0.05 and the null hypothesis would not be rejected.
The critical value method uses two concepts: 1) the critical value and 2) the critical region. The critical value is used to determine the critical region and is based on two things: 1) the significance level of the test (either 0.05 or 0.01) AND the direction of the test (ex, left-tailed, right-tailed, or two-tailed).
When z-test statistic falls in the critical region, it is far enough from the mean that it is significantly different from the mean. Therefore, in this instance, the null hypothesis would be rejected at the significance level used to determine the critical region (either 0.05 or 0.01). Furthermore, the actual p-value can be determined by using the Normal Table.
When the z-test statistic does not fall in the critical region, it indicates that it is not far enough from the mean to be significantly different from the mean. In this instance, the null hypothesis would not be rejected.
The critical value method has been traditionally used for hypothesis testing (note: there are different critical values and tables for t-tests, ANOVAs, and Chi Square tests). The emphasis now, however, is on the use of exact p-values, which are obtained through the use of statistical software packages.
4. Drawing Conclusions Based on the P-Value
This last part of the four-step process of hypothesis testing is the same across all statistical tests, and actually, we’ve already said basically everything there is to say about it, but it can’t hurt to say it again.
The p-value is a measure of how much evidence the data present against Ho. The smaller the p-value, the more evidence the data present against Ho.
We already mentioned that what determines what constitutes enough evidence against Ho is the significance level (α), a cutoff point below which the p-value is considered small enough to reject Ho in favor of Ha. The most commonly used significance level is 0.05.
It is important to mention again that this step has essentially two sub-steps:
(i) Based on the p-value, determine whether or not the results are significant (i.e., the data present enough evidence to reject Ho).
(ii) State your conclusions in the context of the problem.
Let’s go back to our examples and draw conclusions.
(Has the proportion of defective products been reduced from 0.20 as a result of the repair?)
We found that the p-value for this test was 0.023.
Since 0.023 is small (in particular, 0.023 < 0.05), the data provide enough evidence to reject Ho and conclude that as a result of the repair the proportion of defective products has been reduced to below 0.20. The following figure is the complete story of this example, and includes all the steps we went through, starting from stating the hypotheses and ending with our conclusions:
Why is 5% is often selected as the significance level in hypothesis testing, and why 1% is the next most typical level.Answer:
This is largely due to just convenience and tradition.
When Ronald Fisher (one of the founders of modern statistics) published one of his tables, he used a mathematically convenient scale that included 5% and 1%. Later, these same 5% and 1% levels were used by other people, in part just because Fisher was so highly esteemed. But mostly these are arbitrary levels.
The idea of selecting some sort of relatively small cutoff was historically important in the development of statistics; but it’s important to remember that there is really a continuous range of increasing confidence towards the alternative hypothesis, not a single all-or-nothing value. There isn’t much meaningful difference, for instance, between a p-value of 0.049 or 0.051, and it would be foolish to declare one case definitely a “real” effect and to declare the other case definitely a “random” effect. In either case, the study results were roughly 5% likely by chance if there’s no actual effect.
Whether such a p-value is sufficient for us to reject a particular null hypothesis ultimately depends on the risk of making the wrong decision, and the extent to which the hypothesized effect might contradict our prior experience or previous studies.
The Effect of Sample Size on Hypothesis Testing
We have already seen the effect that the sample size has on inference, when we discussed point and interval estimation for the population mean (μ) and population proportion (p). Intuitively …
Larger sample sizes give us more information to pin down the true nature of the population. We can therefore expect the sample mean and sample proportion obtained from a larger sample to be closer to the population mean and proportion, respectively. As a result, for the same level of confidence, we can report a smaller margin of error, and get a narrower confidence interval. What we’ve seen, then, is that larger sample size gives a boost to how much we trust our sample results. In hypothesis testing, larger sample sizes have a similar effect. The following two examples will illustrate that a larger sample size provides more convincing evidence, and how the evidence manifests itself in hypothesis testing
there are some other topics in hypothesis testing like type I,II errors etc we will solve it later in our blog.If you like my blog please make a donation to support my masters study thanks.