Recall again the Big Picture, the four-step process that encompasses statistics: data production, exploratory data analysis, probability, and inference.

We are about to start the fourth part of the process and the final section of this course, where we draw on principles learned in the other units (exploratory data analysis, producing data, and probability) in order to accomplish what has been our ultimate goal all along: use a sample to infer (or draw conclusions) about the population from which it was drawn. The specific form of inference called for depends on the type of variables involved—either a single categorical or quantitative variable, or a combination of two variables whose relationship is of interest.

The purpose of this introduction is to review how we got here and how the previous sections fit together to allow us to make reliable inferences. Also, we will introduce the various forms of statistical inference that will be discussed in this section, and give a general outline of how this section is organized.

In the **Exploratory Data Analysis** sections, we learned to display and summarize data that were obtained from a sample. Regardless of whether we had one variable and we examined its distribution, or whether we had two variables and we examined the relationship between them, it was always understood that these summaries applied *only* to the data at hand; we did not attempt to make claims about the larger population from which the data were obtained.

Such generalizations were, however, a long-term goal from the very beginning of the course. For this reason, in the **Producing Data** sections, we took care to establish principles of sampling and study design that would be essential in order for us to claim that, to some extent, what is true for the sample should be also true for the larger population from which the sample originated. These principles should be kept in mind throughout this section on statistical inference, since the results that we will obtain will not hold if there was bias in the sampling process, or flaws in the study design under which variables’ values were measured.

Perhaps the most important principle stressed in the Producing Data unit was that of randomization. Randomization is essential not only because it prevents bias but also because it permits us to rely on the laws of probability, which is the scientific study of random behavior.

In the **Probability **sections**,** we established basic laws for the behavior of random variables. We ultimately focused on two random variables of particular relevance: the sample mean (X¯) and the sample proportion (pˆ ), and the last module of the Probability unit was devoted to exploring their sampling distributions. We learned what probability theory tells us to expect from the values of the sample mean and the sample proportion, given that the corresponding population parameters—the population mean (μ) and the population proportion (p)—are known.

As we mentioned in that section, the value of such results is more theoretical than practical, since in real-life situations we seldom know what is true for the entire population. All we know is what we see in the sample, and we want to use this information to say something concrete about the larger population. Probability theory has set the stage to accomplish this: learning what to expect from the value of sample mean, given that population mean takes a certain value, teaches us (as we’ll soon learn) what to expect from the value of the unknown population mean, given that a particular value of sample mean has been observed. Similarly, since we have established how sample proportion behaves relative to population proportion, we will now be able to turn this around and say something about the value of population proportion, based on an observed sample proportion. This process—inferring something about the population based on what is measured in the sample—is (as you know) called **statistical inference**.

We introduce three forms of statistical inference in this unit, each one representing a different way of using the information obtained in the sample to draw conclusions about the population. These forms are:

- Point estimation
- Interval estimation
- Hypothesis testing

Obviously, each one of these forms of inference will be discussed at length in this section, but it would be useful to get at least an intuitive sense of the nature of each of these inference forms, and the difference between them in terms of the type of conclusions they draw about the population based on the sample results.

In **point estimation**, we estimate an unknown parameter using a *single number* that is calculated from the sample data.

In **interval estimation**, we estimate an unknown parameter using an **interval of values** that is likely to contain the true value of that parameter (and state how confident we are that this interval indeed captures the true value of the parameter).

In **hypothesis testing**, we have some claim about the population, and we check **whether or not the data** obtained from the sample **provide evidence against this claim.**

As we mentioned at the end of the introduction, the first part of Inference will deal with inference for one variable. Recall that in the Exploratory Data Analysis (EDA) sections, when we learned about summarizing the data obtained from one variable (in the Examining Distributions module) we distinguished between two cases; categorical data and quantitative data.

We will make a similar distinction here in Inference. In EDA, the type of variable determined the displays and numerical measures we used to summarize the data. In Inference, the type of variable of interest (categorical or quantitative) will determine what population parameter we are going to do inference for.

- When the variable of interest is
**categorical**, the population parameter that we will infer about is the**population proportion (p)**associated with that variable. For example, if we are interested in studying opinions about the death penalty among U.S. adults, and thus our variable of interest is “death penalty (in favor/against),” we’ll choose a sample of U.S. adults and use the collected data to make an inference about p—the proportion of U.S. adults who support the death penalty. - When the variable of interest is
**quantitative**, the population parameter that we infer about is the**population mean (μ)**associated with that variable. For example, if we are interested in studying the annual salaries in the population of teachers in a certain state, we’ll choose a sample from that population and use the collected salary data to make an inference about μ, the mean annual salary of all teachers in that state.

# Point Estimation: Introduction

Point estimation is the form of statistical inference in which, based on the sample data, we estimate the unknown parameter of interest using a **single** value (hence the name **point** estimation). As the following two examples illustrate, this form of inference is quite intuitive.

Suppose that we are interested in studying the IQ levels of students at Smart University (SU). In particular (since IQ level is a quantitative variable), we are interested in estimating μ, the mean IQ level of all the students at SU.

A random sample of 100 SU students was chosen, and their (sample) mean IQ level was found to be x¯=115

If we wanted to estimate μ, the population mean IQ level, by a single number based on the sample, it would make intuitive sense to use the corresponding quantity in the sample, the sample mean x¯=115. We say that 115 is the **point estimate** for μ, and in general, we’ll always use x¯ as the **point estimator** for μ. (Note that when we talk about the **specific value** (115), we use the term **estimate**, and when we talk in general about the **statistic** x¯, we use the term **estimator**. The following figure summarizes this example:

# Unbiased Estimators

You may feel that since it is so intuitive, you could have figured out point estimation on your own, even without the benefit of an entire course in statistics. Certainly, our intuition tells us that the best estimator for μ should be x¯, and the best estimator for p should be pˆ.

Probability theory does more than this; it actually gives an explanation (beyond intuition) **why** x¯ and pˆ are the good choices as point estimators for μ and p, respectively. In the Sampling Distributions module of the Probability unit, we learned about the sampling distributions of X¯ and found that **as long as a sample is taken at random**, the distribution of sample means is exactly centered at the value of population mean.

X¯ is therefore said to be an **unbiased estimator** for μ . Any particular sample mean might turn out to be less than the actual population mean, or it might turn out to be more. But in the long run, such sample means are “on target” in that they will not underestimate any more or less often than they overestimate.

Note: Our point estimates are truly unbiased estimates for the population parameter only if the **sample is random and the study design is not flawed.**

Likewise, we learned that the sampling distribution of the sample proportion, pˆ, is centered at the population proportion p (as long as the sample is taken at random), thus making pˆ an **unbiased estimator** for p.

As stated in the introduction, probability theory plays an essential role as we establish results for statistical inference. Our assertion above that sample mean and sample proportion are unbiased estimators is the first such instance.

Intuitively, larger sample sizes give us more information with which to pin down the true nature of the population. We can therefore expect the sample mean and sample proportion obtained from a larger sample to be closer to the population mean and proportion, respectively. In the extreme, when we sample the whole population (which is called a census), the sample mean and sample proportion will exactly coincide with the population mean and population proportion.

There is another layer here that, again, comes from what we learned about the sampling distributions of the sample mean and the sample proportion. Let’s use the sample mean for the explanation.

Recall that the sampling distribution of the sample mean X¯ is, as we mentioned before, centered at the population mean μ and has a standard deviation of σn. As a result, as the sample size n increases, the sampling distribution of X¯ gets less spread out. This means that values of X¯ that are based on a larger sample are more likely to be closer to μ (as the figure below illustrates):

Similarly, since the sampling distribution of pˆ is centered at p and has a standard deviation of p(1−p)n, which decreases as the sample size gets larger, values of pˆ are more likely to be closer to p when the sample size is larger.

if you increase your sample size your spread will reduce it is quite obvious as it will cover more points.

# Interval Estimation

Point estimation is simple and intuitive, but also a bit problematic. Here is why:

When we estimate, say, μ by the sample mean x¯, we are almost guaranteed to make some kind of error. Even though we know that the values of x¯ fall around μ, it is very unlikely that the value of x¯ will fall exactly at μ .

Given that such errors are a fact of life for point estimates (by the mere fact that we are basing our estimate on one sample that is a small fraction of the population), these estimates are in themselves of limited usefulness, unless we are able to quantify the extent of the estimation error. Interval estimation addresses this issue. The idea behind **interval estimation** is, therefore, to enhance the simple point estimates by supplying information about the size of the error attached.

In this introduction, we’ll provide examples that will give you a solid intuition about the basic idea behind interval estimation.

## Example

Consider the example that we discussed in the point estimation section:

Suppose that we are interested in studying the IQ levels of students in a Smart University (SU). In particular (since IQ level is a quantitative variable), we are interested in estimating μ, the mean IQ level of all the students in SU. A random sample of 100 SU students was chosen, and their (sample) mean IQ level was found to be x¯=115.

In point estimation we used x¯=115 as the point estimate for μ. However, we had no idea of what the estimation error involved in such an estimation might be. Interval estimation takes point estimation a step further and says something like:

“I am 95% confident that by using the point estimate x¯=115 to estimate μ, I am off by no more than 3 IQ points. In other words, I am 95% confident that μ is within 3 of 115, or between 112 (115 – 3) and 118 (115 + 3).”

Yet another way to say the same thing is: I am 95% confident that μ is somewhere in (or covered by) the interval (112,118). (**Comment:** At this point you should not worry about, or try to figure out, how we got these numbers. We’ll do that later. All we want to do here is make sure you understand the idea.)

Note that while point estimation provided just one number as an estimate for μ (115), interval estimation provides a whole interval of “plausible values” for μ (between 112 and 118), and also attaches the level of our confidence that this interval indeed includes the value of μ to our estimation (in our example, 95% confidence). The interval (112,118) is therefore called “a 95% confidence interval for μ.”

### Let’s Summarize

The two examples showed us that the idea behind interval estimation is, instead of providing just one number for estimating an unknown parameter of interest, to provide an interval of plausible values of the parameter plus a level of confidence that the value of the parameter is covered by this interval.

We are now going to go into more detail and learn how these confidence intervals are created. As you’ll see, the ideas that were developed in the “Sampling Distributions” module of the Probability unit will, again, be very important (as they were in point estimation).

We’ll start by discussing confidence intervals for the population mean μ, and later discuss confidence intervals for the population proportion p.

# Confidence Intervals for the Population Mean: Overview

### Overview

As we mentioned in the introduction to interval estimation, we start by discussing interval estimation for the population mean μ. Here is a quick overview of how we introduce this topic.

- Learn how a 95% confidence interval for the population mean μ is constructed and interpreted.
- Generalize to confidence intervals with other levels of confidence (for example, what if we want a 99% confidence interval?).
- Understand more broadly the structure of a confidence interval and the importance of the margin of error.
- Understand how the precision of interval estimation is affected by the confidence level and sample size.
- Learn under which conditions we can safely use the methods that are introduced in this section.

Recall the IQ example:

## Example

Suppose that we are interested in studying the IQ levels of students at Smart University (SU). In particular (since IQ level is a quantitative variable), we are interested in estimating μ, the mean IQ level of all the students at SU.

We will assume that from past research on IQ scores in different universities, it is known that the IQ standard deviation in such populations is σ=15. In order to estimate μ , a random sample of 100 SU students was chosen, and their (sample) mean IQ level is calculated (let’s not assume, for now, that the value of this sample mean is 115, as before).

We will now show the rationale behind constructing a 95% confidence interval for the population mean μ.

* We learned in the “Sampling Distributions” module of probability that according to the central limit theorem, the sampling distribution of the sample mean X¯ is approximately normal with a mean of μ and standard deviation of σn. In our example, then, (where σ=15 and n=100), the possible values of X¯, the sample mean IQ level of 100 randomly chosen students, is approximately normal, with mean μ and standard deviation 15100=1.5.

* Next, we recall and apply the Standard Deviation Rule for the normal distribution, and in particular its second part:

There is a 95% chance that the sample mean we get in our sample falls within 2 * 1.5 = 3 of μ.

* Obviously, if there is a certain distance between the sample mean and the population mean, we can describe that distance by starting at either value. So, if the sample mean (x¯) falls within a certain distance of the population mean μ, then the population mean μ falls within the same distance of the sample mean.

Therefore, the statement, “There is a 95% **chance** that the **sample** mean x¯ falls within 3 units of μ” can be rephrased as: “We are 95% **confident** that the **population**mean μ falls within 3 units of x¯.”

So, if we happen to get a sample mean of x¯=115, then we are 95% sure that μ falls within 3 of 115, or in other words that μ is covered by the interval (115 – 3, 115 + 3) = (112,118).

(On later pages, we will use similar reasoning to develop a general formula for a confidence interval.)

### Comment

Note that the first phrasing is about x¯, which is a random variable; that’s why it makes sense to use probability language. But the second phrasing is about μ, which is a parameter, and thus is a “fixed” value that doesn’t change, and that’s why we shouldn’t use probability language to discuss it. This point will become clearer after you do the activities on the next page.

### The General Case

Let’s generalize the IQ example. Suppose that we are interested in estimating the unknown population mean (μ) based on a random sample of size n. Further, we assume that the population standard deviation (σ) is known.

The values of x¯ follow a normal distribution with (unknown) mean μ and standard deviation σn (known, since both σ and n are known). By the (second part of the) Standard Deviation Rule, this means that:

There is a 95% chance that our sample mean (x¯) will fall within 2*σ/sqrt(n) of μ,

which means that:

We are 95% confident that μ falls within 2*σ/sqrt(n) of our sample mean (x¯).

Or, in other words, a 95% confidence interval for the population mean μ is:

x¯− 2*σ/sqrt(n) , x¯+ 2*σ/sqrt(n)

Here, then, is the **general result:**

Suppose a random sample of size n is taken from a normal population of values for a quantitative variable whose mean (μ) is unknown, when the standard deviation (σ) is given. A 95% confidence interval (CI) for μ is:

x¯± 2*σ/sqrt(n)

### Comment

Note that for now we require the population standard deviation (σ) to be known. Practically, σ is rarely known, but for some cases, especially when a lot of research has been done on the quantitative variable whose mean we are estimating (such as IQ, height, weight, scores on standardized tests), it is reasonable to assume that σ is known. Eventually, we will see how to proceed when σ is unknown, and must be estimated with sample standard deviation (s).

### Other Levels of Confidence

The most commonly used level of confidence is 95%. However, we may wish to increase our level of confidence and produce an interval that’s almost certain to contain μ. Specifically, we may want to report an interval for which we are 99% confident that it contains the unknown population mean, rather than only 95%.

Using the same reasoning as in the last comment, in order to create a 99% confidence interval for μ, we should ask: There is a probability of .99 that any normal random variable takes values within how many standard deviations of its mean? The precise answer is 2.576, and therefore, a 99% confidence interval for μ is x¯±2.576*σ/sqrt(n).

Another commonly used level of confidence is a 90% level of confidence. Since there is a probability of 0.90 that any normal random variable takes values within 1.645 standard deviations of its mean, the 90% confidence interval for μ is x¯±1.645*σ/sqrt(n).

### When Is It Safe to Use the Confidence Interval We Developed?

One of the most important things to learn with any inference method is the conditions under which it is safe to use it. It is very tempting to apply a certain method, but if the conditions under which this method was developed are not met, then using this method will lead to unreliable results, which can then lead to wrong and/or misleading conclusions. As you’ll see throughout this section, we always discuss the conditions under which each method can be safely used.

In particular, the confidence interval for μ (when σ is known), x¯±z*∗σn, was developed assuming that the sampling distribution of X¯ is normal; in other words, that the Central Limit Theorem applies. In particular, this allowed us to determine the values of z*, the confidence multiplier, for different levels of confidence.

First, **the sample must be random.** Assuming that the sample is random, recall from the Probability unit that the Central Limit Theorem works when the **sample size is large** (a common rule of thumb for “large” is n > 30), or, for **smaller sample sizes**, if it is known that the quantitative **variable** of interest is **distributed normally** in the population. The only situation in which we cannot use the confidence interval, then, is when the sample size is small and the variable of interest is not known to have a normal distribution. In that case, other methods, called nonparametric methods, which are beyond the scope of this course, need to be used. This can be summarized in the following table: