Already on several occasions we have pointed out the important distinction between a population and a sample. In Exploratory Data Analysis, we learned to summarize and display values of a variable for a sample, such as displaying the blood types of 100 randomly chosen U.S. adults using a pie chart, or displaying the heights of 150 males using a histogram and supplementing it with the sample mean (X¯) and sample standard deviation (S).
In our study of Probability and Random Variables, we discussed the long-run behavior of a variable, considering the population of all possible values taken by that variable. For example, we talked about the distribution of blood types among all U.S. adults and the distribution of the random variable X, representing a male’s height. In this module, we focus directly on the relationship between the values of a variable for a sample and its values for the entire population from which the sample was taken. This module is the bridge between probability and our ultimate goal of the course, statistical inference. In inference, we look at a sample and ask what we can say about the population from which it was drawn. In this module, we’ll pose the reverse question: If I know what the population looks like, what can I expect the sample to look like? Clearly, inference poses the more practical question, since in practice we can look at a sample, but rarely do we know what the whole population looks like. This module will be more theoretical in nature, since it poses a problem which is not really practical, but will present important ideas which are the underpinnings for statistical inference.
different tests required:
Parameters vs. Statistics
To better understand the relationship between sample and population, let’s consider the two examples that were mentioned in the introduction.
Example: Example #1: Blood Type
In the probability section, we presented the distribution of blood types in the entire U.S. population:
Assume now that we take a sample of 500 people in the United States, record their blood type, and display the sample results:
Note that the percentages (or proportions) that we got in our sample are slightly different than the population percentages. This is really not surprising. Since we took a sample of just 500, we cannot expect that our sample will behave exactly like the population, but if the sample is random (as it was), we expect to get results which are not that far from the population (as we did). If we took yet another sample of size 500:
we again get sample results that are slightly different from the population figures, and also different from what we got in the first sample. This very intuitive idea, that sample results change from sample to sample, is called sampling variability.
Let’s look at another example:
Example: Example #2: Heights of Adult Males
Heights among the population of all adult males follow a normal distribution with a mean μ=69 inches and a standard deviation
σ=2.8 inches. Here is a probability display of this population distribution:
A sample of 200 males was chosen, and their heights were recorded. Here are the sample results:
The sample mean is x¯=68.7 inches and the sample standard deviation is s = 2.95 inches.
Again, note that the sample results are slightly different from the population. The histogram we got resembles the normal distribution, but is not as fine, and also the sample mean and standard deviation are slightly different from the population mean and standard deviation. Let’s take another sample of 200 males:
The sample mean is x¯=69.065 inches and the sample standard deviation is s = 2.659 inches.
Again, as in Example 1 we see the idea of sampling variability. Again, the sample results are pretty close to the population, and different from the results we got in the first sample.
In both the examples, we have numbers that describe the population, and numbers that describe the sample. In Example 1, the number 42% is the population proportion of blood type A, and 39.6% is the sample proportion (in sample 1) of blood type A. In Example 2, 69 and 2.8 are the population mean and standard deviation, and (in sample 1) 68.7 and 2.95 are the sample mean and standard deviation.parameter and statistic(definition)
A parameter is a number that describes the population; a statistic is a number that is computed from the sample.
In Example 1: 42% is the parameter and 39.6% is a statistic.
In Example 2: 69 and 2.8 are the parameters and 68.7 and 2.95 are the statistics.
In this course, as in the examples above, we focus on the following parameters and statistics:
- population proportion and sample proportion
- population mean and sample mean
- population standard deviation and sample standard deviation
The following table summarizes the three pairs, and gives the notation
|(Population) Parameter||(Sample) Statistic|
The only new notation here is p for population proportion (p = 0.42 for type A in Example 1), and pˆ for sample proportion
(pˆ = 0.396 for type A in Example 1).
1. Parameters are usually unknown, because it is impractical or impossible to know exactly what values a variable takes for every member of the population.
2. Statistics are computed from the sample, and vary from sample to sample due to sampling variability.
Behavior of Sample Proportion: The Sampling Distribution
Again, the simulations on the previous page reinforced what makes sense to our intuition. Larger random samples will better approximate the population proportion. When the sample size is large, sample proportions will be closer to p. In other words, the sampling distribution for large samples has less variability. Advanced probability theory confirms our observations and gives a more precise way to describe the standard deviation of the sample proportions. This is described next.
The Sampling Distribution of the Sample Proportion
If repeated random samples of a given size n are taken from a population of values for a categorical variable, where the proportion in the category of interest is p, then the mean of all sample proportions (pˆ) is the population proportion (p). As for the spread of all sample proportions, theory dictates the behavior much more precisely than saying that there is less spread for larger samples. In fact, the standard deviation of all sample proportions (pˆ) is exactly p(1-p)n.
Since sample size n appears in the denominator of the square root, the standard deviation does decrease as sample size increases. Finally, the shape of the distribution of pˆ will be approximately normal as long as the sample size n is large enough. The convention is to require both np and n(1 – p) to be at least 10.
We can summarize all of the above by the following:
pˆ has a normal distribution with a mean of μpˆ=p and standard deviation σpˆ=p(1-p)n (and as long as np and n(1 – p) are at least 10).
Let’s apply this result to our example and see how it compares with our simulation.
In our example, n = 25 (sample size) and p = 0.6. Note that np = 15 ≥ 10 and n(1 – p) = 10 ≥ 10. Therefore we can conclude that pˆ is approximately a normal distribution with mean p = 0.6 and standard deviation p(1-p)n=0.6(1-0.6)25=0.097 (which is very close to what we saw in our simulation).
Behavior of Sample Proportion: Applying the Standard Deviation Rule
The above results for the distribution of sample proportion pˆ are directly related to the results already obtained for the distribution of sample count X in a binomial experiment. Remember that X had mean np, standard deviation np(1-p), and a shape that allowed for normal approximations as long as both np and n(1 – p) were at least 10. Since sample proportion is pˆ=Xn, we could derive the mean and standard deviation of pˆ by applying the Rules for Means and Variances:
μp^=μxn=1nμx=1n(np)=p and σp^2=σxn2=1nσx2=1n2(np)(1-p)=1np(1-p) so σp^=p(1-p)n
The requirements that np and n(1 – p) be at least 10 are the same, whether we are focusing on the distribution of sample count or the distribution of sample proportion. After all, the shape of pˆ is the same as the shape of X: the scale of the horizontal axis is just uniformly divided by n.
Let’s compare and contrast what we now know about the sampling distributions for sample means and sample proportions.
|Categorical (example: left-handed or not)||p = population proportion||pˆ = sample proportion||p||p(1-p)n||Normal IF np ≥ 10 and n(1 – p) ≥ 10|
|Quantitative (example: age)||μ = population mean, σ = population standard deviation||x¯ = sample mean||μ||σn||When will the distribution of sample means be approximately normal ?|
Now we will investigate the shape of the sampling distribution of sample means. When we were discussing the sampling distribution of sample proportions, we said that this distribution is approximately normal if np ≥ 10 and n(1 – p) ≥ 10. In other words, we had a guideline based on sample size for determining the conditions under which we could use normal probability calculations for sample proportions.
When will the distribution of sample means be approximately normal? Does this depend on the size of the sample?
It seems reasonable that a population with a normal distribution will have sample means that are normally distributed even for very small samples.
As mentioned in the introduction, this last section in probability is the bridge between the probability sections and inference. It focuses on the relationship between sample values (statistics) and population values (parameters). Statistics vary from sample to sample due to sampling variability, and therefore can be regarded as random variables whose distribution we call sampling distribution. In this module we focused on two statistics, the sample proportion, pˆ, and the sample mean, X¯. Our goal was to explore the sampling distribution of these two statistics relative to their respective population parameters (p and μ), and we found in both cases that under certain conditions the sampling distribution is approximately normal. This result is known as the Central Limit Theorem. As we’ll see in the next section, the Central Limit Theorem is the foundation for statistical inference.