Chapter 4 Sampling Distributions

In this chapter, we will introduce the abstract concept of sampling distributions and their importance to statistics. In an earlier lecture, we learned about common theoretical distributions, such as the Binomial distribution and the Normal distribution. We also learned about population parameters that define these distributions, such as \(p\), \(\mu\), and \(\sigma\).

Theoretical distributions and their parameters are useful when we are trying to understand the probability of observing various outcomes in random samples given a known characteristic of the population.

For example, suppose for the population of people who suffer from occasional migraine headaches, 60% of them get some relief from taking ibuprofen. 60% is assumed to be a known characteristic (or parameter) of the population. One question we could answer based on this information is, if we have a random sample of 50 people who suffer from occasional migraines, what is the chance that 10 of them will get relief from ibuprofen? We can use the Binomial distribution to answer this question. In this example, we are working from the known population parameter to tell us information about the unknown sample.However, when analyzing data, we are trying to make an inference about the population given the result or estimate from the sample. It essentially is working the opposite way from that described on the previous slide.

Using the same example as the previous slide, suppose we are interested in investigating how effective ibuprofen is for people who suffer from occasional migraine headaches. To do this, we go out and obtain a sample from the population and get an estimate of the proportion who get relief after taking ibuprofen. We could then use the results from the sample to make an inference about the population parameter of interest, the proportion who get relief. So the unknown characteristic here is the population parameter and the known part is the result from our sample. Theoretical distributions are used during the process of making an inference but as a way to model the behavior, or distribution, of the sample statistic.

What do we mean by this last sentence? This lecture addresses that question, which is the foundation for statistical inference.Suppose we want to know the proportion of people in the United States who have diabetes. One approach would be to survey everybody in the United States and calculate the proportion of them that have diabetes. Suppose this population proportion is 0.093 or [9.3%][CDC2014].

But, suppose that we didn’t know the population proportion (it’s a black box) and that it isn’t possible to collect data from everybody. Instead, we have to obtain a representative sample of people from the population, make measurements on them, and then use that information to infer something about the proportion in the entire population. How can we use information from a single sample to say something about the population? We have to understand a key thing about samples: sampling variability.

What is sampling variability? Sampling variability is the term we use to explain what happens when the random sampling process is repeated. It tells us what we intuitively know. If many random samples were collected, the measured quantities, or sample statistics, from those samples will vary from one another. Additionally, the sample statistic we obtain will (most likely) be different from the population quantity, or parameter, of interest.

We need to be able to understand the behavior of how the sample statistics vary in order to be able to make an inference about the population parameter. This information about how much a statistic varies from sample to sample is key in helping us know how accurate an estimate is. How can we do this? We can simulate what would happen if random samples of the same size were repeatedly taken.Let’s go back to the Diabetes example to examine the variability of the proportion from sample to sample.

Suppose researchers obtain a random sample of 100 people from the United States. They calculate the proportion in their sample (denoted p-hat) who have diabetes and find that 0.110 or 11% of those in their sample have diabetes. Two other researchers do the exact same thing. In their random samples of 100 people, one finds that 7% have diabetes and the other finds 12% have diabetes. Each of these values are an estimate of the true population proportion.

Just by chance, these researchers collect samples that have a higher or lower value than the true population value of 0.093, and the sample statistics are each different from one another. This is an example of sampling variability. However, taking only three samples does not give us a very good understanding of the behavior of the sample proportions in repeated sampling. We need to see what happens if we repeat the sampling process many, many more times.

If the study with random samples of size n = 100 is repeated 1,000 times (that is, collect 1,000 samples), and the calculated sample proportion for each sample is plotted, a dot plot of the sample proportions would look like this. If we repeat our study over and over indefinitely, the collected sample proportions would form the sampling distribution of sample proportions – defined as the distribution of proportions from all possible samples of this size. In general, whenever a distribution is made up of sample statistics (e.g., means, medians, standard deviations), the distribution is called a sampling distribution of that statistic.

What do we notice about the behavior of the sample proportions over repeated random samples? Recall that when we evaluate distributions of numerical data, we look at the shape, the center, and the spread.

The shape of the distribution of sample proportions is approximately bell-shaped, with the center being around approximately 0.093 (the population proportion), and the range of values spanning 0.02 to 0.18, with the average deviation away from the mean equaling 0.029. Recall that the average deviation away from the mean is another way of saying standard deviation. When we are describing the sampling distribution, we call the standard deviation of the sample statistics standard error (denoted std. error in the plot).But, what would have happened to the sampling distribution of sample proportions if the samples each had n = 500 instead of n = 100 observations? The dot plot above shows the sampling distribution of that scenario.

From this dot plot, we see that the shape is still approximately bell-shaped and the center is still at the population proportion of 0.093. However, compared to the sampling distribution of sample proportions from samples of size 100, the sampling distribution of sample proportions for samples of size 500 has a much smaller spread. The sample proportions on the previous slide ranged from 0.02 to 0.18, and in this case the sample proportions range in values from around 0.06 to about 0.14 and have a standard error of 0.013.To drive the point home, let’s examine one more scenario. What do you think would happen if samples each had n = 1,000 observations, instead of n = 500 or n = 100? The dot plot above shows the sampling distribution of that scenario.

Similar to the last two dot plots, we still see that the shape is approximately bell-shaped (but even more bell-shaped than the previous two) and the center is at the population proportion of 0.093. But now, the spread of the sample proportions is even smaller. The sample proportions range in values from around 0.064 to 0.125 and have a standard error of 0.0093.What general behaviors were noticed as we examined the three sampling distributions of sample proportions?

The first behavior pertains to shape. As the sample size (n) increased, the sampling distribution of sample proportions looked more bell-shaped. This bell-shaped pattern is observed in the sampling distributions for many statistics (but not all) and will play an important role in the majority of inferential methods we use in this course.

The second pattern observed is that the center of the sampling distribution was always located near the population parameter that we are trying to estimate. It would be exactly on the population parameter had we sampled an infinite number of times (which is the precise definition of sampling distributions), but because there were only 1,000 samples in each of the dot plots (which was done for demonstration sake), the center was not always quite equal to the population parameter. This idea of the center of the sampling distribution being equal to the population parameter is important because we want to know we are “hitting” our target–the parameter of interest–on average across all possible samples. That is, our statistic is neither overestimating nor underestimating the value it is trying to estimate; but rather, it is hitting the parameter of interest, on average.

Lastly, as the sample size increased, the width or spread of the sampling distribution decreased. That is, as n increased, the sample statistics tended to be closer to the true population parameter value, thus making the variability of the sample statistics smaller. Besides the shape of the distribution, knowing how the statistics vary from sample to sample (i.e., the spread of the sampling distribution) is what we really care about. We use this information in making an inference from the sample to the population because it tells us how precise an estimate is. Let’s review the different distribution types that have been presented thus far: population distributions, sample distributions, and sampling distributions.

Recall from a previous lecture that population distribution describes the distribution of a characteristic of that population. It displays the proportion or probability of the values that make up the characteristic. As we learned in this lecture, typically, this distribution is unknown, and information about that distribution (e.g., population parameters; population proportion p) are often what we want to try to estimate. In the Diabetes example, we assumed the population proportion of people in the United States who have diabetes was 0.093.

However, what we most likely have access to is a sample of data from the population. If we plot the data from our sample, we create a sample distribution, which is a distribution of a characteristic of the sample. It is similar to the population distribution but it most likely will have different proportions of the values than the population. We use the information from the known and observed sample to help us estimate the unknown population value/parameter of interest.

Unlike population and sample distributions, which are distributions of cases or observations, sampling distributions are distributions of statistics, where each “dot” (a.k.a., sample statistic) is aggregate information from a sample of observations and many, many, many ”dots” (a.k.a., sample statistics) are obtained so we can understand the behavior of the ”dots”. While we can simulate what the sampling distribution will look like (as we have in this lecture), we do not observe this distribution directly in real life. This distribution is an abstraction of what would happen if we could mimic the sampling behavior over and over again. It helps us understand the sampling variability of the statistics so that we can make an inference about the population from the sample.

While sample distributions are an important part of the data analysis process, sampling distributions are the foundation for statistical inference (as mentioned earlier).The concept of sampling distributions for sample proportions was presented in this lecture. But, any statistic that we calculate from a sample has its own sampling distribution. For example, there is a sampling distribution for sample means (which will be discussed in a future lecture), or for sample relative risks. As we discussed earlier, many statistics have sampling distributions that are bell-shaped, but some statistics have sampling distributions that are not bell-shaped (e.g., sample median, sample standard deviation, sample relative risk, sample odds ratio).

References: [CDC2014]: https://www.cdc.gov/media/releases/2014/p0610-diabetes-report.html