Chapter 9 Hypotheis Testing of Means

Now we will formally present how to carry out a hypothesis test, by comparing a single mean to a ‘standard’ value. The context that will be used to formally learn about hypothesis testing is Mercury Content in Fish.

High levels of mercury are harmful to ecosystems and humans because it is known to be highly toxic. As a result, many groups (including state health departments) monitor mercury levels in lakes. In particular, they monitor mercury content in fish, as high mercury levels can be a health hazard to humans who consume this food item (especially to women who are pregnant, nursing mothers, and young children).

A study was carried out on 53 lakes in Florida to understand the mercury concentration in Florida sport fish. Samples of fish (largemouth bass) were collected from each of 53 lakes in Florida and the average mercury level in the fish for each lake was recorded (in parts per million; ppm). For simplicity’s sake, ‘mercury level’ will be used instead of ‘average mercury level’ for the remainder of this lecture, but remember that the mercury level in a lake is an aggregate measure from ALL of the fish sampled from that lake, and not for ONE fish. Summary statistics and plots of this data are presented above.

Recall that when we examine a distribution, we focus on the shape, center, and spread. Multiple summary measures and displays are presented here to give as complete a picture about the data as a researcher might need (and to review previous topics). The mean mercury level in the 53 lakes was 0.53 ppm, and the average deviation away from the mean (the standard deviation, SD) was 0.34 ppm. The data appears to have a right, or positive, skew.

References: Study: Lange, T. R., Royals, H. E., & Connor, L. L. (1994). Mercury accumulation in largemouth bass (Micropterus salmoides) in a Florida lake. Archives of Environmental Contamination and Toxicology, 27(4), 466-471. Dataset: Lock, R.H., Lock, P.F., Lock Morgan, K., Lock, E.F., & Lock, D.F. (2013). Statistics: Unlocking the power of data (1st ed.). Hoboken, NJ: John Wiley & Sons, Inc.  Guidelines for ‘safe’ limits of mercury content in fish are set by the Food and Drug Administration (FDA) or its equivalent. The FDA in the US has determined that a ‘safe’ limit for mercury content in fish is 1.0 ppm (whereas, the limit set by the Canadian FDA is 0.5 ppm).

Because the data were collected in the US, the researchers were interested in determining whether the average mercury level in largemouth bass in Florida lakes is acceptable by the US FDA standard. That is, is the average mercury level in largemouth bass in Florida lakes less than 1.0 ppm

We saw that the study sample had a mean mercury level of 0.53 ppm, which is less than 1.0 ppm, but this may be due to sampling variability. The key question is: Is it a lot lower than what we would expect, taking sampling variability into account, if the population from which this sample was drawn truly had a mean mercury level of 1.0 ppm We will use hypothesis testing to help us answer this question. Before we can carry out a hypothesis test, we need to first evaluate the assumptions about the test. These assumptions are the same as those we’ve seen before, in calculating a confidence interval for a mean. That is:

The sample should be a random (or representative) sample from the population, to allow us to generalize the results from this sample to the population it came from;

The observations should be independent of one another (otherwise, we would use a different statistical inferential method that took dependent or correlated data into account); and

The sampling distribution of sample means should be approximately Normal. How can we check this assumption We can check to see if the conditions for the Central Limit Theorem (CLT) hold. Recall that if the underlying population distribution is approximately Normal (which we can check by plotting the sample distribution), then the sampling distribution of sample means is approximately Normal. OR, even if the underlying population distribution is not Normal, if the sample size is ‘large enough’ (and what is ‘large enough’ depends on how heavily skewed the population distribution is), then the sampling distribution will still be approximately Normal. If this assumption is not met, use other methods for carrying out a hypothesis test for a mean (such as re-randomization tests).

If these assumptions are not met, then the results of a hypothesis test will not be valid. Let’s check the assumptions for the Mercury Content in Fish example.

Are the lakes a random or representative sample of all Florida lakes Based on the brief description of the dataset, it doesn’t seem that the lakes were randomly selected. Are they nevertheless representative of all lakes in Florida It seems reasonable to argue that the study authors would likely have chosen the 53 lakes to be representative of all lakes in Florida.

Are the observations independent of one another Again, without more information, it is reasonable to assume that the mercury content in a given lake is not affected by the mercury level in the other lakes, so the 53 lakes will be independent of one another.

Lastly, are the conditions met for the sampling distribution of sample means to be approximately Normally distributed Recall that the sample distribution of mercury levels in fish in the 53 lakes appeared to be skewed to the right. However, the sample size is 53, which is ‘large enough’. Thus, we can reasonably assume that the sampling distribution of sample means is approximately Normal.

References: Image: https://www.pinclipart.com/pindetail/owRJbh_file-mw-icon-checkmark-svg-creative-commons-check/Now that we have examined the data and checked the assumptions for the test, we can move on to formally defining the hypotheses: the null and the alternative. We write them in terms of a population parameter of interest the population mean, \(\mu\), in this case.

As mentioned in a previous lecture, the null hypothesis defines the skeptical perspective or the ‘no difference’ situation. It is written as: the population mean, \(\mu\), is equal to some specified value, \(\mu_0\) (mu-naught), the null value. Where does this null value come from It is typically defined by the research question.

The alternative hypothesis, on the other hand, is the competing claim. It is typically the statement we are interested in demonstrating.

There are three different forms the alternative hypothesis can take. It can be written as either: - \(mu\) is not equal to \(\mu_0\), or - \(mu\) is greater than \(\mu_0\), or - \(mu\) is less than \(\mu_0\),

The last two statements, “greater than” and “less than”, result in one-tailed tests, because we are only interested in differences in one particular direction. In contrast, the “not equal to” hypothesis results in a two-tailed test, because we are interested in differences in either direction. The majority of the tests in research articles are two-tailed tests.

How do we know which alternative hypothesis to choose for our hypothesis test Again, it is defined by the research question. In the Mercury Content in Fish example, recall that the research question is Is the mean mercury level in largemouth bass in Florida lakes less than 1.0 ppm Let’s use information from this research question to define the null value and the direction of the alternative hypothesis.

We can infer from the wording of the research question that we are interested in comparing a mean to a ‘standard’ value, and that the standard or null value of interest is 1.0 ppm. Therefore, the null hypothesis for this example is that the true mean mercury level in largemouth bass in all Florida lakes is equal to 1.0 ppm. Or, using notation, mu = 1.0 ppm.

Because the researchers were interested in determining whether there is evidence for ‘less than 1.0 ppm’, the alternative hypothesis is that the true mean mercury level in largemouth bass in all Florida lakes is less than 1.0 ppm. Or, using notation, mu < 1.0 ppm. To evaluate the claims in our hypotheses, we need to gather evidence (data). We first summarize the data using exploratory data analysis (summary statistics, tables, graphs) and then calculate a test statistic to measure the compatibility between the result from the data and the null hypothesis. In general, this test statistic standardizes the result from the study to a known statistical distribution.

When carrying out a test for a single mean (given that the assumptions are met), the appropriate test statistic is the t-statistic, which has a t-distribution with n - 1 degrees of freedom. It is found by calculating the difference between the sample mean and the null value and dividing by the standard error (which is the sample standard deviation, s, divided by the square root of the sample size, n). We can think of this value as measuring how far the result we found from our sample data is from what we would expect (given sampling variability) IF the null hypothesis was really true. It is expressed in units of standard error: for example, a t-statistic of +2.0 means that the observed sample mean is 2 standard errors above the null hypothesized population mean. Large test statistic values represent large (relative) differences between the sample result and the null value and small test statistic values represent small (relative) differences between the sample result and the null value. Because we are using the t-distribution to carry out this hypothesis test, it is often referred to to as a t-test (specifically, it is a one-sample t-test when a single mean is being tested).

Does this t formula look similar to something we have seen before Remember the formula Z = (value - mean)/SD The two distributions, the t- and z- (or Standard Normal) distributions are very similar. When we take a value, subtract a mean from that value, and divide the whole thing by some standard deviation measure, we are standardizing the original value. That is, we are measuring how far a value is from the mean, in SD units (for z-scores) or SE units (for t-statistics). So why the t-distribution

In a previous lecture about confidence intervals for a mean, the t-distribution was used because the sample standard deviation, s, was used in the equation instead of the population standard deviation, \(\sigma\) (sigma). And using s as an estimate of sigma adds extra uncertainty because it is a statistic and will vary from sample to sample. Recall that the distribution that accounts for this extra uncertainty is the t-distribution.

A similar argument is made when carrying out hypothesis testing for a mean. When we are trying to carry out inference for a single mean (using either confidence intervals or hypothesis testing), the t-distribution with n - 1 degrees of freedom is the appropriate statistical distribution. Returning to the Mercury Content in Fish example, let’s walk through all of the aspects within the evidence piece.

Data were collected and we explored the data via summary statistics and various plots. We observed that the mean mercury level in largemouth bass in our sample of n=53 lakes was 0.53 ppm. Then we wanted to test whether this value is unusual if we assume that the true mean mercury level in Florida lakes is 1.0 ppm (the claim we are testing against).

To answer this question, we compute a t-test statistic using the data and the null value. The t-test statistic for this example is 0.53 minus 1, all divided by 0.34 divided by the square root of 53, which gives -10.06. This test statistic has a t-distribution with n-1 = 52 degrees of freedom. This test statistic tells us that the mean mercury level in our sample of lakes (0.53 ppm) is about 10 standard errors lower than the hypothesized null value of 1.0 ppm. When we plot the t-distribution with 52 degrees of freedom to see where our test statistic value falls, we see that a t-value of -10.06 is very far out in the left tail of the distribution. It appears that this sample result is very unlikely, or very unusual, if we assume that 1.0 ppm is the true population mean mercury level in fish in all Florida lakes. But, how unlikely is it Can we quantify this value To quantify how unusual the evidence is compared to what is assumed to be true, we calculate a p-value. The p-value is the probability that you would obtain a sample result this “unusual” if the null hypothesis were really true and any observed difference was simply due to sampling variability. To put it another way, it s the probability of getting our sample result (or one even more extreme) if the null hypothesis were true. As with any probability, the value can take on numbers between 0.0 and 1.0. A sample result would be deemed “unusual” or unlikely if the probability of it occurring (under the assumption that the null hypothesis was really true) is small (that is, the p-value is small) and a sample result would be deemed likely if the probability is large (that is, the p-value is large). The smaller the p-value, the less consistent or compatible the data are with the null hypothesis.

But, how do we figure out what would be considered small By using something called the significance level, which is denoted by the Greek letter, alpha. The significance level, or level of unusualness , is typically set at a probability of alpha = 0.05 or 5%. So if something has a chance of occurring 5% of the time or less, then we would consider it to be unusual . The significance level for a study is chosen by the researcher at the beginning of the study, before collecting any data. Note that there is nothing magical about the typical significance level of 0.05. We could just as well set it at alpha = 0.01, or even alpha = 0.0001. So how do we decide what alpha should be Some research areas have standard accepted values of alpha in their field, but most studies will have a significance level of 0.05.

Putting together the evaluation pieces, the last piece of the framework is to make a conclusion about the strength of evidence we have against the claim by comparing the p-value to the significance level, alpha.Let s visualize how to find the p-value for a test statistic using its distribution. Recall that for continuous population distributions (such as the Normal or t-distribution), the area under the curve tells us the probability that a value will occur within that specific range of values. We determine the appropriate area under the curve by the test statistic (and its distribution) and the alternative hypothesis.

For a t-test, if the alternative hypothesis is that (mu) is less than the null value (a one-sided test), then we focus our interest in the lower (left) tail beyond our test statistic. So we locate the test statistic on the t-distribution and calculate the area under the curve to the left of that test statistic. This provides the probability of seeing the result we saw in our study (or less) if the null were really true. (Keep in mind that the test statistic might occasionally happen to be larger than the null value but we still want the area to the left of our test statistic.)

If the alternative hypothesis is that (mu) is not equal to the null value (a two-sided test), then we are interested in both tails beyond our test statistic. A common way to calculate this is to take the absolute value of our test statistic, calculate the area under the curve to the right of that test statistic, and then multiple this value by two. This provides the probability of seeing the result we observed (or more extreme, in either direction) if the null were really true.

Lastly, if the alternative hypothesis is that (mu) is greater than the null value (again, a one-sided test), then we are interested in the upper tail beyond our test statistic. So a similar procedure is carried out. We calculate the area under the curve to the right of the test statistic, and this is the probability of seeing the result we saw (or more) if the null were really true. (Again, keep in mind that the test statistic might occasionally happen to be smaller than the null value but we still want the area to the right of our test statistic.)

Note that we will always use software to calculate p-values from our test statistic and the appropriate t-distribution. Once the p-value is found, then we can make a formal decision about the test.

If the sample result is unlikely to occur (taking sampling variability into account) if the null hypothesis is true, then the p-value will be less than alpha and we say we reject the null hypothesis. That is, we have evidence against the null hypothesis and in favor of the alternative hypothesis. The sample data provide statistically significant evidence in support of the alternative hypothesis.

If the sample result is likely to occur (taking sampling variability into account) if the null hypothesis is true, then the p-value will be greater than alpha and we say we do not reject the null hypothesis. That is, we lack evidence against the null. We do not have sufficient evidence to discard the null hypothesis.

In either case, remember to always state the conclusion in the context of the problem and not just conclude reject the null or result is statistically significant.

By the way, note that it is a quirk of statistical practice to never accept the null hypothesis, but only fail to reject it. This is because lacking strong evidence against the null hypothesis is not the same as having strong evidence for the null hypothesis. In the trial by jury example, the jury does not find the defendant innocent but only not guilty . As the colloquial saying goes, Absence of evidence is not evidence of absence. Let s evaluate the evidence for the Mercury Content in Fish example.

Recall that the test statistic for this example was -10.06. We noticed that it was in the extreme left tail of the t-distribution. So what is the probability of seeing a t-value of -10.06 (or less, because of the alternative hypothesis) if the null hypothesis were really true To find this, we calculate the area under the t-distribution curve to the left of -10.06. This value turns out to be 4 x 10^-14 (in scientific notation), or really, really, really unlikely. Whenever you find a p-value that is really, really small, it s best to report the p-value as p < 0.001 .

Assuming we set the significance level, alpha, to 0.05, our p-value of <0.001 means that we will reject the null hypothesis and conclude that there is evidence that the mean mercury level in largemouth bass in all Florida lakes is less than the guideline mercury level of 1.0 ppm. This is good news: the evidence suggests that the fish in Florida lakes are safe to eat. Testing for a single mean against a standard or skeptical claim is not as common in practice as comparing two groups or comparing two measurements on one group (paired data). Comparing two groups occurs when we have two distinct groups and the observations in each group are independent of one another (to be discussed in more detail in a future lecture). In contrast, paired data arises when we have two measurements of something on all members of a single group and those measurements are dependent on one another. Examples of paired data include:

Before and after (or pre and post) measurements are taken on each participant in a study, A new treatment is tested on one side of the body and the placebo is tested on the opposite side of the body of the same participant, Study participants are matched based on demographic variables (such as age and gender) and one of each pair is assigned to the treatment group and the other is assigned to the control group, and Study participants are twins or siblings recruited as pairs.

In all of these cases, the two measurements are likely to be related to one another, or dependent on one another, and are not independent. (For example, twins are likely to be more similar to each other than to an unrelated person.) When the variable of interest (the outcome variable) is a continuous variable, we can carry out a paired t-test. However, a paired t-test is essentially a one-sample test on a variable that is a paired difference. Let s look at an example to understand how this is the case.A study on patients with cystic fibrosis looked to see if patients pulmonary function (measured as FEV1 (% of predicted)) improved while on a new treatment during the course of their hospital stay. The researchers measured the FEV1 for 18 patients at admission (Pre FEV1), provided all of them the new treatment, and measured the FEV1 again on those same 18 patients at discharge (Post FEV1). They were really interested in the mean difference or mean change in FEV1 values from admission to discharge. So they calculated the difference (Post FEV1 Pre FEV1) for each patient (so that positive differences indicate improved lung function) and then calculated summary statistics (sample mean and sample standard deviation) for that difference or change variable. They found that the mean change in FEV1 was 12.2 (% of predicted) and the standard deviation of the change in FEV1 was 9.1 (% of predicted).

References: Data: Pezzulo, A. A., Stoltz, D. A., Hornick, D. B., & Durairaj, L. (2012). Inhaled hypertonic saline in adults hospitalised for exacerbation of cystic fibrosis lung disease: a retrospective study. BMJ open, 2(2), e000407.Because the researchers were interested in whether there was a mean change in FEV1 from admission to discharge, the null hypothesis for this example is (mu) is equal to 0, or the true mean change in FEV1 is equal to 0 (because remember the null hypothesis is always stated as no effect or nothing going on ). The alternative hypothesis in this case would be that the true mean change in FEV1 is not equal to 0, because the researchers are interested in any change (positive or negative) and didn t specify a direction.

Carrying out the evidence and evaluation pieces of the hypothesis testing framework, it turns out that the mean change of 12.2 is 5.7 standard errors above the hypothesized mean of 0. Using a t-distribution with 17 degrees of freedom, the probability of seeing a result like this (or more extreme) if there really was no change is p = 1 x 10^-5 or p<0.001. Because the p-value is less than the significance level of 0.05, we reject the null hypothesis and we can say there is evidence that the mean change is different from 0. It appears, based on the data, that the mean change is greater than 0, meaning the new treatment appears to improve lung function in cystic fibrosis patients.

To close out the formal lecture on hypothesis testing, here are few words of advice:

Use two-tailed tests in most situations, rather than one-tailed tests. Why As previously mentioned, to define a one-tailed test, a researcher must predict which direction the data will go prior to collecting the data, when planning a study. A one-tailed test can sometimes be useful because it gives a more focused hypothesis and reduces the necessary sample size. However, if the data end up going the opposite direction than expected, then one would end up with a very large p-value and not be able to reject the null hypothesis, even if the difference was very large. For this reason, we encourage using two-tailed tests in nearly all cases, except in the rare case when the researchers truly have no interest at all in one of the directions. Report the actual p-value. Avoid reporting p-values as an inequality (e.g., p < 0.05), unless the value is really, really small. In that case, report the p-value as < 0.001. Also, don t report p-values to more than 3 decimal places. Being that precise in our estimate does not add any relevant information to the reader. There are no sharp distinctions between p-value increments. For example, a p-value of 0.06 provides about the same degree of evidence against the null hypothesis as a p-value of 0.05. One should use multiple sources of evidence to make decisions and not just solely rely on the p-value to make a statistical (and practical) conclusion. The p-value does not indicate the magnitude, direction or clinical importance of an observed result. It merely estimates the role of just by chance as an explanation for the observed result in comparison to the hypothesized value. Remember that we don t conclude we accept the null when we obtain a large p-value. A large p-value just means that our sample result is consistent (given sampling variability) with the null hypothesized value. To provide a better picture of other plausible values that the true population value can take, we recommend you also supply a confidence interval. The null value would then be one of many plausible values that the truth can take (based on our sample result as the best guess ).