5.11 Two Sample Tests

5.11 One Sample and Two Sample Tests

One-sample tests are used to compare the data set to a fixed criterionGeneral term used in this document to identify a groundwater concentration that is relevant to a project; used instead of designations such as Groundwater Protection Standard, clean-up standard, or clean-up level. (for example, population meanThe arithmetic average of a sample set that estimates the middle of a statistical distribution (Unified Guidance)., population percentile). Examples of one-sample tests have already been implicitly presented in Section 5.3 (Tolerance Limits) and Section 5.4 (Prediction Limits), as well as Section 5.6 (Distributional Tests). Other examples are goodness-of-fit tests, where, for example, you would like to know if the data support predictions regarding the value of the population mean. The null hypothesisOne of two mutually exclusive statements about the population from which a sample is taken, and is the initial and favored statement, H₀, in hypothesis testing (Unified Guidance). would be:

H₀: µ = µ_o where μ = actual true population mean

µ₀ = hypothesized population mean (under H₀)

and the alternative hypothesis H_A: µ <> µ₀.

A one sample t-testA t-test, or two-sample test, is a statistical comparison between two sets of data to determine if they are statistically different at a specified level of significance (Unified Guidance). can be applied in this case, if the following assumptions hold:

the data are normally distributed
the sample drawn from the population is random
the cases of the samples are independent
the population mean is known

However, many groundwater monitoring scenarios require the comparison of two populations, such as a population of compliance (potentially impacted) data to a population of spatial or temporal backgroundNatural or baseline groundwater quality at a site that can be characterized by upgradient, historical, or sometimes cross-gradient water quality (Unified Guidance). (unimpacted) data. The statistical tests used for these comparisons are referred to as two-sample tests and are used to determine if the two populations are statistically different at a specified level of significance. Examples of parametricA statistical test that depends upon or assumes observations from a particular probability distribution or distributions (Unified Guidance). two-sample tests include Welch’s t-test and the pooledGroundwater samples from more than one sampling point. varianceThe square of the standard deviation (EPA 1989); a measure of how far numbers are separated in a data set. A small variance indicates that numbers in the dataset are clustered close to the mean. t-test. Nonparametric tests include the Wilcoxon rank sum test, the signed rank test, and the Tarone-Ware two sample test for censored dataValues that are reported as nondetect. Values known only to be below a threshold value such as the method detection limit or analytical reporting limit (Helsel 2005).. These two-sample tests and their applications are described briefly below. Table F-3 includes information about checking assumptions for two sample tests.

5.11.1 Welch’s T-test

Welch’s t-test assumes that each population is normally distributed and requires that no temporal trends exist in the data, no spatial variabilitySpatial variability exists when the distribution or pattern of concentration measurements changes from well location to well location (most typically in the form of differing mean concentrations). Such variation may be natural or synthetic, depending on whether it is caused by natural or artificial factors (Unified Guidance). is present, and samples are statistically independent. One advantage of Welch’s t-test is that it does not require you to assume that population variances are equal. Another advantage is that while Welch’s t-test provides statistical powerStrength of a test to identify an actual release of contaminated groundwater or difference from a criterion (Unified Guidance). comparable to other two-sample tests, it is much simpler to use than other similar tests. The only calculations required are computing the mean, standard deviation, variance, t-statistic, and degrees of freedomThe number of ways which members of a data set or data sets can be independently varied (Unified Guidance).. Many statistical software packages offer Welch’s t-test, but most do not determine if the requirements and assumptions are met.

When applying Welch's t-test, the calculated t-value is compared to a critical t-value which is based on the selected significance level of the test and on the number of degrees of freedom. If the calculated t-value is less than or equal to the critical value, then no evidence exists for a statistically significant difference between the two population means at the selected confidence levelDegree of confidence associated with a statistical estimate or test, denoted as (1 – alpha) (Unified Guidance).. The equations for the necessary calculations, including the critical t-values for common significance levels, can be found in most statistical texts and in the Unified Guidance.

5.11.2 Pooled Variance T-test

The pooled variance t-test shares the same underlying assumptions and requirements of Welch’s t-test but, provides greater statistical power and therefore is helpful in identifying smaller differences. However, the pooled variance t-test has the added requirement that the variances of the two populations be equal; this requirement can be evaluated using box plots, or more robust methods such as Levene's test for equal variances (see Section 11.2, Unified Guidance). If these assumptions are met, the t-statistic can be calculated. Many statistical software packages offer versions of the pooled variance t-test, but most do not determine if the requirements and assumptions are met.

As with Welch’s t-test, the calculated t-value is compared to a critical t-value, which is based on the selected significance level of the test and on the number of degrees of freedom. If the calculated t-value is less than or equal to the critical value, then no evidence exists of a statistically significant difference between the two population means at the specified confidence level. The equations for the necessary calculations, including the critical t-values for common significance levels, can be found in most statistical texts and in the Unified Guidance.

5.11.3 Wilcoxon Rank-sum Test

The Wilcoxon rank-sum test is a nonparametric two-sample test that may be used to compare two populations when the groundwater data are not normally-distributed and cannot be normalized by transformation. The Wilcoxon rank-sum test is equivalent to the Mann-Whitney U-test. Requirements for the Wilcoxon rank-sum test include the assumption of equal variances, the assumption of a common (unknown) distribution, a lack of spatial variability, and temporal stability. The Wilcoxon rank-sum test can handle data sets with a limited number of nondetects (10-15%) with uniform reporting limits.

As the name implies, the Wilcoxon rank-sum test is performed by ordering the combined data from smallest to largest and ranking the values from 1 to N. Tied values receive a midrank which is the average of the ranks they would receive were they not tied. The resulting numerical ranks of the background samples are denoted as B_i and the compliance samples are C_i. The Wilcoxon statistic (W) is computed as the sum of the compliance ranks and the result is standardized to compute a Z-score for comparison to a tabulated critical statistic. Calculations for W, the expected value E(W), standard deviation SD(W), and the test statistic Z, for data with no ties are available in most statistical references and the Unified Guidance.

A computed Z is greater than the tabulated critical Z at the selected significance level, indicates that the compliance well concentrations are statistically different from the background at the significance level.

The Wilcoxon rank-sum test is available in most statistical software packages as a default selection for nonparametrically-distributed data; however, most packages do not automatically evaluate for compliance with the necessary underlying requirements or assumptions.

5.11.4 Sign or Signed Rank Test

The signed rank test is used to evaluate differences between groups of “paired” data such as analytical results from a group of wells before and after remediation efforts. The signed rank test evaluates whether a statistically significant difference exists between the medians of two groups by evaluating the difference between each pair of observations. The pairs are ranked in ascending order of the absolute value of their difference, and each rank is multiplied by the sign of the paired difference. The sum of those products is the test statistic W, which is compared to a tabulated critical value that is based on the selected statistical significanceStatistical difference exceeding a test limit large enough to account for data variability and chance (Unified Guidance). A fixed number equal to alpha (α), the false positive rate, indicating the probability of mistakenly rejecting the stated null hypothesis (H₀) in favor of the alternative hypothesis (Hᴀ). Or, the p-value sufficiently low such that the analyst will reject the null hypothesis (H₀). of the test and the number of sample pairs (differences). A computed test statistic W greater than the tabulated critical W at the selected significance level, indicates that the two groups of data are statistically different at the selected significance level. The signed rank test is available in some statistical software packages and is relatively straightforward to implement in spreadsheet software.

5.11.5 Tarone-Ware Two-sample Test for Censored Data

The Tarone-Ware two-sample test provides the added versatility of dealing with nondetect data. Like other nonparametric tests, Tarone-Ware assumes identical distribution of background and compliance populations, and requires equal variances. Also, as with the other tests, the Tarone-Ware two-sample test also requires temporal stability and lack of spatial variability. To perform this test, the two data sets (for example, background and compliance data) are combined and the distinct (unique) detect values ordered from lowest to highest. The number of values (including nondetects) less than or equal to each ordered value is computed for compliance, background, and combined data. The Tarone-Ware statistic is then calculated using equations found in some statistical references, including the Unified Guidance. Variations of this test (such as Gehan’s (1965) generalized Wilcoxon test) are also found in some statistical software packages, although compliance with the underlying assumptions and requirements is generally not automatically evaluated.

A computed Tarone-Ware statistic (TW) greater than the tabulated critical value at the selected significance level, indicates, given the example above of comparing background and compliance data, that the compliance well concentrations are statistically different (greater) than the background at that significance level.

Publication Date: December 2013

Permission is granted to refer to or quote from this publication with the customary acknowledgment of the source (see suggested citation and disclaimer).