## C.2 Study Question 2: Are concentrations greater than background concentrations?

Determining whether a site's groundwater has been impacted usually requires either comparison of site data to a single criterionGeneral term used in this document to identify a groundwater concentration that is relevant to a project; used instead of designations such as Groundwater Protection Standard, clean-up standard, or clean-up level. derived from concentrations measured in backgroundNatural or baseline groundwater quality at a site that can be characterized by upgradient, historical, or sometimes cross-gradient water quality (Unified Guidance). samples, or a direct comparison of the site data to the background data set. To determine the statistical tools which will meet your specific needs, identify the type of comparison to be made (that is, comparison to a single criterion, or a two data set comparison). The distribution assumption for the data set, and whether interwellComparisons between two monitoring wells separated spatially (Unified Guidance). or intrawellComparison of measurements over time at one monitoring well (Unified Guidance). tests will be used, also determine the selection of the proper statistical methods. Note that the site data and the background data to which they are compared must share the same hydrogeologic and geochemical parameters (see Section 4.3.1: Physical Site Conditions and Section 4.2.1: Background Conditions).

This question is usually relevant in the release detection, site characterization, monitoring, and closure stages of the project life cycle.

Selecting and Characterizing the Data Set

Determine that the data sets to be compared meet the assumptions of the test to be used. Verify that the background data set is representative (see Study Question 1). Refer to Section 3.4 for further discussion of how the following requirements may impact statistical analysis results.

- Check that no autocorrelationCorrelation of values of a single variable data set over successive time intervals (Unified Guidance). The degree of statistical correlation either (1) between observations when considered as a series collected over time from a fixed sampling point (temporal autocorrelation) or (2) within a collection of sampling points when considered as a function of distance between distinct locations (spatial autocorrelation). exists between successive sampling events associated with assumption of random samples (see Section 5.8.3).
- Confirm that no significant trends are present in the data set (see Section 5.8).
- Examine varianceThe square of the standard deviation (EPA 1989); a measure of how far numbers are separated in a data set. A small variance indicates that numbers in the dataset are clustered close to the mean. and the stability of the meanThe arithmetic average of a sample set that estimates the middle of a statistical distribution (Unified Guidance).; similarly, ensure that seasonality is accounted for and considered in the analysis (see Section 5.8).
- Identify outliersValues unusually discrepant from the rest of a series of observations (Unified Guidance).. Use box plots, probability plots, Dixon's test, or Rosner's test to confirm outliers.
- Address nondetectsLaboratory analytical result known only to be below the method detection limit (MDL), or reporting limit (RL); see "censored data" (Unified Guidance). in the data set appropriately (see Section 5.7).
- Determine the data distribution and use it to inform selection of the statistical methods (see Section 5.6).
- See also Section 4.1: Considerations for Statistical Analysis.

Statistical Methods and Tools

After checking that the data meet prerequisites common to most statistical tests, determine which tests will provide the information needed using the data you have or data that you will collect. Depending on the source of the background data, comparisons will either be interwell or intrawell. Background data set development is discussed in Study Question 1.

There are two general approaches for analyzing the site data and determining whether site chemical concentrations are above those measured in the background; individual compliance samples can be compared to pooledGroundwater samples from more than one sampling point. background results, or pooled compliance samples can be compared to pooled background. Site-specific considerations and regulatory considerations usually determine which approach is used. In either case, parametricA statistical test that depends upon or assumes observations from a particular probability distribution or distributions (Unified Guidance). and nonparametricStatistical test that does not depend on knowledge of the distribution of the sampled population (Unified Guidance). methods are available, so determining the distribution of the data is a typical initial data examination step (see Section 3.4.3). Prediction limits, tolerance limits, and control charts allow individual samples (an individual well sampled in time for control chartsGraphical plots of compliance measurements over time; alternative to prediction limits (Unified Guidance).) to be compared to pooled background samples. T-tests and ANOVAone-way analysis of variance-type tests only allow for comparing of pooled compliance samples to pooled background samples.

Tests That Support Examination of Individual Sample Points

Prediction limits (PLs) estimate an interval in which future observations will fall, with a defined probability, given the data which had been collected. The calculation of PLs takes into consideration the number of future results to be compared as well as the number of retests required to confirm a release. Once a background data set is established, prediction limitsIntervals constructed to contain the next few sample values or statistics within a known probability (Unified Guidance). based on the data are used as the criteria for comparison of compliance samples.

- PLs are typically projected around means or medians (Section 5.4).
- An upper prediction limit represents a level that is predicted to equal or exceed future sample values based on past results.
- The number of future samples must be specified.
- The confidence levelDegree of confidence associated with a statistical estimate or test, denoted as (1 – alpha) (Unified Guidance). of a prediction limit represents the probability that a specified number of future samples drawn from the same population will be below the prediction limit.
- Prediction limits increase (or if viewed graphically “widen”) as the number of testing events are increased into the future.
- Test can be constructed to examine a single compliance well using a single sample.

Prediction limits depend on a PL factor, K. The value of K depends on the selected site-wide significance, the number of background measurements, the anticipated number of resamples, and the number of chemicals examined. As the number of chemicals examined increases or the number of resampling instances increases, or both, the upper prediction limit increases because the K value increases. An increased K value corresponds to a decrease in the powerSee "statistical power." of the test, that is, the probability of missing a true exceedance in a well increases. To reduce this source of error, it is important to limit the number of chemicals examined.

Site data can be compared to interwell prediction limits to evaluate whether site data are above background, or upgradient, concentrations. Interwell prediction limits may be useful during site characterization or closure project stages. Intrawell prediction limits, calculated based on historical data collected from a single well, can be compared to current concentrations in that well to evaluate whether a statistically significant increase has occurred. Intrawell comparisons may be useful for release detection. Recommendations for use of prediction limits to calculate a fixed groundwater protection criterion for compliance monitoring are discussed in Section 0.1.

If the individual site well data or selected statistic are less than the prediction limit then you have evidence that the compliance data are consistent with background or at least not inconsistent with background. If the site data are above the prediction limit then you should conclude, that within the confidence level of the prediction limit, the data are inconsistent with background. Resampling to verify the result is appropriate.

Control Charts (individual wells over time)

- Control charts compare data collected sequentially in time to historical background data.
- Control charts can evaluate either intrawell or interwell data.
- Control charts are a parametric procedure.
- Individual samples collected over time must be sufficiently separated in time so as to support that the samples are independent, that is, you are not sampling essentially the same water multiple times.
- Collect a sufficient number of samples, 8 to 10 samples, to support a reliable estimation of the mean and standard deviation. A larger data set may be needed if the data are skewed or there are nondetects.

Intrawell control charts are a useful tool for release detection at sites when historical data from the compliance well exists. Interwell control chart evaluate whether site data are above background, or upgradient concentrations. Control charts may be useful during site characterization or closure project stages. Recommendations for the use of control charts are discussed in Section 5.13.

Individual site well data are compared to the background data using a control limit, the calculation of which depends on the mean and standard deviation. If the data from an individual well are less than the control limit then you may conclude that the data set is consistent with background. If the data are above the control limit then resampling to verify the result is appropriate. The result could indicate that the mean concentration of a contaminant has increased or that the result was a chance occurrence.

- Tolerance limits (TLs) are designed to contain a percentage (typically 90%, 95% or 99%) of the background data set with a specified level of confidence (typically 95%, see Section 5.3).
- Tolerance limits can be used in lieu of PLs or combined with PLsprediction limits for re-testing to control false negatives.
- Tolerance limits can evaluate either intrawell or interwell data.
- Tolerance limits, are typically calculated around means or medians (Section 5.3).
- The confidence level of a tolerance limit represents the probability that a specified percentage of the population is captured.
- A test can be constructed to examine a single compliance well using a single sample.

Individual site well data can be compared to tolerance limitsThe upper or lower limit of a tolerance interval (Unified Guidance). developed using upgradient wells or within-well data for the background data set. An upper tolerance limit can serve as an alternate groundwater protection criterion. However, tolerance limits by definition do not cover the full rangeThe difference between the largest value and smallest value in a dataset (NIST/SEMATECH 2012). of the background data set. Therefore, use of tolerance limits for decision making should incorporate an acceptable failure rate. As a variation to PLsprediction limits, exceedance of the tolerance limits (TLs) would probably require more retesting compared to using the PLs as the criterion. Recommendations for use of tolerance limits are provided in Section 5.3.

Individual site well data are compared to the background data using a tolerance limit. If the data from an individual well is greater than the tolerance limit, there is reason to suspect that the site is impacted and resampling to verify the result is appropriate.

Tests that Support Examination of Pooled Data

In many situations, such as site characterization, it is desirable or advantageous to compare pooled data sets. A key assumption when pooling data from multiple sampling points, is that variability between wells is minimal; however, in many natural systems this spatial variabilitySpatial variability exists when the distribution or pattern of concentration measurements changes from well location to well location (most typically in the form of differing mean concentrations). Such variation may be natural or synthetic, depending on whether it is caused by natural or artificial factors (Unified Guidance). is too great to be ignored and therefore, it should be tested before pooling data. This is particularly true when pooling data to represent background concentrations.

Sometimes, pooling background data is appropriate. For example, when building a background data set, it may be possible to combine data sets that are thought to be background, but which were collected at different times or were spatially separated from one another. Even though it would be exceptional, given the typical variability in groundwater chemical concentrations, monitoring networks that have low natural spatial variability like sand aquifers or artificial systems, may be examined by pooling data.

- These tests are used to compare two data sets for equality of means.
- The tests require normally-distributed data.
- Eight to ten samples are recommended.
- Nondetects must be assigned values; see Section 5.7.5 and Section 5.7.6
- Welch’s t-testA t-test, or two-sample test, is a statistical comparison between two sets of data to determine if they are statistically different at a specified level of significance (Unified Guidance). does not assume equality of variance; see Section 5.11.1.
- Pooled variance t-test assumes equality of variance; see Section 5.11.2.

Pool background data from one or more wells and compare to pooled site characterization data from a single well (or multiple wells) to determine if the means are significantly different. Each data set should contain at least 8 to 10 samples and sample sizes in each data set should be similar for the most robust test. The greater the inequality in data set sizes, the lower the accuracy of the estimated probability of erroneously concluding that background data are significantly different from site data.

A calculated t-statistic greater than the critical t-value indicates a statistically-significant difference between the means of the two data sets; this difference indicates that impact may have occurred.

- This test compares two or more data sets for equality of means.
- The test requires a normal data distribution.
- Nondetect values should be assigned (see Section 5.7.5 and Section 5.7.6).
- The test assumes the two populations have equal variances.
- The test assumes samples are spatially and temporally independent.
- Eight to ten samples are recommended.

Background data from one or more wells is combined then compared to site data by examining the variance between separated wells as compared to the variance between multiple samples taken from the same well. If the well to well variability is the same as the within-well variability then the means must be equal. If the means are not equal then well to well variability is greater than the within-well variability.

There are a number of reasons for variability in data taken from multiple wells and for variability in data collect from individual wells, for example seasonality and other temporal effects for within-well samples, and spatial variability from multiple wells. However, spatial variability is often large relative to temporal and analytical variability. Often ANOVAone-way analysis of variance will conclude that the ratio of between-well variability to within-well variability is significant and the hypothesis of equal means will be rejected.

An F statistic greater than the tabulated critical value (based on degrees of freedomThe number of ways which members of a data set or data sets can be independently varied (Unified Guidance). for between-well and within-well samples), indicates that the means are not equal. In that case, a follow-up test is needed to determine which mean is outside expectations.

Nonparametric Two-Sample Tests

Wilcoxon rank sum test (Mann-Whitney U-test)

- This test compares two populations using ranking methods when nondetects are present but have a common reporting limit.
- This test can accommodate a limited number of nondetects (typically 10% to 15%) in the data sets with a single reporting limit.
- The test assumes equal population variances.
- The test assumes the two data sets share a common, though unknown, distribution.
- A minimum of 8 to 10 samples are recommended.

Pooled background data from one or more wells are compared to pooled site data by use of ordered ranking to determine if the medians are equal.

A calculated W-statistic greater than the critical W-value indicates that the medians of the two data sets are not equal.

Tarone-Ware Two-Sample Test for Censored Data

- This test compares two data sets using ranking methods when nondetects and variable reporting limits are present in the data sets.
- Nondetect data with multiple reporting limits are acceptable.
- This is a nonparametric test.
- The test assumes the two populations have equal variances.
- This test assumes samples are spatially and temporally independent.
- A minimum of 8 to 10 samples are recommended.

Pooled background data from one or more wells are compared to pooled site data by use of ordered ranking to determine if the difference in ranking between the two data sets is greater than that which would have occurred had the ordering occurred by chance.

A Tarone-Ware statistic (TW), greater than the tabulated critical value corresponding to the desired level of confidence, indicates that the test data are significantly different from the background data.

Nonparametric Kruskal-Wallis test

The Kruskal-Wallis test is a nonparametric counterpart to ANOVAone-way analysis of variance that does not require normality of the ANOVA residuals (see Chapter 17.1.2, Unified Guidance and Section 5.8.2). In using this test, the interpretation is similar to the parametric F-test.

Related Study Question

Study Question 1: What are the background concentrations?

Key Words: Background, Compliance Monitoring, Interwell, Intrawell, Concentration Comparisons, Release Detection, Site Characterization, Monitoring, Closure

Publication Date: December 2013