## C.3 Study Question 3: Are concentrations above or below a criterion?

To ensure a valid test, it is important to understand how the criterionGeneral term used in this document to identify a groundwater concentration that is relevant to a project; used instead of designations such as Groundwater Protection Standard, clean-up standard, or clean-up level. used for comparison was derived and based on that understanding, to have a well-defined null hypothesisOne of two mutually exclusive statements about the population from which a sample is taken, and is the initial and favored statement, H₀, in hypothesis testing (Unified Guidance).. The criterion can be an MCL, a risk-based value or fixed backgroundNatural or baseline groundwater quality at a site that can be characterized by upgradient, historical, or sometimes cross-gradient water quality (Unified Guidance). limit and may represent a single regulatory value, the meanThe arithmetic average of a sample set that estimates the middle of a statistical distribution (Unified Guidance). of a population, or a percentile; therefore, when defining the null hypothesis (and ultimately the comparison method), it is important to take this into account to ensure selection of a test that reflects the intent of the criterion. For example, if the criterion is a not-to-exceed value, individual sample results can be compared to it in much the same manner as is done with prediction limitsIntervals constructed to contain the next few sample values or statistics within a known probability (Unified Guidance).; alternatively, if the criterion is derived to represent an average concentration ceiling, an upper confidence limit around the mean of the compliance data is the appropriate test statistic.

This question is relevant during release detection, site characterization, monitoring, and closure stages of the project life cycle.

Selecting and Characterizing the Data Set

Examine the site data set to determine if you are going to compare either intrawellComparison of measurements over time at one monitoring well (Unified Guidance). or interwellComparisons between two monitoring wells separated spatially (Unified Guidance). data to a criterion (see Section 3.6.5). Intrawell comparisons are most common. If interwell data are going to be used, ensure that the sample data share the same hydrogeologic and geochemical characteristics before combining these data, and test for significant spatial variabilitySpatial variability exists when the distribution or pattern of concentration measurements changes from well location to well location (most typically in the form of differing mean concentrations). Such variation may be natural or synthetic, depending on whether it is caused by natural or artificial factors (Unified Guidance).. In either case, examine the site data to determine what distributional assumption should inform selection of statistical tests (see Section 4.3.1: Physical Site Conditions and Section 4.2.1: Background Conditions. Refer to Section 3.4: Common Statistical Assumptions for further discussion concerning how the following requirements may impact statistical analysis results.

- Use box plots, probability plots, Dixon's test, or Rosner's test to check for outliersValues unusually discrepant from the rest of a series of observations (Unified Guidance)..
- Check that mean and varianceThe square of the standard deviation (EPA 1989); a measure of how far numbers are separated in a data set. A small variance indicates that numbers in the dataset are clustered close to the mean. are stable over the time frame (time series plot).
- No autocorrelationCorrelation of values of a single variable data set over successive time intervals (Unified Guidance). The degree of statistical correlation either (1) between observations when considered as a series collected over time from a fixed sampling point (temporal autocorrelation) or (2) within a collection of sampling points when considered as a function of distance between distinct locations (spatial autocorrelation). should exist between successive sampling events.
- Check that no significant trends exist (time series plot). If the data set exhibits significant trends, it may be appropriate to select a subset of the data to representing current concentrations.
- Determine distribution of the data (for example, normal, lognormalA dataset that is not normally distributed (symmetric bell-shaped curve) but that can be transformed using a natural logarithm so that the data set can be evaluated using a normal-theory test (Unified Guidance).) (skewness coefficient, Shapiro-Wilk test, censored probability plots).
- Estimate the mean and standard deviation of left-censored sample using Kaplan-Meier when 50% or less of the data set is nondetect.
- See also Section 4.1: Considerations for Statistical Analysis.

Statistical Methods and Tools

There are two broad approaches for analyzing well data and answering the question as to whether chemical concentrations are above a criterion. These two approaches are comparison of pooledGroundwater samples from more than one sampling point. interwell compliance data to the criterion and comparison of intrawell compliance data to the criterion.

The statistical tests most commonly used are confidence intervals or limits, tolerance limitsThe upper or lower limit of a tolerance interval (Unified Guidance)., prediction limits and one sample t-testA t-test, or two-sample test, is a statistical comparison between two sets of data to determine if they are statistically different at a specified level of significance (Unified Guidance).. Confidence intervals are constructed around a statistic of interest (for example, mean, medianThe 50th percentile of an ordered set of samples (Unified Guidance)., certain percentile) while prediction and tolerance limits are extreme values beyond which only represent a small portion of the data population. The one sample t-test compares a statistic of interest from the data to a criterion based on the same statistic of interest derived from the background. Site-specific considerations or regulatory requirements usually determine which parameters and tests are appropriate.

Limits are most often used to compare sampling data to a fixed criterion. There are two questions that can be asked. One question is whether the groundwater concentration of a specific chemical has exceeded a criterion, while the other question is whether the groundwater concentration of a particular chemical has fallen below a criterion. In determining if a criterion has been exceeded, the lower confidence limit is of primary interest. But the upper confidence limit, tolerance limit, or prediction limit are most important in determining if the concentration has fallen below a criterion.

As an example of limit selection, if the criterion being used is a health-based concentration, and the mean exposure should not exceed the criterion, then select a predetermined confidence that the upper confidence limit on the mean (UCL) is below the standard. Likewise, if you are examining groundwater data which has historically been above a criterion, you want the UCL to be below the standard. The scenario is different when assuming that the well being monitored is not contaminated. In this case, retain the assumption until the lower confidence limit is above the criterion.

If the fixed criterion is an average concentration, the appropriate statistical parameter to compare to is the mean or median concentration from site data by use of either a confidence intervalStatistical interval designed to bound the true value of a population parameter such as the mean or an upper percentile (Unified Guidance). or a one-sided t-test.

Parametric Confidence Intervals

Confidence intervals can be calculated for normal, lognormal or nonparametricStatistical test that does not depend on knowledge of the distribution of the sampled population (Unified Guidance). distributions (see next section) using the methods below:

- confidence interval around a mean (see Section 5.2.2., Section 5.2.3, and Section 5.2.4).
- confidence interval around an upper percentile (see Section 5.2.5).
- robust confidence interval around a mean to modify the nonrobust calculations so that outlying observations in a data-set can be accommodated. (USEPA 1999).

Data must be normal or capable of being transformed so that they are normal.

- Use lognormal methods when the underlying population is heavily right-skewed, meaning that a majority of lower concentration data are combined with fewer but much higher concentration data. When the data are transformed, the data should become reasonably symmetric about the mean or normally distributed (check with skewness coefficient, Shapiro-Wilk Test, probability plot).
- Some nondetectsLaboratory analytical result known only to be below the method detection limit (MDL), or reporting limit (RL); see "censored data" (Unified Guidance). are acceptable. Use simple substitution for nondetect are approximately 10-15%. If nondetects are less than or equal to 50%, create a censored probability plot to check for normality.
- The parametricA statistical test that depends upon or assumes observations from a particular probability distribution or distributions (Unified Guidance). methods depend on t values from a student’s t-table, small numbers of samples will correspond to large t values which in turn will make the intervals wide and the corresponding limits extreme; therefore, a minimum of 8 samples is recommended.

Nonparametric Confidence Interval

Nonparametric confidence intervals can be calculated for non-normal data and data which cannot reasonably be transformed so as to become normally distributed. They can also be used when the data set contains a high number of nondetects. Use of nonparametric confidence intervals in determining if a criterion has been exceeded is similar to the parametric confidence interval. As with parametric confidence intervals the assumption is that like parameters are being compared, for example, median to median. When data are ranked using nonparametric methods, it is relatively simple to estimate percentiles in which the data fall; but it is more difficult to estimate parameters such a mean and variance. Thus, nonparametric confidence intervals are built around medians or 50th percentile as opposed to means.

For data sets that do not fit a normal or lognormal distribution.

- May be useful when the data includes a large number of nondetects
- Nonparametric confidence intervals are typically built around a median but can also be estimated at other percentiles such as a 90th or 95th.
- Confidence intervals for small data sets may too large to be useful, so the number of samples you need may be greater than typically needed using parametric methods.

When a fixed criterion is an upper percentile or maximum, and no more than a small specified fraction of the individual concentration measurements should exceed the limit, a tolerance limit is a possible appropriate statistic. As with confidence limits, a tolerance limit is one side of a tolerance interval. Tolerance limits, as with confidence limits, may be calculated based on either parametric or nonparametric assumptions.

Using the tolerance limit for testing, you can state that, “I’m 95 percent confident that a particular tolerance interval brackets some percentage, say 99 percent, of the population.” Similarly, for the upper tolerance limit (UTL) you could say that “I’m 95 percent confident that 99 percent of all data will be less than the UTL." Note that this statement is independent of the specific number of future samples, and this is what contrasts tolerance limits with predictions limits.

It may be useful to also note that there is no difference between a 95 percent confidence on the upper 95th percentile and an upper tolerance limit on the 95th percentile at 95% confidence.

- Tolerance limits (TLs) are designed to contain a large fraction or coverage of the data set (typically 90%, 95% or 99%) with a specified level of confidence (typically 95%, see Section 5.3)
- TLs can be used in lieu of PLs or combined with PLs for re-testing to control false negatives.
- Individual compliance well samples (interwell) or samples collected from multiple wells (interwell) can be used to calculate tolerance limits
- Test can be constructed to examine a single compliance well using a single sample.
- Tolerance limits can be calculated based on either parametric or nonparametric assumptions.

When you must determine if a criterion has been exceeded, the lower tolerance limit (LTL) is compared to the criterion. If the LTL is greater than the criterion, then you can conclude that the data are higher than the criterion at the confidence levelDegree of confidence associated with a statistical estimate or test, denoted as (1 – alpha) (Unified Guidance). used to calculate the LTL. Similarly, if you are trying to determine whether data have fallen below a criterion, the UTLupper tolerance limit is compared to the criterion. If the UTL is below the criterion then you can conclude that the data are below the criterion. But note, tolerance limits by definition do not cover 100% of the data. Therefore, use of tolerance limits for decision making must incorporate an acceptable failure rate and a plan for retesting.

Prediction limits (PLs) estimate an interval in which future observations will fall, with a defined probability, given the collected data. The calculation of PLs takes into consideration the number of future data to be compared, as well as the number of retests required to confirm a release.

As the number of chemicals increase, and the number of resampling instances increases, the upper prediction limit also increases. A corresponding decrease will occur in the powerSee "statistical power." of the test (the probability of missing a true exceedance). To reduce this source of error, limit the number of chemicals examined.

Typically, background data are collected and PLs are developed for that data. A set number of future site samples are then compared to the PLs (see Study Question 2). When a criterion is based on an upper percentile or is a maximum, it is possible to develop PLs around site data and then ask if the upper prediction limit has exceeded the criterion.

- PLs are typically projected around means or median (Section 5.4).
- While TLs permit a specified percent of statistical failures, (false negatives); PLs are designed with the intent of no statistical failures.
- An upper prediction limit represents a level that is predicted to equal or exceed future sample values based on past results.
- The number of future samples must be specified.
- The confidence level of a prediction limit represents the probability that a specified number of future samples drawn from the same population will be below the prediction limit.
- Prediction limits increase (or if viewed graphically “widen”) as the number of testing events are increased into the future.
- Test can be constructed to examine a single compliance well using a single sample.
- Interwell or intrawell compliance data can be used to construct PLs.

To determine if site data have fallen below a criterion, the upper prediction limit (UPL) may be used. If the UPL exceeds the criterion then you have an indication that the data set used to calculate the UPL may not be consistent with the criterion. Resampling to verify the result is appropriate.

The one sample t-test compares a statistic of interest (generally the mean) from the data to a criterion representing the same statistic from the background population. The test can be used on either interwell data or intrawell data. It is a parametric test.

All the assumptions which apply to the two sample t-test apply to the one sample t-test (see Section 5.12 and Study Question 2).

A calculated t-statistic greater than the critical t-value indicates a statistically-significant difference between the statistic of interest of the two data sets; this difference indicates that the statistic of interest of the site data is greater than the criterion. Typically the statistic of interest is the average (mean) of the site data and the criterion for the average concentrations. A significance level indicating the chance that the test will return an incorrect result and the size of the data set will be used to determine the critical t-value.

Interpretation of Results and Associated Uncertainty

In selecting the statistical method, understand what the groundwater criterion represents and the consequences of exceeding that criterion. The statistical methods selected and interpretation of their results may vary depending the null hypothesis selected (for example, site data are above the criterion or site data are below the criterion). When using a risk-based criterion or background, typically the UCL of the mean or median concentration is compared to the criterion.

Closure determination is supported only when the entire confidence interval (UCL) is below the criterion. Small sample size can result in a wide confidence interval, such that the interval is not useful in identifying a difference. In such cases, additional samples will need to be collected to increase sample size to narrow the interval. Chapter 21 and Chapter 22 of the Unified Guidance provide additional information regarding use of confidence intervals in monitoring for compliance and closure.

See also Section 4.2.4: Statistical Methods for Release Detection Objectives, Section 4.6.1: Compliance with Criteria, and Section 5.13: Control Charts.

Related Study Questions

Study Question 4: When will contaminant concentrations reach a criterion?

Study Question 5: Is there a trend in contaminant concentrations?

Key Words: Compliance, Comparison to Standards, Release Detection, Site Characterization, Monitoring, Closure, Target Levels

References

USEPA. 1999. "Robust Statistical Intervals for Performance Evaluations." In. Las Vegas, NV: Office of Research and Development.

Publication Date: December 2013