5.12 Correlation Tests
Correlation tests can be used to assess whether two groundwater variables have a linear relationship with each other. Correlation tests may be used to evaluate both positive (when one variable increases, the other variable increases) and negative (when one variable increases, the other variable decreases) correlations. An example of a positive correlationAn estimate of the degree to which two sets of variables vary together, with no distinction between dependent and independent variables (USEPA 2013b). would be an observation that chemical concentrations in a well increase when water levels in the well increase. An example of a negative correlation would be an observed decrease in concentrations when the pumping rate for a groundwater extraction system is increased. These tests may also be used to test for monotonic trends or to compare trends.
5.12.1 Pearson Correlation Test
The parametricA statistical test that depends upon or assumes observations from a particular probability distribution or distributions (Unified Guidance). Pearson correlation test provides a measure of the linear association between two continuous variables. To conduct the test, correlation coefficients are calculated for each (x,y) pair, and the values of x and y are subsequently replaced with their ranks. Application of the test results in a correlation coefficient that ranges from -1 to 1. The sign of the coefficient indicates the direction of the relationship (that is, negative values imply an inverse relationship or a decreasing trend), and its absolute value indicates its strength, with larger (absolute) values indicating stronger linear relationships.
![Closed](../../../Skins/Default/Stylesheets/Images/transparent.gif)
The Pearson correlation coefficient is a common numerical measure of the degree of linear association between two continuous variables.
Study Question 5: Is there a trend in contaminant concentrations?
![Closed](../../../Skins/Default/Stylesheets/Images/transparent.gif)
- Linear relationship between variables should hold.
- Variables should be identically distributed (but not necessarily independently).
- Assumes parametric distribution
![Closed](../../../Skins/Default/Stylesheets/Images/transparent.gif)
- A minimum of two variables with at least three observations for each variable are needed in order for the test to be meaningful.
- Use 8 to 10 paired observations, although a larger data set may be needed if the data sets are skewed or contain nondetectsLaboratory analytical result known only to be below the method detection limit (MDL), or reporting limit (RL); see "censored data" (Unified Guidance)..
- The degree of confidence in order to detect patterns in the data increases with larger sample sizes.
- See Section 5.7 for information regarding the handling of nondetects.
- You may need to standardize each variable for plotting purposes in order to preserve the scales.
![Closed](../../../Skins/Default/Stylesheets/Images/transparent.gif)
This test does not recognize nonlinear relationships between variables.
![Closed](../../../Skins/Default/Stylesheets/Images/transparent.gif)
A description of how to construct scatter plots is found in Chapter 9.4, Unified Guidance. Formula [3.5] of Chapter 3.3, Unified Guidance shows how to construct the Pearson correlation coefficient.
5.12.2 Spearman Rank Correlation Coefficient
The Spearman rank correlation test is essentially the nonparametricStatistical test that does not depend on knowledge of the distribution of the sampled population (Unified Guidance). version of the Pearson correlation coefficient test, and provides a measure of the linear association between two variables. Spearman’s rank correlation coefficient rho (ρ) is a nonparametric correlation coefficient that can be used to test for monotonic trends. To calculate the correlation coefficient ρ for any pair of variables x and y, each value of x is replaced with its rank R(x) and each corresponding value of y is replaced with its rank R(y). For concentrations sequentially measured over time (such as those, from a monitoring well), the x variable denotes time and R(x) is the sampling event order (R(x) = 1 for the first sampling event). The rank of the smallest concentration measurement is 1 (when it is not tied with other values).
Spearman’s ρ is similar to Pearson’s r that is calculated for the paired ranked results (1, R(y1)), (2, R(y2)), … (n, R(yn)) (for instance using Equation 3.5 in Chapter 3.5, Unified Guidance). Like the Pearson’s r, Spearman’s ρ ranges from -1 to 1 and can be tested to determine whether it is significantly different from zero; a positive value indicates an increasing trend and a negative value indicates a decreasing trend. The absolute value of the coefficient indicates its strength, with larger (absolute) values indicating stronger linear relationships.
When the sample size n is large (n > 20), the test statistic t = ρ (n- 2)½/(1 - ρ2)½ approximately follows the Student’s t distribution with n – 2 degree of freedom. To test whether there is a significant trend, the statistic t is compared with upper and lower percentiles of the Student’s t distribution. A large value of t (for example, greater than the 95th percentile of the Student’s t distribution with n-2 degree of freedom) suggests a significant increasing trend; a negative value (less than the 5th percentile) suggests a decreasing trend. For small sample sizes statistical tables can be used to determine whether ρ is significantly different from zero.
Applications and Relevant Study Questions
- The Spearman correlation coefficient is a common numerical measure of the degree of linear association between two variables.
- Use this test to evaluate stationarityStationarity exists when the population being sampled has a constant mean and variance across time and space (Unified Guidance). of the meanThe arithmetic average of a sample set that estimates the middle of a statistical distribution (Unified Guidance). (the absence of a trend) for parametric data sets, which is a requirement for many statistical methods. A slope differing from zero may indicate the presence of a trend.
- Study Question 5: Is there a trend in contaminant concentrations?
![Closed](../../../Skins/Default/Stylesheets/Images/transparent.gif)
- This test assumes a monotonic relationship between two variables (that is, as one variable increases, the other variable either increases or decreases, but does not fluctuate).
- This test assumes no seasonal trends are present, which generally require more sophisticated evaluations.
- Variables should be identically distributed (but not necessarily independently).
![Closed](../../../Skins/Default/Stylesheets/Images/transparent.gif)
- A minimum of two variables with at least 8 to 10 observations for each variable is recommended. Although it is possible to apply the test with fewer observations, such applications may provide a less meaningful result. A greater number of measurements may be needed if data sets are skewed or contain nondetects.
- The degree of confidence in detecting patterns in the data increases with larger sample sizes.
- Each variable may need to be standardized, for plotting purposes, in order to preserve the scales.
- See Section 5.7 for information regarding the treatment of nondetect data.
- Data should be matched pairs.
- This test does not recognize nonlinear relationships between variables.
![Closed](../../../Skins/Default/Stylesheets/Images/transparent.gif)
- This test does not require a particular data distribution.
- This test can be used with data sets that contain nondetects. Nondetects result in tied ranks when ρ is calculated.
- This test is not sensitive to outliersValues unusually discrepant from the rest of a series of observations (Unified Guidance)..
- This test can be used to detect nonlinear (monotonic) trends.
- Transformation of the data using logarithms (and other monotonic functions) does not alter the value of ρ.
![Closed](../../../Skins/Default/Stylesheets/Images/transparent.gif)
A description of how to construct scatter plotsGraphical representation of multiple observations from a single point used to illustrate the relationship between two or more variables. An example would be concentrations of one chemical on the x-axis and a second chemical on the y-axis. They are a typical exploratory data analysis tool to identify linear versus nonlinear relationships between variables (Unified Guidance). is found in Chapter 9.4, Unified Guidance. For all of the cases, the values of each of the variables are ranked from smallest to largest, and the Pearson correlation coefficient is computed on the ranks. Additional information is also available in Statistical Methods in Water Resources (Helsel and Hirsch 2002).
Publication Date: December 2013