D.15 R FOR STATISTICS
Approximate Cost: Free
Source: http://www.rproject.org
Operating System Needs: Operates on Windows, Mac OS, and most versions of UNIX.
Input Structure: Scripts can be written in R to read and analyze data from a wide variety of data sources including, but not limited to text/binary files, spreadsheets, and databases.
Overview
According to the R FAQ (Hornik 2013), "R is a system for statistical computation and graphics consisting of a programming language plus a runtime environment with graphics, a debugger, access to certain system functions, and the ability to run programs stored in script files." "R is an integrated suite of software facilities for data manipulation, calculation, and graphical display" according to the “An Introduction to R” document (Venables et al. 2013).
The statistical functions in R provide support for linear and generalized linear models, nonlinear regression models, time series analysis, classical parametricA statistical test that depends upon or assumes observations from a particular probability distribution or distributions (Unified Guidance). and nonparametricStatistical test that does not depend on knowledge of the distribution of the sampled population (Unified Guidance). tests, clustering and smoothing, analysis of spatial data, and Bayesian analysis, among others. In addition to storing and manipulating data, a mature collection of functions help in the production of reportquality graphics. R can be downloaded free from the Comprehensive R archive network (CRAN; http://www.rproject.org/). It is distributed under a GNUstyle copyleft (http://www.gnu.org/copyleft/copyleft.html) license and is part of the GNU project (http://www.gnu.org).
Functions and corresponding data sets are typically organized in units called ‘packages’. The directory where packages are stored is called the library. R comes with a standard set of packages in the standard library. Other packages can be downloaded and installed as needed. Once installed, these packages must be loaded into the session to be used. The list of packages in the standard library and detailed descriptions and documentation for each of the packages can be found at http://stat.ethz.ch/Rmanual/Rdevel/library/base/html/00Index.html. In addition to the standard packages, the user can install additional packages from the CRAN website or elsewhere. Additional contributed packages can be found at the CRAN website at http://CRAN.Rproject.org/ and related sites such as Bioconductor (http://www.bioconductor.org/) and Omegahat (http://www.omegahat.org/). Advanced users can program their own packages for custom applications.
Statistical Method 
Capability As Is 
Capability with Scripts/AddIns 

Handling of NDs 


● 
N/A 

◒ 
● 

◒ 
● 

◒ 
● 

Exploratory/Diagnostic Tools 


Summary Statistics 
● 
N/A 
● 
N/A 

● 
N/A 

Data transformations 
● 
N/A 
Statistical Design 


Statistical Power 
● 
N/A 
● 
N/A 

Contaminant ranking 
● 
N/A 

◒ 

Statistical Limits 


● 
N/A 

● 
N/A 

● 
N/A 

Testing Compliance Limits 
● 
N/A 
Graphics 


Plots/Charts 
● 
N/A 
Batch plots 
● 
N/A 
Tweaking of graphics 
● 
N/A 
Statistical Comparisons 


● 
N/A 

● 
N/A 

Spatial Analysis 


Geostatistics/Mapping 
◒ 
● 
◒ 
● 

◒ 
● 

Regression/Time Series 


● 
N/A 

● 
N/A 

● 
N/A 

● 
N/A 

● 
N/A 

● 
N/A 

Multivariate Analysis 


Multiple regression 
● 
N/A 
Factor/Discriminant analysis 
● 
N/A 
● 
N/A 
Capability Ratings:
N/A = Not applicable or not available
● = Full capability
◒ = Some capability
(blank cell) = No capability
AddIns Available
Several existing addon packages extend the functionality of R. A partial list can be found at http://cran.rproject.org/doc/FAQ/RFAQ.html●Add_002donpackagesfromCRAN.
Ease of Use and Data Import
The most common data structures in R are vectors and data frames. Higher order data structures such as lists and data frames are also available for advanced analysis. The R environment may challenge a new user; however, an interactive user interface and comprehensive help documentation are provided. In addition, active development is underway to generate graphical user interfaces that provide a method to access commonly used functions.
Types of Distributions
R can be used for calculating properties of probability distributions as well as to check whether a given data set fits a standard distribution. A number of distributions and distributional tests are supported in R, including: beta, binomial, Cauchy, chisquared, exponential, F, gammaA gamma distribution or data set. A parametric unimodal distribution model commonly applied to groundwater data where the data set is left skewed and tied to zero. Very similar to Weibull and lognormal distributions; differences are in their tail behavior, and the gamma density has the second longest tail where its coefficient of variation is less than 1 (Unified Guidance; Gilbert 1987; Silva and Lisboa 2007)., geometric, hypergeometric, lognormalA dataset that is not normally distributed (symmetric bellshaped curve) but that can be transformed using a natural logarithm so that the data set can be evaluated using a normaltheory test (Unified Guidance)., logistic, negative binomial, normal, Poisson, Student’s T, uniform, and Weibull.
Visualization
R has a mature graphics library and can produce presentation quality graphics for most of the commonly used plots, such as stem and leaf, box plots, scatter plotsGraphical representation of multiple observations from a single point used to illustrate the relationship between two or more variables. An example would be concentrations of one chemical on the xaxis and a second chemical on the yaxis. They are a typical exploratory data analysis tool to identify linear versus nonlinear relationships between variables (Unified Guidance).,histograms, and contours.
Primary Uses for Groundwater Data Analysis
R is commonly used to perform the following tasks:
 calculate summary statistics
 perform distributional tests
 get point estimates of population meanThe arithmetic average of a sample set that estimates the middle of a statistical distribution (Unified Guidance).
 get interval estimates of population mean with known and unknown varianceThe square of the standard deviation (EPA 1989); a measure of how far numbers are separated in a data set. A small variance indicates that numbers in the dataset are clustered close to the mean.
 perform sampling size of population mean
 calculate point and interval estimates of population proportion
 test hypotheses
 perform linear and nonlinear regression
 perform analysis on timeseries and spatial data
 snalyze nondetectsLaboratory analytical result known only to be below the method detection limit (MDL), or reporting limit (RL); see "censored data" (Unified Guidance). in data using substitutiontype methods and also more advanced maximum likelihood estimator methods
 develop custom applications
Benefits
 provides a flexible, interactive, and powerful environment for data analysis and visualization
 free
 builtin support for a variety of simple to the complex statistical analyses
 scripts for performing complex analysis
 easily produces presentationquality graphics and automated reports
 active and knowledgeable online community for support issues.
 detailed online documentation
Limitations and Data Requirements
 The program provides the functions and libraries to read and process data from a variety of sources including, but not limited to ASCII Files, binary Files, spreadsheets, and databases.
 As long as the data format and structure is known, data can be imported into the R environment.
 The environment challenging to the firsttime user, and presents a steep initial learning curve.
References
Faraway, J. 2002. Practical Regression and ANOVA Using R. http://cran.rproject.org/doc/contrib/FarawayPRA.pdf.
Hornik, K. 2013. The R FAQ. http://CRAN.Rproject.org/doc/FAQ/RFAQ.html.
R Development Core Team. 2008. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. http://www.rproject.org.
Venables W.N., D.M. Smith, and the R Core Team. 2013. An Introduction to R. Notes on R: A Programming Environment for Data Analysis and Graphics. Version 3.0.1.
Publication Date: December 2013