Tuesday, June 1, 2010

Non Detects and Data Analysis

This is from the PracticalStats.com newsletter, by Dennis Helsel.

"Nondetects And Data Analysis" (both the course and textbook) spend considerable time on survival analysis methods applied to censored environmental data. In spite of the many documented cases of errors arising from substituting values for nondetects (see
http://dx.doi.org/10.1093/annhyg/mep092 for a new explanation of the dangers), substitution remains popular because it is so easy. Yet there are two other easy solutions that do not introduce the errors inherent in substitution.

-- For computing descriptive statistics
What is a measure of the center for the following dataset?
<5 <5 8 15 19 24 27 33 41
To compute the mean we would need to use a survival analysis procedure like maximum likelihood estimation (MLE) to avoid the 'invasive data' problem of substitution. But there are two easier solutions.
a) use the median. The median of these 9 values is the 5th observation from the bottom, or 19.
b) use the percent of data above the detection limit. There are 7/9, or 78%
of the data above 5.

And for this second dataset with two detection limits,
<5 <5 8 15 <20 24 27 33 41
the two options remain the same. The median is <20. And there are 4/9 or 44% of the observations above 20.

-- In general, we can generalize these two procedures to other statistical applications. First, we can use methods based on percentiles, or ranks. The rank i divided by (n+1) is the position for the percentile of that observation. The 6th ranked observation (24) above, for example, is at
(6/10) or the 60th percentile. Methods based on ranks such as nonparametric hypothesis tests are procedures analyzing the percentiles of the data.

Second, we can treat the data as being either above or below the (highest) detection limit, and interpret the proportion of data falling above the
limit. Binomial procedures allow us to discern changes in such proportions.

-- For hypothesis testing, nonparametric hypothesis tests are based on percentiles, or ranks. To compare two sets of data, the Mann-Whitney (also called rank-sum) test can always be used without requiring substitution. The test determines whether the cumulative distribution frequencies (the set of
percentiles) in the two data sets are similar, or different. If multiple detection limits are present, all data below the highest limit are coded as being tied in order to use the simple version of this test. For example, the two data sets:
<5 <5 8 15 <20 24 27 33 41
and
<5 <5 <5 <5 6 9 10 12 16 21

are re-coded to
<20 <20 <20 <20 <20 24 27 33 41
and
<20 <20 <20 <20 <20 <20 <20 <20 <20 21

and their ranks are:
7.5 7.5 7.5 7.5 7.5 16 17 18 19
and
7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 15

The identical procedure would precede the Kruskal-Wallis test when there are three or more groups. This simple method is far superior to substitution because no artificial pattern alien to the original data is placed into the data, as it is with substitution. Unlike substitution followed by t-tests or ANOVA, if differences are found by the Mann-Whitney or Kruskal-Wallis tests, they can be believed. If patterns are obscured by re-censoring at the highest limit, more complicated survival analysis methods are available. But we're trying here to keep it simple.

The second test procedure is the binomial-based "test of proportions" or contingency table test. Here the proportions of data above the (highest) detection limit are tested for similarity or difference. Data are coded into two groups, above and below the highest detection limit. For the above data, the first group has 4/9 or 44% of values above 20, while the second group has only 1/10 or 10% above the value of 20. The test determines whether these two percentages are significantly different.

-- For correlation and regression
Nonparametric correlation coefficients Kendall's tau and Spearman's rho may
be computed on data with nondetects, without substitution. Kendall's tau
easily handles data with multiple detection limits, though the software is
not usually written to do so. The nonparametric Theil-Sen line (used for
the Mann-Kendall trend test, for example) may end up with a
One approach to perform regression with a binary Y variable is called logistic regression. Here the probability of being in the higher category, say the probability of recording a detected value, is predicted from one or more explanatory variables. Interpretation of the results is very similar to ordinary regression.

So all in all, these simple methods do not require substitution, can be computed with standard statistical software, and avoid the pitfalls of "invasive data" that result from fabricating data by substitution. If you can't justify going to the more complicated procedures in Nondetects And Data Analysis that handle nondetects at multiple levels, these simpler methods should meet your requirements.

No comments:

Post a Comment