ICEDM: June 2010

Monday, June 28, 2010

Add Screening Levels to EQuIS-generated Excel charts

The AMEC data management team are often asked to add a horizontal line to time-concentration plots representing screening levels such as prediction limits, regulatory limits or risk screening levels. The screening level is a handy reference for the viewer to compare to concentrations, but manually adding screening levels to dozens of charts after I've whipped them out with EQuIS is enough to make me cry.

The easiest way to get the screening level line on to an Excel chart is to add a new series composed of the two end points (min date and screening level, and max date and screening level). Adding a new series works well because it moves with data should you readjust the x and y axes.

A comparison of my dummy data set below to the RSL in red shows we've got a little regulatory problem here....

On a recent project I posted 20 single parameter time-series charts representing upgradient-downgradient well clusters on one sheet such that the upgradient well cluster was on the left and the successive four other well clusters were oriented to the right with the most downgradient cluster on the far-right. Each well cluster consists of four wells each screened in a different water-bearing zone. The chart representing the shallowest zone was at the top of the sheet and each deeper zone was beneath it. I used EQuIS to generate about 30 Excel workbooks containing the 20 charts, one book for each parameter. Then, I wrote a couple of macros that looped through each workbook formatting and posting each set of charts on one sheet. (data and identifying info have been removed)

The project team liked the layout and formatting, but then asked me to add risk-screening levels... yikes!! I schemed for a day trying to think about how I could get EQuIS to generate a series of screening levels with the chemical data. My first thought was to modify EQuIS' analytical results stored procedure to update each result set with the risk data, then use the EQuIS chart utility to create each chart with the two series, my target constituent and the risk data. However, when I thought about what a hassle that would be to figure out min and max dates for each result set to apply to the RSL series I came up the Excel VBA solution posted below. My chart grouping macro calls this sub procedure prior to adding the charts to one sheet. It prompts the user to enter the screening level into an input box, then adds the series to each chart in the workbook. Quick, simple and I can re-run it for a variety of screening levels without having to regen the whole chart set. I almost broke my arm patting myself on the back....

Let me know if you have any questions.

'adds Risk Screening Level chart series to existing charts
Private Sub add_RSLs()

Dim lowDate As Date, highDate As Date
Dim SL 'screening level'
Dim ChartName As String

'removes old RSL data if it exists
On Error Resume Next
Application.DisplayAlerts = False
Sheets("RSL").Delete
Application.DisplayAlerts = True
On Error GoTo 0

'picks the date ranges for existing data to use with the soon-to-be created RSL chart series data
lowDate = ActiveWorkbook.Charts(1).Axes(xlCategory).MinimumScale
highDate = ActiveWorkbook.Charts(1).Axes(xlCategory).MaximumScale

'user enters the RSL
SL = InputBox("Enter Screening Level")

If SL = 0 Then
MsgBox "No SL Entered"
Exit Sub
End If

'adds a new worksheet along with the data to be used for the RSL chart series
Sheets.Add before:=Sheets(1)
ActiveSheet.Name = "RSL"
Sheets("RSL").Cells(1, 1) = lowDate
Sheets("RSL").Cells(2, 1) = highDate
Sheets("RSL").Cells(1, 2) = SL
Sheets("RSL").Cells(2, 2) = SL

'Assumes every chart in workbook is the same compound and will have the RSL series added to it
'Loops through every chart in the workbook
For Each cht In ActiveWorkbook.Charts
cht.Activate
ChartName = ActiveChart.Name

Charts(ChartName).Activate   'you'd think it'd already be activated, but I must have wanted to be sure

'deletes the old RSL series if it exists
On Error Resume Next
ActiveChart.SeriesCollection("RSL").Delete
On Error GoTo 0

'adds the new RSL chart series to each chart
With ActiveChart.SeriesCollection.NewSeries
    .Name = "RSL"
    .Values = Sheets("RSL").Range("b1:b2")
.    XValues = Sheets("RSL").Range("a1:a2")
End With

'formats the new series
ActiveChart.SeriesCollection("RSL").Select
Selection.Border.Weight = xlThin
Selection.Border.LineStyle = xlDash

sName = Selection.Name
Selection.MarkerStyle = xlNone
'Selection.MarkerForegroundColorIndex = 3
'Selection.MarkerBackgroundColorIndex = 3
Selection.MarkerSize = 2
Selection.Border.ColorIndex = 3

Next cht
End Sub

Tuesday, June 15, 2010

Using Excel in place of ARC GIS- Really....

Who needs ArcGIS when you have MS Excel... (?)

Here's a new approach to map making.

Tuesday, June 1, 2010

T-SQL Reference Books

I've found the following book titles to be invaluable in modifying stored procedures and functions in SQL-Server. I've used info from the books to help me analyze data too. Neither one of these are really a newby How-to, but if you know just a few basics (SELECT, FROM, WHERE)and you're curious you'll be golden.

1. SQL Pocket Guide by Jonathon Gennick (O'Reilly). This one's a great reference book when you want to know something quickly- like a date function format code or what the dang in-string function is (in VBA it's instr() and T-SQL it's CHARINDEX).

2.SQL Cookbook by Anthony Molinaro (O'Reilly). This one helps you dig into T-SQL. It explains complex steps in a simple manner and give the rationale behind the author's approach. I've learned a lot from this book.

Non Detects and Data Analysis

This is from the PracticalStats.com newsletter, by Dennis Helsel.

"Nondetects And Data Analysis" (both the course and textbook) spend considerable time on survival analysis methods applied to censored environmental data. In spite of the many documented cases of errors arising from substituting values for nondetects (see
http://dx.doi.org/10.1093/annhyg/mep092 for a new explanation of the dangers), substitution remains popular because it is so easy. Yet there are two other easy solutions that do not introduce the errors inherent in substitution.

-- For computing descriptive statistics
What is a measure of the center for the following dataset?
<5 <5 8 15 19 24 27 33 41
To compute the mean we would need to use a survival analysis procedure like maximum likelihood estimation (MLE) to avoid the 'invasive data' problem of substitution. But there are two easier solutions.
a) use the median. The median of these 9 values is the 5th observation from the bottom, or 19.
b) use the percent of data above the detection limit. There are 7/9, or 78%
of the data above 5.

And for this second dataset with two detection limits,
<5 <5 8 15 <20 24 27 33 41
the two options remain the same. The median is <20. And there are 4/9 or 44% of the observations above 20.

-- In general, we can generalize these two procedures to other statistical applications. First, we can use methods based on percentiles, or ranks. The rank i divided by (n+1) is the position for the percentile of that observation. The 6th ranked observation (24) above, for example, is at
(6/10) or the 60th percentile. Methods based on ranks such as nonparametric hypothesis tests are procedures analyzing the percentiles of the data.

Second, we can treat the data as being either above or below the (highest) detection limit, and interpret the proportion of data falling above the
limit. Binomial procedures allow us to discern changes in such proportions.

-- For hypothesis testing, nonparametric hypothesis tests are based on percentiles, or ranks. To compare two sets of data, the Mann-Whitney (also called rank-sum) test can always be used without requiring substitution. The test determines whether the cumulative distribution frequencies (the set of
percentiles) in the two data sets are similar, or different. If multiple detection limits are present, all data below the highest limit are coded as being tied in order to use the simple version of this test. For example, the two data sets:
<5 <5 8 15 <20 24 27 33 41
and
<5 <5 <5 <5 6 9 10 12 16 21

are re-coded to
<20 <20 <20 <20 <20 24 27 33 41
and
<20 <20 <20 <20 <20 <20 <20 <20 <20 21

and their ranks are:
7.5 7.5 7.5 7.5 7.5 16 17 18 19
and
7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 15

The identical procedure would precede the Kruskal-Wallis test when there are three or more groups. This simple method is far superior to substitution because no artificial pattern alien to the original data is placed into the data, as it is with substitution. Unlike substitution followed by t-tests or ANOVA, if differences are found by the Mann-Whitney or Kruskal-Wallis tests, they can be believed. If patterns are obscured by re-censoring at the highest limit, more complicated survival analysis methods are available. But we're trying here to keep it simple.

The second test procedure is the binomial-based "test of proportions" or contingency table test. Here the proportions of data above the (highest) detection limit are tested for similarity or difference. Data are coded into two groups, above and below the highest detection limit. For the above data, the first group has 4/9 or 44% of values above 20, while the second group has only 1/10 or 10% above the value of 20. The test determines whether these two percentages are significantly different.

-- For correlation and regression
Nonparametric correlation coefficients Kendall's tau and Spearman's rho may
be computed on data with nondetects, without substitution. Kendall's tau
easily handles data with multiple detection limits, though the software is
not usually written to do so. The nonparametric Theil-Sen line (used for
the Mann-Kendall trend test, for example) may end up with a
One approach to perform regression with a binary Y variable is called logistic regression. Here the probability of being in the higher category, say the probability of recording a detected value, is predicted from one or more explanatory variables. Interpretation of the results is very similar to ordinary regression.

So all in all, these simple methods do not require substitution, can be computed with standard statistical software, and avoid the pitfalls of "invasive data" that result from fabricating data by substitution. If you can't justify going to the more complicated procedures in Nondetects And Data Analysis that handle nondetects at multiple levels, these simpler methods should meet your requirements.