Re: [R-sig-eco] Multiple comparisons among predictors generated from same data

2012-05-25 Thread Gavin Simpson
On Thu, 2012-05-24 at 15:00 -0700, J Straka wrote:
 Hello,
 
 I'm planning on using a regression model to describe seed set of plants (my
 response) using some sort of predictor based on temperature.  I have a
 number of temperature variables calculated from the same set of data
 (hourly temperatures for the growing season, converted to variables such as
 average temperature, maximum temperature, minimum temperature, degree-days
 above zero Celsius, degree days above ten Celsius, etc...), and I want to
 decide which one should be included in my model. I know that I would
 ideally select one based on prior knowledge of the system (e.g. so-called
 planned comparisons or choosing a temperature threshold that is known to
 be important for the development of seeds), but not much is known about
 this system.

What is the model for? Understanding so you want to interpret the
coefficients directly as something meaningful or for prediction?

If the latter I would say it doesn't really matter; choose the model
that gives the best out-of-sample predictions (lowest error etc), or
average predictions over a set of best/good models. Simply choosing the
best model via some sort of selection procedure may result in a model
with high variance (change the data a bit and different variables would
be selected). If so, consider a regression method that applies shrinkage
to the coefficients such as the lasso or the elastic net; this will lead
to a small bit of bias in the estimates of the coefficients but should
reduce the variance of the final model because you are considering the
selection of variables as part of the model itself.

If you want to interpret the model coefficients as something real then
you have to be very careful doing any form of selection; the stepwise
procedures and best subsets all can potentially lead to strong bias in
the model coefficients. Be removing a variable from the model in effect
you are saying that the sample estimate of the effect of that variable
on the response is 0, not some small (statistically insignificant)
value.

This is a very tricky thing to get right and I'm not sure I know the
right answer (or even if there is one!?).

 I've been warned against testing the significance of multiple predictors
 using p-values, unless I use Bonferroni correction (or some equivalent).
 Unfortunately, using Bonferroni correction would result in something like p
 = 0.05/7 (for seven different temperature variables); a rather small value
 for detecting anything! I was wondering whether it would be appropriate to
 instead use likelihood-based techniques (direct comparisons of
 log-likelihoods or AIC scores) to compare a series of models using each of
 the alternative predictors in turn, and choose the most relevant
 temperature variable (i.e. predictor) based on that.

Choosing models by AIC or BIC is just the same as doing it using
p-values; the selection procedure has all the problems I mention above.
LRTs require a significance test of the ratio of the two likelihoods, so
you are still doing a series of sequential tests that you might want to
control the overal error rate of.

There are other corrections for multiple testing. For example, see the
p.adjust() function in R for some options.

HTH

G

 Thoughts on the validity of this approach? Would any adjustments have to be
 made for multiple comparisons if I used this strategy?
 
 Jason Straka
 University of Victoria
 
   [[alternative HTML version deleted]]
 
 ___
 R-sig-ecology mailing list
 R-sig-ecology@r-project.org
 https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
 

-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Dr. Gavin Simpson [t] +44 (0)20 7679 0522
 ECRC, UCL Geography,  [f] +44 (0)20 7679 0565
 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London  [w] http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT. [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology


Re: [R-sig-eco] Multiple comparisons among predictors generated from same data

2012-05-25 Thread Bob O'Hara

On 05/25/2012 10:18 AM, Gavin Simpson wrote:

On Thu, 2012-05-24 at 15:00 -0700, J Straka wrote:

Hello,

I'm planning on using a regression model to describe seed set of plants (my
response) using some sort of predictor based on temperature.  I have a
number of temperature variables calculated from the same set of data
(hourly temperatures for the growing season, converted to variables such as
average temperature, maximum temperature, minimum temperature, degree-days
above zero Celsius, degree days above ten Celsius, etc...), and I want to
decide which one should be included in my model. I know that I would
ideally select one based on prior knowledge of the system (e.g. so-called
planned comparisons or choosing a temperature threshold that is known to
be important for the development of seeds), but not much is known about
this system.

What is the model for? Understanding so you want to interpret the
coefficients directly as something meaningful or for prediction?

If the latter I would say it doesn't really matter; choose the model
that gives the best out-of-sample predictions (lowest error etc), or
average predictions over a set of best/good models. Simply choosing the
best model via some sort of selection procedure may result in a model
with high variance (change the data a bit and different variables would
be selected). If so, consider a regression method that applies shrinkage
to the coefficients such as the lasso or the elastic net; this will lead
to a small bit of bias in the estimates of the coefficients but should
reduce the variance of the final model because you are considering the
selection of variables as part of the model itself.

If you want to interpret the model coefficients as something real then
you have to be very careful doing any form of selection; the stepwise
procedures and best subsets all can potentially lead to strong bias in
the model coefficients. Be removing a variable from the model in effect
you are saying that the sample estimate of the effect of that variable
on the response is 0, not some small (statistically insignificant)
value.

This is a very tricky thing to get right and I'm not sure I know the
right answer (or even if there is one!?).
An additional complication here is that the variables are going to be 
correlated, so a model with all or most in it could be unstable. If a 
single temperature variable is enough, then I'd suggest either trying 
your best to pick one, or use what everyone else uses (GDD5?), so the 
study can be comparable.


Once you have a model, it might be worth checking to see if the other 
variables tell a different story. If it's the same story but with 
different p-values, you might as well stick to the original analysis.


Bob

--

Bob O'Hara

Biodiversity and Climate Research Centre
Senckenberganlage 25
D-60325 Frankfurt am Main,
Germany

Tel: +49 69 798 40226
Mobile: +49 1515 888 5440
WWW:   http://www.bik-f.de/root/index.php?page_id=219
Blog: http://blogs.nature.com/boboh
Journal of Negative Results - EEB: www.jnr-eeb.org

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology


[R-sig-eco] Multiple comparisons among predictors generated from same data

2012-05-24 Thread J Straka
Hello,

I'm planning on using a regression model to describe seed set of plants (my
response) using some sort of predictor based on temperature.  I have a
number of temperature variables calculated from the same set of data
(hourly temperatures for the growing season, converted to variables such as
average temperature, maximum temperature, minimum temperature, degree-days
above zero Celsius, degree days above ten Celsius, etc...), and I want to
decide which one should be included in my model. I know that I would
ideally select one based on prior knowledge of the system (e.g. so-called
planned comparisons or choosing a temperature threshold that is known to
be important for the development of seeds), but not much is known about
this system.

I've been warned against testing the significance of multiple predictors
using p-values, unless I use Bonferroni correction (or some equivalent).
Unfortunately, using Bonferroni correction would result in something like p
= 0.05/7 (for seven different temperature variables); a rather small value
for detecting anything! I was wondering whether it would be appropriate to
instead use likelihood-based techniques (direct comparisons of
log-likelihoods or AIC scores) to compare a series of models using each of
the alternative predictors in turn, and choose the most relevant
temperature variable (i.e. predictor) based on that.

Thoughts on the validity of this approach? Would any adjustments have to be
made for multiple comparisons if I used this strategy?

Jason Straka
University of Victoria

[[alternative HTML version deleted]]

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology