Re: [R-sig-eco] logistic regression and spatial autocorrelation

2011-08-26 Thread Nicholas Lewin-Koh
Hi,
to weigh in on this:
@Aitor, Harrell's rules of thumb are assuming independent predictors
without
any fancy covariance function. To model the covariance of the residuals
you are now estimating extra 
2nd order parameters from the data, so even more data is needed to
stabilize the parameter estimates.
The good news is that in the residual space it is the numbers of
adjacent 0's or 1's that matter. 

However, if the goal is prediction of species occurrence at unoccupied
sites, than
you may want to think about the problem differently and use either
indicator kriging,
kind of a spatial tobit model to predict probabilities of occurrence
based on Gaussian random
fields, or, you might want to look at geoRglm, for geostatistics in the
glm framework. The problem here
is, as another poster mentioned, is you may have more of a network than
a continuous random field, you may
get around that by using an anisotropic variogram.

Otherwise, in a prediction model in a regression context, over fitting
is going to be more of an issue
than autocorrelation of the residuals. Putting the spatial coordinates,
or the principal components of the spatial
weight matrix as one of the predictors may be good enough. Spatial
autocorrelation really effects the estimates
of the variance, and comes into play if you want to do inference, or
estimate confidence intervals/prediction intervals.

Again, all this assumes you are more interested in prediction than
modeling mechanism.

Nicholas

 
 --
 
 Message: 8
 Date: Thu, 25 Aug 2011 23:22:34 +0200
 From: Aitor Gast?n aitor.gas...@upm.es
 To: r-sig-ecology@r-project.org
 Subject: Re: [R-sig-eco] logistic regression and spatial
   autocorrelation
 Message-ID: 76265B51921941D6ACF52B30985497BB@botanica1
 Content-Type: text/plain; format=flowed; charset=UTF-8;
   reply-type=original
 
 The limiting sample size in logistic regression is the minimum between
 the 
 number of positive and negative cases, in Tim's data 132 positive cases 
 (species occurrences). A minimum of 10 events per estimated parameter are 
 recommended based on external validation studies to avoid overfitting
 (see 
 Harrell, 2001. Regression Modeling Strategies. Springer). Therefore, with 
 Tim's data up to 13 parameters could be estimated (e.g., 13 variables 
 without nonlinear terms, or 6 variables in quadratic form, or 4 variables 
 using restricted cubic splines with 4 knots...).
 
 Aitor
 
 --
 From: cpar...@pdx.edu
 Sent: Thursday, August 25, 2011 8:08 PM
 To: Pedro Lima Pequeno pacol...@gmail.com
 Cc: r-sig-ecology@r-project.org; Tim Seipel t.sei...@env.ethz.ch
 Subject: Re: [R-sig-eco] logistic regression and spatial autocorrelation
 
  Tim pointed put that he has only 132 samples out of 2800 with a species 
  present and I am curious what people think about how well we can model 
  that with logistic regression.
  -Chris
 
 
 
 
  On Aug 25, 2011, at 10:36 AM, Pedro Lima Pequeno pacol...@gmail.com 
  wrote:
 
  Hi Tim,
 
  there are several ways of dealing with spatial autocorrelation in
  ecological models (see e.g. Dormann 2007: Methods to account for
  spatial autocorrelation in the analysis of species distributional
  data: a review; and Beale et al. 2010: Regression analysis of spatial
  data). As always, this is an area of active research, so the right or
  wrong thing to do is not as clear as it may seem. Some have even
  concluded  that changes in coefficients between spatial and
  non-spatial methods depend on the method used and are largely
  idiosyncratic, so that researchers may have little choice but to be
  more explicit about the uncertainty of models and more cautious in
  their interpretation (Bini et al. 2009: Coefficient shifts in
  geographical ecology: an empirical evaluation of spatial and
  non-spatial regression). Thus, new methods are emerging at a faster
  rate than people studying and comparing their properties. Nonetheless,
  I think some observations are useful:
  1) there are methods explicitly designed to detect spatial
  autocorrelation such as Moran's autocorrelograms or variograms
  (available in several R packages). As already pointed out, the
  autocorrelation function is well behaved with linear, equally spaced
  series in time or space;
  2) minimal adequate model selection with the AIC is sensitive to
  residual autocorrelation; it tends to generate unstable and overfitted
  models. Thus, when applying any model selection procedure, you should
  account for the uncertainty in the process by averaging the model set
  (or model predictions) with respect to the relative support of each
  model (e.g. Akaike weights). Since you have a large sample, you could
  account for residual spatial autocorrelation using eigenvector
  filtering, which produces synthetic variables that capture spatial
  patterns and can be included in linear models as explanatory variables
  - a more intuitive approach

Re: [R-sig-eco] logistic regression and spatial autocorrelation

2011-08-26 Thread Aitor Gastón

Hi,

Nicholas, I understand that the rule of thumb of 10 events per parameter 
comes from model predictive performance assessments and may not apply if you 
are doing inference.


However, I'm not sure that the rule assumes independent predictors, 
collinearity can cause inflated standard errors of the regression 
coefficients, but does not affect predictions made on new data that have the 
same degree of collinearity as the training data, as long as extreme 
extrapolation is not attempted (Harrell, 2001, page 65).


Perhaps 10 events per parameter is a too general recommendation and larger 
samples may be required in some situations. In small sample scenarios, doing 
variable reduction before model fitting (without using the response 
variable) and applying some kind of penalization to reduce the effective 
degrees of freedom will produce better predictive performance.


Aitor


--
From: Nicholas Lewin-Koh ni...@hailmail.net
Sent: Friday, August 26, 2011 7:01 PM
To: r-sig-ecology@r-project.org
Subject: Re: [R-sig-eco] logistic regression and spatial autocorrelation


Hi,
to weigh in on this:
@Aitor, Harrell's rules of thumb are assuming independent predictors
without
any fancy covariance function. To model the covariance of the residuals
you are now estimating extra
2nd order parameters from the data, so even more data is needed to
stabilize the parameter estimates.
The good news is that in the residual space it is the numbers of
adjacent 0's or 1's that matter.

However, if the goal is prediction of species occurrence at unoccupied
sites, than
you may want to think about the problem differently and use either
indicator kriging,
kind of a spatial tobit model to predict probabilities of occurrence
based on Gaussian random
fields, or, you might want to look at geoRglm, for geostatistics in the
glm framework. The problem here
is, as another poster mentioned, is you may have more of a network than
a continuous random field, you may
get around that by using an anisotropic variogram.

Otherwise, in a prediction model in a regression context, over fitting
is going to be more of an issue
than autocorrelation of the residuals. Putting the spatial coordinates,
or the principal components of the spatial
weight matrix as one of the predictors may be good enough. Spatial
autocorrelation really effects the estimates
of the variance, and comes into play if you want to do inference, or
estimate confidence intervals/prediction intervals.

Again, all this assumes you are more interested in prediction than
modeling mechanism.

Nicholas



--

Message: 8
Date: Thu, 25 Aug 2011 23:22:34 +0200
From: Aitor Gast?n aitor.gas...@upm.es
To: r-sig-ecology@r-project.org
Subject: Re: [R-sig-eco] logistic regression and spatial
autocorrelation
Message-ID: 76265B51921941D6ACF52B30985497BB@botanica1
Content-Type: text/plain; format=flowed; charset=UTF-8;
reply-type=original

The limiting sample size in logistic regression is the minimum between
the
number of positive and negative cases, in Tim's data 132 positive cases
(species occurrences). A minimum of 10 events per estimated parameter are
recommended based on external validation studies to avoid overfitting
(see
Harrell, 2001. Regression Modeling Strategies. Springer). Therefore, with
Tim's data up to 13 parameters could be estimated (e.g., 13 variables
without nonlinear terms, or 6 variables in quadratic form, or 4 variables
using restricted cubic splines with 4 knots...).

Aitor

--
From: cpar...@pdx.edu
Sent: Thursday, August 25, 2011 8:08 PM
To: Pedro Lima Pequeno pacol...@gmail.com
Cc: r-sig-ecology@r-project.org; Tim Seipel t.sei...@env.ethz.ch
Subject: Re: [R-sig-eco] logistic regression and spatial autocorrelation

 Tim pointed put that he has only 132 samples out of 2800 with a species
 present and I am curious what people think about how well we can model
 that with logistic regression.
 -Chris




 On Aug 25, 2011, at 10:36 AM, Pedro Lima Pequeno pacol...@gmail.com
 wrote:

 Hi Tim,

 there are several ways of dealing with spatial autocorrelation in
 ecological models (see e.g. Dormann 2007: Methods to account for
 spatial autocorrelation in the analysis of species distributional
 data: a review; and Beale et al. 2010: Regression analysis of spatial
 data). As always, this is an area of active research, so the right or
 wrong thing to do is not as clear as it may seem. Some have even
 concluded  that changes in coefficients between spatial and
 non-spatial methods depend on the method used and are largely
 idiosyncratic, so that researchers may have little choice but to be
 more explicit about the uncertainty of models and more cautious in
 their interpretation (Bini et al. 2009: Coefficient shifts in
 geographical ecology: an empirical evaluation of spatial and
 non-spatial regression). Thus, new methods are emerging at a faster
 rate than people

Re: [R-sig-eco] logistic regression and spatial autocorrelation

2011-08-25 Thread Dunbar, Michael J.
Hi Tim

You haven't really explained where your group variable in the glmm has come 
from. Moving from glm to glmm you've changed two things, adding the grouping 
and the autocorrelation as well. 

You have to be very careful when using the autocorrelation function. As it 
stands the model will assume that the points on your gradient are evenly spaced 
and sorted in order. 

Regards
Mike


-Original Message-
From: r-sig-ecology-boun...@r-project.org 
[mailto:r-sig-ecology-boun...@r-project.org] On Behalf Of Tim Seipel
Sent: 25 August 2011 10:04
To: r-sig-ecology@r-project.org
Subject: [R-sig-eco] logistic regression and spatial autocorrelation


Dear List,
I am trying to determine the best environmental predictors of the 
presence of a species along an elevational gradient.
Elevation ranges from 400 to 2050 m a.s.l. and the ratio of presences to 
absences is low (132 presences out 2800 samples)

So to start I fit the full model of with the variable of interest.

sc.m-glm(PA~sp.max+su.mmin+su.max+fa.mmin+fa.max+Slope+Haupt4+Pop_density+Dist_G+Growi_sea+,data=sc.pa,'binomial')

First, I performed univariate and backward selection using Akaike 
Information Criteria, and the fit was good and realistic given my 
knowledge of the environment though the D^2 was low 0.08. My final model 
was:
-
glm(formula = PA ~ Slope + sp.mmin + su.max + fa.mmin + Haupt4,
 family = binomial, data = sc.pa)

Deviance Residuals:
 Min   1Q   Median   3Q  Max
-0.5415  -0.3506  -0.2608  -0.1762   3.0768

Coefficients:
  Estimate Std. Error z value Pr(|z|)
(Intercept) -73.45212   23.13842  -3.174  0.00150 **
Slope-0.038340.01174  -3.265  0.00109 **
sp.mmin -15.345945.30360  -2.893  0.00381 **
su.max5.097121.70332   2.992  0.00277 **
fa.mmin  13.522624.64021   2.914  0.00357 **
Haupt42  -0.722370.27710  -2.607  0.00914 **
Haupt43  -0.957300.37762  -2.535  0.01124 *
Haupt44  -0.253570.24330  -1.042  0.29731
---
 Null deviance: 958.21  on 2784  degrees of freedom
Residual deviance: 896.10  on 2777  degrees of freedom
AIC: 912.1

--

I then realized that my residuals were all highly correlated (0.8-0.6) 
when I plotted them using acf() function.

So to account for this I used glmmPQL to fit the full model:

model.sc.c - glmmPQL(PA ~ 
sp.mmin+su.mmin+su.max+fa.mmin+Slope+Haupt4+Pop_density+Dist_G+Growi_sea, 
random= 
~1|group.sc, data=sc.dat, family=binomial, correlation=corAR1())

However, the algorithm failed to converge and all the p-vaules were 
either 0 or 1 and coefficient estimates approached infinity. 
Additionally the grouping factor of the random effect is slightly 
arbitrary and accounts a tiny amount of variation.

---
So know I feel stuck between a rock and a hard place, on the one hand I 
know I have a lot of autocorrelation and on the other hand I don't have 
a clear way to include it in the model.

I would appreciate any advice on the matter.

Sincerely,

Tim

[[alternative HTML version deleted]]

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
-- 
This message (and any attachments) is for the recipient only. NERC
is subject to the Freedom of Information Act 2000 and the contents
of this email and any reply you make may be disclosed by NERC unless
it is exempt from release under the Act. Any material supplied to
NERC may be stored in an electronic records management system.
___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology


Re: [R-sig-eco] logistic regression and spatial autocorrelation

2011-08-25 Thread Tim Seipel


Thank you for the replies.

To clarify, the points are generally ordered by geogrpahic distance and 
increasing elevation (they are a converted GPS track- spaced evenly 
every 200 m), though there are some ups and downs in elevation. The 
order of points become more difficult at high elevation. The sampling 
followed river valleys which merge to form the Rhine river in 
Switzerland. My grouping factor reflects this, my random factor consists 
of three groups, 'chur'- which is the primary valley, then the valley 
splits to form two secondary tributaries 'vord' and 'hint'.


Given that my points become less well order toward high elevation should 
I use form= ~1|group?




On 25.08.11 11:21, Dunbar, Michael J. wrote:

Hi Tim

You haven't really explained where your group variable in the glmm has come 
from. Moving from glm to glmm you've changed two things, adding the grouping 
and the autocorrelation as well.

You have to be very careful when using the autocorrelation function. As it 
stands the model will assume that the points on your gradient are evenly spaced 
and sorted in order.

Regards
Mike


-Original Message-
From: r-sig-ecology-boun...@r-project.org 
[mailto:r-sig-ecology-boun...@r-project.org] On Behalf Of Tim Seipel
Sent: 25 August 2011 10:04
To: r-sig-ecology@r-project.org
Subject: [R-sig-eco] logistic regression and spatial autocorrelation


Dear List,
I am trying to determine the best environmental predictors of the
presence of a species along an elevational gradient.
Elevation ranges from 400 to 2050 m a.s.l. and the ratio of presences to
absences is low (132 presences out 2800 samples)

So to start I fit the full model of with the variable of interest.

sc.m-glm(PA~sp.max+su.mmin+su.max+fa.mmin+fa.max+Slope+Haupt4+Pop_density+Dist_G+Growi_sea+,data=sc.pa,'binomial')

First, I performed univariate and backward selection using Akaike
Information Criteria, and the fit was good and realistic given my
knowledge of the environment though the D^2 was low 0.08. My final model
was:
-
glm(formula = PA ~ Slope + sp.mmin + su.max + fa.mmin + Haupt4,
  family = binomial, data = sc.pa)

Deviance Residuals:
  Min   1Q   Median   3Q  Max
-0.5415  -0.3506  -0.2608  -0.1762   3.0768

Coefficients:
   Estimate Std. Error z value Pr(|z|)
(Intercept) -73.45212   23.13842  -3.174  0.00150 **
Slope-0.038340.01174  -3.265  0.00109 **
sp.mmin -15.345945.30360  -2.893  0.00381 **
su.max5.097121.70332   2.992  0.00277 **
fa.mmin  13.522624.64021   2.914  0.00357 **
Haupt42  -0.722370.27710  -2.607  0.00914 **
Haupt43  -0.957300.37762  -2.535  0.01124 *
Haupt44  -0.253570.24330  -1.042  0.29731
---
  Null deviance: 958.21  on 2784  degrees of freedom
Residual deviance: 896.10  on 2777  degrees of freedom
AIC: 912.1

--

I then realized that my residuals were all highly correlated (0.8-0.6)
when I plotted them using acf() function.

So to account for this I used glmmPQL to fit the full model:

model.sc.c- glmmPQL(PA ~
sp.mmin+su.mmin+su.max+fa.mmin+Slope+Haupt4+Pop_density+Dist_G+Growi_sea, 
random=
~1|group.sc, data=sc.dat, family=binomial, correlation=corAR1())

However, the algorithm failed to converge and all the p-vaules were
either 0 or 1 and coefficient estimates approached infinity.
Additionally the grouping factor of the random effect is slightly
arbitrary and accounts a tiny amount of variation.

---
So know I feel stuck between a rock and a hard place, on the one hand I
know I have a lot of autocorrelation and on the other hand I don't have
a clear way to include it in the model.

I would appreciate any advice on the matter.

Sincerely,

Tim

[[alternative HTML version deleted]]

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology


___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology


Re: [R-sig-eco] logistic regression and spatial autocorrelation

2011-08-25 Thread Dunbar, Michael J.
Hi Tim

The problem you have is firstly that you only have three levels for your random 
effect, leading to considerable uncertainties, and secondly as you've 
essentially got a dendritic network, chur is linked to both vord and hint. 
There isn't any simple answer to this problem as far as I'm aware, you may want 
to consider modelling each valley separately, or modelling chur and vord, and 
separately chur and hint.

Another issue to think about, if you haven't already, is whether you think that 
the autocorrelation is caused by some common unknown environmental factors, or 
by some ecological process such as limited dispersal. This can help frame the 
modelling. But whatever, it's still tricky.

Regards
Mike

-Original Message-
From: r-sig-ecology-boun...@r-project.org 
[mailto:r-sig-ecology-boun...@r-project.org] On Behalf Of Tim Seipel
Sent: 25 August 2011 12:04
Cc: r-sig-ecology@r-project.org
Subject: Re: [R-sig-eco] logistic regression and spatial autocorrelation


Thank you for the replies.

To clarify, the points are generally ordered by geogrpahic distance and 
increasing elevation (they are a converted GPS track- spaced evenly 
every 200 m), though there are some ups and downs in elevation. The 
order of points become more difficult at high elevation. The sampling 
followed river valleys which merge to form the Rhine river in 
Switzerland. My grouping factor reflects this, my random factor consists 
of three groups, 'chur'- which is the primary valley, then the valley 
splits to form two secondary tributaries 'vord' and 'hint'.

Given that my points become less well order toward high elevation should 
I use form= ~1|group?



On 25.08.11 11:21, Dunbar, Michael J. wrote:
 Hi Tim

 You haven't really explained where your group variable in the glmm has come 
 from. Moving from glm to glmm you've changed two things, adding the grouping 
 and the autocorrelation as well.

 You have to be very careful when using the autocorrelation function. As it 
 stands the model will assume that the points on your gradient are evenly 
 spaced and sorted in order.

 Regards
 Mike


 -Original Message-
 From: r-sig-ecology-boun...@r-project.org 
 [mailto:r-sig-ecology-boun...@r-project.org] On Behalf Of Tim Seipel
 Sent: 25 August 2011 10:04
 To: r-sig-ecology@r-project.org
 Subject: [R-sig-eco] logistic regression and spatial autocorrelation


 Dear List,
 I am trying to determine the best environmental predictors of the
 presence of a species along an elevational gradient.
 Elevation ranges from 400 to 2050 m a.s.l. and the ratio of presences to
 absences is low (132 presences out 2800 samples)

 So to start I fit the full model of with the variable of interest.

 sc.m-glm(PA~sp.max+su.mmin+su.max+fa.mmin+fa.max+Slope+Haupt4+Pop_density+Dist_G+Growi_sea+,data=sc.pa,'binomial')

 First, I performed univariate and backward selection using Akaike
 Information Criteria, and the fit was good and realistic given my
 knowledge of the environment though the D^2 was low 0.08. My final model
 was:
 -
 glm(formula = PA ~ Slope + sp.mmin + su.max + fa.mmin + Haupt4,
   family = binomial, data = sc.pa)

 Deviance Residuals:
   Min   1Q   Median   3Q  Max
 -0.5415  -0.3506  -0.2608  -0.1762   3.0768

 Coefficients:
Estimate Std. Error z value Pr(|z|)
 (Intercept) -73.45212   23.13842  -3.174  0.00150 **
 Slope-0.038340.01174  -3.265  0.00109 **
 sp.mmin -15.345945.30360  -2.893  0.00381 **
 su.max5.097121.70332   2.992  0.00277 **
 fa.mmin  13.522624.64021   2.914  0.00357 **
 Haupt42  -0.722370.27710  -2.607  0.00914 **
 Haupt43  -0.957300.37762  -2.535  0.01124 *
 Haupt44  -0.253570.24330  -1.042  0.29731
 ---
   Null deviance: 958.21  on 2784  degrees of freedom
 Residual deviance: 896.10  on 2777  degrees of freedom
 AIC: 912.1

 --

 I then realized that my residuals were all highly correlated (0.8-0.6)
 when I plotted them using acf() function.

 So to account for this I used glmmPQL to fit the full model:

 model.sc.c- glmmPQL(PA ~
 sp.mmin+su.mmin+su.max+fa.mmin+Slope+Haupt4+Pop_density+Dist_G+Growi_sea, 
 random=
 ~1|group.sc, data=sc.dat, family=binomial, correlation=corAR1())

 However, the algorithm failed to converge and all the p-vaules were
 either 0 or 1 and coefficient estimates approached infinity.
 Additionally the grouping factor of the random effect is slightly
 arbitrary and accounts a tiny amount of variation.

 ---
 So know I feel stuck between a rock and a hard place, on the one hand I
 know I have a lot of autocorrelation and on the other hand I don't have
 a clear way to include it in the model.

 I would appreciate any advice on the matter.

 Sincerely,

 Tim

   [[alternative HTML version deleted]]

 ___
 R-sig-ecology mailing list
 R-sig-ecology@r-project.org
 https://stat.ethz.ch

Re: [R-sig-eco] logistic regression and spatial autocorrelation

2011-08-25 Thread cpar...@pdx.edu
Tim pointed put that he has only 132 samples out of 2800 with a species present 
and I am curious what people think about how well we can model that with 
logistic regression.
-Chris




On Aug 25, 2011, at 10:36 AM, Pedro Lima Pequeno pacol...@gmail.com wrote:

 Hi Tim,
 
 there are several ways of dealing with spatial autocorrelation in
 ecological models (see e.g. Dormann 2007: Methods to account for
 spatial autocorrelation in the analysis of species distributional
 data: a review; and Beale et al. 2010: Regression analysis of spatial
 data). As always, this is an area of active research, so the right or
 wrong thing to do is not as clear as it may seem. Some have even
 concluded  that changes in coefficients between spatial and
 non-spatial methods depend on the method used and are largely
 idiosyncratic, so that researchers may have little choice but to be
 more explicit about the uncertainty of models and more cautious in
 their interpretation (Bini et al. 2009: Coefficient shifts in
 geographical ecology: an empirical evaluation of spatial and
 non-spatial regression). Thus, new methods are emerging at a faster
 rate than people studying and comparing their properties. Nonetheless,
 I think some observations are useful:
 1) there are methods explicitly designed to detect spatial
 autocorrelation such as Moran's autocorrelograms or variograms
 (available in several R packages). As already pointed out, the
 autocorrelation function is well behaved with linear, equally spaced
 series in time or space;
 2) minimal adequate model selection with the AIC is sensitive to
 residual autocorrelation; it tends to generate unstable and overfitted
 models. Thus, when applying any model selection procedure, you should
 account for the uncertainty in the process by averaging the model set
 (or model predictions) with respect to the relative support of each
 model (e.g. Akaike weights). Since you have a large sample, you could
 account for residual spatial autocorrelation using eigenvector
 filtering, which produces synthetic variables that capture spatial
 patterns and can be included in linear models as explanatory variables
 - a more intuitive approach if you don't want to mess with random
 factors (see e.g. Diniz-Filho et al. 2008: Model selection and
 information theory in geographical ecology). This can be implemented
 with packages spacemakeR and vegan;
 3) autocorrelation in model residuals is not the only - nor most
 important - problem in biological modeling; model misspecification is
 the major issue. Residual autocorrelation often arises due to not
 including relevant explanatory variables, interaction terms, assuming
 an inappropriate response shape and/or an inadequate variance
 structure, or any combination of these. All these things need to be
 checked for proper model validation, for instance by partial
 regression plots (or added-varialbe plots), which help you see the
 shape of the response to each explanatory variable after account for
 the variation in the remaining explanatory set.
 At the same time, you could plot the residuals againts both model
 predictions and explanatory variables. Since residuals are the
 stochastic component of the model (noise), its relation with the
 systematic components should be random; clear patterns in these plots
 are indications of misspecification.
 Finally, all this model tinkering is based on two fundamental
 premises: you want to model a mean tendecy of response and the pattern
 of variation around it. These are strictly statistical properties of
 data - they have nothing to do with biology. If you don't really
 believe the biological process you are studying implies a mean
 response, but rather e.g. a maximum one (such as in population or
 abundance limitation), than all these methods will actually induce you
 to misspecify the model, but there are alternatives (see e.g. Cade et
 al. 2005 - Quantile regression reveals hidden bias and uncertainty in
 habitat models).
 
 2011/8/25, Tim Seipel t.sei...@env.ethz.ch:
 
 Dear List,
 I am trying to determine the best environmental predictors of the
 presence of a species along an elevational gradient.
 Elevation ranges from 400 to 2050 m a.s.l. and the ratio of presences to
 absences is low (132 presences out 2800 samples)
 
 So to start I fit the full model of with the variable of interest.
 
 sc.m-glm(PA~sp.max+su.mmin+su.max+fa.mmin+fa.max+Slope+Haupt4+Pop_density+Dist_G+Growi_sea+,data=sc.pa,'binomial')
 
 First, I performed univariate and backward selection using Akaike
 Information Criteria, and the fit was good and realistic given my
 knowledge of the environment though the D^2 was low 0.08. My final model
 was:
 -
 glm(formula = PA ~ Slope + sp.mmin + su.max + fa.mmin + Haupt4,
 family = binomial, data = sc.pa)
 
 Deviance Residuals:
 Min   1Q   Median   3Q  Max
 -0.5415  -0.3506  -0.2608  -0.1762   3.0768
 
 Coefficients:
  Estimate Std. 

Re: [R-sig-eco] logistic regression and spatial autocorrelation

2011-08-25 Thread Aitor Gastón
The limiting sample size in logistic regression is the minimum between the 
number of positive and negative cases, in Tim's data 132 positive cases 
(species occurrences). A minimum of 10 events per estimated parameter are 
recommended based on external validation studies to avoid overfitting (see 
Harrell, 2001. Regression Modeling Strategies. Springer). Therefore, with 
Tim's data up to 13 parameters could be estimated (e.g., 13 variables 
without nonlinear terms, or 6 variables in quadratic form, or 4 variables 
using restricted cubic splines with 4 knots...).


Aitor

--
From: cpar...@pdx.edu
Sent: Thursday, August 25, 2011 8:08 PM
To: Pedro Lima Pequeno pacol...@gmail.com
Cc: r-sig-ecology@r-project.org; Tim Seipel t.sei...@env.ethz.ch
Subject: Re: [R-sig-eco] logistic regression and spatial autocorrelation

Tim pointed put that he has only 132 samples out of 2800 with a species 
present and I am curious what people think about how well we can model 
that with logistic regression.

-Chris




On Aug 25, 2011, at 10:36 AM, Pedro Lima Pequeno pacol...@gmail.com 
wrote:



Hi Tim,

there are several ways of dealing with spatial autocorrelation in
ecological models (see e.g. Dormann 2007: Methods to account for
spatial autocorrelation in the analysis of species distributional
data: a review; and Beale et al. 2010: Regression analysis of spatial
data). As always, this is an area of active research, so the right or
wrong thing to do is not as clear as it may seem. Some have even
concluded  that changes in coefficients between spatial and
non-spatial methods depend on the method used and are largely
idiosyncratic, so that researchers may have little choice but to be
more explicit about the uncertainty of models and more cautious in
their interpretation (Bini et al. 2009: Coefficient shifts in
geographical ecology: an empirical evaluation of spatial and
non-spatial regression). Thus, new methods are emerging at a faster
rate than people studying and comparing their properties. Nonetheless,
I think some observations are useful:
1) there are methods explicitly designed to detect spatial
autocorrelation such as Moran's autocorrelograms or variograms
(available in several R packages). As already pointed out, the
autocorrelation function is well behaved with linear, equally spaced
series in time or space;
2) minimal adequate model selection with the AIC is sensitive to
residual autocorrelation; it tends to generate unstable and overfitted
models. Thus, when applying any model selection procedure, you should
account for the uncertainty in the process by averaging the model set
(or model predictions) with respect to the relative support of each
model (e.g. Akaike weights). Since you have a large sample, you could
account for residual spatial autocorrelation using eigenvector
filtering, which produces synthetic variables that capture spatial
patterns and can be included in linear models as explanatory variables
- a more intuitive approach if you don't want to mess with random
factors (see e.g. Diniz-Filho et al. 2008: Model selection and
information theory in geographical ecology). This can be implemented
with packages spacemakeR and vegan;
3) autocorrelation in model residuals is not the only - nor most
important - problem in biological modeling; model misspecification is
the major issue. Residual autocorrelation often arises due to not
including relevant explanatory variables, interaction terms, assuming
an inappropriate response shape and/or an inadequate variance
structure, or any combination of these. All these things need to be
checked for proper model validation, for instance by partial
regression plots (or added-varialbe plots), which help you see the
shape of the response to each explanatory variable after account for
the variation in the remaining explanatory set.
At the same time, you could plot the residuals againts both model
predictions and explanatory variables. Since residuals are the
stochastic component of the model (noise), its relation with the
systematic components should be random; clear patterns in these plots
are indications of misspecification.
Finally, all this model tinkering is based on two fundamental
premises: you want to model a mean tendecy of response and the pattern
of variation around it. These are strictly statistical properties of
data - they have nothing to do with biology. If you don't really
believe the biological process you are studying implies a mean
response, but rather e.g. a maximum one (such as in population or
abundance limitation), than all these methods will actually induce you
to misspecify the model, but there are alternatives (see e.g. Cade et
al. 2005 - Quantile regression reveals hidden bias and uncertainty in
habitat models).

2011/8/25, Tim Seipel t.sei...@env.ethz.ch:


Dear List,
I am trying to determine the best environmental predictors of the
presence of a species along an elevational gradient.
Elevation