Re: [R-sig-eco] logistic regression and spatial autocorrelation
Hi, to weigh in on this: @Aitor, Harrell's rules of thumb are assuming independent predictors without any fancy covariance function. To model the covariance of the residuals you are now estimating extra 2nd order parameters from the data, so even more data is needed to stabilize the parameter estimates. The good news is that in the residual space it is the numbers of adjacent 0's or 1's that matter. However, if the goal is prediction of species occurrence at unoccupied sites, than you may want to think about the problem differently and use either indicator kriging, kind of a spatial tobit model to predict probabilities of occurrence based on Gaussian random fields, or, you might want to look at geoRglm, for geostatistics in the glm framework. The problem here is, as another poster mentioned, is you may have more of a network than a continuous random field, you may get around that by using an anisotropic variogram. Otherwise, in a prediction model in a regression context, over fitting is going to be more of an issue than autocorrelation of the residuals. Putting the spatial coordinates, or the principal components of the spatial weight matrix as one of the predictors may be good enough. Spatial autocorrelation really effects the estimates of the variance, and comes into play if you want to do inference, or estimate confidence intervals/prediction intervals. Again, all this assumes you are more interested in prediction than modeling mechanism. Nicholas -- Message: 8 Date: Thu, 25 Aug 2011 23:22:34 +0200 From: Aitor Gast?n aitor.gas...@upm.es To: r-sig-ecology@r-project.org Subject: Re: [R-sig-eco] logistic regression and spatial autocorrelation Message-ID: 76265B51921941D6ACF52B30985497BB@botanica1 Content-Type: text/plain; format=flowed; charset=UTF-8; reply-type=original The limiting sample size in logistic regression is the minimum between the number of positive and negative cases, in Tim's data 132 positive cases (species occurrences). A minimum of 10 events per estimated parameter are recommended based on external validation studies to avoid overfitting (see Harrell, 2001. Regression Modeling Strategies. Springer). Therefore, with Tim's data up to 13 parameters could be estimated (e.g., 13 variables without nonlinear terms, or 6 variables in quadratic form, or 4 variables using restricted cubic splines with 4 knots...). Aitor -- From: cpar...@pdx.edu Sent: Thursday, August 25, 2011 8:08 PM To: Pedro Lima Pequeno pacol...@gmail.com Cc: r-sig-ecology@r-project.org; Tim Seipel t.sei...@env.ethz.ch Subject: Re: [R-sig-eco] logistic regression and spatial autocorrelation Tim pointed put that he has only 132 samples out of 2800 with a species present and I am curious what people think about how well we can model that with logistic regression. -Chris On Aug 25, 2011, at 10:36 AM, Pedro Lima Pequeno pacol...@gmail.com wrote: Hi Tim, there are several ways of dealing with spatial autocorrelation in ecological models (see e.g. Dormann 2007: Methods to account for spatial autocorrelation in the analysis of species distributional data: a review; and Beale et al. 2010: Regression analysis of spatial data). As always, this is an area of active research, so the right or wrong thing to do is not as clear as it may seem. Some have even concluded that changes in coefficients between spatial and non-spatial methods depend on the method used and are largely idiosyncratic, so that researchers may have little choice but to be more explicit about the uncertainty of models and more cautious in their interpretation (Bini et al. 2009: Coefficient shifts in geographical ecology: an empirical evaluation of spatial and non-spatial regression). Thus, new methods are emerging at a faster rate than people studying and comparing their properties. Nonetheless, I think some observations are useful: 1) there are methods explicitly designed to detect spatial autocorrelation such as Moran's autocorrelograms or variograms (available in several R packages). As already pointed out, the autocorrelation function is well behaved with linear, equally spaced series in time or space; 2) minimal adequate model selection with the AIC is sensitive to residual autocorrelation; it tends to generate unstable and overfitted models. Thus, when applying any model selection procedure, you should account for the uncertainty in the process by averaging the model set (or model predictions) with respect to the relative support of each model (e.g. Akaike weights). Since you have a large sample, you could account for residual spatial autocorrelation using eigenvector filtering, which produces synthetic variables that capture spatial patterns and can be included in linear models as explanatory variables - a more intuitive approach
Re: [R-sig-eco] logistic regression and spatial autocorrelation
Hi, Nicholas, I understand that the rule of thumb of 10 events per parameter comes from model predictive performance assessments and may not apply if you are doing inference. However, I'm not sure that the rule assumes independent predictors, collinearity can cause inflated standard errors of the regression coefficients, but does not affect predictions made on new data that have the same degree of collinearity as the training data, as long as extreme extrapolation is not attempted (Harrell, 2001, page 65). Perhaps 10 events per parameter is a too general recommendation and larger samples may be required in some situations. In small sample scenarios, doing variable reduction before model fitting (without using the response variable) and applying some kind of penalization to reduce the effective degrees of freedom will produce better predictive performance. Aitor -- From: Nicholas Lewin-Koh ni...@hailmail.net Sent: Friday, August 26, 2011 7:01 PM To: r-sig-ecology@r-project.org Subject: Re: [R-sig-eco] logistic regression and spatial autocorrelation Hi, to weigh in on this: @Aitor, Harrell's rules of thumb are assuming independent predictors without any fancy covariance function. To model the covariance of the residuals you are now estimating extra 2nd order parameters from the data, so even more data is needed to stabilize the parameter estimates. The good news is that in the residual space it is the numbers of adjacent 0's or 1's that matter. However, if the goal is prediction of species occurrence at unoccupied sites, than you may want to think about the problem differently and use either indicator kriging, kind of a spatial tobit model to predict probabilities of occurrence based on Gaussian random fields, or, you might want to look at geoRglm, for geostatistics in the glm framework. The problem here is, as another poster mentioned, is you may have more of a network than a continuous random field, you may get around that by using an anisotropic variogram. Otherwise, in a prediction model in a regression context, over fitting is going to be more of an issue than autocorrelation of the residuals. Putting the spatial coordinates, or the principal components of the spatial weight matrix as one of the predictors may be good enough. Spatial autocorrelation really effects the estimates of the variance, and comes into play if you want to do inference, or estimate confidence intervals/prediction intervals. Again, all this assumes you are more interested in prediction than modeling mechanism. Nicholas -- Message: 8 Date: Thu, 25 Aug 2011 23:22:34 +0200 From: Aitor Gast?n aitor.gas...@upm.es To: r-sig-ecology@r-project.org Subject: Re: [R-sig-eco] logistic regression and spatial autocorrelation Message-ID: 76265B51921941D6ACF52B30985497BB@botanica1 Content-Type: text/plain; format=flowed; charset=UTF-8; reply-type=original The limiting sample size in logistic regression is the minimum between the number of positive and negative cases, in Tim's data 132 positive cases (species occurrences). A minimum of 10 events per estimated parameter are recommended based on external validation studies to avoid overfitting (see Harrell, 2001. Regression Modeling Strategies. Springer). Therefore, with Tim's data up to 13 parameters could be estimated (e.g., 13 variables without nonlinear terms, or 6 variables in quadratic form, or 4 variables using restricted cubic splines with 4 knots...). Aitor -- From: cpar...@pdx.edu Sent: Thursday, August 25, 2011 8:08 PM To: Pedro Lima Pequeno pacol...@gmail.com Cc: r-sig-ecology@r-project.org; Tim Seipel t.sei...@env.ethz.ch Subject: Re: [R-sig-eco] logistic regression and spatial autocorrelation Tim pointed put that he has only 132 samples out of 2800 with a species present and I am curious what people think about how well we can model that with logistic regression. -Chris On Aug 25, 2011, at 10:36 AM, Pedro Lima Pequeno pacol...@gmail.com wrote: Hi Tim, there are several ways of dealing with spatial autocorrelation in ecological models (see e.g. Dormann 2007: Methods to account for spatial autocorrelation in the analysis of species distributional data: a review; and Beale et al. 2010: Regression analysis of spatial data). As always, this is an area of active research, so the right or wrong thing to do is not as clear as it may seem. Some have even concluded that changes in coefficients between spatial and non-spatial methods depend on the method used and are largely idiosyncratic, so that researchers may have little choice but to be more explicit about the uncertainty of models and more cautious in their interpretation (Bini et al. 2009: Coefficient shifts in geographical ecology: an empirical evaluation of spatial and non-spatial regression). Thus, new methods are emerging at a faster rate than people
Re: [R-sig-eco] logistic regression and spatial autocorrelation
Hi Tim You haven't really explained where your group variable in the glmm has come from. Moving from glm to glmm you've changed two things, adding the grouping and the autocorrelation as well. You have to be very careful when using the autocorrelation function. As it stands the model will assume that the points on your gradient are evenly spaced and sorted in order. Regards Mike -Original Message- From: r-sig-ecology-boun...@r-project.org [mailto:r-sig-ecology-boun...@r-project.org] On Behalf Of Tim Seipel Sent: 25 August 2011 10:04 To: r-sig-ecology@r-project.org Subject: [R-sig-eco] logistic regression and spatial autocorrelation Dear List, I am trying to determine the best environmental predictors of the presence of a species along an elevational gradient. Elevation ranges from 400 to 2050 m a.s.l. and the ratio of presences to absences is low (132 presences out 2800 samples) So to start I fit the full model of with the variable of interest. sc.m-glm(PA~sp.max+su.mmin+su.max+fa.mmin+fa.max+Slope+Haupt4+Pop_density+Dist_G+Growi_sea+,data=sc.pa,'binomial') First, I performed univariate and backward selection using Akaike Information Criteria, and the fit was good and realistic given my knowledge of the environment though the D^2 was low 0.08. My final model was: - glm(formula = PA ~ Slope + sp.mmin + su.max + fa.mmin + Haupt4, family = binomial, data = sc.pa) Deviance Residuals: Min 1Q Median 3Q Max -0.5415 -0.3506 -0.2608 -0.1762 3.0768 Coefficients: Estimate Std. Error z value Pr(|z|) (Intercept) -73.45212 23.13842 -3.174 0.00150 ** Slope-0.038340.01174 -3.265 0.00109 ** sp.mmin -15.345945.30360 -2.893 0.00381 ** su.max5.097121.70332 2.992 0.00277 ** fa.mmin 13.522624.64021 2.914 0.00357 ** Haupt42 -0.722370.27710 -2.607 0.00914 ** Haupt43 -0.957300.37762 -2.535 0.01124 * Haupt44 -0.253570.24330 -1.042 0.29731 --- Null deviance: 958.21 on 2784 degrees of freedom Residual deviance: 896.10 on 2777 degrees of freedom AIC: 912.1 -- I then realized that my residuals were all highly correlated (0.8-0.6) when I plotted them using acf() function. So to account for this I used glmmPQL to fit the full model: model.sc.c - glmmPQL(PA ~ sp.mmin+su.mmin+su.max+fa.mmin+Slope+Haupt4+Pop_density+Dist_G+Growi_sea, random= ~1|group.sc, data=sc.dat, family=binomial, correlation=corAR1()) However, the algorithm failed to converge and all the p-vaules were either 0 or 1 and coefficient estimates approached infinity. Additionally the grouping factor of the random effect is slightly arbitrary and accounts a tiny amount of variation. --- So know I feel stuck between a rock and a hard place, on the one hand I know I have a lot of autocorrelation and on the other hand I don't have a clear way to include it in the model. I would appreciate any advice on the matter. Sincerely, Tim [[alternative HTML version deleted]] ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology -- This message (and any attachments) is for the recipient only. NERC is subject to the Freedom of Information Act 2000 and the contents of this email and any reply you make may be disclosed by NERC unless it is exempt from release under the Act. Any material supplied to NERC may be stored in an electronic records management system. ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
Re: [R-sig-eco] logistic regression and spatial autocorrelation
Thank you for the replies. To clarify, the points are generally ordered by geogrpahic distance and increasing elevation (they are a converted GPS track- spaced evenly every 200 m), though there are some ups and downs in elevation. The order of points become more difficult at high elevation. The sampling followed river valleys which merge to form the Rhine river in Switzerland. My grouping factor reflects this, my random factor consists of three groups, 'chur'- which is the primary valley, then the valley splits to form two secondary tributaries 'vord' and 'hint'. Given that my points become less well order toward high elevation should I use form= ~1|group? On 25.08.11 11:21, Dunbar, Michael J. wrote: Hi Tim You haven't really explained where your group variable in the glmm has come from. Moving from glm to glmm you've changed two things, adding the grouping and the autocorrelation as well. You have to be very careful when using the autocorrelation function. As it stands the model will assume that the points on your gradient are evenly spaced and sorted in order. Regards Mike -Original Message- From: r-sig-ecology-boun...@r-project.org [mailto:r-sig-ecology-boun...@r-project.org] On Behalf Of Tim Seipel Sent: 25 August 2011 10:04 To: r-sig-ecology@r-project.org Subject: [R-sig-eco] logistic regression and spatial autocorrelation Dear List, I am trying to determine the best environmental predictors of the presence of a species along an elevational gradient. Elevation ranges from 400 to 2050 m a.s.l. and the ratio of presences to absences is low (132 presences out 2800 samples) So to start I fit the full model of with the variable of interest. sc.m-glm(PA~sp.max+su.mmin+su.max+fa.mmin+fa.max+Slope+Haupt4+Pop_density+Dist_G+Growi_sea+,data=sc.pa,'binomial') First, I performed univariate and backward selection using Akaike Information Criteria, and the fit was good and realistic given my knowledge of the environment though the D^2 was low 0.08. My final model was: - glm(formula = PA ~ Slope + sp.mmin + su.max + fa.mmin + Haupt4, family = binomial, data = sc.pa) Deviance Residuals: Min 1Q Median 3Q Max -0.5415 -0.3506 -0.2608 -0.1762 3.0768 Coefficients: Estimate Std. Error z value Pr(|z|) (Intercept) -73.45212 23.13842 -3.174 0.00150 ** Slope-0.038340.01174 -3.265 0.00109 ** sp.mmin -15.345945.30360 -2.893 0.00381 ** su.max5.097121.70332 2.992 0.00277 ** fa.mmin 13.522624.64021 2.914 0.00357 ** Haupt42 -0.722370.27710 -2.607 0.00914 ** Haupt43 -0.957300.37762 -2.535 0.01124 * Haupt44 -0.253570.24330 -1.042 0.29731 --- Null deviance: 958.21 on 2784 degrees of freedom Residual deviance: 896.10 on 2777 degrees of freedom AIC: 912.1 -- I then realized that my residuals were all highly correlated (0.8-0.6) when I plotted them using acf() function. So to account for this I used glmmPQL to fit the full model: model.sc.c- glmmPQL(PA ~ sp.mmin+su.mmin+su.max+fa.mmin+Slope+Haupt4+Pop_density+Dist_G+Growi_sea, random= ~1|group.sc, data=sc.dat, family=binomial, correlation=corAR1()) However, the algorithm failed to converge and all the p-vaules were either 0 or 1 and coefficient estimates approached infinity. Additionally the grouping factor of the random effect is slightly arbitrary and accounts a tiny amount of variation. --- So know I feel stuck between a rock and a hard place, on the one hand I know I have a lot of autocorrelation and on the other hand I don't have a clear way to include it in the model. I would appreciate any advice on the matter. Sincerely, Tim [[alternative HTML version deleted]] ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
Re: [R-sig-eco] logistic regression and spatial autocorrelation
Hi Tim The problem you have is firstly that you only have three levels for your random effect, leading to considerable uncertainties, and secondly as you've essentially got a dendritic network, chur is linked to both vord and hint. There isn't any simple answer to this problem as far as I'm aware, you may want to consider modelling each valley separately, or modelling chur and vord, and separately chur and hint. Another issue to think about, if you haven't already, is whether you think that the autocorrelation is caused by some common unknown environmental factors, or by some ecological process such as limited dispersal. This can help frame the modelling. But whatever, it's still tricky. Regards Mike -Original Message- From: r-sig-ecology-boun...@r-project.org [mailto:r-sig-ecology-boun...@r-project.org] On Behalf Of Tim Seipel Sent: 25 August 2011 12:04 Cc: r-sig-ecology@r-project.org Subject: Re: [R-sig-eco] logistic regression and spatial autocorrelation Thank you for the replies. To clarify, the points are generally ordered by geogrpahic distance and increasing elevation (they are a converted GPS track- spaced evenly every 200 m), though there are some ups and downs in elevation. The order of points become more difficult at high elevation. The sampling followed river valleys which merge to form the Rhine river in Switzerland. My grouping factor reflects this, my random factor consists of three groups, 'chur'- which is the primary valley, then the valley splits to form two secondary tributaries 'vord' and 'hint'. Given that my points become less well order toward high elevation should I use form= ~1|group? On 25.08.11 11:21, Dunbar, Michael J. wrote: Hi Tim You haven't really explained where your group variable in the glmm has come from. Moving from glm to glmm you've changed two things, adding the grouping and the autocorrelation as well. You have to be very careful when using the autocorrelation function. As it stands the model will assume that the points on your gradient are evenly spaced and sorted in order. Regards Mike -Original Message- From: r-sig-ecology-boun...@r-project.org [mailto:r-sig-ecology-boun...@r-project.org] On Behalf Of Tim Seipel Sent: 25 August 2011 10:04 To: r-sig-ecology@r-project.org Subject: [R-sig-eco] logistic regression and spatial autocorrelation Dear List, I am trying to determine the best environmental predictors of the presence of a species along an elevational gradient. Elevation ranges from 400 to 2050 m a.s.l. and the ratio of presences to absences is low (132 presences out 2800 samples) So to start I fit the full model of with the variable of interest. sc.m-glm(PA~sp.max+su.mmin+su.max+fa.mmin+fa.max+Slope+Haupt4+Pop_density+Dist_G+Growi_sea+,data=sc.pa,'binomial') First, I performed univariate and backward selection using Akaike Information Criteria, and the fit was good and realistic given my knowledge of the environment though the D^2 was low 0.08. My final model was: - glm(formula = PA ~ Slope + sp.mmin + su.max + fa.mmin + Haupt4, family = binomial, data = sc.pa) Deviance Residuals: Min 1Q Median 3Q Max -0.5415 -0.3506 -0.2608 -0.1762 3.0768 Coefficients: Estimate Std. Error z value Pr(|z|) (Intercept) -73.45212 23.13842 -3.174 0.00150 ** Slope-0.038340.01174 -3.265 0.00109 ** sp.mmin -15.345945.30360 -2.893 0.00381 ** su.max5.097121.70332 2.992 0.00277 ** fa.mmin 13.522624.64021 2.914 0.00357 ** Haupt42 -0.722370.27710 -2.607 0.00914 ** Haupt43 -0.957300.37762 -2.535 0.01124 * Haupt44 -0.253570.24330 -1.042 0.29731 --- Null deviance: 958.21 on 2784 degrees of freedom Residual deviance: 896.10 on 2777 degrees of freedom AIC: 912.1 -- I then realized that my residuals were all highly correlated (0.8-0.6) when I plotted them using acf() function. So to account for this I used glmmPQL to fit the full model: model.sc.c- glmmPQL(PA ~ sp.mmin+su.mmin+su.max+fa.mmin+Slope+Haupt4+Pop_density+Dist_G+Growi_sea, random= ~1|group.sc, data=sc.dat, family=binomial, correlation=corAR1()) However, the algorithm failed to converge and all the p-vaules were either 0 or 1 and coefficient estimates approached infinity. Additionally the grouping factor of the random effect is slightly arbitrary and accounts a tiny amount of variation. --- So know I feel stuck between a rock and a hard place, on the one hand I know I have a lot of autocorrelation and on the other hand I don't have a clear way to include it in the model. I would appreciate any advice on the matter. Sincerely, Tim [[alternative HTML version deleted]] ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch
Re: [R-sig-eco] logistic regression and spatial autocorrelation
Tim pointed put that he has only 132 samples out of 2800 with a species present and I am curious what people think about how well we can model that with logistic regression. -Chris On Aug 25, 2011, at 10:36 AM, Pedro Lima Pequeno pacol...@gmail.com wrote: Hi Tim, there are several ways of dealing with spatial autocorrelation in ecological models (see e.g. Dormann 2007: Methods to account for spatial autocorrelation in the analysis of species distributional data: a review; and Beale et al. 2010: Regression analysis of spatial data). As always, this is an area of active research, so the right or wrong thing to do is not as clear as it may seem. Some have even concluded that changes in coefficients between spatial and non-spatial methods depend on the method used and are largely idiosyncratic, so that researchers may have little choice but to be more explicit about the uncertainty of models and more cautious in their interpretation (Bini et al. 2009: Coefficient shifts in geographical ecology: an empirical evaluation of spatial and non-spatial regression). Thus, new methods are emerging at a faster rate than people studying and comparing their properties. Nonetheless, I think some observations are useful: 1) there are methods explicitly designed to detect spatial autocorrelation such as Moran's autocorrelograms or variograms (available in several R packages). As already pointed out, the autocorrelation function is well behaved with linear, equally spaced series in time or space; 2) minimal adequate model selection with the AIC is sensitive to residual autocorrelation; it tends to generate unstable and overfitted models. Thus, when applying any model selection procedure, you should account for the uncertainty in the process by averaging the model set (or model predictions) with respect to the relative support of each model (e.g. Akaike weights). Since you have a large sample, you could account for residual spatial autocorrelation using eigenvector filtering, which produces synthetic variables that capture spatial patterns and can be included in linear models as explanatory variables - a more intuitive approach if you don't want to mess with random factors (see e.g. Diniz-Filho et al. 2008: Model selection and information theory in geographical ecology). This can be implemented with packages spacemakeR and vegan; 3) autocorrelation in model residuals is not the only - nor most important - problem in biological modeling; model misspecification is the major issue. Residual autocorrelation often arises due to not including relevant explanatory variables, interaction terms, assuming an inappropriate response shape and/or an inadequate variance structure, or any combination of these. All these things need to be checked for proper model validation, for instance by partial regression plots (or added-varialbe plots), which help you see the shape of the response to each explanatory variable after account for the variation in the remaining explanatory set. At the same time, you could plot the residuals againts both model predictions and explanatory variables. Since residuals are the stochastic component of the model (noise), its relation with the systematic components should be random; clear patterns in these plots are indications of misspecification. Finally, all this model tinkering is based on two fundamental premises: you want to model a mean tendecy of response and the pattern of variation around it. These are strictly statistical properties of data - they have nothing to do with biology. If you don't really believe the biological process you are studying implies a mean response, but rather e.g. a maximum one (such as in population or abundance limitation), than all these methods will actually induce you to misspecify the model, but there are alternatives (see e.g. Cade et al. 2005 - Quantile regression reveals hidden bias and uncertainty in habitat models). 2011/8/25, Tim Seipel t.sei...@env.ethz.ch: Dear List, I am trying to determine the best environmental predictors of the presence of a species along an elevational gradient. Elevation ranges from 400 to 2050 m a.s.l. and the ratio of presences to absences is low (132 presences out 2800 samples) So to start I fit the full model of with the variable of interest. sc.m-glm(PA~sp.max+su.mmin+su.max+fa.mmin+fa.max+Slope+Haupt4+Pop_density+Dist_G+Growi_sea+,data=sc.pa,'binomial') First, I performed univariate and backward selection using Akaike Information Criteria, and the fit was good and realistic given my knowledge of the environment though the D^2 was low 0.08. My final model was: - glm(formula = PA ~ Slope + sp.mmin + su.max + fa.mmin + Haupt4, family = binomial, data = sc.pa) Deviance Residuals: Min 1Q Median 3Q Max -0.5415 -0.3506 -0.2608 -0.1762 3.0768 Coefficients: Estimate Std.
Re: [R-sig-eco] logistic regression and spatial autocorrelation
The limiting sample size in logistic regression is the minimum between the number of positive and negative cases, in Tim's data 132 positive cases (species occurrences). A minimum of 10 events per estimated parameter are recommended based on external validation studies to avoid overfitting (see Harrell, 2001. Regression Modeling Strategies. Springer). Therefore, with Tim's data up to 13 parameters could be estimated (e.g., 13 variables without nonlinear terms, or 6 variables in quadratic form, or 4 variables using restricted cubic splines with 4 knots...). Aitor -- From: cpar...@pdx.edu Sent: Thursday, August 25, 2011 8:08 PM To: Pedro Lima Pequeno pacol...@gmail.com Cc: r-sig-ecology@r-project.org; Tim Seipel t.sei...@env.ethz.ch Subject: Re: [R-sig-eco] logistic regression and spatial autocorrelation Tim pointed put that he has only 132 samples out of 2800 with a species present and I am curious what people think about how well we can model that with logistic regression. -Chris On Aug 25, 2011, at 10:36 AM, Pedro Lima Pequeno pacol...@gmail.com wrote: Hi Tim, there are several ways of dealing with spatial autocorrelation in ecological models (see e.g. Dormann 2007: Methods to account for spatial autocorrelation in the analysis of species distributional data: a review; and Beale et al. 2010: Regression analysis of spatial data). As always, this is an area of active research, so the right or wrong thing to do is not as clear as it may seem. Some have even concluded that changes in coefficients between spatial and non-spatial methods depend on the method used and are largely idiosyncratic, so that researchers may have little choice but to be more explicit about the uncertainty of models and more cautious in their interpretation (Bini et al. 2009: Coefficient shifts in geographical ecology: an empirical evaluation of spatial and non-spatial regression). Thus, new methods are emerging at a faster rate than people studying and comparing their properties. Nonetheless, I think some observations are useful: 1) there are methods explicitly designed to detect spatial autocorrelation such as Moran's autocorrelograms or variograms (available in several R packages). As already pointed out, the autocorrelation function is well behaved with linear, equally spaced series in time or space; 2) minimal adequate model selection with the AIC is sensitive to residual autocorrelation; it tends to generate unstable and overfitted models. Thus, when applying any model selection procedure, you should account for the uncertainty in the process by averaging the model set (or model predictions) with respect to the relative support of each model (e.g. Akaike weights). Since you have a large sample, you could account for residual spatial autocorrelation using eigenvector filtering, which produces synthetic variables that capture spatial patterns and can be included in linear models as explanatory variables - a more intuitive approach if you don't want to mess with random factors (see e.g. Diniz-Filho et al. 2008: Model selection and information theory in geographical ecology). This can be implemented with packages spacemakeR and vegan; 3) autocorrelation in model residuals is not the only - nor most important - problem in biological modeling; model misspecification is the major issue. Residual autocorrelation often arises due to not including relevant explanatory variables, interaction terms, assuming an inappropriate response shape and/or an inadequate variance structure, or any combination of these. All these things need to be checked for proper model validation, for instance by partial regression plots (or added-varialbe plots), which help you see the shape of the response to each explanatory variable after account for the variation in the remaining explanatory set. At the same time, you could plot the residuals againts both model predictions and explanatory variables. Since residuals are the stochastic component of the model (noise), its relation with the systematic components should be random; clear patterns in these plots are indications of misspecification. Finally, all this model tinkering is based on two fundamental premises: you want to model a mean tendecy of response and the pattern of variation around it. These are strictly statistical properties of data - they have nothing to do with biology. If you don't really believe the biological process you are studying implies a mean response, but rather e.g. a maximum one (such as in population or abundance limitation), than all these methods will actually induce you to misspecify the model, but there are alternatives (see e.g. Cade et al. 2005 - Quantile regression reveals hidden bias and uncertainty in habitat models). 2011/8/25, Tim Seipel t.sei...@env.ethz.ch: Dear List, I am trying to determine the best environmental predictors of the presence of a species along an elevational gradient. Elevation