Re: [R-sig-eco] anova.cca question / missing data in constraining matrix

2013-05-31 Thread Jari Oksanen

On 01/06/2013, at 05:20 AM, Jari Oksanen wrote:
>> 
>> The CCA seems to run just fine, but when I attempt to do the posthoc tests
>> such as anova.cca (anova(toolik250.cca,by='terms',perm=999), I get an error
>> message: "Error in anova.ccabyterm(object, step = step, ...) : number of
>> rows has changed: remove missing values?"  What exactly is occurring here to
>> cause this error - I suspect it must be related to the fact that the
>> environmental data matrix has a lot of missing data.  I don't quite
>> understand why it states that the number of rows has changed...changed from
>> what?  
> 
> The number of rows has changed from term to term. That is, you have different 
> numbers of missing values in each term (= explanatory variable), and when 
> rows with missing values are removed for the current model, the accepted 
> observations change from term to term. I admit the error message is not the 
> most obvious one. I must see where it comes from, and how to make it more 
> informative. However, it does give a hint to "remove missing values", doesn't 
> it?
> 
> If you want to have a term-wise test with missing values in terms, you must 
> refit your model for complete.cases.  Use argument 'subset' to select a 
> subset with no missing values. Currently I don't know any nice short cut to 
> do this with the current mode, but the following may work (untested), 
> although it is not nice:
> 
> keep <- rep(TRUE, nrow(tooliken.s)
> keep[toolik250.cca$na.action] <- FALSE
> m2 <- update(toolik250.cca, subset = keep)
> anova(m2, by="terms", perm=999)

Actually, there is a bit easier way of doing this, because 'subset' can also be 
a vector of indices, and negative indices acn be used to remove observations. 
If 'toolik250.cca' is a result object with missing observations, then

m2 <- update(toolik250.cca, subset = -toolik250.cca$na.action)

will remove items listed as removed in 'na.action' (NB. the minus sign in 
'subset'). The update()d model will be equal to the original model, but missing 
data removed.

Cheers, Jari Oksanen 

-- 
Jari Oksanen, Dept Biology, Univ Oulu, 90014 Finland
jari.oksa...@oulu.fi, Ph. +358 400 408593, http://cc.oulu.fi/~jarioksa

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology


Re: [R-sig-eco] anova.cca question / missing data in constraining matrix

2013-05-31 Thread Jari Oksanen
Hello, 
On 01/06/2013, at 02:10 AM, ckellogg wrote:

> Hello,
> I am using the cca function in Vegan to examine the relationship between
> microbial community structure and a (large) suite of environmental
> variables.  My constraining/environmental data matrix as a lot of holes in
> it so I have been exploring using the na.action argument. 
> This is the command I am currently using:
> toolik250.cca<-cca(toolikotus250.ra~julianday+logwindspd_max_1dayprior+lograin_3dayprior+sqrtwindspd_1dayprior+windspd_3dayprior+days_since_thaw+days_since_iceout+days_btw_iceoutandthaw+toolikepitemp_24h+logtemp+conductivity+pH,
> toolikenv.s, na.action=na.omit)
> 
> The CCA seems to run just fine, but when I attempt to do the posthoc tests
> such as anova.cca (anova(toolik250.cca,by='terms',perm=999), I get an error
> message: "Error in anova.ccabyterm(object, step = step, ...) : number of
> rows has changed: remove missing values?"  What exactly is occurring here to
> cause this error - I suspect it must be related to the fact that the
> environmental data matrix has a lot of missing data.  I don't quite
> understand why it states that the number of rows has changed...changed from
> what?  

The number of rows has changed from term to term. That is, you have different 
numbers of missing values in each term (= explanatory variable), and when rows 
with missing values are removed for the current model, the accepted 
observations change from term to term. I admit the error message is not the 
most obvious one. I must see where it comes from, and how to make it more 
informative. However, it does give a hint to "remove missing values", doesn't 
it?

If you want to have a term-wise test with missing values in terms, you must 
refit your model for complete.cases.  Use argument 'subset' to select a subset 
with no missing values. Currently I don't know any nice short cut to do this 
with the current mode, but the following may work (untested), although it is 
not nice:

keep <- rep(TRUE, nrow(tooliken.s)
keep[toolik250.cca$na.action] <- FALSE
m2 <- update(toolik250.cca, subset = keep)
anova(m2, by="terms", perm=999)

> Is there any way to get around having missing data when running the
> anovas as you can when running the CCA itself?
> 
> One other question I have is when I try and run the CCA on all the data in
> my environmental data matrix (toolikenv.s), not just a subset of variables
> as I do above, using this command:
> toolik250.cca <-cca(toolikotus250.ra~., toolikenv.s, na.action=na.omit)
> I get the following error message.  "Error in svd(Xbar, nu = 0, nv = 0) : a
> dimension is zero" What might be causing this error message to be thrown?
> 
What does sum(complete.cases(toolikenv.s)) give as a result? Does it give 0?

I suspect you have so many holes that nothing is left when you remove rows with 
any missing values. The message is about an attempt to analyse zero-dimensional 
matrix.

> Thank you so much for your help.  Maybe I will just have to filter out the
> samples with missing environmental data (or filter out some of the variables
> themselves if they have too much missing data), but I was just hoping to
> avoid having to do this.


The functions can handle missing values, but they handle them by removing the 
observation. Do you want to lose a huge number of rows? We won't invent values 
to replace missing data in cca(). Some people have suggested ways to do that, 
and that is not difficult: just search for imputation in R (for instance, 
package mice). However, the real problem is how to compare and summarize the 
multivariate results after imputation. Further, if you have a lot of missing 
values, nothing may be very reliable. It could be possible to collect together 
and combine permutation test results after multiple imputation, but better 
consult a statistician before trying to do this.

Cheers, Jari Oksanen

-- 
Jari Oksanen, Dept Biology, Univ Oulu, 90014 Finland

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology


[R-sig-eco] Question on GLMM

2013-05-31 Thread Luis Fernando García Hernández
Dear Friends,

I am new on R so I ask you to excuse me if this question sounds fool. I
want to see if there is a significativa relationship between the mating
(response variable) and several explanatory variables such as individual
number (categorical), leg movemente (continous) and the reuse of
individuals (categorical). My data looks like this

 Mating ID Replication Shaking  1 M1 R1 10  0 M1 R2 14  0 M2 R1 15  0 M2 R2
12  1 M3 R1 14  0 M3 R2 17  1 M4 R1 19  0 M4 R2 22  1 M5 R1 18  0 M5 R2 16
1 M6 R1 17



1 means mating happened, 0 means did not occur.

I am trying to organize the data to apply a GLMM to the data, but have not
 been able to do it.I followed the model proposed by CRawley for binary
response with pseudorreplication but does not wok. The script goes like this

model2<-lmer(Mating~Replication+(1ID),family=binomial,method=”PQL”)

Can somebody tell me what part is wrong and how could I fix it? If you know
a better method, will be really welcomed!

All the best!

[[alternative HTML version deleted]]

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology


[R-sig-eco] anova.cca question / missing data in constraining matrix

2013-05-31 Thread ckellogg
Hello,
I am using the cca function in Vegan to examine the relationship between
microbial community structure and a (large) suite of environmental
variables.  My constraining/environmental data matrix as a lot of holes in
it so I have been exploring using the na.action argument. 
This is the command I am currently using:
toolik250.cca<-cca(toolikotus250.ra~julianday+logwindspd_max_1dayprior+lograin_3dayprior+sqrtwindspd_1dayprior+windspd_3dayprior+days_since_thaw+days_since_iceout+days_btw_iceoutandthaw+toolikepitemp_24h+logtemp+conductivity+pH,
toolikenv.s, na.action=na.omit)

The CCA seems to run just fine, but when I attempt to do the posthoc tests
such as anova.cca (anova(toolik250.cca,by='terms',perm=999), I get an error
message: "Error in anova.ccabyterm(object, step = step, ...) : number of
rows has changed: remove missing values?"  What exactly is occurring here to
cause this error - I suspect it must be related to the fact that the
environmental data matrix has a lot of missing data.  I don't quite
understand why it states that the number of rows has changed...changed from
what?  Is there any way to get around having missing data when running the
anovas as you can when running the CCA itself?

One other question I have is when I try and run the CCA on all the data in
my environmental data matrix (toolikenv.s), not just a subset of variables
as I do above, using this command:
toolik250.cca <-cca(toolikotus250.ra~., toolikenv.s, na.action=na.omit)
I get the following error message.  "Error in svd(Xbar, nu = 0, nv = 0) : a
dimension is zero" What might be causing this error message to be thrown?

Thank you so much for your help.  Maybe I will just have to filter out the
samples with missing environmental data (or filter out some of the variables
themselves if they have too much missing data), but I was just hoping to
avoid having to do this.

Colleen



--
View this message in context: 
http://r-sig-ecology.471788.n2.nabble.com/anova-cca-question-missing-data-in-constraining-matrix-tp7578175.html
Sent from the r-sig-ecology mailing list archive at Nabble.com.

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology


Re: [R-sig-eco] Random Forest classification on species counts

2013-05-31 Thread Jesse Ross
I have also had some luck with using unsupervised random forest to get a
proximity matrix. This proximity matrix can then be transformed into a
distance matrix, and used as an input to PCO or NMDS.

e.g.

require(ecodist)
urf = randomForest(FY09Bugs[, 2:146])
site_distances = as.dist(1 - urf$proximity)
site_nmds = nmds(site_distances)
site_nmin = nmds.min(site_nmds)
plot(site_nmin, type="n")
text(site_nmin[, 1], site_nmin[, 2], FY09Bugs[, 1])

Cheers,
Jesse Ross


On Fri, May 31, 2013 at 3:33 AM, Gonzalez-Mirelis Genoveva <
genoveva.gonzalez-mire...@imr.no> wrote:

> Hi there,
> I think I can help a little bit with this: From the description of your
> problem it would seem that you want to run random forest in unsupervised
> mode; for that you have to omit the y in your formula (ie provide only your
> matrix of species abundances). Otherwise you are using species composition
> to predict the identity of each site, which of course is a very difficult
> classification problem (possibly the reason why you're getting such high
> error rate). Once you have done this, you should be able to get the
> proximities between the sites, which you then can use further to classify
> your sites into clusters of similar species compostion (as far as I know,
> the random forest function does not deal with that part of the problem, but
> you might want to seek more help there).
> Hope this helps!
> Geno
>
> -Original Message-
> From: r-sig-ecology-boun...@r-project.org [mailto:
> r-sig-ecology-boun...@r-project.org] On Behalf Of Hall, Kyle
> Sent: May-30-13 22:37
> To: r-sig-ecology@r-project.org
> Subject: [R-sig-eco] Random Forest classification on species counts
>
> First time poster, please forgive me for errors.
>
> I have a data set of 23 sites with 145 different species counts for
> macroinvertebrate communities for a given year. each species is represented
> at least once per site and there are a lot of 0's for some species. I have
> been applying a variety of vegan functions to the data set to get a better
> understanding of the structure and I would like to classify sites based on
> species using randomForest. My thought is that this will give me a more
> understandable classification based on species that I can use to cluster my
> sites and also see which species are of more importance in classification.
>
> Question 1. Am I barking up the wrong...tree (pun intended) with
> randomForest for this purpose?
>
> With PCA two sites are typically separated from the rest but the other 21
> sites show no discernible structure; spread like white noise over both axes.
> When I perform NMDS I tend to get a shot gun look and there are not tight
> groupings on these reduced axes. However with Wards clustering (in hclust)
> I do see some clusters plotting heavily to one side of an axis or another
> (albeit spread wide on the orthogonal axis).
>
> Question 2. Is it possible that my data set just doesn't have enough
> structure to neatly classify Sites by species count or am I simply a newbie
> that is applying randomForest incorrectly?
>
> Example of data structure:
> Site ABLA.MAL   ABLA.PAR   ACEN.SPP   ACRO.MEL
> MC14A 1120
> MC17 4200
> MC22A 8000
> MC25 13   300
> MC27 0000
> MC29A1 1000
> MC30A 1000
> MC31A 4100
> MC31B 4000
> MC33 8000
> MC38 7000
> MC40A 12   300
> MC42 0000
> MC45 9000
> MC47A 0000
> MC49A 5000
> MC50 2001
> MC51 13   000
> MC66 4000
> MY11B 13   100
> MY13 0000
> MY7B 1000
> MY8  3201
>
>
> This is my call to randomForest:
>
> FY09BUGS.rF <- randomForest(Site~ .,data=FY09Bugs, ntree=500,
> mtry=sqrt(ncol(FY09Bugs)), replace=TRUE,importance=TRUE, proximity=TRUE,
> norm.votes=TRUE, keep.forest=TRUE, do.trace=100)
>
> I am following the iris data example with my formula but the print data on
> FY09BUGS.rf returns 100% OOB error rate and the summary returns:
>
> summary(FY09BUGS.rF)
> Length Class  Mode
> call   11  -none- call
> type1  -none- character
> predicted  23  factor numeric
> err.rate12000  -none- numeric
> confusion 552  -none- numeric
> votes 529  matrix numeric
> oob.times  23  -none- numeric
> classes23  -none- character
> importance   3625  -none- numeric
> importanceSD 3480  -none- numeric
> localImportance 0  -none- NULL
> proximity 529  -none- numeric
> ntree   1  -none- numeric
> mtry1  -none- numeric
> forest 14  -none- list
> y  23  factor numeric
> test0  -none- NULL
> inbag   0  -none- NULL
> terms   3  terms  call
>
> One concern I have is that the iris example does not appear to give a
> training data 

Re: [R-sig-eco] Random Forest classification on species counts

2013-05-31 Thread Gonzalez-Mirelis Genoveva
Hi there,
I think I can help a little bit with this: From the description of your problem 
it would seem that you want to run random forest in unsupervised mode; for that 
you have to omit the y in your formula (ie provide only your matrix of species 
abundances). Otherwise you are using species composition to predict the 
identity of each site, which of course is a very difficult classification 
problem (possibly the reason why you're getting such high error rate). Once you 
have done this, you should be able to get the proximities between the sites, 
which you then can use further to classify your sites into clusters of similar 
species compostion (as far as I know, the random forest function does not deal 
with that part of the problem, but you might want to seek more help there).
Hope this helps!
Geno

-Original Message-
From: r-sig-ecology-boun...@r-project.org 
[mailto:r-sig-ecology-boun...@r-project.org] On Behalf Of Hall, Kyle
Sent: May-30-13 22:37
To: r-sig-ecology@r-project.org
Subject: [R-sig-eco] Random Forest classification on species counts

First time poster, please forgive me for errors.

I have a data set of 23 sites with 145 different species counts for 
macroinvertebrate communities for a given year. each species is represented at 
least once per site and there are a lot of 0's for some species. I have been 
applying a variety of vegan functions to the data set to get a better 
understanding of the structure and I would like to classify sites based on 
species using randomForest. My thought is that this will give me a more 
understandable classification based on species that I can use to cluster my 
sites and also see which species are of more importance in classification.

Question 1. Am I barking up the wrong...tree (pun intended) with randomForest 
for this purpose?

With PCA two sites are typically separated from the rest but the other 21 sites 
show no discernible structure; spread like white noise over both axes.
When I perform NMDS I tend to get a shot gun look and there are not tight 
groupings on these reduced axes. However with Wards clustering (in hclust) I do 
see some clusters plotting heavily to one side of an axis or another (albeit 
spread wide on the orthogonal axis).

Question 2. Is it possible that my data set just doesn't have enough structure 
to neatly classify Sites by species count or am I simply a newbie that is 
applying randomForest incorrectly?

Example of data structure:
Site ABLA.MAL   ABLA.PAR   ACEN.SPP   ACRO.MEL
MC14A 1120
MC17 4200
MC22A 8000
MC25 13   300
MC27 0000
MC29A1 1000
MC30A 1000
MC31A 4100
MC31B 4000
MC33 8000
MC38 7000
MC40A 12   300
MC42 0000
MC45 9000
MC47A 0000
MC49A 5000
MC50 2001
MC51 13   000
MC66 4000
MY11B 13   100
MY13 0000
MY7B 1000
MY8  3201


This is my call to randomForest:

FY09BUGS.rF <- randomForest(Site~ .,data=FY09Bugs, ntree=500, 
mtry=sqrt(ncol(FY09Bugs)), replace=TRUE,importance=TRUE, proximity=TRUE, 
norm.votes=TRUE, keep.forest=TRUE, do.trace=100)

I am following the iris data example with my formula but the print data on 
FY09BUGS.rf returns 100% OOB error rate and the summary returns:

summary(FY09BUGS.rF)
Length Class  Mode
call   11  -none- call
type1  -none- character
predicted  23  factor numeric
err.rate12000  -none- numeric
confusion 552  -none- numeric
votes 529  matrix numeric
oob.times  23  -none- numeric
classes23  -none- character
importance   3625  -none- numeric
importanceSD 3480  -none- numeric
localImportance 0  -none- NULL
proximity 529  -none- numeric
ntree   1  -none- numeric
mtry1  -none- numeric
forest 14  -none- list
y  23  factor numeric
test0  -none- NULL
inbag   0  -none- NULL
terms   3  terms  call

One concern I have is that the iris example does not appear to give a training 
data set and so I don't believe I have done that either. I feel like there is 
potential here but I can't seem to find the solution searching online so I put 
the questions to you! Thanks in advance for any assistance or constructive 
criticism.

Kyle


Kyle Hall .
City of Charlotte Storm Water Services
Water Quality Modeler
600 East Fourth Street
Charlotte, NC 28202
704.336.4110
Fax: 704.353.0473

[[alternative HTML version deleted]]

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.eth