Re: [R-sig-eco] anova.cca question / missing data in constraining matrix
On 01/06/2013, at 05:20 AM, Jari Oksanen wrote: >> >> The CCA seems to run just fine, but when I attempt to do the posthoc tests >> such as anova.cca (anova(toolik250.cca,by='terms',perm=999), I get an error >> message: "Error in anova.ccabyterm(object, step = step, ...) : number of >> rows has changed: remove missing values?" What exactly is occurring here to >> cause this error - I suspect it must be related to the fact that the >> environmental data matrix has a lot of missing data. I don't quite >> understand why it states that the number of rows has changed...changed from >> what? > > The number of rows has changed from term to term. That is, you have different > numbers of missing values in each term (= explanatory variable), and when > rows with missing values are removed for the current model, the accepted > observations change from term to term. I admit the error message is not the > most obvious one. I must see where it comes from, and how to make it more > informative. However, it does give a hint to "remove missing values", doesn't > it? > > If you want to have a term-wise test with missing values in terms, you must > refit your model for complete.cases. Use argument 'subset' to select a > subset with no missing values. Currently I don't know any nice short cut to > do this with the current mode, but the following may work (untested), > although it is not nice: > > keep <- rep(TRUE, nrow(tooliken.s) > keep[toolik250.cca$na.action] <- FALSE > m2 <- update(toolik250.cca, subset = keep) > anova(m2, by="terms", perm=999) Actually, there is a bit easier way of doing this, because 'subset' can also be a vector of indices, and negative indices acn be used to remove observations. If 'toolik250.cca' is a result object with missing observations, then m2 <- update(toolik250.cca, subset = -toolik250.cca$na.action) will remove items listed as removed in 'na.action' (NB. the minus sign in 'subset'). The update()d model will be equal to the original model, but missing data removed. Cheers, Jari Oksanen -- Jari Oksanen, Dept Biology, Univ Oulu, 90014 Finland jari.oksa...@oulu.fi, Ph. +358 400 408593, http://cc.oulu.fi/~jarioksa ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
Re: [R-sig-eco] anova.cca question / missing data in constraining matrix
Hello, On 01/06/2013, at 02:10 AM, ckellogg wrote: > Hello, > I am using the cca function in Vegan to examine the relationship between > microbial community structure and a (large) suite of environmental > variables. My constraining/environmental data matrix as a lot of holes in > it so I have been exploring using the na.action argument. > This is the command I am currently using: > toolik250.cca<-cca(toolikotus250.ra~julianday+logwindspd_max_1dayprior+lograin_3dayprior+sqrtwindspd_1dayprior+windspd_3dayprior+days_since_thaw+days_since_iceout+days_btw_iceoutandthaw+toolikepitemp_24h+logtemp+conductivity+pH, > toolikenv.s, na.action=na.omit) > > The CCA seems to run just fine, but when I attempt to do the posthoc tests > such as anova.cca (anova(toolik250.cca,by='terms',perm=999), I get an error > message: "Error in anova.ccabyterm(object, step = step, ...) : number of > rows has changed: remove missing values?" What exactly is occurring here to > cause this error - I suspect it must be related to the fact that the > environmental data matrix has a lot of missing data. I don't quite > understand why it states that the number of rows has changed...changed from > what? The number of rows has changed from term to term. That is, you have different numbers of missing values in each term (= explanatory variable), and when rows with missing values are removed for the current model, the accepted observations change from term to term. I admit the error message is not the most obvious one. I must see where it comes from, and how to make it more informative. However, it does give a hint to "remove missing values", doesn't it? If you want to have a term-wise test with missing values in terms, you must refit your model for complete.cases. Use argument 'subset' to select a subset with no missing values. Currently I don't know any nice short cut to do this with the current mode, but the following may work (untested), although it is not nice: keep <- rep(TRUE, nrow(tooliken.s) keep[toolik250.cca$na.action] <- FALSE m2 <- update(toolik250.cca, subset = keep) anova(m2, by="terms", perm=999) > Is there any way to get around having missing data when running the > anovas as you can when running the CCA itself? > > One other question I have is when I try and run the CCA on all the data in > my environmental data matrix (toolikenv.s), not just a subset of variables > as I do above, using this command: > toolik250.cca <-cca(toolikotus250.ra~., toolikenv.s, na.action=na.omit) > I get the following error message. "Error in svd(Xbar, nu = 0, nv = 0) : a > dimension is zero" What might be causing this error message to be thrown? > What does sum(complete.cases(toolikenv.s)) give as a result? Does it give 0? I suspect you have so many holes that nothing is left when you remove rows with any missing values. The message is about an attempt to analyse zero-dimensional matrix. > Thank you so much for your help. Maybe I will just have to filter out the > samples with missing environmental data (or filter out some of the variables > themselves if they have too much missing data), but I was just hoping to > avoid having to do this. The functions can handle missing values, but they handle them by removing the observation. Do you want to lose a huge number of rows? We won't invent values to replace missing data in cca(). Some people have suggested ways to do that, and that is not difficult: just search for imputation in R (for instance, package mice). However, the real problem is how to compare and summarize the multivariate results after imputation. Further, if you have a lot of missing values, nothing may be very reliable. It could be possible to collect together and combine permutation test results after multiple imputation, but better consult a statistician before trying to do this. Cheers, Jari Oksanen -- Jari Oksanen, Dept Biology, Univ Oulu, 90014 Finland ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] Question on GLMM
Dear Friends, I am new on R so I ask you to excuse me if this question sounds fool. I want to see if there is a significativa relationship between the mating (response variable) and several explanatory variables such as individual number (categorical), leg movemente (continous) and the reuse of individuals (categorical). My data looks like this Mating ID Replication Shaking 1 M1 R1 10 0 M1 R2 14 0 M2 R1 15 0 M2 R2 12 1 M3 R1 14 0 M3 R2 17 1 M4 R1 19 0 M4 R2 22 1 M5 R1 18 0 M5 R2 16 1 M6 R1 17 1 means mating happened, 0 means did not occur. I am trying to organize the data to apply a GLMM to the data, but have not been able to do it.I followed the model proposed by CRawley for binary response with pseudorreplication but does not wok. The script goes like this model2<-lmer(Mating~Replication+(1ID),family=binomial,method=PQL) Can somebody tell me what part is wrong and how could I fix it? If you know a better method, will be really welcomed! All the best! [[alternative HTML version deleted]] ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] anova.cca question / missing data in constraining matrix
Hello, I am using the cca function in Vegan to examine the relationship between microbial community structure and a (large) suite of environmental variables. My constraining/environmental data matrix as a lot of holes in it so I have been exploring using the na.action argument. This is the command I am currently using: toolik250.cca<-cca(toolikotus250.ra~julianday+logwindspd_max_1dayprior+lograin_3dayprior+sqrtwindspd_1dayprior+windspd_3dayprior+days_since_thaw+days_since_iceout+days_btw_iceoutandthaw+toolikepitemp_24h+logtemp+conductivity+pH, toolikenv.s, na.action=na.omit) The CCA seems to run just fine, but when I attempt to do the posthoc tests such as anova.cca (anova(toolik250.cca,by='terms',perm=999), I get an error message: "Error in anova.ccabyterm(object, step = step, ...) : number of rows has changed: remove missing values?" What exactly is occurring here to cause this error - I suspect it must be related to the fact that the environmental data matrix has a lot of missing data. I don't quite understand why it states that the number of rows has changed...changed from what? Is there any way to get around having missing data when running the anovas as you can when running the CCA itself? One other question I have is when I try and run the CCA on all the data in my environmental data matrix (toolikenv.s), not just a subset of variables as I do above, using this command: toolik250.cca <-cca(toolikotus250.ra~., toolikenv.s, na.action=na.omit) I get the following error message. "Error in svd(Xbar, nu = 0, nv = 0) : a dimension is zero" What might be causing this error message to be thrown? Thank you so much for your help. Maybe I will just have to filter out the samples with missing environmental data (or filter out some of the variables themselves if they have too much missing data), but I was just hoping to avoid having to do this. Colleen -- View this message in context: http://r-sig-ecology.471788.n2.nabble.com/anova-cca-question-missing-data-in-constraining-matrix-tp7578175.html Sent from the r-sig-ecology mailing list archive at Nabble.com. ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
Re: [R-sig-eco] Random Forest classification on species counts
I have also had some luck with using unsupervised random forest to get a proximity matrix. This proximity matrix can then be transformed into a distance matrix, and used as an input to PCO or NMDS. e.g. require(ecodist) urf = randomForest(FY09Bugs[, 2:146]) site_distances = as.dist(1 - urf$proximity) site_nmds = nmds(site_distances) site_nmin = nmds.min(site_nmds) plot(site_nmin, type="n") text(site_nmin[, 1], site_nmin[, 2], FY09Bugs[, 1]) Cheers, Jesse Ross On Fri, May 31, 2013 at 3:33 AM, Gonzalez-Mirelis Genoveva < genoveva.gonzalez-mire...@imr.no> wrote: > Hi there, > I think I can help a little bit with this: From the description of your > problem it would seem that you want to run random forest in unsupervised > mode; for that you have to omit the y in your formula (ie provide only your > matrix of species abundances). Otherwise you are using species composition > to predict the identity of each site, which of course is a very difficult > classification problem (possibly the reason why you're getting such high > error rate). Once you have done this, you should be able to get the > proximities between the sites, which you then can use further to classify > your sites into clusters of similar species compostion (as far as I know, > the random forest function does not deal with that part of the problem, but > you might want to seek more help there). > Hope this helps! > Geno > > -Original Message- > From: r-sig-ecology-boun...@r-project.org [mailto: > r-sig-ecology-boun...@r-project.org] On Behalf Of Hall, Kyle > Sent: May-30-13 22:37 > To: r-sig-ecology@r-project.org > Subject: [R-sig-eco] Random Forest classification on species counts > > First time poster, please forgive me for errors. > > I have a data set of 23 sites with 145 different species counts for > macroinvertebrate communities for a given year. each species is represented > at least once per site and there are a lot of 0's for some species. I have > been applying a variety of vegan functions to the data set to get a better > understanding of the structure and I would like to classify sites based on > species using randomForest. My thought is that this will give me a more > understandable classification based on species that I can use to cluster my > sites and also see which species are of more importance in classification. > > Question 1. Am I barking up the wrong...tree (pun intended) with > randomForest for this purpose? > > With PCA two sites are typically separated from the rest but the other 21 > sites show no discernible structure; spread like white noise over both axes. > When I perform NMDS I tend to get a shot gun look and there are not tight > groupings on these reduced axes. However with Wards clustering (in hclust) > I do see some clusters plotting heavily to one side of an axis or another > (albeit spread wide on the orthogonal axis). > > Question 2. Is it possible that my data set just doesn't have enough > structure to neatly classify Sites by species count or am I simply a newbie > that is applying randomForest incorrectly? > > Example of data structure: > Site ABLA.MAL ABLA.PAR ACEN.SPP ACRO.MEL > MC14A 1120 > MC17 4200 > MC22A 8000 > MC25 13 300 > MC27 0000 > MC29A1 1000 > MC30A 1000 > MC31A 4100 > MC31B 4000 > MC33 8000 > MC38 7000 > MC40A 12 300 > MC42 0000 > MC45 9000 > MC47A 0000 > MC49A 5000 > MC50 2001 > MC51 13 000 > MC66 4000 > MY11B 13 100 > MY13 0000 > MY7B 1000 > MY8 3201 > > > This is my call to randomForest: > > FY09BUGS.rF <- randomForest(Site~ .,data=FY09Bugs, ntree=500, > mtry=sqrt(ncol(FY09Bugs)), replace=TRUE,importance=TRUE, proximity=TRUE, > norm.votes=TRUE, keep.forest=TRUE, do.trace=100) > > I am following the iris data example with my formula but the print data on > FY09BUGS.rf returns 100% OOB error rate and the summary returns: > > summary(FY09BUGS.rF) > Length Class Mode > call 11 -none- call > type1 -none- character > predicted 23 factor numeric > err.rate12000 -none- numeric > confusion 552 -none- numeric > votes 529 matrix numeric > oob.times 23 -none- numeric > classes23 -none- character > importance 3625 -none- numeric > importanceSD 3480 -none- numeric > localImportance 0 -none- NULL > proximity 529 -none- numeric > ntree 1 -none- numeric > mtry1 -none- numeric > forest 14 -none- list > y 23 factor numeric > test0 -none- NULL > inbag 0 -none- NULL > terms 3 terms call > > One concern I have is that the iris example does not appear to give a > training data
Re: [R-sig-eco] Random Forest classification on species counts
Hi there, I think I can help a little bit with this: From the description of your problem it would seem that you want to run random forest in unsupervised mode; for that you have to omit the y in your formula (ie provide only your matrix of species abundances). Otherwise you are using species composition to predict the identity of each site, which of course is a very difficult classification problem (possibly the reason why you're getting such high error rate). Once you have done this, you should be able to get the proximities between the sites, which you then can use further to classify your sites into clusters of similar species compostion (as far as I know, the random forest function does not deal with that part of the problem, but you might want to seek more help there). Hope this helps! Geno -Original Message- From: r-sig-ecology-boun...@r-project.org [mailto:r-sig-ecology-boun...@r-project.org] On Behalf Of Hall, Kyle Sent: May-30-13 22:37 To: r-sig-ecology@r-project.org Subject: [R-sig-eco] Random Forest classification on species counts First time poster, please forgive me for errors. I have a data set of 23 sites with 145 different species counts for macroinvertebrate communities for a given year. each species is represented at least once per site and there are a lot of 0's for some species. I have been applying a variety of vegan functions to the data set to get a better understanding of the structure and I would like to classify sites based on species using randomForest. My thought is that this will give me a more understandable classification based on species that I can use to cluster my sites and also see which species are of more importance in classification. Question 1. Am I barking up the wrong...tree (pun intended) with randomForest for this purpose? With PCA two sites are typically separated from the rest but the other 21 sites show no discernible structure; spread like white noise over both axes. When I perform NMDS I tend to get a shot gun look and there are not tight groupings on these reduced axes. However with Wards clustering (in hclust) I do see some clusters plotting heavily to one side of an axis or another (albeit spread wide on the orthogonal axis). Question 2. Is it possible that my data set just doesn't have enough structure to neatly classify Sites by species count or am I simply a newbie that is applying randomForest incorrectly? Example of data structure: Site ABLA.MAL ABLA.PAR ACEN.SPP ACRO.MEL MC14A 1120 MC17 4200 MC22A 8000 MC25 13 300 MC27 0000 MC29A1 1000 MC30A 1000 MC31A 4100 MC31B 4000 MC33 8000 MC38 7000 MC40A 12 300 MC42 0000 MC45 9000 MC47A 0000 MC49A 5000 MC50 2001 MC51 13 000 MC66 4000 MY11B 13 100 MY13 0000 MY7B 1000 MY8 3201 This is my call to randomForest: FY09BUGS.rF <- randomForest(Site~ .,data=FY09Bugs, ntree=500, mtry=sqrt(ncol(FY09Bugs)), replace=TRUE,importance=TRUE, proximity=TRUE, norm.votes=TRUE, keep.forest=TRUE, do.trace=100) I am following the iris data example with my formula but the print data on FY09BUGS.rf returns 100% OOB error rate and the summary returns: summary(FY09BUGS.rF) Length Class Mode call 11 -none- call type1 -none- character predicted 23 factor numeric err.rate12000 -none- numeric confusion 552 -none- numeric votes 529 matrix numeric oob.times 23 -none- numeric classes23 -none- character importance 3625 -none- numeric importanceSD 3480 -none- numeric localImportance 0 -none- NULL proximity 529 -none- numeric ntree 1 -none- numeric mtry1 -none- numeric forest 14 -none- list y 23 factor numeric test0 -none- NULL inbag 0 -none- NULL terms 3 terms call One concern I have is that the iris example does not appear to give a training data set and so I don't believe I have done that either. I feel like there is potential here but I can't seem to find the solution searching online so I put the questions to you! Thanks in advance for any assistance or constructive criticism. Kyle Kyle Hall . City of Charlotte Storm Water Services Water Quality Modeler 600 East Fourth Street Charlotte, NC 28202 704.336.4110 Fax: 704.353.0473 [[alternative HTML version deleted]] ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.eth