[R-sig-eco] Rarefying metagenome data table

2016-03-10 Thread Jacob Cram
Dear List,
I have a large data table (its the TARA Oceans metagenomic data and
can be found here http://ocean-microbiome.embl.de/companion.html).
Essentially the columns of the data table are samples taken in
different parts of the ocean and the rows are different genes found in
those locations. The numbers in the body of the table are the number
of times an insturment detects that gene in a sample. Because the
machine returns more reads (number of observed genes) in some samples
than in others, the data need to be "rarefied", that is subsampled
such that there are the same number of genes assigned to each station.
I am trying to use the rrarefy package in vegan to do this, but I keep
getting an error message

## Get TARA oceans KEGG metagenomes
temp = tempfile()

download.file("http://ocean-microbiome.embl.de/data/TARA243.KO.profile.release.gz";,
temp)
taraKEGG = read.delim(gzfile(temp, "TARA243.KO.profile.release"))
unlink(temp)

## some processing
ids = taraKEGG[,1]
data = taraKEGG[,2:dim(taraKEGG)[2]]
minsamples = min(colSums(data))
mtx= as.matrix(data)
imtx=as.integer(as.matrix(data))
data2 = data
data2 = sapply(round(data, 0), as.integer)

## load library
library(vegan)

## rarify samples
rare = rrarefy(data2, minsamples)

> Error in if (sum(x[i, ]) <= sample[i]) next :missing value where 
> TRUE/FALSE needed In
> addition: Warning messages: 1: In rrarefy(data2, minsamples) :   Some
> row sums < 'sample' and are not rarefied 2: In sum(x[i, ]) : integer
> overflow - use sum(as.numeric(.))

It looks like the problem is a line in rrarify where it tries to make
a huge matrix that has as many columns as there are genes in one of
the samples

` row <- sample(rep(nm, times = x[i, ]), sample[i])`

Since there are sometimes millions to tens of millions of genes, this
ends up being a really big matrix and the program gives up in order to
save my computer memory.

So, I am trying to figure out how to proceed from here. Perhaps I am
using this function incorrectly and there is a different way to use it
to not have this problem. Alternatively, perhaps I should be using a
different tool. I have heard of rarefication programs in python, or,
for that matter the bioinfomatics universe of Qime, but I'd rather
stay within R if at all possible. Does anyone have any suggestions?

Thank you for your time.
Sincerely,
Jacob Cram

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology


[R-sig-eco] El Nino and GAMM Model of a Seasonal and Long Term Time Series

2014-04-01 Thread Jacob Cram
Hello Gavin and R-sig-ecology list,
 Based on Gavin Simpson's very helpful advice on this forum, I have
been modelling biotic and environmental parameters using a generalized
additive mixed model of the form

mod <- gamm(logit.Area ~ s(DoY, bs = "cc") + s(elapsed.days), data = lldata,
correlation = corCAR1(form = ~elapsed.days), knots =
list(DoY = c(0, 366)))


where lldata is provided below and is data for one taxonomic group
logit.Area is the group's log odds transformed relative abundance
DoY is the day of the year
elapsed.days is the time since initiation of the study.


Example data for one parameter is provided below.
This has identified a number of parameters that show seasonal variability
and another set of parameters that show long term inter-annual variability.
I would like to identify whether the long term variation for this variable
is predictable from the multivariate El-Nino Southern oscillation index
(MEI), also provided below. Unfortunately I am not sure how I would test
this. Any suggestions?
Sincerely,
Jacob

"RDate","elapsed.days","DoY","MEI","logit.Area"
2000-08-17,0,230,-0.146,-0.0504416284090222
2000-08-17,0,230,-0.146,-0.0504416284090222
2000-08-17,0,230,-0.146,-0.0504416284090222
2000-08-17,0,230,-0.146,-0.0504416284090222
2000-09-18,32,262,-0.249,-0.467074759836293
2000-09-18,32,262,-0.249,-0.467074759836293
2000-09-18,32,262,-0.249,-0.467074759836293
2000-10-09,53,283,-0.382,-0.348402739936071
2000-10-09,53,283,-0.382,-0.348402739936071
2000-10-09,53,283,-0.382,-0.348402739936071
2000-10-09,53,283,-0.382,-0.348402739936071
2000-12-08,113,343,-0.581,-0.15334106437543
2000-12-08,113,343,-0.581,-0.15334106437543
2000-12-08,113,343,-0.581,-0.15334106437543
2000-12-08,113,343,-0.581,-0.15334106437543
2001-01-16,152,16,-0.54,-0.673220475271774
2001-01-16,152,16,-0.54,-0.673220475271774
2001-01-16,152,16,-0.54,-0.673220475271774
2001-01-16,152,16,-0.54,-0.673220475271774
2001-02-05,172,36,-0.699,-0.61864178154517
2001-02-05,172,36,-0.699,-0.61864178154517
2001-02-05,172,36,-0.699,-0.61864178154517
2001-02-05,172,36,-0.699,-0.61864178154517
2001-03-19,214,78,-0.591,0.0323819413186415
2001-03-19,214,78,-0.591,0.0323819413186415
2001-03-19,214,78,-0.591,0.0323819413186415
2001-03-19,214,78,-0.591,0.0323819413186415
2001-04-02,228,92,-0.149,-0.412683387668689
2001-04-02,228,92,-0.149,-0.412683387668689
2001-04-02,228,92,-0.149,-0.412683387668689
2001-04-02,228,92,-0.149,-0.412683387668689
2001-05-09,265,129,0.151,-0.252047467102658
2001-05-09,265,129,0.151,-0.252047467102658
2001-05-09,265,129,0.151,-0.252047467102658
2001-05-09,265,129,0.151,-0.252047467102658
2001-06-23,310,174,-0.082,-0.278011323004986
2001-06-23,310,174,-0.082,-0.278011323004986
2001-06-23,310,174,-0.082,-0.278011323004986
2001-06-23,310,174,-0.082,-0.278011323004986
2001-07-27,344,208,0.223,-0.965949833146247
2001-07-27,344,208,0.223,-0.965949833146247
2001-07-27,344,208,0.223,-0.965949833146247
2001-07-27,344,208,0.223,-0.965949833146247
2001-08-24,372,236,0.357,-0.395916644785599
2001-08-24,372,236,0.357,-0.395916644785599
2001-08-24,372,236,0.357,-0.395916644785599
2001-08-24,372,236,0.357,-0.395916644785599
2001-09-28,407,271,-0.131,-0.568070396068749
2001-09-28,407,271,-0.131,-0.568070396068749
2001-09-28,407,271,-0.131,-0.568070396068749
2001-09-28,407,271,-0.131,-0.568070396068749
2001-10-29,438,302,-0.276,0.00433931080895083
2001-10-29,438,302,-0.276,0.00433931080895083
2001-10-29,438,302,-0.276,0.00433931080895083
2001-10-29,438,302,-0.276,0.00433931080895083
2001-11-19,459,323,-0.181,-0.596502420009713
2001-11-19,459,323,-0.181,-0.596502420009713
2001-11-19,459,323,-0.181,-0.596502420009713
2001-11-19,459,323,-0.181,-0.596502420009713
2001-12-17,487,351,0.003,-0.806445429137077
2001-12-17,487,351,0.003,-0.806445429137077
2001-12-17,487,351,0.003,-0.806445429137077
2001-12-17,487,351,0.003,-0.806445429137077
2002-01-14,515,14,-0.05,-1.04911451755897
2002-01-14,515,14,-0.05,-1.04911451755897
2002-01-14,515,14,-0.05,-1.04911451755897
2002-01-14,515,14,-0.05,-1.04911451755897
2002-02-20,552,51,-0.206,-0.00546905363182486
2002-02-20,552,51,-0.206,-0.00546905363182486
2002-02-20,552,51,-0.206,-0.00546905363182486
2002-02-20,552,51,-0.206,-0.00546905363182486
2002-03-15,575,74,-0.187,-0.543418349329509
2002-03-15,575,74,-0.187,-0.543418349329509
2002-03-15,575,74,-0.187,-0.543418349329509
2002-04-22,613,112,0.337,-0.200012640323267
2002-04-22,613,112,0.337,-0.200012640323267
2002-04-22,613,112,0.337,-0.200012640323267
2002-04-22,613,112,0.337,-0.200012640323267
2002-05-20,641,140,0.766,0.0464233195400769
2002-05-20,641,140,0.766,0.0464233195400769
2002-05-20,641,140,0.766,0.0464233195400769
2002-05-20,641,140,0.766,0.0464233195400769
2002-06-19,671,170,0.854,-0.647290280298141
2002-06-19,671,170,0.854,-0.647290280298141
2002-06-19,671,170,0.854,-0.647290280298141
2002-06-19,671,170,0.854,-0.647290280298141
2002-07-11,693,192,0.576,-0.422237525936998
2002-07-11,693,192,0.576,-0.422237525936998

Re: [R-sig-eco] Cosinor with data that trend over time

2014-03-30 Thread Jacob Cram
Thanks Gavin,
Your suggestion seems promising.  I don't think I'll do the derivatives
analysis at this time so no hurry on those codes.
New question.  I am wondering if there are additional considerations when
making multiple comparasons with a gamm model.
I have been testing a large number of parameters to see which of say 20
environmental variables are seasonal. I would expect a number of these
variables to be seasonal but based on this conversation
http://stats.stackexchange.com/questions/49052/are-splines-overfitting-the-data
I am nervous about the variety of possible spline fits somehow compounding
the problem of testing multiple variables for seasonality.  Any thoughts?
-Jacob

On Wed, Mar 26, 2014 at 2:34 PM, Gavin Simpson  wrote:

> Sorry about the errors (typos, not syntax errors) - I was forgetting
> that you'd need to use `gamm()` and hence access the `$gam` component
>
> I don't follow the point about a factor trending up or down. You
> shouldn't try to use the `$lme` part of the model for this.
> `summary(mod$gam)` should be sufficient, but as it relates to a
> spline, this is more a test of whether the spline is different from a
> horizontal, flat, null line. The problem with splines is that the
> trend need not just be "trending up or down". In the past, to convey
> where change in the trend occurs I have used the first derivative of
> the fitted spline and looked for where in time the 95% confidence
> interval on the first derivative of the spline doesn't include zero;
> that shows the regions in time where the trend is significantly
> increasing or decreasing. I cover how to do this in a blog post I
> wrote:
>
>
> http://www.fromthebottomoftheheap.net/2011/06/12/additive-modelling-and-the-hadcrut3v-global-mean-temperature-series/
>
> the post contains links to the R code used for the derivatives etc,
> though it is a little more complex in the case of a model with a trend
> spline and seasonal spline.
>
> I'm supposed to have updated those codes and the post because several
> people have asked me how I do the analysis for models with multiple
> spline terms. If you can't get the code to work for your models, ping
> me back and I'll try to move that to the top of my TO DO list.
>
> Note the `Xs(time)Fx1` entries in the `summary(mod$lme)` table refer
> to the basis functions that represent the spline or at least to some
> part of those basis functions. You can't really make much practical
> use out of those values are they relate specifically to way the
> penalised regression spline model has been converted into an
> equivalent linear mixed effect form.
>
> HTH
>
> G
>
> On 26 March 2014 12:10, Jacob Cram  wrote:
> > Thanks again Gavin, this works.
> > gamm() also models the long term trend with a spline s(Time), which is
> > great. I would still like though, to be able to say whether the factor is
> > trending up or down over time.  Would it be fair to query
> > summary(mod$lme)$tTable
> > and to look at the p-value and "Value" corresponding Xs(time)Fx1 value to
> > identify such a trend?
> >
> > Also, here are a few syntax corrections on the code provided in the last
> > email:
> > 1)visual appoach
> > plot(mod$gam, pages = 1)
> >
> > 2) quantitative approach
> > pred <- predict(mod$gam, newdata = newdat, type = "terms")
> > take <- which(pred[,"s(DoY)"] == max(pred[,"s(DoY)"]))
> > or
> > take <- as.numeric(which.max(pred[,"s(DoY)"]))
> >
> > Cheers,
> > -Jacob
> >
> >
> > On Wed, Mar 26, 2014 at 9:46 AM, Gavin Simpson 
> wrote:
> >>
> >> 1) Visually - unless it actually matters exactly on which day in the
> >> year the peak is observed? If visually is OK, just do `plot(mod, pages
> >> = 1)` to see the fitted splines on a single page. See `?plot.gam` for
> >> more details on the plot method.
> >>
> >> 2) You could generate some new data to predict upon as follows:
> >>
> >> newdat <- data.frame(DoY = seq_len(366), time = mean(foo$time))
> >>
> >> Then predict for those new data but collect the individual
> >> contributions of the spline terms to the predicted value rather than
> >> just the final prediction
> >>
> >> pred <- predict(mod, newdata = newdat, type = "terms")
> >>
> >> Then find the maximal value of the DoY contribution
> >>
> >> take <- which(pred$DoY == max(pred$DoY))
> >> newdat[take, , drop = FALSE]
> >>
> >> You could use
> >>
> >> 

Re: [R-sig-eco] Cosinor with data that trend over time

2014-03-26 Thread Jacob Cram
Thanks again Gavin, this works.
gamm() also models the long term trend with a spline s(Time), which is
great. I would still like though, to be able to say whether the factor is
trending up or down over time.  Would it be fair to query
summary(mod$lme)$tTable
and to look at the p-value and "Value" corresponding Xs(time)Fx1 value to
identify such a trend?

Also, here are a few syntax corrections on the code provided in the last
email:
1)visual appoach
plot(mod$gam, pages = 1)

2) quantitative approach
pred <- predict(mod$gam, newdata = newdat, type = "terms")
take <- which(pred[,"s(DoY)"] == max(pred[,"s(DoY)"]))
or
take <- as.numeric(which.max(pred[,"s(DoY)"]))

Cheers,
-Jacob


On Wed, Mar 26, 2014 at 9:46 AM, Gavin Simpson  wrote:

> 1) Visually - unless it actually matters exactly on which day in the
> year the peak is observed? If visually is OK, just do `plot(mod, pages
> = 1)` to see the fitted splines on a single page. See `?plot.gam` for
> more details on the plot method.
>
> 2) You could generate some new data to predict upon as follows:
>
> newdat <- data.frame(DoY = seq_len(366), time = mean(foo$time))
>
> Then predict for those new data but collect the individual
> contributions of the spline terms to the predicted value rather than
> just the final prediction
>
> pred <- predict(mod, newdata = newdat, type = "terms")
>
> Then find the maximal value of the DoY contribution
>
> take <- which(pred$DoY == max(pred$DoY))
> newdat[take, , drop = FALSE]
>
> You could use
>
> take <- which.max(pred$DoY)
>
> instead of the `which()` I used, but only if there is a single maximal
> value.
>
> This works because the spline terms in the additive model are just
> that; additive. Hence because you haven't let the DoY and time splines
> interact (in the simple model I mentioned, it is more complex if you
> allow these to interact as you then need to predict DoY for each years
> worth of time points), you can separate DoY from the other terms.
>
> None of the above code has been tested, but was written off top of my
> head, but should work or at least get you pretty close to something
> that works.
>
> HTH
>
> G
>
> On 26 March 2014 10:02, Jacob Cram  wrote:
> > Thanks Gavin,
> >  This seems like a promising approach and a first pass suggests it
> works
> > with this data. I can't quite figure out how I would go about
> interrogating
> > the fitted spline to deterine when the peak value happens with respect to
> > DoY.  Any suggestions?
> > -Jacob
> >
> >
> > On Tue, Mar 25, 2014 at 9:06 PM, Gavin Simpson 
> wrote:
> >>
> >> I would probably attack this using a GAM modified to model the
> >> residuals as a stochastic time series process.
> >>
> >> For example
> >>
> >> require("mgcv")
> >> mod <- gamm(y ~ s(DoY, bs = "cc") + s(time), data = foo,
> >>  correlation = corCAR1(form = ~ time))
> >>
> >> where `foo` is your data frame, `DoY` is a variable in the data frame
> >> computed as `as.numeric(strftime(RDate, format = "%j"))` and `time` is
> >> a variable for the passage of time - you could do `as.numeric(RDate)`
> >> but the number of days is probably large as we might encounter more
> >> problems fitting the model. Instead you might do `as.numeric(RDate) /
> >> 1000` say to produce values on a more manageable scale. The `bs =
> >> "cc"` bit specifies a cyclic spline applicable to data measured
> >> throughout a year. You may want to fix the start and end knots to be
> >> days 1 and days 366 respectively, say via `knots = list(DoY =
> >> c(0,366))` as an argument to `gam()` [I think I have this right,
> >> specifying the boundary knots, but let me know if you get an error
> >> about the number of knots]. The residuals are said to follow a
> >> continuois time AR(1), the irregular-spaced counter part to the AR(1),
> >> plus random noise.
> >>
> >> There may be identifiability issues as the `s(time)` and `corCAR1()`
> >> compete to explain the fine-scale variation. If you hit such a case,
> >> you can make an educated guess as to the wiggliness (degrees of
> >> freedom) for the smooth terms based on a plot of the data and fix the
> >> splines at those values via argument `k = x` and `fx = TRUE`, where
> >> `x` in `k = x` is some integer value. Both these go in as arguments to
> >> the `s()` functions. If the trend is not very non-linear you can use a
> >>

Re: [R-sig-eco] Cosinor with data that trend over time

2014-03-26 Thread Jacob Cram
Thanks Gavin,
 This seems like a promising approach and a first pass suggests it
works with this data. I can't quite figure out how I would go about
interrogating the fitted spline to deterine when the peak value happens
with respect to DoY.  Any suggestions?
-Jacob


On Tue, Mar 25, 2014 at 9:06 PM, Gavin Simpson  wrote:

> I would probably attack this using a GAM modified to model the
> residuals as a stochastic time series process.
>
> For example
>
> require("mgcv")
> mod <- gamm(y ~ s(DoY, bs = "cc") + s(time), data = foo,
>  correlation = corCAR1(form = ~ time))
>
> where `foo` is your data frame, `DoY` is a variable in the data frame
> computed as `as.numeric(strftime(RDate, format = "%j"))` and `time` is
> a variable for the passage of time - you could do `as.numeric(RDate)`
> but the number of days is probably large as we might encounter more
> problems fitting the model. Instead you might do `as.numeric(RDate) /
> 1000` say to produce values on a more manageable scale. The `bs =
> "cc"` bit specifies a cyclic spline applicable to data measured
> throughout a year. You may want to fix the start and end knots to be
> days 1 and days 366 respectively, say via `knots = list(DoY =
> c(0,366))` as an argument to `gam()` [I think I have this right,
> specifying the boundary knots, but let me know if you get an error
> about the number of knots]. The residuals are said to follow a
> continuois time AR(1), the irregular-spaced counter part to the AR(1),
> plus random noise.
>
> There may be identifiability issues as the `s(time)` and `corCAR1()`
> compete to explain the fine-scale variation. If you hit such a case,
> you can make an educated guess as to the wiggliness (degrees of
> freedom) for the smooth terms based on a plot of the data and fix the
> splines at those values via argument `k = x` and `fx = TRUE`, where
> `x` in `k = x` is some integer value. Both these go in as arguments to
> the `s()` functions. If the trend is not very non-linear you can use a
> low value 1-3 here for x and for the DoY term say 3-4 might be
> applicable.
>
> There are other ways to approach this problem of identifiability, but
> that would require more time/space here, which I can go into via a
> follow-up if needed.
>
> You can interrogate the fitted splines to see when the peak value of
> the `DoY` term is in the year.
>
> You can also allow the seasonal signal to vary in time with the trend
> by allowing the splines to "interact" in a 2d-tensor product spline.
> Using `te(DoY, time, bs = c("cc","cr"))` instead of the two `s()`
> terms (or using `ti()` terms for the two "marginal" splines and the
> 2-d spline). Again you can add in the `k` = c(x,y), fx = TRUE)` to the
> `te()` term where `x` and `y` are the dfs for each dimension in the
> `te()` term. It is a bit more complex to do this for `ti()` terms.
>
> Part of the reason to prefer a spline for DoY for the seasonal term is
> that one might not expect the seasonal cycle to be a symmetric cycle
> as a cos/sin terms would imply.
>
> A recent ecological paper describing a similar approach (though using
> different package in R) is that of Claire Ferguson and colleagues in J
> Applied Ecology (2008) http://doi.org/10./j.1365-2664.2007.01428.x
> (freely available).
>
> HTH
>
> G
>
> On 25 March 2014 19:14, Jacob Cram  wrote:
> > Hello all,
> >  I am thinking about applying season::cosinor() analysis to some
> > irregularely spaced time series data. The data are unevenly spaced, so
> > usual time series methods, as well as the nscosinor() function are out.
> My
> > data do however trend over time and I am wondering if I can feed date as
> a
> > variable into my cosinor analyis.  In the example below, then I'd
> conclude
> > then that the abundances are seasonal, with maximal abundance in mid June
> > and furthermore, they are generally decreasing over time.
> > Can I use both time variables together like this? If not, is there some
> > better approach I should take?
> > Thanks in advance,
> > -Jacob
> >
> > For context lAbundance is logg-odds transformed abundance data of a
> > microbial species in a given location over time.  RDate is the date the
> > sample was collected in the r date format.
> >
> >
> >> res <- cosinor(lAbundance ~ RDate, date = "RDate", data = lldata)
> >
> >>summary(res)
> > Cosinor test
> > Number of observations = 62
> > Amplitude = 0.58
> > Phase: Month = June , day = 14
> > Low point: Month = December , day = 14
> > Significant seasonalit

[R-sig-eco] Cosinor with data that trend over time

2014-03-25 Thread Jacob Cram
Hello all,
 I am thinking about applying season::cosinor() analysis to some
irregularely spaced time series data. The data are unevenly spaced, so
usual time series methods, as well as the nscosinor() function are out. My
data do however trend over time and I am wondering if I can feed date as a
variable into my cosinor analyis.  In the example below, then I'd conclude
then that the abundances are seasonal, with maximal abundance in mid June
and furthermore, they are generally decreasing over time.
Can I use both time variables together like this? If not, is there some
better approach I should take?
Thanks in advance,
-Jacob

For context lAbundance is logg-odds transformed abundance data of a
microbial species in a given location over time.  RDate is the date the
sample was collected in the r date format.


> res <- cosinor(lAbundance ~ RDate, date = "RDate", data = lldata)

>summary(res)
Cosinor test
Number of observations = 62
Amplitude = 0.58
Phase: Month = June , day = 14
Low point: Month = December , day = 14
Significant seasonality based on adjusted significance level of 0.025  =
TRUE

>summary(res$glm)


Call:
glm(formula = f, family = family, data = data, offset = offset)

Deviance Residuals:
Min   1Q   Median   3Q  Max
-2.3476  -0.6463   0.1519   0.6574   1.9618

Coefficients:
  Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.0115693  1.8963070   0.006  0.99515
RDate   -0.0003203  0.0001393  -2.299  0.02514 *
cosw-0.5516458  0.1837344  -3.002  0.00395 **
sinw 0.1762904  0.1700670   1.037  0.30423
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 0.9458339)

Null deviance: 70.759  on 61  degrees of freedom
Residual deviance: 54.858  on 58  degrees of freedom
AIC: 178.36

Number of Fisher Scoring iterations: 2


llldata as a csv below
"RDate","lAbundance"
2003-03-12,-3.3330699059335
2003-05-21,-3.04104625745886
2003-06-17,-3.04734680029566
2003-07-02,-4.18791034708572
2003-09-18,-3.04419201802053
2003-10-22,-3.13805060873929
2004-02-19,-3.80688269144794
2004-03-17,-4.50755507726145
2004-04-22,-4.38846502542992
2004-05-19,-3.06618649442674
2004-06-17,-5.20518774876304
2004-07-14,-3.75041853151097
2004-08-25,-3.67882486716196
2004-09-22,-5.22205827512234
2004-10-14,-3.99297508670535
2004-11-17,-4.68793287601157
2004-12-15,-4.31712380781011
2005-02-16,-4.30893550479904
2005-03-16,-4.05781773988454
2005-05-11,-3.94746237402035
2005-07-19,-4.91195185391358
2005-08-17,-4.93590576323119
2005-09-15,-4.85820800095518
2005-10-20,-5.22956391101343
2005-12-13,-5.12244047315448
2006-01-18,-3.04854660925046
2006-02-22,-6.77145858348375
2006-03-29,-4.33151493849021
2006-04-19,-3.36152357710535
2006-06-20,-3.09071584142593
2006-07-25,-3.31430484483825
2006-08-24,-3.09974933041469
2006-09-13,-3.33288992218458
2007-12-17,-4.19942661980677
2008-03-19,-3.86146499633625
2008-04-22,-3.36161599919095
2008-05-14,-4.30878307213324
2008-06-18,-3.74372448768828
2008-07-09,-4.65951429661651
2008-08-20,-5.35984647704619
2008-09-22,-4.78481898261137
2008-10-20,-3.58588161980965
2008-11-20,-3.10625125552057
2009-02-18,-6.90675477864855
2009-03-11,-3.43446932013368
2009-04-23,-3.82688066341466
2009-05-13,-4.44885332005661
2009-06-18,-3.97671552612412
2009-07-09,-3.40185954003936
2009-08-19,-3.44958231694091
2009-09-24,-3.86508161094726
2010-01-28,-4.95281587967569
2010-02-11,-3.78064756876257
2010-03-24,-3.5823501176064
2010-04-27,-4.3363571587
2010-05-17,-3.90545735473055
2010-07-21,-3.3147176517321
2010-08-11,-4.53218360860017
2010-10-21,-6.90675477864855
2010-11-23,-6.90675477864855
2010-12-16,-6.75158176094352
2011-01-11,-6.90675477864855

[[alternative HTML version deleted]]

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology


Re: [R-sig-eco] Mantel test with multiple imputation, P values

2013-08-18 Thread Jacob Cram
Krzystof,
   Thank you for your reply.  It is promising that you think that Li's
method should work for imputed mantel results.  That said, I am a bit over
my head with the math in the Li et al reference.  Could you (or anyone on
this list) provide an R-code example of how D would be calculated from the
output of several mantel tests?

Also, and forgive my ignorance if this statement is coming from the wrong
direction, It is my understanding that mantel P values are generally
calculated by actually permuting the rows of the similarity matrix and
re-running the mantel test a given number of times (usually 1000).
Accordingly, as far as I can tell there is no explicitly generated F value
corresponding to a mantel p value.  It seems like Li's method assumes I am
generating P from an F table.  Would it be appropriate to back calculate F
from P, k and m?

Thanks again,
Jacob



On Sun, Aug 18, 2013 at 2:19 PM, Krzysztof Sakrejda <
krzysztof.sakre...@gmail.com> wrote:

> Hi Jacob, comments below.
>
> On Aug 18, 2013 2:31 PM, "Jacob Cram"  wrote:
> >
> > Dear List,
> >   I have an environmental data set with several missing values that I
> > am trying to relate to a community structure data set using a mantel
> > test.   One solution to the missing data problem seems to be multiple
> > imputation; I am using the Amelia package. This generates several (five
> in
> > this example) imputed data sets.  I can run mantel on each of these and
> > come up with five similar but not identical solutions.  I figure I can
> > average the mantel rho values.  However, I am not sure what to do about
> the
> > P values. From looking around online, it looks like I shouldn't take the
> > average of p values.  I found this reference <
> >
> http://missingdata.lshtm.ac.uk/index.php?option=com_content&view=article&id=164:combining-p-values-from-multiple-imputations&catid=57:multiple-imputation&Itemid=98
> >
> > that seems to have promising suggestions, but I can't seem to figure out
> > how I'd implement any of these in R.
>
> So following that link and reading the Li et al. reference it looks as
> though the procedure is well described at the top of page 71. You get your
> parameter estimate from the usual procedure. The test statistic, written as
> "D", is the distance between the null value and the estimated value with
> some scaling specified in eq. 1.17. They use the F distribution and k and m
> (the number of imputations) degrees of freedom. I don't think you need to
> reinvent some inferior ecologists-only procedure for this.
>
> Krzysztof
>
> I was hoping somebody might have a
> > suggestion for how I could combine my p values.  One option, I think
> would
> > be to take the highest (worst) p value (in the example below, this would
> be
> > p = 0.012).  However for large numbers of imputations, I am believe that
> > this method might be to conservative.  Another option might be to take
> the
> > p value corresponding to the median rho score (in the example below this
> > would be p =0.008).   Thoughts?
> > -Jacob
> >
> >
> > ##Example Code Below
> > require(Amelia)
> > require(vegan)
> > require(ecodist)
> >
> > ##Species data matrix with environmental data that are missing some
> values.
> > data(varespec)
> > data(varechem)
> > varechem.missings <- varechem[,c("N", "P", "K")]
> > varechem.missings[c(1,5, 7, 15, 20),1] <- NA
> > varechem.missings[c(1,2, 9, 21), 2] <- NA
> >
> > #I multiply impute the missing values with the Amelia package
> > imps <- 5
> > #imps <- 25
> > varechem.amelia <- amelia(varechem.missings, m = imps)
> >
> >
> > #for each imputation of the environmental data I run a mantel test and
> save
> > #the results to mresults
> > mresults <- NULL
> >
> > for(i in 1:imps){
> > varespec.dist <- vegdist(varespec)
> > varechem.am.dist <- dist(varechem.amelia$imputations[[i]])
> > mresults <- rbind(mresults,
> > (ecodist::mantel(varespec.dist~varechem.am.dist)))
> > }
> >
> > mresults
> >
> > ##mantelr pval1 pval2 pval3 llim.2.5% ulim.97.5%
> > ## [1,] 0.2137656 0.008 0.993 0.008 0.1015176  0.3389979
> > ## [2,] 0.2162388 0.011 0.990 0.011 0.1207528  0.3346554
> > ## [3,] 0.2149556 0.012 0.989 0.016 0.1319943  0.3279028
> > ## [4,] 0.2101820 0.009 0.992 0.012 0.1217293  0.3288272
> > ## [5,] 0.2135279 0.006 0.995 0.006 0.1130386  0.3359864
> >
> > #based on these results what would be a reasonable p value to report for
> > the environmental parameters relating to the community structure?
> >
> > ##end example
> >
> > [[alternative HTML version deleted]]
> >
> > ___
> > R-sig-ecology mailing list
> > R-sig-ecology@r-project.org
> > https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
>
>

[[alternative HTML version deleted]]

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology


[R-sig-eco] Mantel test with multiple imputation, P values

2013-08-18 Thread Jacob Cram
Dear List,
  I have an environmental data set with several missing values that I
am trying to relate to a community structure data set using a mantel
test.   One solution to the missing data problem seems to be multiple
imputation; I am using the Amelia package. This generates several (five in
this example) imputed data sets.  I can run mantel on each of these and
come up with five similar but not identical solutions.  I figure I can
average the mantel rho values.  However, I am not sure what to do about the
P values. From looking around online, it looks like I shouldn't take the
average of p values.  I found this reference <
http://missingdata.lshtm.ac.uk/index.php?option=com_content&view=article&id=164:combining-p-values-from-multiple-imputations&catid=57:multiple-imputation&Itemid=98>
that seems to have promising suggestions, but I can't seem to figure out
how I'd implement any of these in R.  I was hoping somebody might have a
suggestion for how I could combine my p values.  One option, I think would
be to take the highest (worst) p value (in the example below, this would be
p = 0.012).  However for large numbers of imputations, I am believe that
this method might be to conservative.  Another option might be to take the
p value corresponding to the median rho score (in the example below this
would be p =0.008).   Thoughts?
-Jacob


##Example Code Below
require(Amelia)
require(vegan)
require(ecodist)

##Species data matrix with environmental data that are missing some values.
data(varespec)
data(varechem)
varechem.missings <- varechem[,c("N", "P", "K")]
varechem.missings[c(1,5, 7, 15, 20),1] <- NA
varechem.missings[c(1,2, 9, 21), 2] <- NA

#I multiply impute the missing values with the Amelia package
imps <- 5
#imps <- 25
varechem.amelia <- amelia(varechem.missings, m = imps)


#for each imputation of the environmental data I run a mantel test and save
#the results to mresults
mresults <- NULL

for(i in 1:imps){
varespec.dist <- vegdist(varespec)
varechem.am.dist <- dist(varechem.amelia$imputations[[i]])
mresults <- rbind(mresults,
(ecodist::mantel(varespec.dist~varechem.am.dist)))
}

mresults

##mantelr pval1 pval2 pval3 llim.2.5% ulim.97.5%
## [1,] 0.2137656 0.008 0.993 0.008 0.1015176  0.3389979
## [2,] 0.2162388 0.011 0.990 0.011 0.1207528  0.3346554
## [3,] 0.2149556 0.012 0.989 0.016 0.1319943  0.3279028
## [4,] 0.2101820 0.009 0.992 0.012 0.1217293  0.3288272
## [5,] 0.2135279 0.006 0.995 0.006 0.1130386  0.3359864

#based on these results what would be a reasonable p value to report for
the environmental parameters relating to the community structure?

##end example

[[alternative HTML version deleted]]

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology


[R-sig-eco] cocorresp to determine best of several predictor communities

2013-01-23 Thread Jacob Cram
Dear List Members,

I am using the 'cocorresp' package and trying to compare several pairs of
communities. My goal would be to see which of several communities best
predicts a final community. Say, for instance, that I have four communities
A, B, C and D. I want to know which community best predicts D. To further
complicate things, I have more data points for some communities than
others. For instance communities A and B and D have 100 sites associated
with them, while C only was measured at 75 sites (lets imagine it wasn't
measured in the last 25 sites). So I run

A.pred <- coca(x =A, y = D, method = “pred”, reg.method = “eigen” n.axes =
10)
B.pred <- coca(x =B, y = D, method = “pred”, reg.method = “eigen” n.axes =
10)
C.pred <- coca(x =C, y = D[1:75,], method = “pred”, reg.method = “eigen”
n.axes = 10)

I then apply crossvalidation, as in (Braak & Schaffers 2004).

A.crossval <- crossval(x = A.pred, n.axes = 10)
A.cvax <- which.max(A.crossval$CVfit)
A.cvmax <- max(A.crossval$CVfit)

B.crossval <- crossval(x = B.pred, n.axes = 10)
B.cvax <- which.max(B.crossval$CVfit)
B.cvmax <- max(B.crossval$CVfit)

C.crossval <- crossval(x = C.pred, n.axes = 10)
C.cvax <- which.max(C.crossval$CVfit)
C.cvmax <- max(C.crossval$CVfit)

Thus X.cvax is the number of axes that give the best crossvalidation score
for community X predicting community D. X.cvax max is that best
crossvalidation score.  I think that if X.cvmax is positive, that suggests
that the model can predict come fraction of the structure of community D
from community X (where X is the community (A, B or C) in question) (as
suggested in Schaffers et al 2008).

Now, I want to know whether A, B or C best predicts D. Because C is only 75
data points, I expect A.cvmax and B.cvmax should be inflated relative to
C.cvmax, so I shouldn't be able to use these values to cross compare
predictive power.

One approach that I think might work is to run permutest to generate
percent fits for each axis that crossvalidation suggests I should include
in the model.

A.perm <- permutest(mo.pred, n.axes = A.cvax)
B.perm <- permutest(mo.pred, n.axes = B.cvax)
C.perm <- permutest(mo.pred, n.axes = C.cvax)

I can then ask what percentage of the community of X predicts what percent
of community Y using the pcent.fit value from the permutest objects. I
think this value is independent of sample size, but am not sure. Does
anybody else know? I could either do this for the first axis, or for the
sum all axes in the model. I'm not sure if the second option is legitimate

Option 1
A.ax1 <- A.perm$pcent.fit[1]
B.ax1 <- B.perm$pcent.fit[1]
C.ax1 <- C.perm$pcent.fit[1]

Option 2
A.all <- sum(A.perm$pcent.fit[1:A.cvax])
B.all <- sum(A.perm$pcent.fit[1:B.cvax])
C.all <- sum(B.perm$pcent.fit[1:C.cvax])

So my first question is, when comparing communities with different but
overlapping sample sizes(all 75 sample sites in C are the same sample sites
as the first 75 in A and B), if

A.all >B.all > C.all

or if A.ax1 > B.ax1 > C.ax1,

can I then say that A is the best predictor community for community D?

More specifically, is this approach a legitimate way of comparing
communities with different sample sizes in which the sample locations
overlap. If not is there a better way of making such a comparison? Perhaps
the co-correspondence method is not appropriate for making this kind of
comparison. I notice that most literature to date uses co-correspondence in
conjunction with cca to determine whether a different community or
environmental parameters best predict a given community. If
co-correspondence is not appropriate, could somebody explain why and
suggest a better method?

References:

 Braak CJF ter, Schaffers AP. (2004). Co-Correspondence Analysis: A New
Ordination Method to Relate Two Community Compositions. Ecology 85:834–846.

Schaffers AP, Raemakers IP, Sýkora KV, Braak CJF ter. (2008). Arthropod
Assemblages Are Best Predicted by Plant Species Composition. Ecology
89:782–794.

 Thanks everyone for your consideration in this matter.

Sincerely,

Jacob Cram

Graduate Student
Department of Biological Sciences
University of Southern California

[[alternative HTML version deleted]]

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology