Re: [R-sig-eco] Manual Rarefaction and CI's

2014-05-28 Thread Jari Oksanen
Nicholas,

Your approach looks perfectly OK. However, rarefied sampling is normally done 
*without* replacement. 

It looks to me that a similar analysis could be done with vegan::specaccum 
function. However, that uses sampling without replacement.

About your specific questions: you can use quantile() function in base R to get 
desired confidence intervals. The CIs should be given as argument probs to 
quantile(). For quantile function you should return a vector of richness values 
instead of a table. You can get similar with table output as well, but you may 
need to write that yourself.

It is excepted that the CI gets narrower as the subsample size increases. In 
classic rarefaction sampling without replacement, there is only one "subsample" 
of the original sample size and therefore there is no variance but you get the 
observed data. As the subsampled proportion approaches 1, the number of 
possible samples decreases and so does the variance of subsamples. With 
replacement this may not happen as some units are there twice or more and some 
are missing. Subsample of original size with replacement should contain 63.2% 
of your original units. This also means that because you duplicate (or 
multiplicate) some sampling units and omit some, you will systematically 
underestimate richness when you subsample with replacement. This is why we 
normally use subsampling without replacement. However, the larger subsamples 
tend to become more similar, in particular with data set like you described. 
This is necessary and correct. 

This is the most annoying thing in CIs in rarefaction to me. The rarefaction 
variation *only* describes the randomness of subsampling from a fixed sample. 
It does not describe the variance of that fixed sample, or the real variance of 
your observed richness. With real variability I mean the variance you observe 
if you replicate your sampling in Nature and get completely new, independent 
samples from the same real community. I am afraid people use those rarefaction 
CIs and variances as measures of sample variability which is really misleading 
-- they actually describe only the artifactual variance of subsampling from 
fixed samples, when these fixed samples are regarded as error-free and to have 
zero variance of species richness. Using these variances to infer differences 
in species richness in Nature is wrong and misleading. I have tried to spell 
out that warning in vegan manual pages, but people seem to ignore those parts 
of the text. 

cheers, Jari Oksanen

On 28/05/2014, at 20:05 PM, Nicholas J. Negovetich wrote:

> I have a question related to rarefaction of our samples.  Unlike all of the 
> examples of which I'm aware, I'm not working with 'sites', per se.  Instead, 
> I'm working with the parasites of snails.  Snails are infected with 1 
> parasite species, and each pond/stream can hold several parasite species.  
> The difficulty comes when comparing parasite richness between sites (Pond A 
> vs Pond B) and sampling effort (# snails collected) differs.  Rarefaction 
> provides a means to standardize effort on the lowest sample size so that we 
> can test if parasite richness differs between sites.
> 
> I'm familiar with the vegan package and how it performs rarefaction.  But, I 
> can't perform this type of analysis because (1.) 'uninfected' snails would be 
> considered empty patches, and (2.) max richness will be 1 (only one parasite 
> species can (usually) infect a snail).  To counter this problem, I've tried 
> to rarefy my samples using randomization methods.  The script is below.  In 
> summary, I sampled (with replacement) my dataset and calculated the number of 
> parasite species from each sample.  I repeated this 100 times and calculated 
> the mean richness for a given sample size.  I did this for sample sizes 1-50 
> (smallest sample size is 50 individuals).
> 
> I have a few questions related this this analysis.
> 1.  Does this appear correct?
> 
> 2.  How can I generate CI using this method?  I normally calculate the 2.5% 
> and 97.5% quantiles when performing randomization, but the nature of our data 
> does not lend itself well to quantile calculations.  Could I assume that my 
> bootstrapped means follow a normal distribution?  If not, then what would be 
> the best method for generating confidence intervals?
> 
> 3.  This last one is the primary reason for this post.  When performing this 
> analysis on my real datasets, I've noticed a peculiar thing.  In one sample, 
> my variance of the bootstrapped samples converges to zero as sample size 
> approaches the true sample size (this is for my sample of the smallest sample 
> size). Relatedly, I've noticed some (but not all) of my CI narrowing as my 
> sample size approaches the actual size.  This suggests that I'm not doing 
> something correctly, but I'm not sure how to correct it (or if I even need to 
> correct it).  The alternative for me is to scrap the rarefaction and perform 
> an alternat

Re: [R-sig-eco] Manual Rarefaction and CI's

2014-05-28 Thread Novack-Gottshall, Philip M.
Hi Nicholas,

What you're doing seems proper to me (but see below regarding calculation of 
the SE).

Rarefaction error windows should coalesce around the endpoint as you get to 
larger randomized samples. And you only have limited numbers of richness values 
to draw from. You can confirm this behavior artificially with:
rA <- rrfy(A,n=2000)

A major benefit of resampling techniques is that they are much less sensitive 
to distributional shape.  So I don't see any reason to assume normality when 
calculating the confidence interval using the 1.96*SE.

And note that when using bootstrap methods [see Efron and Tibshirani 1997], the 
SD of the resampling distribution is approximately equal to the standard error. 
So you should *not* divide by the sqrt of sample size in the following lines:
 low[i] <- mean(rs)-1.96*(sd(rs)/sqrt(length(rs)))
 high[i] <- mean(rs)+1.96*(sd(rs)/sqrt(length(rs)))

Instead, I would recommend using the naive 2.5% and 97.5% quantiles to 
calculate the confidence interval [or just use mean +- 1.96*sd(rs)]:
 low[i] <- quantile(rs, type=3, prob=0.025)
 high[i] <- quantile(rs, type=3, prob=0.975)

Because your data are discrete integers, the use of these quantiles with the 
discrete median as your sample statistic might actually be preferable:
 s[i] <- median(rs)

But with discrete data and relatively few richness values, increasing your reps 
to 2000 or more is probably warranted. (The function alarm() is useful for 
signalling the end of a routine.)

Although focused more on fossils, here's a potentially useful reference, with a 
link to an appendix with R code and examples:
Kowalewski, M., and P. Novack-Gottshall. 2010. Resampling methods in 
paleontology. Pp. 19-54. In J. Alroy, and G. Hunt, eds. Quantitative Methods in 
Paleobiology. Short Courses in Paleontology 16. Paleontological Society and 
Paleontological Research Institute, Ithaca, NY.

Cheers,
Phil





On 5/28/2014 12:05 PM, Nicholas J. Negovetich wrote:

I have a question related to rarefaction of our samples.  Unlike all of
the examples of which I'm aware, I'm not working with 'sites', per se.
Instead, I'm working with the parasites of snails.  Snails are infected
with 1 parasite species, and each pond/stream can hold several parasite
species.  The difficulty comes when comparing parasite richness between
sites (Pond A vs Pond B) and sampling effort (# snails collected)
differs.  Rarefaction provides a means to standardize effort on the
lowest sample size so that we can test if parasite richness differs
between sites.

I'm familiar with the vegan package and how it performs rarefaction.
But, I can't perform this type of analysis because (1.) 'uninfected'
snails would be considered empty patches, and (2.) max richness will be
1 (only one parasite species can (usually) infect a snail).  To counter
this problem, I've tried to rarefy my samples using randomization
methods.  The script is below.  In summary, I sampled (with replacement)
my dataset and calculated the number of parasite species from each
sample.  I repeated this 100 times and calculated the mean richness for
a given sample size.  I did this for sample sizes 1-50 (smallest sample
size is 50 individuals).

I have a few questions related this this analysis.
1.  Does this appear correct?

2.  How can I generate CI using this method?  I normally calculate the
2.5% and 97.5% quantiles when performing randomization, but the nature
of our data does not lend itself well to quantile calculations.  Could I
assume that my bootstrapped means follow a normal distribution?  If not,
then what would be the best method for generating confidence intervals?

3.  This last one is the primary reason for this post.  When performing
this analysis on my real datasets, I've noticed a peculiar thing.  In
one sample, my variance of the bootstrapped samples converges to zero as
sample size approaches the true sample size (this is for my sample of
the smallest sample size). Relatedly, I've noticed some (but not all) of
my CI narrowing as my sample size approaches the actual size.  This
suggests that I'm not doing something correctly, but I'm not sure how to
correct it (or if I even need to correct it).  The alternative for me is
to scrap the rarefaction and perform an alternative analysis, such as
randomizing the sites and comparing the actual difference in richness
between sites to a null distribution of randomized differences.  Thoughts?

NN


#Script below
#Two sites: Site A = 100 snails; Site B = 50 snails
#0 = uninfected; 1-4 = species 1-4

A <- c(rep(0,56),rep(1,25),rep(2,15),rep(3,3),rep(4,1))
B <- c(rep(0,29),rep(1,12),rep(2,7),rep(3,2))

#Function to generate a random sample with replacement and calculate
species richness
bsS <- function(x,n=100){
   rs <- sample(x,size=n,replace=T)
   S <- sum(table(rs)[-1]>0)
   return(S)
}

#Function to rarefy my dataset; utilized bsS function
#For sample sizes of 1:n, generate 'reps' number of bootstrapped
richness values (bsS

[R-sig-eco] Manual Rarefaction and CI's

2014-05-28 Thread Nicholas J. Negovetich
I have a question related to rarefaction of our samples.  Unlike all of 
the examples of which I'm aware, I'm not working with 'sites', per se.  
Instead, I'm working with the parasites of snails.  Snails are infected 
with 1 parasite species, and each pond/stream can hold several parasite 
species.  The difficulty comes when comparing parasite richness between 
sites (Pond A vs Pond B) and sampling effort (# snails collected) 
differs.  Rarefaction provides a means to standardize effort on the 
lowest sample size so that we can test if parasite richness differs 
between sites.


I'm familiar with the vegan package and how it performs rarefaction.  
But, I can't perform this type of analysis because (1.) 'uninfected' 
snails would be considered empty patches, and (2.) max richness will be 
1 (only one parasite species can (usually) infect a snail).  To counter 
this problem, I've tried to rarefy my samples using randomization 
methods.  The script is below.  In summary, I sampled (with replacement) 
my dataset and calculated the number of parasite species from each 
sample.  I repeated this 100 times and calculated the mean richness for 
a given sample size.  I did this for sample sizes 1-50 (smallest sample 
size is 50 individuals).


I have a few questions related this this analysis.
1.  Does this appear correct?

2.  How can I generate CI using this method?  I normally calculate the 
2.5% and 97.5% quantiles when performing randomization, but the nature 
of our data does not lend itself well to quantile calculations.  Could I 
assume that my bootstrapped means follow a normal distribution?  If not, 
then what would be the best method for generating confidence intervals?


3.  This last one is the primary reason for this post.  When performing 
this analysis on my real datasets, I've noticed a peculiar thing.  In 
one sample, my variance of the bootstrapped samples converges to zero as 
sample size approaches the true sample size (this is for my sample of 
the smallest sample size). Relatedly, I've noticed some (but not all) of 
my CI narrowing as my sample size approaches the actual size.  This 
suggests that I'm not doing something correctly, but I'm not sure how to 
correct it (or if I even need to correct it).  The alternative for me is 
to scrap the rarefaction and perform an alternative analysis, such as 
randomizing the sites and comparing the actual difference in richness 
between sites to a null distribution of randomized differences.  Thoughts?


NN


#Script below
#Two sites: Site A = 100 snails; Site B = 50 snails
#0 = uninfected; 1-4 = species 1-4

A <- c(rep(0,56),rep(1,25),rep(2,15),rep(3,3),rep(4,1))
B <- c(rep(0,29),rep(1,12),rep(2,7),rep(3,2))

#Function to generate a random sample with replacement and calculate 
species richness

bsS <- function(x,n=100){
  rs <- sample(x,size=n,replace=T)
  S <- sum(table(rs)[-1]>0)
  return(S)
}

#Function to rarefy my dataset; utilized bsS function
#For sample sizes of 1:n, generate 'reps' number of bootstrapped 
richness values (bsS)

#and save the mean richness for each sample.
rrfy <- function(x,n=10,reps=100){
  smax <- sum(table(x)[-1]>0)
  s <- c()
  std <- c()
  low <- c()
  high <- c()
  for (i in 1:n){
rs <- c()
for(j in 1:reps){
  rs[j] <- bsS(x,i)
}
s[i] <- mean(rs)
std[i] <- sd(rs)
low[i] <- mean(rs)-1.96*(sd(rs)/sqrt(length(rs))) #I don't think 
this is correct
high[i] <- mean(rs)+1.96*(sd(rs)/sqrt(length(rs))) #I don't think 
this is correct

  }
  plot(s~c(1:n),type='l',ylim=c(0,smax+1))
  lines(low~c(1:n),lty=2,col='red')
  lines(high~c(1:n),lty=2,col='red')
  return(data.frame(x=c(1:n),s,stdev=std))
}
rA <- rrfy(A,n=50)
rB <- rrfy(B,n=50)

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology