Dear R users,
I would like to summarize the answers I got to the following question:
I am interested in correctly testing effects of continuous
environmental variables and ordered factors on bacterial abundance.
Bacterial abundance is derived from counts and expressed as percentage.
My problem is that the abundance data contain many zero values:
Bacteria -
c(2.23,0,0.03,0.71,2.34,0,0.2,0.2,0.02,2.07,0.85,0.12,0,0.59,0.02,2.3,0
.29,0.39,1.32,0.07,0.52,1.2,0,0.85,1.09,0,0.5,1.4,0.08,0.11,0.05,0.17,0
.31,0,0.12,0,0.99,1.11,1.78,0,0,0,2.33,0.07,0.66,1.03,0.15,0.15,0.59,0,
0.03,0.16,2.86,0.2,1.66,0.12,0.09,0.01,0,0.82,0.31,0.2,0.48,0.15)
First I tried transforming the data (e.g., logit) but because of the
zeros I was not satisfied. Next I converted the percentages into
integer values by round(Bacteria*10) or ceiling(Bacteria*10) and
calculated a glm with a Poisson error structure; however, I am not very
happy with this approach because it changes the original percentage
data substantially (e.g., 0.03 becomes either 0 or 1). The same is true
for converting the percentages into factors and calculating a
multinomial or proportional-odds model (anyway, I do not know if this
would be a meaningful approach).
I was searching the web and the best answer I could get was
http://www.biostat.wustl.edu/archives/html/s-news/1998-12/ msg00010.html
in which several persons suggested quasi-likelihood. Would it be
reasonable to use a glm with quasipoisson? If yes, how I can I find the
appropriate variance function? Any other suggestions?
If you know the totals from which these percentages were derived,
then transform your Bacteria back to original observations and fit a
quasi-Poisson model with log(total) as an offset. That is:
BCount - round(tot * Bacteria)
glm(Bcount ~ x1+ x2 + offset(log(tot)), family=quasipoisson)
cheers, jari oksanen
I have developed an R library for specificially dealing with positive
continuous data with exact zeros. For example, rainfall: No rain
means exactly zero is recorded, but when rain falls, a continuous
amount is recorded (after suitable rounding).
This library--available on CRAN--is called tweedie. The distributions
used are Tweedie models, which belong to the EDM family and so
can be used in generalized linear models. The Tweedie models have
a variance function V(mu) = mu^p, for p not in the range (0, 1).
For various values of p, we have:
Value of p Distribution
p =0 Defined over whole real line
p=0 Normal distribution
0 p 1 No distributions exist
p=1 Poisson distribution (with phi=1)
1 p 2 Continuous over positive Y, with positive mass at Y=0
p=2 Gamma distribution
p = 2 Continuous for positive Y
p=3 Inverse Gaussian distribution
Of particular interest are the distributions such that 1 p 2,
which can be seen as a Poisson sum of gamma random variables. They are
continuous for Y0 with a positive probability that Y=0 exactly. For
this reason, the Tweedie densities with 1 p 2 have been called the
compound Poisson, compound gamma and the Poisson-gamma distribution.
In your case, percentages with exact zeros may not exactly fall into
this category because of the upper limit of 100%. But provided there's
very few values near 100%, the Tweedie models might be worth a try.
(The data above seem to indicate few values near 100%.)
Get the tweedie package from CRAN, or from
http://www.sci.usq.edu.au/staff/dunn/twhtml/home.html
You will also need the statmod package, also available on CRAN.
All the best.
P.
--
Dr Peter Dunn (USQ CRICOS No. 00244B)
Web:http://www.sci.usq.edu.au/staff/dunn
Email: dunn @ usq.edu.au
Opinions expressed are mine, not those of USQ. Obviously...
You might try with ZIP i.e. zero inflated poisson model. I did not
used it, but I have such data to work on. So if there is anyone hwo
can point how to do this in R - please. There is also a classs of ZINB
or something like that for zero inflated negative binomial models.
Actually I just went on web and found a book from Simonoff Analyzing
Categorical Data and there are some examples in it for ZIP et al.
Look examples for sections 4.5 and 5.4
http://www.stern.nyu.edu/~jsimonof/AnalCatData/Splus/analcatdata.s
http://www.stern.nyu.edu/~jsimonof/AnalCatData/Splus/functions.s
--
Lep pozdrav / With regards,
Gregor GORJANC
The ZIP model can be fitted with Jim Lindsey's function fmr
from his gnlm library, see:
http://popgen0146uns50.unimaas.nl/~jlindsey/rcode.html
Bendix Carstensen
It turned out that the percentage data were calculated from
concentrations resulting in positive continuous data with exact zeros.
The Tweedie models did a fine job.
Many thanks, Christian Kamenik
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html