Re: [R] Comparing multiple distributions

2007-05-31 Thread jiho
Nobody answered my first request. I am sorry if I did not explain my  
problem clearly. English is not my native language and statistical  
english is even more difficult. I'll try to summarize my issue in  
more appropriate statistical terms:


Each of my observations is not a single number but a vector of 5  
proportions (which add up to 1 for each observation). I want to  
compare the shape of those vectors between two treatments (i.e. how  
the quantities are distributed between the 5 values in treatment A  
with respect to treatment B).


I was pointed to Hotelling T-squared. Does it seem appropriate? Are  
there other possibilities (I read many discussions about hotelling  
vs. manova but I could not see how any of those related to my  
particular case)?


Thank you very much in advance for your insights. See below for my  
earlier, more detailed, e-mail.


On 2007-May-21  , at 19:26 , jiho wrote:
I am studying the vertical distribution of plankton and want to  
study its variations relatively to several factors (time of day,  
species, water column structure etc.). So my data is special in  
that, at each sampling site (each observation), I don't have *one*  
number, I have *several* numbers (abundance of organisms in each  
depth bin, I sample 5 depth bins) which describe a vertical  
distribution.


Then let say I want to compare speciesA with speciesB, I would end  
up trying to compare a group of several distributions with another  
group of several distributions (where a distribution is a vector  
of 5 numbers: an abundance for each depth bin). Does anyone know  
how I could do this (with R obviously ;) )?


Currently I kind of get around the problem and:
- compute mean abundance per depth bin within each group and  
compare the two mean distributions with a ks.test but this  
obviously diminishes the power of the test (I only compare 5*2  
observations)
- restrict the information at each sampling site to the mean depth  
weighted by the abundance of the species of interest. This way I  
have one observation per station but I reduce the information to  
the mean depths while the actual repartition is important also.


I know this is probably not directly R related but I have already  
searched around for solutions and solicited my local statistics  
expert... to no avail. So I hope that the stats' experts on this  
list will help me.


Thank you very much in advance.


JiHO
---
http://jo.irisson.free.fr/



--
Ce message a été vérifié par MailScanner
pour des virus ou des polluriels et rien de
suspect n'a été trouvé.
CRI UPVD http://www.univ-perp.fr

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Comparing multiple distributions

2007-05-31 Thread Ravi Varadhan
Your data is compositional data. The R package compositions might be
useful. You might also want to consult the book by J. Aitchison: statistical
analysis of compositional data.

Ravi.


---

Ravi Varadhan, Ph.D.

Assistant Professor, The Center on Aging and Health

Division of Geriatric Medicine and Gerontology 

Johns Hopkins University

Ph: (410) 502-2619

Fax: (410) 614-9625

Email: [EMAIL PROTECTED]

Webpage:  http://www.jhsph.edu/agingandhealth/People/Faculty/Varadhan.html

 




-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of jiho
Sent: Thursday, May 31, 2007 11:37 AM
To: R-help
Subject: Re: [R] Comparing multiple distributions

Nobody answered my first request. I am sorry if I did not explain my  
problem clearly. English is not my native language and statistical  
english is even more difficult. I'll try to summarize my issue in  
more appropriate statistical terms:

Each of my observations is not a single number but a vector of 5  
proportions (which add up to 1 for each observation). I want to  
compare the shape of those vectors between two treatments (i.e. how  
the quantities are distributed between the 5 values in treatment A  
with respect to treatment B).

I was pointed to Hotelling T-squared. Does it seem appropriate? Are  
there other possibilities (I read many discussions about hotelling  
vs. manova but I could not see how any of those related to my  
particular case)?

Thank you very much in advance for your insights. See below for my  
earlier, more detailed, e-mail.

On 2007-May-21  , at 19:26 , jiho wrote:
 I am studying the vertical distribution of plankton and want to  
 study its variations relatively to several factors (time of day,  
 species, water column structure etc.). So my data is special in  
 that, at each sampling site (each observation), I don't have *one*  
 number, I have *several* numbers (abundance of organisms in each  
 depth bin, I sample 5 depth bins) which describe a vertical  
 distribution.

 Then let say I want to compare speciesA with speciesB, I would end  
 up trying to compare a group of several distributions with another  
 group of several distributions (where a distribution is a vector  
 of 5 numbers: an abundance for each depth bin). Does anyone know  
 how I could do this (with R obviously ;) )?

 Currently I kind of get around the problem and:
 - compute mean abundance per depth bin within each group and  
 compare the two mean distributions with a ks.test but this  
 obviously diminishes the power of the test (I only compare 5*2  
 observations)
 - restrict the information at each sampling site to the mean depth  
 weighted by the abundance of the species of interest. This way I  
 have one observation per station but I reduce the information to  
 the mean depths while the actual repartition is important also.

 I know this is probably not directly R related but I have already  
 searched around for solutions and solicited my local statistics  
 expert... to no avail. So I hope that the stats' experts on this  
 list will help me.

 Thank you very much in advance.

JiHO
---
http://jo.irisson.free.fr/



-- 
Ce message a iti virifii par MailScanner
pour des virus ou des polluriels et rien de
suspect n'a iti trouvi.
CRI UPVD http://www.univ-perp.fr

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Comparing multiple distributions

2007-05-31 Thread Bert Gunter
While Ravi's suggestion of the compositions package is certainly
appropriate, I suspect that the complex and extensive statistical homework
you would need to do to use it might be overwhelming (the geometry of
compositions is a simplex, and this makes things hard). As a simple and
perhaps useful alternative, use pairs() or splom() to plot your 5-D data,
distinguishing the different treatments via color and/or symbol.

In addition, it might be useful to do the same sort of plot on the first two
principal components (?prcomp) of the first 4 dimensions of your 5 component
vectors (since the 5th is determined by the first 4). Because of the
simplicial geometry, this PCA approach is not right, but it may nevertheless
be revealing. The same plotting ideas are in the compositions package done
properly (in the correct geometry),so if you are motivated to do so, you can
do these things there. Even if you don't dig into the details, using the
compositions package version of the plots may be realtively easy to
do,interpretable, and revealing -- more so than my simple but wrong
suggestions. You can decide.

I would not trust inference using ad hoc approaches in the untransformed
data. That's what the package is for. But plotting the data should always be
at least the first thing you do anyway. I often find it to be sufficient,
too.


Bert Gunter
Genentech Nonclinical Statistics


-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of jiho
Sent: Thursday, May 31, 2007 8:37 AM
To: R-help
Subject: Re: [R] Comparing multiple distributions

Nobody answered my first request. I am sorry if I did not explain my  
problem clearly. English is not my native language and statistical  
english is even more difficult. I'll try to summarize my issue in  
more appropriate statistical terms:

Each of my observations is not a single number but a vector of 5  
proportions (which add up to 1 for each observation). I want to  
compare the shape of those vectors between two treatments (i.e. how  
the quantities are distributed between the 5 values in treatment A  
with respect to treatment B).

I was pointed to Hotelling T-squared. Does it seem appropriate? Are  
there other possibilities (I read many discussions about hotelling  
vs. manova but I could not see how any of those related to my  
particular case)?

Thank you very much in advance for your insights. See below for my  
earlier, more detailed, e-mail.

On 2007-May-21  , at 19:26 , jiho wrote:
 I am studying the vertical distribution of plankton and want to  
 study its variations relatively to several factors (time of day,  
 species, water column structure etc.). So my data is special in  
 that, at each sampling site (each observation), I don't have *one*  
 number, I have *several* numbers (abundance of organisms in each  
 depth bin, I sample 5 depth bins) which describe a vertical  
 distribution.

 Then let say I want to compare speciesA with speciesB, I would end  
 up trying to compare a group of several distributions with another  
 group of several distributions (where a distribution is a vector  
 of 5 numbers: an abundance for each depth bin). Does anyone know  
 how I could do this (with R obviously ;) )?

 Currently I kind of get around the problem and:
 - compute mean abundance per depth bin within each group and  
 compare the two mean distributions with a ks.test but this  
 obviously diminishes the power of the test (I only compare 5*2  
 observations)
 - restrict the information at each sampling site to the mean depth  
 weighted by the abundance of the species of interest. This way I  
 have one observation per station but I reduce the information to  
 the mean depths while the actual repartition is important also.

 I know this is probably not directly R related but I have already  
 searched around for solutions and solicited my local statistics  
 expert... to no avail. So I hope that the stats' experts on this  
 list will help me.

 Thank you very much in advance.

JiHO
---
http://jo.irisson.free.fr/



-- 
Ce message a iti virifii par MailScanner
pour des virus ou des polluriels et rien de
suspect n'a iti trouvi.
CRI UPVD http://www.univ-perp.fr

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Comparing multiple distributions

2007-05-31 Thread jiho


On 2007-May-31  , at 18:56 , Bert Gunter wrote:

While Ravi's suggestion of the compositions package is certainly
appropriate, I suspect that the complex and extensive statistical  
homework

you would need to do to use it might be overwhelming (the geometry of
compositions is a simplex, and this makes things hard).


Yes I am reading the documentation now, which is well written but  
huge indeed...



As a simple and
perhaps useful alternative, use pairs() or splom() to plot your 5-D  
data,

distinguishing the different treatments via color and/or symbol.

In addition, it might be useful to do the same sort of plot on the  
first two
principal components (?prcomp) of the first 4 dimensions of your 5  
component

vectors (since the 5th is determined by the first 4). Because of the
simplicial geometry, this PCA approach is not right, but it may  
nevertheless
be revealing. The same plotting ideas are in the compositions  
package done
properly (in the correct geometry),so if you are motivated to do  
so, you can
do these things there. Even if you don't dig into the details,  
using the

compositions package version of the plots may be realtively easy to
do,interpretable, and revealing -- more so than my simple but wrong
suggestions. You can decide.

I would not trust inference using ad hoc approaches in the  
untransformed
data. That's what the package is for. But plotting the data should  
always be
at least the first thing you do anyway. I often find it to be  
sufficient,

too.


Thank you for your suggestions on plotting, I will look into it. I  
was using histograms of mean proportions + SE until now because it  
was what seemed the most straightforward given my specific questions.  
If we come back to my original data (abandoning the statistical  
language for a while ;) ) I have proportions of fishes caught 1. near  
the surface, 2. a bit below,  5. near the bottom. The questions I  
want to ask are for example: does the vertical distribution of  
species A and species B differ? So I can plot the mean proportion at  
each depth for both species and obtain a visual representation of the  
vertical distribution of each.
At this stage differences between fishes that accumulate near the  
surface or near the bottom are quite obvious. If I add error bars I  
can get an idea of the variability of those distributions. The issue  
arise when I want to *test* for a difference between the  
distributions of species A and B. If I use a basic KS test I can only  
compare the mean proportions for species A (5 points) to the mean  
proportions of species B (5 points) and this has low power + does not  
take in account the variability around those means. In addition I may  
also want to know wether there is a difference within species A, B  
and C and pairwise KS tests would increase alpha error risk. Am I  
explaining things correctly? Does this seem logical to you too?

As for the PCA I must admit I don't really understand what you mean.

Thank you very much again.


-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of jiho
Subject: Re: [R] Comparing multiple distributions

Nobody answered my first request. I am sorry if I did not explain my
problem clearly. English is not my native language and statistical
english is even more difficult. I'll try to summarize my issue in
more appropriate statistical terms:

Each of my observations is not a single number but a vector of 5
proportions (which add up to 1 for each observation). I want to
compare the shape of those vectors between two treatments (i.e. how
the quantities are distributed between the 5 values in treatment A
with respect to treatment B).

I was pointed to Hotelling T-squared. Does it seem appropriate? Are
there other possibilities (I read many discussions about hotelling
vs. manova but I could not see how any of those related to my
particular case)?

Thank you very much in advance for your insights. See below for my
earlier, more detailed, e-mail.

On 2007-May-21  , at 19:26 , jiho wrote:

I am studying the vertical distribution of plankton and want to
study its variations relatively to several factors (time of day,
species, water column structure etc.). So my data is special in
that, at each sampling site (each observation), I don't have *one*
number, I have *several* numbers (abundance of organisms in each
depth bin, I sample 5 depth bins) which describe a vertical
distribution.

Then let say I want to compare speciesA with speciesB, I would end
up trying to compare a group of several distributions with another
group of several distributions (where a distribution is a vector
of 5 numbers: an abundance for each depth bin). Does anyone know
how I could do this (with R obviously ;) )?

Currently I kind of get around the problem and:
- compute mean abundance per depth bin within each group and
compare the two mean distributions with a ks.test but this
obviously diminishes the power of the test (I only compare 5