Re: [R] Comparing multiple distributions
On 2007-May-31 , at 18:56 , Bert Gunter wrote: While Ravi's suggestion of the "compositions" package is certainly appropriate, I suspect that the complex and extensive statistical "homework" you would need to do to use it might be overwhelming (the geometry of compositions is a simplex, and this makes things hard). Yes I am reading the documentation now, which is well written but huge indeed... As a simple and perhaps useful alternative, use pairs() or splom() to plot your 5-D data, distinguishing the different treatments via color and/or symbol. In addition, it might be useful to do the same sort of plot on the first two principal components (?prcomp) of the first 4 dimensions of your 5 component vectors (since the 5th is determined by the first 4). Because of the simplicial geometry, this PCA approach is not right, but it may nevertheless be revealing. The same plotting ideas are in the compositions package done properly (in the correct geometry),so if you are motivated to do so, you can do these things there. Even if you don't dig into the details, using the compositions package version of the plots may be realtively easy to do,interpretable, and revealing -- more so than my "simple but wrong" suggestions. You can decide. I would not trust inference using ad hoc approaches in the untransformed data. That's what the package is for. But plotting the data should always be at least the first thing you do anyway. I often find it to be sufficient, too. Thank you for your suggestions on plotting, I will look into it. I was using histograms of mean proportions + SE until now because it was what seemed the most straightforward given my specific questions. If we come back to my original data (abandoning the statistical language for a while ;) ) I have proportions of fishes caught 1. near the surface, 2. a bit below, 5. near the bottom. The questions I want to ask are for example: does the vertical distribution of species A and species B differ? So I can plot the mean proportion at each depth for both species and obtain a visual representation of the vertical distribution of each. At this stage differences between fishes that accumulate near the surface or near the bottom are quite obvious. If I add error bars I can get an idea of the variability of those distributions. The issue arise when I want to *test* for a difference between the distributions of species A and B. If I use a basic KS test I can only compare the mean proportions for species A (5 points) to the mean proportions of species B (5 points) and this has low power + does not take in account the variability around those means. In addition I may also want to know wether there is a difference within species A, B and C and pairwise KS tests would increase alpha error risk. Am I explaining things correctly? Does this seem logical to you too? As for the PCA I must admit I don't really understand what you mean. Thank you very much again. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of jiho Subject: Re: [R] Comparing multiple distributions Nobody answered my first request. I am sorry if I did not explain my problem clearly. English is not my native language and statistical english is even more difficult. I'll try to summarize my issue in more appropriate statistical terms: Each of my observations is not a single number but a vector of 5 proportions (which add up to 1 for each observation). I want to compare the "shape" of those vectors between two treatments (i.e. how the quantities are distributed between the 5 values in treatment A with respect to treatment B). I was pointed to Hotelling T-squared. Does it seem appropriate? Are there other possibilities (I read many discussions about hotelling vs. manova but I could not see how any of those related to my particular case)? Thank you very much in advance for your insights. See below for my earlier, more detailed, e-mail. On 2007-May-21 , at 19:26 , jiho wrote: I am studying the vertical distribution of plankton and want to study its variations relatively to several factors (time of day, species, water column structure etc.). So my data is special in that, at each sampling site (each observation), I don't have *one* number, I have *several* numbers (abundance of organisms in each depth bin, I sample 5 depth bins) which describe a vertical distribution. Then let say I want to compare speciesA with speciesB, I would end up trying to compare a group of several distributions with another group of several distributions (where a "distribution" is a vector of 5 numbers: an abundance for each depth bin). Does anyone know how I could do this (with R obviously ;) )? Currently I kind of get around the problem and: - compute mean abundance per depth bin within each group and compare the two mean distribu
Re: [R] Comparing multiple distributions
While Ravi's suggestion of the "compositions" package is certainly appropriate, I suspect that the complex and extensive statistical "homework" you would need to do to use it might be overwhelming (the geometry of compositions is a simplex, and this makes things hard). As a simple and perhaps useful alternative, use pairs() or splom() to plot your 5-D data, distinguishing the different treatments via color and/or symbol. In addition, it might be useful to do the same sort of plot on the first two principal components (?prcomp) of the first 4 dimensions of your 5 component vectors (since the 5th is determined by the first 4). Because of the simplicial geometry, this PCA approach is not right, but it may nevertheless be revealing. The same plotting ideas are in the compositions package done properly (in the correct geometry),so if you are motivated to do so, you can do these things there. Even if you don't dig into the details, using the compositions package version of the plots may be realtively easy to do,interpretable, and revealing -- more so than my "simple but wrong" suggestions. You can decide. I would not trust inference using ad hoc approaches in the untransformed data. That's what the package is for. But plotting the data should always be at least the first thing you do anyway. I often find it to be sufficient, too. Bert Gunter Genentech Nonclinical Statistics -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of jiho Sent: Thursday, May 31, 2007 8:37 AM To: R-help Subject: Re: [R] Comparing multiple distributions Nobody answered my first request. I am sorry if I did not explain my problem clearly. English is not my native language and statistical english is even more difficult. I'll try to summarize my issue in more appropriate statistical terms: Each of my observations is not a single number but a vector of 5 proportions (which add up to 1 for each observation). I want to compare the "shape" of those vectors between two treatments (i.e. how the quantities are distributed between the 5 values in treatment A with respect to treatment B). I was pointed to Hotelling T-squared. Does it seem appropriate? Are there other possibilities (I read many discussions about hotelling vs. manova but I could not see how any of those related to my particular case)? Thank you very much in advance for your insights. See below for my earlier, more detailed, e-mail. On 2007-May-21 , at 19:26 , jiho wrote: > I am studying the vertical distribution of plankton and want to > study its variations relatively to several factors (time of day, > species, water column structure etc.). So my data is special in > that, at each sampling site (each observation), I don't have *one* > number, I have *several* numbers (abundance of organisms in each > depth bin, I sample 5 depth bins) which describe a vertical > distribution. > > Then let say I want to compare speciesA with speciesB, I would end > up trying to compare a group of several distributions with another > group of several distributions (where a "distribution" is a vector > of 5 numbers: an abundance for each depth bin). Does anyone know > how I could do this (with R obviously ;) )? > > Currently I kind of get around the problem and: > - compute mean abundance per depth bin within each group and > compare the two mean distributions with a ks.test but this > obviously diminishes the power of the test (I only compare 5*2 > "observations") > - restrict the information at each sampling site to the mean depth > weighted by the abundance of the species of interest. This way I > have one observation per station but I reduce the information to > the mean depths while the actual repartition is important also. > > I know this is probably not directly R related but I have already > searched around for solutions and solicited my local statistics > expert... to no avail. So I hope that the stats' experts on this > list will help me. > > Thank you very much in advance. JiHO --- http://jo.irisson.free.fr/ -- Ce message a iti virifii par MailScanner pour des virus ou des polluriels et rien de suspect n'a iti trouvi. CRI UPVD http://www.univ-perp.fr __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Comparing multiple distributions
Your data is "compositional data". The R package "compositions" might be useful. You might also want to consult the book by J. Aitchison: statistical analysis of compositional data. Ravi. --- Ravi Varadhan, Ph.D. Assistant Professor, The Center on Aging and Health Division of Geriatric Medicine and Gerontology Johns Hopkins University Ph: (410) 502-2619 Fax: (410) 614-9625 Email: [EMAIL PROTECTED] Webpage: http://www.jhsph.edu/agingandhealth/People/Faculty/Varadhan.html -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of jiho Sent: Thursday, May 31, 2007 11:37 AM To: R-help Subject: Re: [R] Comparing multiple distributions Nobody answered my first request. I am sorry if I did not explain my problem clearly. English is not my native language and statistical english is even more difficult. I'll try to summarize my issue in more appropriate statistical terms: Each of my observations is not a single number but a vector of 5 proportions (which add up to 1 for each observation). I want to compare the "shape" of those vectors between two treatments (i.e. how the quantities are distributed between the 5 values in treatment A with respect to treatment B). I was pointed to Hotelling T-squared. Does it seem appropriate? Are there other possibilities (I read many discussions about hotelling vs. manova but I could not see how any of those related to my particular case)? Thank you very much in advance for your insights. See below for my earlier, more detailed, e-mail. On 2007-May-21 , at 19:26 , jiho wrote: > I am studying the vertical distribution of plankton and want to > study its variations relatively to several factors (time of day, > species, water column structure etc.). So my data is special in > that, at each sampling site (each observation), I don't have *one* > number, I have *several* numbers (abundance of organisms in each > depth bin, I sample 5 depth bins) which describe a vertical > distribution. > > Then let say I want to compare speciesA with speciesB, I would end > up trying to compare a group of several distributions with another > group of several distributions (where a "distribution" is a vector > of 5 numbers: an abundance for each depth bin). Does anyone know > how I could do this (with R obviously ;) )? > > Currently I kind of get around the problem and: > - compute mean abundance per depth bin within each group and > compare the two mean distributions with a ks.test but this > obviously diminishes the power of the test (I only compare 5*2 > "observations") > - restrict the information at each sampling site to the mean depth > weighted by the abundance of the species of interest. This way I > have one observation per station but I reduce the information to > the mean depths while the actual repartition is important also. > > I know this is probably not directly R related but I have already > searched around for solutions and solicited my local statistics > expert... to no avail. So I hope that the stats' experts on this > list will help me. > > Thank you very much in advance. JiHO --- http://jo.irisson.free.fr/ -- Ce message a iti virifii par MailScanner pour des virus ou des polluriels et rien de suspect n'a iti trouvi. CRI UPVD http://www.univ-perp.fr __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Comparing multiple distributions
Nobody answered my first request. I am sorry if I did not explain my problem clearly. English is not my native language and statistical english is even more difficult. I'll try to summarize my issue in more appropriate statistical terms: Each of my observations is not a single number but a vector of 5 proportions (which add up to 1 for each observation). I want to compare the "shape" of those vectors between two treatments (i.e. how the quantities are distributed between the 5 values in treatment A with respect to treatment B). I was pointed to Hotelling T-squared. Does it seem appropriate? Are there other possibilities (I read many discussions about hotelling vs. manova but I could not see how any of those related to my particular case)? Thank you very much in advance for your insights. See below for my earlier, more detailed, e-mail. On 2007-May-21 , at 19:26 , jiho wrote: I am studying the vertical distribution of plankton and want to study its variations relatively to several factors (time of day, species, water column structure etc.). So my data is special in that, at each sampling site (each observation), I don't have *one* number, I have *several* numbers (abundance of organisms in each depth bin, I sample 5 depth bins) which describe a vertical distribution. Then let say I want to compare speciesA with speciesB, I would end up trying to compare a group of several distributions with another group of several distributions (where a "distribution" is a vector of 5 numbers: an abundance for each depth bin). Does anyone know how I could do this (with R obviously ;) )? Currently I kind of get around the problem and: - compute mean abundance per depth bin within each group and compare the two mean distributions with a ks.test but this obviously diminishes the power of the test (I only compare 5*2 "observations") - restrict the information at each sampling site to the mean depth weighted by the abundance of the species of interest. This way I have one observation per station but I reduce the information to the mean depths while the actual repartition is important also. I know this is probably not directly R related but I have already searched around for solutions and solicited my local statistics expert... to no avail. So I hope that the stats' experts on this list will help me. Thank you very much in advance. JiHO --- http://jo.irisson.free.fr/ -- Ce message a été vérifié par MailScanner pour des virus ou des polluriels et rien de suspect n'a été trouvé. CRI UPVD http://www.univ-perp.fr __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.