Re: [R-sig-eco] Package 'compositions': Interpreting dist() output
Rich, Distances can be interpreted as degrees of similarity. In your dataset, it means that the observation in 2004 is more similar to 2011 than 2013. You can visualize distances using distance-based clustering (http://www.statmethods.net/advstats/cluster.html) or multidimentional scaling (http://www.statmethods.net/advstats/mds.html). -- Essi De : Rich Shepard Envoyé : lundi, 13 octobre 2014 17:34 À : r-sig-ecology@r-project.org On Mon, 13 Oct 2014, Rich Shepard wrote: 2004 2005 20062011 2012 20050.5917687 20060.70849411.1382195 20110.57968710.35033940.9175847 20121.36156700.80987641.76824540.9206943 20131.49556971.20241231.67514631.01467111.2160550 Spacing was not correct on the above; this should be better: 2004 2005 20062011 2012 20050.5917687 20060.70849411.1382195 20110.57968710.35033940.9175847 20121.36156700.80987641.76824540.9206943 20131.49556971.20241231.67514631.01467111.2160550 Rich ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology [[alternative HTML version deleted]] ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
Re: [R-sig-eco] CoDA: Clustering Multiple Data Sets
Hi Rich, It is not clear whether you need a supervised or an unsupervised model. Clustering is unsupervised: it will classify compositions in hierarchical groups regardless the label (countries, regions). If this is what you intend, you might compute the clustering (hclust) on an euclidean distance matrix (vegdist) performed across the clr- or ilr-transformed data (both return the same distances). If you mean a supervised approach, you might want to explain how groups differ, and/or predict to which group the composition belongs. To explain, discriminant analysis (packages MASS or ade4) is (arguably) often a good choice. To predict a category, you might look at machine learning techniques (see caret package among many others). Regards, Essi De : Rich Shepard Envoyé : jeudi, 9 octobre 2014 15:13 À : r-sig-ecology@r-project.org The documentation for packages compositions and robCompositions describe distance measures and (in the former package) clustering. However, all the examples, and the function syntax, apply to a single data set. This works well with geochemical and official statistical data when the goal is to examine relationships among the components in the data set. I find no examples for clustering multiple compositional data sets. For example, if the expenditures (or expendituresEU) packages in robCompositions included data from multiple countries and the analytical goal is to cluster the countries based on each one's compositional data set. The package AnimalVegetation in the compositions package compares [A]real compositions by abundance of vegetation and animals for 50 plots in each of regions A and B and appears to be similar to my data: macroinvertebrate compositions by functional feeding groups and multiple (and variable number) of years in each of 6 stream networks; each stream network is a separate data set. I want to cluster the streams based on each data set. Unfortunately, I do not see an example in package compositions that uses the AnimalVegetation data for clustering. The hclust() function in the stats and compositions package (perhaps the latter calls the function in the former package) appears to be limited to a single data set. What package and function will allow me to calculate a distance matrix for these 6 compositional data sets, then use those distances for hierarchical clustering? Rich ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology [[alternative HTML version deleted]] ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
Re: [R-sig-eco] Package 'compositions'; Function dnorm.acomp()
Hi Rich, Filzmoser et al. (2009) wrote that Some measures like the standard deviation (or the variance) make no statistical sense with closed data [...]. They also wrote that If Euclidean geometry is not valid, the arithmetic mean is quite likely to be a poor estimate of the data center. See Filzmoser et al. (2009) http://www.statistik.tuwien.ac.at/forschung/SM/SM-2009-2.pdf As Euclidean geometry is not valid for compositions, you have to compute the mean in the ilr or clr space (both are euclidean, alr is not). The mean.acomp function computes the mean in euclidean space, then back-transform the result in the compositional space. library(compositions) # Data comp = matrix(c(0.0667, 0.0612061206120612, 0.0435, 0.044, 0.05, 0.0161, 0.6, 0.571457145714572, 0.6232, 0.5934, 0.4333, 0.629, 0.0667, 0.0612061206120612, 0.1014, 0.0659, 0.0667, 0.0323, 0.2444, 0.265326532653265, 0.2174, 0.2637, 0.3667, 0.2903, 0.0222, 0.0408040804080408, 0.0145, 0.033, 0.0833, 0.0323), ncol=5) # Mean colMeans(unclass(comp)) ## biased mean mean(comp) ## unbiased mean, calls mean.acomp under the hood sbp = matrix(c( 1, 1, 1,-1,-1, ## A dummy sequential binary partition 1,-1,-1, 0, 0, 0, 1,-1, 0, 0, 0, 0, 0, 1,-1), ncol=5, byrow=TRUE) psi = gsi.buildilrBase(t(sbp)) ## The orthonormal matrix balances = ilr(comp, V=psi) ## computing the orthonormal balances bal_mean = colMeans(balances) ## means of balances ilrInv(bal_mean, V=psi) ## back-transform the mean in the compositional space You see that the back-transformed mean is equal to mean.acomp(comp). The total variance estimator is computed using eq. 10 in Filzmoser et al. (2009). This is what mvar does. # Variance sum1 = 0 for (i in 1:(ncol(comp)-1)) { sum2 = 0 for (j in (i+1):ncol(comp)) { sum2 = sum2 + var(log(comp[,i]/comp[,j])) } sum1 = sum1+sum2 } tot_var = sum1/ncol(comp) tot_var mvar(comp) The variance-covariance matrix of compositions should be computed in a log-ratio space. So var, sd, confidence intervals and p-values should be computed on your transformed data. Although confidence intervals on compositions are widely seen in the litterature, they can be misleading. I prefer to compute the variance in the ilr space and put the confidence intervals in a CoDaDendrogram, then put only the means on compositions in a table below the dendrogram, as in Figure 5 of Parent et al. (2012) - I’ll send you the plot function if you want it, http://www.frontiersin.org/files/Articles/63683/fpls-04-00449-HTML/image_m/fpls-04-00449-g005.JPG Regards, Serge-Étienne De : Rich Shepard Envoyé : mardi, 30 septembre 2014 12:45 À : r-sig-ecology@r-project.org For a data set of count proportions, testing for fit to a multivariate normal distribution is done with the function dnorm.acomp() in package 'compositions'. The function's calling parameters are the data set, mean, and variance. Example data set: dput(win.acomp) structure(c(0.0667, 0.0612061206120612, 0.0435, 0.044, 0.05, 0.0161, 0.6, 0.571457145714572, 0.6232, 0.5934, 0.4333, 0.629, 0.0667, 0.0612061206120612, 0.1014, 0.0659, 0.0667, 0.0323, 0.2444, 0.265326532653265, 0.2174, 0.2637, 0.3667, 0.2903, 0.0222, 0.0408040804080408, 0.0145, 0.033, 0.0833, 0.0323), .Dim = c(6L, 5L), .Dimnames = list( NULL, c(Filterer, Gatherer, Grazer, Predator, Shredder )), class = acomp) The mean() function returns the mean value for each column: mean(win.acomp) Filterer Gatherer Grazer Predator Shredder 0.04386630 0.58270151 0.06366245 0.27664502 0.03312472 and the multivariate function, mvar(), returns a single value: mvar(win.acomp) [1] 0.6309852 The dnorm.acomp() syntax, according to ?dnorm.acomp has a single value for the mean: dnorm.acomp(x,mean,var) which raises the question of which mean value do I use for a data set? TIA, Rich ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology [[alternative HTML version deleted]] ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
Re: [R-sig-eco] Measurement distance for proportion data
I would also suggest to give a try to the Aitchison distance. To do so, you can use the âcompositionsâ package. You transform the proportions to centered log-ratios or isometric log-ratios (clr and ilr functions, respectively), then compute the Euclidean distance through transformed data - both transformations should return the same distances. library(compositions) library(vegan) data(AnimalVegetation) region = factor(ifelse(AnimalVegetation[,5]==1, A, B)) # region label comp = acomp(AnimalVegetation[,1:4]) # proportions closed between 0 and 1 # comp[region==A,] = acomp(comp[region==A,]) + c(1,1,2,1) # perturbation on region A for testing purposes bal = ilr(comp) # isometric log-ratios dist = vegdist(bal, method=euclidean) # Aitchison dissimilarity matrix mod = betadisper(dist, region) mod plot(mod) adonis(dist ~ region) Cheers, Essi Parent De : Jari Oksanen Envoyé : âmardiâ, â13â âmaiâ â2014 â11â:â21 à : Zbigniew Ziembik Cc : r-sig-ecology@r-project.org Typical dissimilarity indices are of form difference/adjustment, where the adjustment takes care of forcing the index to the range 0..1, and handles varying total abundances / richnesses. If you have proportional data, you may not need the adjustment at all, but you can just use any index. That is, it does not matter so awfully much what index you use, and for many practical purposes it does not matter if data are proportional. Actually, several indices may be equal to each with with proportional data. For instance, Manhattan, Bray-Curtis and Kulczynski indices are all identical. All you need to decide is which name you use for your index -- numbers do not change. The analysis of proportional data usually covers very different classes of models than ANOSIM and friends. Dissimilarities are not usually involved in these models. One aspect in proportional data is that only M-1 of M variables really are independent. However, this really needs to be taken into account if M is low. I have no idea how is that in your case. Cheers, Jari Oksanen On 13/05/2014, at 15:32 PM, Zbigniew Ziembik wrote: I am not sure, but it seems that your problem is related to compositional data analysis. You can probably use Aitchison distance to estimate separation between proportions. Take a (free) look at: http://www.leg.ufpr.br/lib/exe/fetch.php/pessoais:abtmartins:a_concise_guide_to_compositional_data_analysis.pdf. http://dugi-doc.udg.edu/bitstream/10256/297/1/CoDa-book.pdf. or (commercial): Aitchison, J. 2003. The Statistical Analysis of Compositional Data. The Blackburn Press. Best regards, ZZ Dnia 2014-05-12, pon o godzinie 16:37 +, Javier Lenzi pisze: Dear all, I'm doing data exploration on seabirds trophic ecology data and I am using ANOSIM to evaluate possible differences in diet during breeding and non-breeding seasons. As starting point I am using some classical indexes such as %FO (relative frequency of occurrence), N (number of prey counted in the pooled sample of pellets), %N (N as a percentage of the total number of prey of all food types in the pooled sample), V (total volume of all prey in the pooled sample), and IRI (index of relative importance). I have a concern on which similarity meassurement should I use in ANOSIM for those indexes that are proportions.. I am aware that for instance Bray-Curtis is used for count data (e.g. N) and Jaccard is used for presence-absence data (which I don't have), however I did not find a proper distance measurement for proportion data. Please, could you help me to find a proper distance measurement for these proportion data? Thank you very much in advance. Regards,Javier Lenzi [[alternative HTML version deleted]] ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology [[alternative HTML version deleted]] ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology