Re: [R-sig-phylo] asymmetric transitions
Hi, Thanks for the Allman Rhodes paper, it is very nice. For me at least it confirms my suspicions, but made me realise that claims of asymmetric transition rates are only suspicious if you are unprepared to make some (strong?) assumptions. If anyone disagrees with what I have written below, then please tell me and I will try again to understand this stuff: Identifiability is achieved because the pdf for the root state is the stationary distribution (denoted by sigma in Allman Rhodes: see example 1). This is, I believe, the default in newer versions of Mesquite, although in older versions the distribution is 0.5/0.5 as in ace. If the pdf of the root state is defined by an additional parameter, this leaves a single parameter to describe the rate of transitions, and asymmetrical transition rates are non-identifiable. It seems to me there is a choice to be made between a) assuming the same processes after the root held before the root and talk about asymmetric transition rates or b) do not make this assumption and then admit that the rates of transition from 0-1 and 1-0 are not separable. I don't think the data can be used to distinguish between these view points, and so its a matter of personal choice which interpretation/model is used. Cheers, Jarrod Quoting Mark Holder mthol...@ku.edu on Thu, 16 Aug 2012 23:41:45 -0500: Hi, I agree that model testing between ARD vs MK models is going to be misleading when the process is really described by a threshold model (and sorry for ignoring that set of simulations by Jarrod previously; somehow I misfiled that email and didn't see it). The threshold model has nice ways of dealing with correlations among characters. However, when it is applied as the underlying model for a single binary character (as in Jarrod's sims), the threshold model is similar to the single-site version of the covarion model (Tuffley and Steel's version). I don't think the models are identical, but they are quite similar. I suspect that if you generated a data set under one of the models, it would be quite hard to determine which was the generating model. Instead of just having a an on and off state (as in the covarion model), the threshold model has a continuum (the further the underlying continuous trait is from boundary, the more off the observable binary trait is). Allman and Rhodes (2009, ref below) proved some results on the identifiability of generalizations of covarion processes. They considered models with more hidden rate categories (not just rate of zero and an rate of evolution when in the on state). I believe that their results were that the number of hidden rate categories that you can identify cannot exceed the number of observable states. So it may be hard to get much richer than the Tuffley+Steel covarion when you have a binary character. Which is a long way of saying that, it might be worth looking at the covarion model variants for the types of data that Jarrod is interested in. Implementations of the covarion model for two states is quite fast and tractable. Testing Mk+covarion vs ARD+covarion may indeed be a more robust way of detecting asymmetry in rates of character transitions compared to Mk vs ARD. Thanks for pointing out the Boettiger et al paper, Matt. all the best, Mark [1] E. S. Allman and J. A. Rhodes, “The Identifiability of Covarion Models in Phylogenetics,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 6, no. 1, pp. 76–88, Jan. 2009. On Aug 16, 2012, at 10:07 PM, Matt Pennell wrote: correction: the last sentence should have read I wonder how that would work in this case. I think these are important questions going forward. On Thu, Aug 16, 2012 at 11:00 PM, Matt Pennell mwpenn...@gmail.com wrote: Hey all, This has been a really fantastic discussion. Mark, you make some really excellent points in response to my earlier comments. I think you are correct in this. The question that arises out of Jarrod and Dan's simulations (which I have just run) is whether a model selection criteria would be able to distinguish MK from the threshold model that Felsenstein (and Wright before him) put forth? And how do we best assess model adequacy? Carl Boettiger and company (2012: Evolution) suggested a Phylogenetic Monte Carlo approach for continuous characters. I wonder how that would before I think these are important questions going forward. thanks again, matt On Thu, Aug 16, 2012 at 10:43 PM, Dan Rabosky drabo...@umich.edu wrote: Hi all- A couple of points. I am actually less concerned about the Type I error rates I gave in that previous message for the equal rates markov process, even though I think they are real (e.g., I can corroborate them using Diversitree). I don't think it is an issue of ascertainment bias, but I think Mark may be right about the LRT being
Re: [R-sig-phylo] asymmetric transitions
Hi folks- I still think there is a difference between (i) parameter identifiability, which may or may not be a problem here, and (ii) strong support for the wrong model, which clearly appears to be occurring here (e.g., Type I error rates 0.75). I don't think non-identifiability of a parameter implies that you'll get massively inflated Type I error rates for models that include the parameter. Also, the root state (and assumptions regarding it) don't seem to drive the pattern - you can set the root to 0 under the threshold model (e.g., both states equiprobable at root) and you recover the same strong bias - Jarrod's threshold example set the root to -1, you can verify that the problem holds with equiprobable root states). At this point I guess I'd be more concerned about model misspecification and its diagnosis (when modeling discrete characters) than about identifiability per se, though perhaps I am not thinking about this correctly. Distinguishing MK/ARD from threshold may be difficult, but there is clearly signal of these processes in the data: when you fit MK/ARD to data simulated under a threshold model, you get strong support for ARD - but not when you fit those same models to data generated under MK. So - there is clearly something in the data that is sufficiently informative about the processes such that we observe high error rates. Mark, thanks for pointing out the relationship between the threshold model and the single-site covarion model. Cheers, ~Dan On Aug 17, 2012, at 6:31 AM, Jarrod Hadfield wrote: Hi, Thanks for the Allman Rhodes paper, it is very nice. For me at least it confirms my suspicions, but made me realise that claims of asymmetric transition rates are only suspicious if you are unprepared to make some (strong?) assumptions. If anyone disagrees with what I have written below, then please tell me and I will try again to understand this stuff: Identifiability is achieved because the pdf for the root state is the stationary distribution (denoted by sigma in Allman Rhodes: see example 1). This is, I believe, the default in newer versions of Mesquite, although in older versions the distribution is 0.5/0.5 as in ace. If the pdf of the root state is defined by an additional parameter, this leaves a single parameter to describe the rate of transitions, and asymmetrical transition rates are non-identifiable. It seems to me there is a choice to be made between a) assuming the same processes after the root held before the root and talk about asymmetric transition rates or b) do not make this assumption and then admit that the rates of transition from 0-1 and 1-0 are not separable. I don't think the data can be used to distinguish between these view points, and so its a matter of personal choice which interpretation/model is used. Cheers, Jarrod Quoting Mark Holder mthol...@ku.edu on Thu, 16 Aug 2012 23:41:45 -0500: Hi, I agree that model testing between ARD vs MK models is going to be misleading when the process is really described by a threshold model (and sorry for ignoring that set of simulations by Jarrod previously; somehow I misfiled that email and didn't see it). The threshold model has nice ways of dealing with correlations among characters. However, when it is applied as the underlying model for a single binary character (as in Jarrod's sims), the threshold model is similar to the single-site version of the covarion model (Tuffley and Steel's version). I don't think the models are identical, but they are quite similar. I suspect that if you generated a data set under one of the models, it would be quite hard to determine which was the generating model. Instead of just having a an on and off state (as in the covarion model), the threshold model has a continuum (the further the underlying continuous trait is from boundary, the more off the observable binary trait is). Allman and Rhodes (2009, ref below) proved some results on the identifiability of generalizations of covarion processes. They considered models with more hidden rate categories (not just rate of zero and an rate of evolution when in the on state). I believe that their results were that the number of hidden rate categories that you can identify cannot exceed the number of observable states. So it may be hard to get much richer than the Tuffley+Steel covarion when you have a binary character. Which is a long way of saying that, it might be worth looking at the covarion model variants for the types of data that Jarrod is interested in. Implementations of the covarion model for two states is quite fast and tractable. Testing Mk+covarion vs ARD+covarion may indeed be a more robust way of detecting asymmetry in rates of character transitions compared to Mk vs ARD. Thanks for pointing out the Boettiger et al paper, Matt. all the best, Mark [1] E. S.
Re: [R-sig-phylo] asymmetric transitions
Hi all, I brought up the non-identifiability of the rich forms of the covarion model only because that is the source for my intuition that it will be really hard to distinguish the 1-binary-character threshold from the covarion. I agree with Dan, that the non-identifiability is not causing high type I error rates here. I think that Mk, ARD, Mk+covarion, and ARD+covarion are all distinguishable from each other. I think that Mk, ARD, 1-binary-character threshold, and a Asymmetric-threshold (if such a model has ever been described) are all distinguishable from each other. My understanding is that Jarrod and Dan have shown that if you: 1. simulate under a binary character under the threshold (which has no bias 0-1 or 1-0 bias), and then 2. test Mk vs ARD, you often strongly prefer ARD (despite the fact that the true model has no asymmetry) I think that this result is correct (not the result of computer glitches) and worrying for folks interested in detecting asymmetric transition rates. I think testing Mk+covarion vs ARD+covarion in step 2 might lower your the type I error rate (but that is a conjecture). In many ways the message here is similar to Wayne Maddison's warning that if you want to know about evolutionary asymmetries in character change, you have to consider whether the character states affect diversification rates (his observation that eventually led him to work on BiSSE). Here the message is if you want to know about evolutionary asymmetries in character change, you also have to consider patterns of rate heterogeneity across the tree that could confound your model testing. all the best, Mark PS: below is my long-winded, loosy-goosey explanation for my intuition that going to Mk+covarion vs ARD+covarion would help: The simulated data will often have large clades that are fixed for a single state, and the observed state frequencies can be far from 50:50 because of these large expanses of the tree with the same state. ARD vs Mk = The Mk model has can only explain a strong deviation from 50:50 in the observed state frequencies by saying that: A. the character evolves slowly so it has not had time to equilibrate, OR B. there have been lots of changes, but by coincidence we always end up in the same state in that clade. On trees with small total length, explanation A is very reasonable and you can get very weak support for ARD. If you have trees with lots of leaves and lots of changes overall, explanation A becomes untenable, and Mk has to rely on the coincidence explanation (explanation B). Which results in a very low likelihood. The ARD can explain the a strong deviation from 50:50 in the observed state frequencies by asymmetric in transition rates. So it's likelihood does not tank as quickly as you add large, monomorphic clades to the tree. Mk+covarion vs ARD = You should be able to prefer Mk+covarion on these data sets over ARD (or at least not be able to avoid a strong preference for ARD; you'd have to use AIC or BIC for this test). The Mk+covarion has access to an explanation that the Mk does not: C. the character was stuck in the off state, so you can explain a large paraphyletic assemblage with the same state by a single transition to tha state and a subsequent cessation of the substitution process. Mk+covarion predicts no significant deviations from 50:50 in the types of transitions (after you use the covarion process to help you get a better understanding of the relative opportunity for each type of transition -- the proportion of time spent in the on hidden state for state 0 and 1 on the tree). ARD expects to see the same asymmetry in changes the parts of the tree that happen to have high rates of change and the parts of the tree that have low rates of change (since the ARD assumes that there is a constant rate of change, and all apparent difference is rate are sampling error). The simulated data should show close the unbiased transitions in the fast changing parts of the tree, so Mk+covarion should do a better job than ARD. Mk+covarion vs ARD+covarion == Both models should do a good job of not getting distracted by large expanses of the tree that are fixed (they won't attribute this as strong evidence in favor of that character). If there does appear to be a strong bias in the parts of the tree with lots of changes, then ARD+covarion will win. Mk+covarion has fewer parameters. So it should win unless there is strong evidence for asymmetry in the fast parts of the tree. On Aug 17, 2012, at 6:29 AM, Dan Rabosky wrote: Hi folks- I still think there is a difference between (i) parameter identifiability, which may or may not be a problem here, and (ii) strong support for the wrong model, which clearly appears to be occurring here (e.g., Type I error rates 0.75). I don't think non-identifiability of a parameter
Re: [R-sig-phylo] asymmetric transitions
Hi, I see the problem: the threshold model is symmetric but NOT in the sense used in the ARD model. In the threshold model it is natural to think about evolution of the probabilty of being in one state versus the other. If the probability at the root was 0.2 and evolution was very slow so that the probability at the tips was ~0.2, then this would be equivalent to sampling tip states from a Bernoulli with Pr=0.2 (like my first example). From an ARD perspective, the high degree of variation within taxa suggests transitions happen frequently, and so the only way that frequencies close to 0.2/0.8 can occur is if there's a 4:1 asymmetry in the transitions 0-1 to 1-0. Imagine a Bernoulli sequence with Pr 0.8, and you have just moved to state 1: Pr(0|1)=0.2, if you then move to state 0 (even though it is unlikely): Pr(1|0)=0.8. There are asymmetric evolutionary transitions, but the underlying probability of being in a particular state is constant. Which interpretation should a researcher take? Imagine my character is does the species name begin with A-T or U-Z!? Cheers, Jarrod Quoting Mark Holder mthol...@ku.edu on Fri, 17 Aug 2012 08:13:01 -0500: Hi all, I brought up the non-identifiability of the rich forms of the covarion model only because that is the source for my intuition that it will be really hard to distinguish the 1-binary-character threshold from the covarion. I agree with Dan, that the non-identifiability is not causing high type I error rates here. I think that Mk, ARD, Mk+covarion, and ARD+covarion are all distinguishable from each other. I think that Mk, ARD, 1-binary-character threshold, and a Asymmetric-threshold (if such a model has ever been described) are all distinguishable from each other. My understanding is that Jarrod and Dan have shown that if you: 1. simulate under a binary character under the threshold (which has no bias 0-1 or 1-0 bias), and then 2. test Mk vs ARD, you often strongly prefer ARD (despite the fact that the true model has no asymmetry) I think that this result is correct (not the result of computer glitches) and worrying for folks interested in detecting asymmetric transition rates. I think testing Mk+covarion vs ARD+covarion in step 2 might lower your the type I error rate (but that is a conjecture). In many ways the message here is similar to Wayne Maddison's warning that if you want to know about evolutionary asymmetries in character change, you have to consider whether the character states affect diversification rates (his observation that eventually led him to work on BiSSE). Here the message is if you want to know about evolutionary asymmetries in character change, you also have to consider patterns of rate heterogeneity across the tree that could confound your model testing. all the best, Mark PS: below is my long-winded, loosy-goosey explanation for my intuition that going to Mk+covarion vs ARD+covarion would help: The simulated data will often have large clades that are fixed for a single state, and the observed state frequencies can be far from 50:50 because of these large expanses of the tree with the same state. ARD vs Mk = The Mk model has can only explain a strong deviation from 50:50 in the observed state frequencies by saying that: A. the character evolves slowly so it has not had time to equilibrate, OR B. there have been lots of changes, but by coincidence we always end up in the same state in that clade. On trees with small total length, explanation A is very reasonable and you can get very weak support for ARD. If you have trees with lots of leaves and lots of changes overall, explanation A becomes untenable, and Mk has to rely on the coincidence explanation (explanation B). Which results in a very low likelihood. The ARD can explain the a strong deviation from 50:50 in the observed state frequencies by asymmetric in transition rates. So it's likelihood does not tank as quickly as you add large, monomorphic clades to the tree. Mk+covarion vs ARD = You should be able to prefer Mk+covarion on these data sets over ARD (or at least not be able to avoid a strong preference for ARD; you'd have to use AIC or BIC for this test). The Mk+covarion has access to an explanation that the Mk does not: C. the character was stuck in the off state, so you can explain a large paraphyletic assemblage with the same state by a single transition to tha state and a subsequent cessation of the substitution process. Mk+covarion predicts no significant deviations from 50:50 in the types of transitions (after you use the covarion process to help you get a better understanding of the relative opportunity for each type of transition -- the proportion of time spent in the on hidden state for state 0 and 1 on the tree). ARD expects to see the same asymmetry in changes
[R-sig-phylo] Literature about multigene phylogenetic
Hello. I'm new on phylogenetics and i find it hard understand how to do a phylogenetic tree with more than one gene (sequence). For example, i was reading A large-scale phylogeny of Amphibia including over 2800 species, and a revised classification of extant frogs, salamanders, and caecilians (Pyron and Wiens 2011). And they used many genes. But they used genes that are from the mitochondria, like 16s and 12s, and others that are not from mitochondria like TYR and RAG1, among many other. So there is information from different sequences, from different genes. How this is treated? I still playing with package ape to figure out things, but could someone indicate some basic literature, books or articles to understand how to use more than one gene to produce distances and phylogenetic tree? Examples using R to replicate analysis from articles are really instructive for me, if someone could guide me to one. I have not found many useful things via google search, guess I'm could be using bad words on searches. Thanks for the attention. -- Grato Augusto C. A. Ribas Site Pessoal: http://augustoribas.heliohost.org Lattes: http://lattes.cnpq.br/7355685961127056 ___ R-sig-phylo mailing list R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
[R-sig-phylo] Can PGLS cope with collinearity between explanatory variables?
Hi all, I am testing a correlation between two explanatory variables and a response variable using PGLS. All of the variables are continuous. My model is Log female body size ~ Log egg size * Log clutch size. However, there is a significant negative correlation between egg size and clutch size. Can PGLS cope with collinearity between explanatory variables? Is there any way that I can apply something like principal component analysis to PGLS models? Thanks, Xu [[alternative HTML version deleted]] ___ R-sig-phylo mailing list R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Re: [R-sig-phylo] Can PGLS cope with collinearity between explanatory variables?
The issue of collinearity of independent variables is neither better nor worse with PGLS as opposed to OLS. Statistical significance per se of a correlation between X variables is not really the issue. How strong is the correlation? Most sources suggest that it needs to be greater than 0.7-0.8 in magnitude to cause serious problems. Cheers, Ted Theodore Garland, Jr. Professor Department of Biology University of California, Riverside Riverside, CA 92521 Office Phone: (951) 827-3524 Facsimile: (951) 827-4286 = Dept. office (not confidential) Email: tgarl...@ucr.edu http://www.biology.ucr.edu/people/faculty/Garland.html Experimental Evolution: Concepts, Methods, and Applications of Selection Experiments. 2009. Edited by Theodore Garland, Jr. and Michael R. Rose http://www.ucpress.edu/book.php?isbn=9780520261808 (PDFs of chapters are available from me or from the individual authors) From: r-sig-phylo-boun...@r-project.org [r-sig-phylo-boun...@r-project.org] on behalf of Xu Han [duck_han365...@hotmail.com] Sent: Friday, August 17, 2012 12:33 PM To: r-sig-phylo@r-project.org Subject: [R-sig-phylo] Can PGLS cope with collinearity between explanatory variables? Hi all, I am testing a correlation between two explanatory variables and a response variable using PGLS. All of the variables are continuous. My model is Log female body size ~ Log egg size * Log clutch size. However, there is a significant negative correlation between egg size and clutch size. Can PGLS cope with collinearity between explanatory variables? Is there any way that I can apply something like principal component analysis to PGLS models? Thanks, Xu [[alternative HTML version deleted]] ___ R-sig-phylo mailing list R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo ___ R-sig-phylo mailing list R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Re: [R-sig-phylo] Can PGLS cope with collinearity between explanatory variables?
Thanks Dr. Garland,The correlation between egg size and clutch size is 0.3, and the variance inflation factors for both egg size and clutch size are smaller than 2. There shouldn't be a big problem of collinearity. Thanks for your clarification.Best,Xu From: theodore.garl...@ucr.edu To: duck_han365...@hotmail.com; r-sig-phylo@r-project.org Subject: RE: [R-sig-phylo] Can PGLS cope with collinearity between explanatoryvariables? Date: Fri, 17 Aug 2012 19:38:31 + The issue of collinearity of independent variables is neither better nor worse with PGLS as opposed to OLS. Statistical significance per se of a correlation between X variables is not really the issue. How strong is the correlation? Most sources suggest that it needs to be greater than 0.7-0.8 in magnitude to cause serious problems. Cheers, Ted Theodore Garland, Jr. Professor Department of Biology University of California, Riverside Riverside, CA 92521 Office Phone: (951) 827-3524 Facsimile: (951) 827-4286 = Dept. office (not confidential) Email: tgarl...@ucr.edu http://www.biology.ucr.edu/people/faculty/Garland.html Experimental Evolution: Concepts, Methods, and Applications of Selection Experiments. 2009. Edited by Theodore Garland, Jr. and Michael R. Rose http://www.ucpress.edu/book.php?isbn=9780520261808 (PDFs of chapters are available from me or from the individual authors) From: r-sig-phylo-boun...@r-project.org [r-sig-phylo-boun...@r-project.org] on behalf of Xu Han [duck_han365...@hotmail.com] Sent: Friday, August 17, 2012 12:33 PM To: r-sig-phylo@r-project.org Subject: [R-sig-phylo] Can PGLS cope with collinearity between explanatory variables? Hi all, I am testing a correlation between two explanatory variables and a response variable using PGLS. All of the variables are continuous. My model is Log female body size ~ Log egg size * Log clutch size. However, there is a significant negative correlation between egg size and clutch size. Can PGLS cope with collinearity between explanatory variables? Is there any way that I can apply something like principal component analysis to PGLS models? Thanks, Xu [[alternative HTML version deleted]] ___ R-sig-phylo mailing list R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo [[alternative HTML version deleted]] ___ R-sig-phylo mailing list R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo