Re: [R-sig-phylo] asymmetric transitions

2012-08-17 Thread Jarrod Hadfield

Hi,

Thanks for the Allman  Rhodes paper, it is very nice. For me at least  
it confirms my suspicions, but made me realise that claims of  
asymmetric transition rates are only suspicious if you are unprepared  
to make some (strong?) assumptions. If anyone disagrees with what I  
have written below, then please tell me and I will try again to  
understand this stuff:


Identifiability is achieved because the pdf for the root state is the  
stationary distribution (denoted by sigma in Allman  Rhodes: see  
example 1).  This is, I believe, the default in newer versions of  
Mesquite, although in older versions the distribution is 0.5/0.5 as in  
ace.


If the pdf of the root state is defined by an additional parameter,  
this leaves a single parameter to describe the rate of transitions,  
and asymmetrical transition rates are non-identifiable.  It seems to  
me there is a choice to be made between a) assuming the same processes  
after the root held before the root and talk about asymmetric  
transition rates or b) do not make this assumption and then admit that  
the rates of transition from 0-1 and 1-0 are not separable. I don't  
think the data can be used to distinguish between these view points,  
and so its a matter of personal choice which interpretation/model is  
used.


Cheers,

Jarrod








Quoting Mark Holder mthol...@ku.edu on Thu, 16 Aug 2012 23:41:45 -0500:


Hi,

I agree that model testing between ARD vs MK models is going to be  
misleading when the process is really described by a threshold model  
(and sorry for ignoring that set of simulations by Jarrod  
previously; somehow I misfiled that email and didn't see it).


The threshold model has nice ways of dealing with correlations among  
characters.  However, when it is applied as the underlying model for  
a single binary character (as in Jarrod's sims), the threshold model  
is similar to the single-site version of the covarion model (Tuffley  
and Steel's version).


I don't think the models are identical, but they are quite similar.  
I suspect that if you generated a data set under one of the models,  
it would be quite hard to determine which was the generating model.  
Instead of just having a an on and off state (as in the covarion  
model), the threshold model has a continuum (the further the  
underlying continuous trait is from boundary, the more off the  
observable binary trait is).  Allman and Rhodes (2009, ref below)  
proved some results on the identifiability of generalizations of  
covarion processes. They considered models with more hidden rate  
categories (not just rate of zero and an rate of evolution when in  
the on state). I believe that their results were that the number  
of hidden rate categories that you can identify cannot exceed the  
number of observable states. So it may be hard to get much richer  
than the Tuffley+Steel covarion when you have a binary character.


Which is a long way of saying that, it might be worth looking at the  
covarion model variants for the types of data that Jarrod is  
interested in.  Implementations of the covarion model for two states  
is quite fast and tractable. Testing Mk+covarion vs ARD+covarion may  
indeed be a more robust way of

detecting asymmetry in rates of character transitions compared to Mk vs ARD.


Thanks for pointing out the Boettiger et al paper, Matt.


all the best,
Mark


[1]	E. S. Allman and J. A. Rhodes, “The Identifiability of Covarion  
Models in Phylogenetics,” IEEE/ACM Transactions on Computational  
Biology and Bioinformatics, vol. 6, no. 1, pp. 76–88, Jan. 2009.





On Aug 16, 2012, at 10:07 PM, Matt Pennell wrote:


correction: the last sentence should have read

I wonder how that would work in this case. I think these are  
important questions going forward.


On Thu, Aug 16, 2012 at 11:00 PM, Matt Pennell mwpenn...@gmail.com wrote:
Hey all,

This has been a really fantastic discussion. Mark, you make some  
really excellent points in response to my earlier comments. I think  
you are correct in this.


The question that arises out of Jarrod and Dan's simulations (which  
I have just run) is whether a model selection criteria would be  
able to distinguish MK from the threshold model that Felsenstein  
(and Wright before him) put forth? And how do we best assess model  
adequacy? Carl Boettiger and company (2012: Evolution) suggested a  
Phylogenetic Monte Carlo approach for continuous characters. I  
wonder how that would before  I think these are important questions  
going forward.


thanks again,
matt



On Thu, Aug 16, 2012 at 10:43 PM, Dan Rabosky drabo...@umich.edu wrote:

Hi all-

A couple of points. I am actually less concerned about the Type I  
error rates I gave in that previous message for the equal rates  
markov process, even though I think they are real (e.g., I can  
corroborate them using Diversitree). I don't think it is an issue  
of ascertainment bias, but I think Mark may be right about the LRT  
being 

Re: [R-sig-phylo] asymmetric transitions

2012-08-17 Thread Dan Rabosky

Hi folks-

I still think there is a difference between (i) parameter identifiability, 
which may or may not be a problem here, and (ii) strong support for the wrong 
model, which clearly appears to be occurring here (e.g., Type I error rates  
0.75). I don't think non-identifiability of a parameter implies that you'll get 
massively inflated Type I error rates for models that include the parameter. 

Also, the root state (and assumptions regarding it) don't seem to drive the 
pattern - you can set the root to 0 under the threshold model (e.g., both 
states equiprobable at root) and you recover the same strong bias - Jarrod's 
threshold example set the root to -1, you can verify that the problem holds 
with equiprobable root states). At this point I guess I'd be more concerned 
about model misspecification and its diagnosis (when modeling discrete 
characters) than about identifiability per se, though perhaps I am not thinking 
about this correctly. Distinguishing MK/ARD from threshold may be difficult, 
but there is clearly signal of these processes in the data: when you fit MK/ARD 
to data simulated under a threshold model, you get strong support for ARD - but 
not when you fit those same models to data generated under MK. So - there is 
clearly something in the data that is sufficiently informative about the 
processes such that we observe high error rates. 

Mark, thanks for pointing out the relationship between the threshold model and 
the single-site covarion model. 

Cheers,
~Dan






On Aug 17, 2012, at 6:31 AM, Jarrod Hadfield wrote:

 Hi,
 
 Thanks for the Allman  Rhodes paper, it is very nice. For me at least it 
 confirms my suspicions, but made me realise that claims of asymmetric 
 transition rates are only suspicious if you are unprepared to make some 
 (strong?) assumptions. If anyone disagrees with what I have written below, 
 then please tell me and I will try again to understand this stuff:
 
 Identifiability is achieved because the pdf for the root state is the 
 stationary distribution (denoted by sigma in Allman  Rhodes: see example 1). 
  This is, I believe, the default in newer versions of Mesquite, although in 
 older versions the distribution is 0.5/0.5 as in ace.
 
 If the pdf of the root state is defined by an additional parameter, this 
 leaves a single parameter to describe the rate of transitions, and 
 asymmetrical transition rates are non-identifiable.  It seems to me there is 
 a choice to be made between a) assuming the same processes after the root 
 held before the root and talk about asymmetric transition rates or b) do not 
 make this assumption and then admit that the rates of transition from 0-1 
 and 1-0 are not separable. I don't think the data can be used to distinguish 
 between these view points, and so its a matter of personal choice which 
 interpretation/model is used.
 
 Cheers,
 
 Jarrod
 
 
 
 
 
 
 
 
 Quoting Mark Holder mthol...@ku.edu on Thu, 16 Aug 2012 23:41:45 -0500:
 
 Hi,
 
 I agree that model testing between ARD vs MK models is going to be 
 misleading when the process is really described by a threshold model (and 
 sorry for ignoring that set of simulations by Jarrod previously; somehow I 
 misfiled that email and didn't see it).
 
 The threshold model has nice ways of dealing with correlations among 
 characters.  However, when it is applied as the underlying model for a 
 single binary character (as in Jarrod's sims), the threshold model is 
 similar to the single-site version of the covarion model (Tuffley and 
 Steel's version).
 
 I don't think the models are identical, but they are quite similar. I 
 suspect that if you generated a data set under one of the models, it would 
 be quite hard to determine which was the generating model. Instead of just 
 having a an on and off state (as in the covarion model), the threshold 
 model has a continuum (the further the underlying continuous trait is from 
 boundary, the more off the observable binary trait is).  Allman and Rhodes 
 (2009, ref below) proved some results on the identifiability of 
 generalizations of covarion processes. They considered models with more 
 hidden rate categories (not just rate of zero and an rate of evolution when 
 in the on state). I believe that their results were that the number of 
 hidden rate categories that you can identify cannot exceed the number of 
 observable states. So it may be hard to get much richer than the 
 Tuffley+Steel covarion when you have a binary character.
 
 Which is a long way of saying that, it might be worth looking at the 
 covarion model variants for the types of data that Jarrod is interested in.  
 Implementations of the covarion model for two states is quite fast and 
 tractable. Testing Mk+covarion vs ARD+covarion may indeed be a more robust 
 way of
 detecting asymmetry in rates of character transitions compared to Mk vs ARD.
 
 
 Thanks for pointing out the Boettiger et al paper, Matt.
 
 
 all the best,
 Mark
 
 
 [1]  E. S. 

Re: [R-sig-phylo] asymmetric transitions

2012-08-17 Thread Mark Holder
Hi all,

I brought up the non-identifiability of the rich forms of the covarion model 
only because that is the source for my intuition that it will be really hard to 
distinguish the 1-binary-character threshold from the covarion. I agree with 
Dan, that the non-identifiability is not causing high type I error rates here.

I think that Mk, ARD, Mk+covarion, and ARD+covarion are all distinguishable 
from each other.

I think that Mk, ARD, 1-binary-character threshold, and a 
Asymmetric-threshold (if such a model has ever been described) are all 
distinguishable from each other.

My understanding is that Jarrod and Dan have shown that if you:
1. simulate under a binary character under the threshold (which has no 
bias 0-1 or 1-0 bias), and then
2. test Mk vs ARD,
you often strongly prefer ARD (despite the fact that the true model has no 
asymmetry)

I think that this result is correct (not the result of computer glitches) and 
worrying for folks interested in detecting asymmetric transition rates.  

I think testing Mk+covarion vs ARD+covarion in step 2 might lower your the type 
I error rate (but that is a conjecture).

In many ways the message here is similar to Wayne Maddison's warning that if 
you want to know about evolutionary asymmetries in character change, you have 
to consider whether the character states affect diversification rates (his 
observation that eventually led him to work on BiSSE).  Here the message is if 
you want to know about evolutionary asymmetries in character change, you also 
have to consider patterns of rate heterogeneity across the tree that could 
confound your model testing.


all the best,
Mark






PS: below is my long-winded, loosy-goosey explanation for my intuition that 
going to Mk+covarion vs ARD+covarion would help:


The simulated data will often have large clades that are fixed for a single 
state, and the observed state frequencies can be far from 50:50 because of 
these large expanses of the tree with the same state.

ARD vs Mk
=

The Mk model has can only explain a strong deviation from 50:50 in the observed 
state frequencies by saying that:
A. the character evolves slowly so it has not had time to 
equilibrate, OR
B. there have been lots of changes, but by coincidence we always end up 
in the same state in that clade.

On trees with small total length, explanation A is very reasonable and you can 
get very weak support for ARD.

If you have trees with lots of leaves and lots of changes overall, explanation 
A becomes untenable, and Mk has to rely on the coincidence explanation 
(explanation B). Which results in a very low likelihood.

The ARD can explain the a strong deviation from 50:50 in the observed state 
frequencies by asymmetric in transition rates. So it's likelihood does not tank 
as quickly as you add large, monomorphic clades to the tree.

Mk+covarion vs ARD
=
You should be able to prefer Mk+covarion on these data sets over ARD (or at 
least not be able to avoid a strong preference for ARD; you'd have to use AIC 
or BIC for this test). 

The Mk+covarion has access to an explanation that the Mk does not:
C. the character was stuck in the off state, so you can explain a large 
paraphyletic assemblage with the same state by a single transition to tha state 
and a subsequent cessation of the substitution process.

Mk+covarion predicts no significant deviations from 50:50 in the types of 
transitions (after you use the covarion process to help you get a better 
understanding of the relative opportunity for each type of transition -- the 
proportion of time spent in the on hidden state for state 0 and 1 on the 
tree). 

ARD expects to see the same asymmetry in changes the parts of the tree that 
happen to have high rates of change and the parts of the tree that have low 
rates of change (since the ARD assumes that there is a constant rate of change, 
and all apparent difference is rate are sampling error).

The simulated data should show close the unbiased transitions in the fast 
changing parts of the tree, so Mk+covarion should do a better job than ARD.


Mk+covarion vs ARD+covarion
==
Both models should do a good job of not getting distracted by large expanses of 
the tree that are fixed (they won't attribute this as strong evidence in favor 
of that character).  If there does appear to be a strong bias in the parts of 
the tree with lots of changes, then ARD+covarion will win. Mk+covarion has 
fewer parameters. So it should win unless there is strong evidence for 
asymmetry in the fast parts of the tree.





On Aug 17, 2012, at 6:29 AM, Dan Rabosky wrote:

 
 Hi folks-
 
 I still think there is a difference between (i) parameter identifiability, 
 which may or may not be a problem here, and (ii) strong support for the wrong 
 model, which clearly appears to be occurring here (e.g., Type I error rates  
 0.75). I don't think non-identifiability of a parameter 

Re: [R-sig-phylo] asymmetric transitions

2012-08-17 Thread Jarrod Hadfield

Hi,

I see the problem: the threshold model is symmetric but NOT in the  
sense used in the ARD model.  In the threshold model it is natural to  
think about evolution of the probabilty of being in one state versus  
the other. If the probability at the root was 0.2 and evolution was  
very slow so that the probability at the tips was ~0.2, then this  
would be equivalent to sampling tip states from a Bernoulli with  
Pr=0.2 (like my first example).   From an ARD perspective, the high  
degree of variation within taxa suggests transitions happen  
frequently, and so the only way that frequencies close to 0.2/0.8 can  
occur is if there's a 4:1 asymmetry in the transitions 0-1 to 1-0.


 Imagine a Bernoulli sequence with Pr 0.8, and you have just moved to  
state 1: Pr(0|1)=0.2, if you then move to state 0 (even though it is  
unlikely): Pr(1|0)=0.8. There are asymmetric evolutionary transitions,  
but the underlying probability of being in a particular state is  
constant. Which interpretation should a researcher take? Imagine my  
character is does the species name begin with A-T or U-Z!?


Cheers,

Jarrod







Quoting Mark Holder mthol...@ku.edu on Fri, 17 Aug 2012 08:13:01 -0500:


Hi all,

I brought up the non-identifiability of the rich forms of the  
covarion model only because that is the source for my intuition that  
it will be really hard to distinguish the 1-binary-character  
threshold from the covarion. I agree with Dan, that the  
non-identifiability is not causing high type I error rates here.


I think that Mk, ARD, Mk+covarion, and ARD+covarion are all  
distinguishable from each other.


I think that Mk, ARD, 1-binary-character threshold, and a  
Asymmetric-threshold (if such a model has ever been described) are  
all distinguishable from each other.


My understanding is that Jarrod and Dan have shown that if you:
	1. simulate under a binary character under the threshold (which has  
no bias 0-1 or 1-0 bias), and then

2. test Mk vs ARD,
you often strongly prefer ARD (despite the fact that the true model  
has no asymmetry)


I think that this result is correct (not the result of computer  
glitches) and worrying for folks interested in detecting asymmetric  
transition rates.


I think testing Mk+covarion vs ARD+covarion in step 2 might lower  
your the type I error rate (but that is a conjecture).


In many ways the message here is similar to Wayne Maddison's warning  
that if you want to know about evolutionary asymmetries in character  
change, you have to consider whether the character states affect  
diversification rates (his observation that eventually led him to  
work on BiSSE).  Here the message is if you want to know about  
evolutionary asymmetries in character change, you also have to  
consider patterns of rate heterogeneity across the tree that could  
confound your model testing.



all the best,
Mark






PS: below is my long-winded, loosy-goosey explanation for my  
intuition that going to Mk+covarion vs ARD+covarion would help:



The simulated data will often have large clades that are fixed for a  
single state, and the observed state frequencies can be far from  
50:50 because of these large expanses of the tree with the same state.


ARD vs Mk
=

The Mk model has can only explain a strong deviation from 50:50 in  
the observed state frequencies by saying that:

A. the character evolves slowly so it has not had time to 
equilibrate, OR
	B. there have been lots of changes, but by coincidence we always  
end up in the same state in that clade.


On trees with small total length, explanation A is very reasonable  
and you can get very weak support for ARD.


If you have trees with lots of leaves and lots of changes overall,  
explanation A becomes untenable, and Mk has to rely on the  
coincidence explanation (explanation B). Which results in a very low  
likelihood.


The ARD can explain the a strong deviation from 50:50 in the  
observed state frequencies by asymmetric in transition rates. So  
it's likelihood does not tank as quickly as you add large,  
monomorphic clades to the tree.


Mk+covarion vs ARD
=
You should be able to prefer Mk+covarion on these data sets over  
ARD (or at least not be able to avoid a strong preference for ARD;  
you'd have to use AIC or BIC for this test).


The Mk+covarion has access to an explanation that the Mk does not:
	C. the character was stuck in the off state, so you can explain a  
large paraphyletic assemblage with the same state by a single  
transition to tha state and a subsequent cessation of the  
substitution process.


Mk+covarion predicts no significant deviations from 50:50 in the  
types of transitions (after you use the covarion process to help you  
get a better understanding of the relative opportunity for each type  
of transition -- the proportion of time spent in the on hidden  
state for state 0 and 1 on the tree).


ARD expects to see the same asymmetry in changes 

[R-sig-phylo] Literature about multigene phylogenetic

2012-08-17 Thread Augusto Ribas
Hello.
I'm new on phylogenetics and i find it hard understand how to do a
phylogenetic tree with more than one gene (sequence).

For example, i was reading A large-scale phylogeny of Amphibia
including over 2800 species, and a revised classification of extant
frogs, salamanders, and caecilians (Pyron and Wiens 2011). And they
used many genes. But they used genes that are from the mitochondria,
like 16s and 12s, and others that are not from mitochondria like TYR
and RAG1, among many other.
So there is information from different sequences, from different
genes. How this is treated?

I still playing with package ape to figure out things, but could
someone indicate some basic literature, books or articles to
understand how to use more than one gene to produce distances and
phylogenetic tree?
Examples using R to replicate analysis from  articles are really
instructive for me, if someone could guide me to one. I have not found
many useful things via google search, guess I'm could be using bad
words on searches.


Thanks for the attention.

--
Grato
Augusto C. A. Ribas

Site Pessoal: http://augustoribas.heliohost.org
Lattes: http://lattes.cnpq.br/7355685961127056

___
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo


[R-sig-phylo] Can PGLS cope with collinearity between explanatory variables?

2012-08-17 Thread Xu Han

Hi all,
I am testing a correlation between two explanatory variables and a response 
variable using PGLS. All of the variables are continuous. My model is Log 
female body size ~ Log egg size * Log clutch size. However, there is a 
significant negative correlation between egg size and clutch size. Can PGLS 
cope with collinearity between explanatory variables? Is there any way that I 
can apply something like principal component analysis to PGLS models?
Thanks,
Xu
[[alternative HTML version deleted]]

___
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo


Re: [R-sig-phylo] Can PGLS cope with collinearity between explanatory variables?

2012-08-17 Thread Theodore Garland Jr
The issue of collinearity of independent variables is neither better nor worse 
with PGLS as opposed to OLS.  Statistical significance per se of a correlation 
between X variables is not really the issue.  How strong is the correlation?  
Most sources suggest that it needs to be greater than 0.7-0.8 in magnitude to 
cause serious problems.

Cheers,
Ted
 
Theodore Garland, Jr.
Professor
Department of Biology
University of California, Riverside
Riverside, CA 92521
Office Phone:  (951) 827-3524
Facsimile:  (951) 827-4286 = Dept. office (not confidential)
Email:  tgarl...@ucr.edu
http://www.biology.ucr.edu/people/faculty/Garland.html

Experimental Evolution: Concepts, Methods, and Applications of Selection 
Experiments. 2009.
Edited by Theodore Garland, Jr. and Michael R. Rose
http://www.ucpress.edu/book.php?isbn=9780520261808
(PDFs of chapters are available from me or from the individual authors)


From: r-sig-phylo-boun...@r-project.org [r-sig-phylo-boun...@r-project.org] on 
behalf of Xu Han [duck_han365...@hotmail.com]
Sent: Friday, August 17, 2012 12:33 PM
To: r-sig-phylo@r-project.org
Subject: [R-sig-phylo] Can PGLS cope with collinearity between explanatory  
variables?

Hi all,
I am testing a correlation between two explanatory variables and a response 
variable using PGLS. All of the variables are continuous. My model is Log 
female body size ~ Log egg size * Log clutch size. However, there is a 
significant negative correlation between egg size and clutch size. Can PGLS 
cope with collinearity between explanatory variables? Is there any way that I 
can apply something like principal component analysis to PGLS models?
Thanks,
Xu
[[alternative HTML version deleted]]

___
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo

___
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo


Re: [R-sig-phylo] Can PGLS cope with collinearity between explanatory variables?

2012-08-17 Thread Xu Han

Thanks Dr. Garland,The correlation between egg size and clutch size is 0.3, and 
the variance inflation factors for both egg size and clutch size are smaller 
than 2. There shouldn't be a big problem of collinearity. Thanks for your 
clarification.Best,Xu
 From: theodore.garl...@ucr.edu
 To: duck_han365...@hotmail.com; r-sig-phylo@r-project.org
 Subject: RE: [R-sig-phylo] Can PGLS cope with collinearity between 
 explanatoryvariables?
 Date: Fri, 17 Aug 2012 19:38:31 +
 
 The issue of collinearity of independent variables is neither better nor 
 worse with PGLS as opposed to OLS.  Statistical significance per se of a 
 correlation between X variables is not really the issue.  How strong is the 
 correlation?  Most sources suggest that it needs to be greater than 0.7-0.8 
 in magnitude to cause serious problems.
 
 Cheers,
 Ted
  
 Theodore Garland, Jr.
 Professor
 Department of Biology
 University of California, Riverside
 Riverside, CA 92521
 Office Phone:  (951) 827-3524
 Facsimile:  (951) 827-4286 = Dept. office (not confidential)
 Email:  tgarl...@ucr.edu
 http://www.biology.ucr.edu/people/faculty/Garland.html
 
 Experimental Evolution: Concepts, Methods, and Applications of Selection 
 Experiments. 2009.
 Edited by Theodore Garland, Jr. and Michael R. Rose
 http://www.ucpress.edu/book.php?isbn=9780520261808
 (PDFs of chapters are available from me or from the individual authors)
 
 
 From: r-sig-phylo-boun...@r-project.org [r-sig-phylo-boun...@r-project.org] 
 on behalf of Xu Han [duck_han365...@hotmail.com]
 Sent: Friday, August 17, 2012 12:33 PM
 To: r-sig-phylo@r-project.org
 Subject: [R-sig-phylo] Can PGLS cope with collinearity between explanatory
   variables?
 
 Hi all,
 I am testing a correlation between two explanatory variables and a response 
 variable using PGLS. All of the variables are continuous. My model is Log 
 female body size ~ Log egg size * Log clutch size. However, there is a 
 significant negative correlation between egg size and clutch size. Can PGLS 
 cope with collinearity between explanatory variables? Is there any way that I 
 can apply something like principal component analysis to PGLS models?
 Thanks,
 Xu
 [[alternative HTML version deleted]]
 
 ___
 R-sig-phylo mailing list
 R-sig-phylo@r-project.org
 https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
  
[[alternative HTML version deleted]]

___
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo