Re: [R-sig-phylo] identifying phylogenetically equivalent nodes

2021-02-18 Thread Eliot Miller
Assuming trees are rooted and fully bifurcating, I think the following will
work. I tried a few tests and it seemed to work. I traded a few emails back
and forth with Jake to make sure I understood what he wanted. Guess what I
was supposed to be working on wasn't holding my attention...

Best,
Eliot

library(geiger)

first <- sim.bdtree(n=50)
second <- sim.bdtree(n=1000)
third <- sim.bdtree(n=2000)

tableSetup <- function(tree1, tree2)
{
  output <-
data.frame(matrix(nrow=min(c(length(tree1$tip.label),length(tree2$tip.label)))-1,
ncol=2))
  names(output) <- c("tree1", "tree2")
  output
}

#least inclusive=if there are multiple nodes in tree 2 which could
potentially be mapped to
#a node in tree 1, do you want to keep the smallest or the largest one? for
example:
#((A,B),Z) vs (((A,B),Z),C), where C doesn't occur in tree 1. Do you want
the matched node
#from tree 2 to be A,B,Z or A,B,Z,C.
equivCut <- function(tree1, tree2, results, least.inclusive)
{
  #identify nodes that appear more than once
  dups <- unique(results[,2][duplicated(results[,2])])

  #if there aren't any, return the original results
  if(length(dups)==0)
  {
return(results)
  }

  #loop through, add details on size of clade to data frame
  for(i in 1:length(dups))
  {
#set aside all the results that pertain to that node
setAside <- results[results[,2]==dups[i],]

#go through each node and figure out how many taxa descend from that
node
temp <- data.frame(matrix(nrow=dim(setAside)[1], ncol=2))
names(temp) <- c("node","no.taxa")
for(j in 1:dim(setAside)[1])
{
  temp[j,"node"] <- setAside[j,"tree1"]
  temp[j,"no.taxa"] <- length(extract.clade(tree1,
setAside[j,"tree1"])$tip.label)
}

#now you can decide which of these to keep
if(least.inclusive)
{
  keep <- temp$node[temp$no.taxa==min(temp$no.taxa)]
  toDrop <- setdiff(temp$node, keep)
  results <- results[!(results$tree1 %in% toDrop),]
}

else
{
  keep <- temp$node[temp$no.taxa==max(temp$no.taxa)]
  toDrop <- setdiff(temp$node, keep)
  results <- results[!(results$tree1 %in% toDrop),]
}
  }

  results
}

idMatches <- function(tree1, tree2, least.inclusive=TRUE)
{
  #set the results table up
  results <- tableSetup(tree1, tree2)

  #define tree1 nodes here
  nodes1 <- (length(tree1$tip.label)+1):max(tree1$edge)

  for(i in 1:length(nodes1))
  {
#set the correct row in the tree1 column to the node in question
results[i,1] <- nodes1[i]

#find the descendants of this node
tips1 <- tips(tree1, nodes1[i])

#drop these to just tips that are in tree 2
tips1 <- tips1[tips1 %in% tree2$tip.label]

#if there's nothing left, set to no match and move on
if(length(tips1)==0)
{
  results[i,2] <- "no.match"
  next()
}

#find the MRCA of those tips in tree 2
#if there's only a single taxon left, pull the node it descends from
(getMRCA will fail)
if(length(tips1==1))
{
  edge2 <- which.edge(tree2, tips1)
  mrca2 <- tree2$edge[edge2,][1]
}
else
{
  mrca2 <- getMRCA(tree2, tips1)
}

#find the descendants of that node
tips2 <- tips(tree2, mrca2)

#drop to just taxa that are in tree1
tips2 <- tips2[tips2 %in% tree1$tip.label]

#if these tips are equivalent, the nodes match
if(setequal(tips1, tips2))
{
  results[i,2] <- mrca2
}

else
{
  results[i,2] <- "no.match"
}
  }

  #consider what you want to do next. do you want to, e.g., add a row for
  #every not-yet-mentioned tree2 node and add "no.match" to the first
column,
  #or do you want to return as is, or do you want to only return matches?
for
  #now, drop to only matches and re-class both as integer so can demonstrate
  #equality no matter which tree is first or second
  results <- results[results[,2] != "no.match",]
  results[,2] <- as.integer(results[,2])

  #run the equivCut function
  results <- equivCut(tree1, tree2, results, least.inclusive)
  results
}

#always returns equivalent no matter which tree is first if you set it to
least inclusive = TRUE
system.time(test <- idMatches(first,second))
system.time(test2 <- idMatches(second,first))
setequal(test[,1], test2[,2])
setequal(test[,2], test2[,1])

#always returns equivalent no matter which tree is first if you set it to
least inclusive = TRUE
system.time(test3 <- idMatches(second, third))
system.time(test4 <- idMatches(third, second))
setequal(test3[,1], test4[,2])
setequal(test3[,2], test4[,1])

#here's how you might compare a set of trees to a single reference one
ref <- sim.bdtree(n=100)
target1 <- sim.bdtree(n=78)
target2 <- sim.bdtree(n=86)

vs1 <- idMatches(ref,target1)
vs2 <- idMatches(ref,target2)
names(vs1) <- c("ref","target1")
names(vs2) <- c("ref","target2")

final <- data.frame(ref=unique(c(vs1$ref, vs2$ref)))
final <- merge(final, vs1, all.x=TRUE)
final <- merge(final, vs2, all.x=TRUE)
final

On Thu, Feb 18, 2021 at 8:56 AM Emmanuel Paradis 
wrote:

> Hi 

Re: [R-sig-phylo] identifying phylogenetically equivalent nodes

2021-02-18 Thread Emmanuel Paradis
Hi Jacob,

ape::makeNodeLabel(phy, method = "md5sum") returns 'phy' with node labels that 
depend on the tips descendant from each node. For instance:

tr3 <- makeNodeLabel(rtree(3), m = "m")
tr4 <- makeNodeLabel(rtree(4), m = "m")
any(tr3$node.label %in% tr4$node.label)

If you repeat these 3 commands several times, you should have ~20% of TRUE. In 
your case, match() should make more sense.

Also, I suppose your trees are rooted. If they are unrooted, you should 
consider using splits (or root them).

Best,

Emmanuel

- Le 18 Fév 21, à 0:59, Jacob Berv jakeberv.r.sig.ph...@gmail.com a écrit :
> Dear R-sig-phylo,
> 
> Over the weekend, I asked Liam Revell if he had a solution to use matchNodes 
> for
> a particular problem I’m trying to solve—finding all phylogenetically
> equivalent nodes when comparing trees that have uneven taxon samples and
> different topologies. Liam was kind enough to take some time to write a blog
> post about this, and got me started with some code
> 
> http://blog.phytools.org/2021/02/on-matching-nodes-between-trees-using.html
> 
> On it’s face this seems like a simple problem, but I’m running into some 
> issues
> and thought I would reach out to the broader group. The code linked above 
> seems
> to work, but only for comparing trees that start out as topologically
> identical. For my purposes, I’m trying to match nodes from a given a 
> reference,
> to nodes in and across several hundred gene trees that differ in topology and
> taxon sample relative to the reference.
> 
> Here is a function definition based on Liam’s example
> 
> #function to match nodes from consensus
> #to individual gene trees with uneven sampling
> #derived from Liam Revell's example-- need to
> testmatch_phylo_nodes<-function(t1, t2){
>  ## step one drop tips
>  t1p<-drop.tip(t1,setdiff(t1$tip.label, t2$tip.label))
>  t2p<-drop.tip(t2,setdiff(t2 $tip.label, t1$tip.label))
>  
>  ## step two match nodes "descendants"
>  M<-matchNodes(t1p,t2p)
>  
>  ## step two match nodes "distances"
>  M1<-matchNodes(t1,t1p,"distances")
>  M2<-matchNodes(t2,t2p,"distances")
>  
>  ## final step, reconcile
>  MM<-matrix(NA,t1$Nnode,2,dimnames=list(NULL,c("left","right")))
>  
>  for(i in 1:nrow(MM)){
>MM[i,1]<-M1[i,1]
>nn<-M[which(M[,1]==M1[i,2]),2]
>if(length(nn)>0){
>MM[i,2]<-M2[which(M2[,2]==nn),1]
>}
>  }
>  return(MM)
> }
> 
> 
> When t1 and t2 are trees that have topological conflicts, this function 
> returns
> an error:
> 
> Error in MM[i, 2] <- M2[which(M2[, 2] == nn), 1] :
>  replacement has length zero
> 
> I think(?) this happens because a particular node doesn’t exist in one or the
> other trees, and it returns integer(0) at that line — but I’m not sure I 
> really
> understand what is going on here.
> 
> 
> I modified Liam’s code slightly to get it to run without error in the 
> described
> case, by making it conditional on that particular line:
> 
> 
> Modified version
> 
> #function to match nodes from consensus
> #to individual gene trees with uneven sampling
> #derived from Liam Revell's example-- need to test
> match_phylo_nodes<-function(t1, t2){
>   ## step one drop tips
>   t1p<-drop.tip(t1,setdiff(t1$tip.label, t2$tip.label))
>   t2p<-drop.tip(t2,setdiff(t2 $tip.label, t1$tip.label))
> 
>   ## step two match nodes "descendants"
>   M<-matchNodes(t1p,t2p)
> 
>   ## step two match nodes "distances"
>   M1<-matchNodes(t1,t1p,"distances")
>   M2<-matchNodes(t2,t2p,"distances")
> 
>   ## final step, reconcile
>   MM<-matrix(NA,t1$Nnode,2,dimnames=list(NULL,c("left","right")))
> 
>   for(i in 1:nrow(MM)){
>   MM[i,1]<-M1[i,1]
>   nn<-M[which(M[,1]==M1[i,2]),2]
>if(length(nn)>0){
>   if(length(which(M2[,2]==nn))>0){
>   MM[i,2]<-M2[which(M2[,2]==nn),1]
>   }
>} else {
>}
> }
> return(MM)
> }
> 
> 
> I’ve been experimenting with this and some downstream code for the last few
> days, but I’ve run into some weird inconsistent results (not easily 
> summarized)
> that make me think that this function is not working as intended.
> 
> I was wondering — have any of you dealt with a similar problem? In principle
> this seems like it should be similar to concordance analysis, but I care less
> about identifying the proportion of nodes that exist in gene trees given a
> reference, and instead I need the actual node numbers in a given gene tree 
> that
> are phylogenetically equivalent to particular nodes in a reference. Happy to
> try to hack away at something…
> 
> 
> Best,
> Jake Berv
> 
> 
> 
> 
> 
>   [[alternative HTML version deleted]]
> 
> ___
> R-sig-phylo mailing list - R-sig-phylo@r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

___
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz

[R-sig-phylo] identifying phylogenetically equivalent nodes

2021-02-17 Thread Jacob Berv
Dear R-sig-phylo,

Over the weekend, I asked Liam Revell if he had a solution to use matchNodes 
for a particular problem I’m trying to solve—finding all phylogenetically 
equivalent nodes when comparing trees that have uneven taxon samples and 
different topologies. Liam was kind enough to take some time to write a blog 
post about this, and got me started with some code

http://blog.phytools.org/2021/02/on-matching-nodes-between-trees-using.html

On it’s face this seems like a simple problem, but I’m running into some issues 
and thought I would reach out to the broader group. The code linked above seems 
to work, but only for comparing trees that start out as topologically 
identical. For my purposes, I’m trying to match nodes from a given a reference, 
to nodes in and across several hundred gene trees that differ in topology and 
taxon sample relative to the reference.

Here is a function definition based on Liam’s example

#function to match nodes from consensus 
#to individual gene trees with uneven sampling
#derived from Liam Revell's example-- need to 
testmatch_phylo_nodes<-function(t1, t2){
  ## step one drop tips
  t1p<-drop.tip(t1,setdiff(t1$tip.label, t2$tip.label))
  t2p<-drop.tip(t2,setdiff(t2 $tip.label, t1$tip.label))
  
  ## step two match nodes "descendants"
  M<-matchNodes(t1p,t2p)
  
  ## step two match nodes "distances"
  M1<-matchNodes(t1,t1p,"distances")
  M2<-matchNodes(t2,t2p,"distances")
  
  ## final step, reconcile
  MM<-matrix(NA,t1$Nnode,2,dimnames=list(NULL,c("left","right")))
  
  for(i in 1:nrow(MM)){
MM[i,1]<-M1[i,1]
nn<-M[which(M[,1]==M1[i,2]),2]
if(length(nn)>0){   
MM[i,2]<-M2[which(M2[,2]==nn),1]
}   
  }
  return(MM)
}


When t1 and t2 are trees that have topological conflicts, this function returns 
an error: 

Error in MM[i, 2] <- M2[which(M2[, 2] == nn), 1] : 
  replacement has length zero

I think(?) this happens because a particular node doesn’t exist in one or the 
other trees, and it returns integer(0) at that line — but I’m not sure I really 
understand what is going on here.


I modified Liam’s code slightly to get it to run without error in the described 
case, by making it conditional on that particular line:


Modified version

#function to match nodes from consensus 
#to individual gene trees with uneven sampling
#derived from Liam Revell's example-- need to test
match_phylo_nodes<-function(t1, t2){
## step one drop tips
t1p<-drop.tip(t1,setdiff(t1$tip.label, t2$tip.label))
t2p<-drop.tip(t2,setdiff(t2 $tip.label, t1$tip.label))

## step two match nodes "descendants"
M<-matchNodes(t1p,t2p)

## step two match nodes "distances"
M1<-matchNodes(t1,t1p,"distances")
M2<-matchNodes(t2,t2p,"distances")

## final step, reconcile
MM<-matrix(NA,t1$Nnode,2,dimnames=list(NULL,c("left","right")))

for(i in 1:nrow(MM)){
MM[i,1]<-M1[i,1]
nn<-M[which(M[,1]==M1[i,2]),2]
if(length(nn)>0){   
if(length(which(M2[,2]==nn))>0){
MM[i,2]<-M2[which(M2[,2]==nn),1]
}
} else {
}   
}
return(MM)  
}


I’ve been experimenting with this and some downstream code for the last few 
days, but I’ve run into some weird inconsistent results (not easily summarized) 
that make me think that this function is not working as intended.

I was wondering — have any of you dealt with a similar problem? In principle 
this seems like it should be similar to concordance analysis, but I care less 
about identifying the proportion of nodes that exist in gene trees given a 
reference, and instead I need the actual node numbers in a given gene tree that 
are phylogenetically equivalent to particular nodes in a reference. Happy to 
try to hack away at something… 


Best,
Jake Berv





[[alternative HTML version deleted]]

___
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/