Re: [R-sig-phylo] A possible alternate MRCA function to APE's getMRCA

2017-06-10 Thread Emmanuel Paradis

Joseph,

Really cool. I updated the code in ape.

About (byte-)compiling, this is usually done during installing the package:

http://ape-package.ird.fr/ape_installation.html#linux

I guess Win and Mac versions from CRAN are byte-compiled. To check if a 
function is compiled, just print it in R, it will show the bytecode at 
the end of the display:


R> rtree




Besides, the NEWS for R 3.4.0 says:

"The JIT (‘Just In Time’) byte-code compiler is now enabled by default 
at its level 3. This means functions will be compiled on first or second 
use and top-level loops will be compiled and then run. (Thanks to Tomas 
Kalibera for extensive work to make this possible.)"


This seems to be effective:

R> foo <- function(x) x
R> foo(1)
[1] 1
R> foo
function(x) x
R> foo(1)
[1] 1
R> foo
function(x) x


Best,

Emmanuel

Le 10/06/2017 à 14:29, Joseph W. Brown a écrit :

Emmanuel and Klaus,

After Klaus reminded me that `which` was so inefficient I realized using 
it in finding the match in the first lineage is unnecessary. Instead, we 
can capitalize on the fact that shallower nodes have smaller nodeids. So 
I threw out the vector index `mrcaind` altogether:


getMRCA_new<- function(phy, tip) {
if (!inherits(phy, "phylo"))
stop('object "phy" is not of class "phylo"')
if (length(tip) < 2) return(NULL)
Ntip <- length(phy$tip.label)
rootnd <- Ntip + 1L

pars<- integer(phy$Nnode) # worst case assignment, usually far too long
tnd <- if (is.character(tip)) match(tip, phy$tip.label) else tip

## build a lookup table to get parents faster
pvec <- integer(Ntip + phy$Nnode)
pvec[phy$edge[, 2]] <- phy$edge[, 1]

## get entire lineage for first tip
nd <- tnd[1]
for (k in 1:phy$Nnode) {
nd <- pvec[nd]
pars[k] <- nd
if (nd == rootnd) break
}
pars<- pars[1:k] # delete the rest
mrcand <- pars[1]; # keep track of actual node rather than index in pars


## traverse lineages for remaining tips, stop if hit common ancestor
for (i in 2:length(tnd)) {
done <- FALSE
cnd <- tnd[i]
while (!done) {
cpar<- pvec[cnd] # get immediate parent
if (cpar %in% pars) {
if (cpar == rootnd) return(rootnd) # early exit
mrcand <- min(mrcand,cpar); # shallowest node has lowest nodeid
done <- TRUE
}
cnd <- cpar # keep going!
}
}
mrcand
}

I also played around with compiling the function:

library(compiler);
getMRCA_em_cmp<- cmpfun(getMRCA_em);
getMRCA_new_cmp<- cmpfun(getMRCA_new);

Here are some timings (*_em is your code, *_new is the code above). For 
my original example (100K tips, 500 tips in MRCA query) there is not a 
whole lot of difference:


microbenchmark(getMRCA_em(phy, tips), getMRCA_new(phy, tips), 
getMRCA_em_cmp(phy, tips), getMRCA_new_cmp(phy, tips));

Unit: milliseconds
expr min lq mean median uq max neval
   getMRCA_em(phy, tips) 
23.9987929.8196433.6743732.3914935.61888136.34521100
  getMRCA_new(phy, tips) 
23.7694128.0250031.8428531.3700134.8964044.66380100
   getMRCA_em_cmp(phy, tips) 
20.8192423.6011828.2395728.2520230.1853559.72102100
  getMRCA_new_cmp(phy, tips) 
20.1658823.1957127.1961026.6578329.8214942.61702100


However, for the worst-case scenario that Klaus devised the results are 
very impressive:


microbenchmark(getMRCA_em(tree, c(1,2)), getMRCA_new(tree, c(1,2)), 
getMRCA_em_cmp(tree, c(1,2)), getMRCA_new_cmp(tree, c(1,2)));

Unit: milliseconds
expr min lq mean median uq max neval
   getMRCA_em(tree, c(1, 2)) 
124.79040140.44473146.98292145.39224150.96044286.78335100
  getMRCA_new(tree, c(1, 2)) 
123.39924139.41609145.21311143.92062148.34205297.73846100
   getMRCA_em_cmp(tree, c(1, 2)) 
23.7210526.8494429.9495029.8793532.2649041.99264100
  getMRCA_new_cmp(tree, c(1, 2)) 
23.1522325.7468629.2152629.2537030.8091569.68522100


So it might be worth compiling.

HTH.
JWB

Joseph W. Brown
Post-doctoral Researcher, Smith Laboratory
University of Michigan
Department of Ecology & Evolutionary Biology
Room 2071, Kraus Natural Sciences Building
Ann Arbor MI 48109-1079
josep...@umich.edu 



On 10 Jun, 2017, at 05:26, Emmanuel Paradis > wrote:


Joseph, Klaus,

This is great. I slightly edited the code and renamed the argument 
'tips' to 'tip' (as in the original definition). I run on a bunch of 
random trees and the results are exactly identical with the current 
version of getMRCA. I'm changing the code in ape now.


Cheers,

Emmanuel

getMRCA <- function(phy, tip)
{
   if (!inherits(phy, "phylo"))
   stop('object "phy" is not of class "phylo"')
   if (length(tip) < 2) return(NULL)
   Ntip <- length(phy$tip.label)
   rootnd <- Ntip + 1L

   pars <- integer(phy$Nnode) # worst case assignment, usually far too 
long
   mrcaind <- 1L # index in pars of the mrca node. will return highest 
one traversed by other lineages

   tnd <- if (is.character(tip)) match(tip, phy$tip.label) else tip

   ## build a lookup table to get parents faster
   pvec <- 

Re: [R-sig-phylo] A possible alternate MRCA function to APE's getMRCA

2017-06-10 Thread Emmanuel Paradis

Joseph, Klaus,

This is great. I slightly edited the code and renamed the argument 
'tips' to 'tip' (as in the original definition). I run on a bunch of 
random trees and the results are exactly identical with the current 
version of getMRCA. I'm changing the code in ape now.


Cheers,

Emmanuel

getMRCA <- function(phy, tip)
{
if (!inherits(phy, "phylo"))
stop('object "phy" is not of class "phylo"')
if (length(tip) < 2) return(NULL)
Ntip <- length(phy$tip.label)
rootnd <- Ntip + 1L

pars <- integer(phy$Nnode) # worst case assignment, usually far too 
long
mrcaind <- 1L # index in pars of the mrca node. will return highest 
one traversed by other lineages

tnd <- if (is.character(tip)) match(tip, phy$tip.label) else tip

## build a lookup table to get parents faster
pvec <- integer(Ntip + phy$Nnode)
pvec[phy$edge[, 2]] <- phy$edge[, 1]

## get entire lineage for first tip
nd <- tnd[1]
for (k in 1:phy$Nnode) {
nd <- pvec[nd]
pars[k] <- nd
if (nd == rootnd) break
}
pars <- pars[1:k] # delete the rest

## traverse lineages for remaining tips, stop if hit common ancestor
for (i in 2:length(tnd)) {
done <- FALSE
cnd <- tnd[i]
while (!done) {
cpar <- pvec[cnd] # get immediate parent
if (cpar %in% pars) {
## early exit! if the mrca of any set of taxa is the 
root then we are done

if (cpar == rootnd) return(rootnd)
idx <- which(pars == cpar)
if (idx > mrcaind)  mrcaind <- idx
done <- TRUE
} else cnd <- cpar # keep going!
}
}
pars[mrcaind]
}


Le 10/06/2017 à 00:28, Klaus Schliep a écrit :

Hi Joseph

I guess I figured out where some room for improvement was.
You grow the vector pars <- c(pars, nd)
I preallocated pars, but you could even just grow pars using pars[k] <- nd
without pre-allocation. This operation is now (with R >= 3.4) far less
costly as mentioned in the NEWS:
"Assigning to an element of a vector beyond the current length now
over-allocates by a small fraction. The new vector is marked internally as
growable, and the true length of the new vector is stored in the truelength
field. This makes building up a vector result by assigning to the next
element beyond the current length more efficient, though pre-allocating is
still preferred. The implementation is subject to change and not intended
to be used in packages at this time."

One of the many little improvements recently making R quite a bit faster
and easier.

So the worst case behavior was pretty bad

tree <- stree(10, "right")
system.time(get_mrca_fast(tree, c(1,2)))
user  system elapsed
  13.424   0.012  13.440
  mrca
[1] 19

with the changes it is much better:

system.time(mrca <- get_mrca_fast(tree, c(1,2)))
user  system elapsed
   0.012   0.000   0.012

mrca

[1] 19

And here is the code:

get_mrca_fast <- function (phy, tips) {
 rootnd <- length(phy$tip.label) + 1;
 pars <- integer(phy$Nnode);  # worst case assignment, usually far too
long
 mrcaind <- 1;  # index in pars of the mrca node. will return
highest one traversed by other lineages
 tnd <- NULL;   # node ids of tips
 if ("character" %in% class(tips)) {
 tnd <- match(tips, phy$tip.label);
 } else {
 tnd <- tips;
 }

 # build a lookup table to get parents faster
 pvec <- integer(max(phy$edge));
 pvec[phy$edge[,2]] <- phy$edge[,1];

 # get entire lineage for first tip
 nd <- tnd[1];
 for(k in 1:phy$Nnode){
 nd <- pvec[nd]
 pars[k] <- nd
 if (nd == rootnd) break
 }
 pars <- pars[1:k] #delete the rest

 # traverse lineages for remaining tips, stop if hit common ancestor
 for (i in 2:length(tnd)) {
 done <- FALSE;
 cnd <- tnd[i];
 while (!done) {
 cpar <- pvec[cnd]; # get immediate parent
 if (cpar %in% pars) {
 if (cpar == rootnd) {
 # early exit! if the mrca of any set of taxa is the
root then we are done
 return(rootnd);
 }
 idx <- which(pars == cpar);
 if (idx > mrcaind) {
 mrcaind <- idx;
 }
 done <- TRUE;
 } else {
 # keep going!
 cnd <- cpar;
 }
 }
 }
 return(pars[mrcaind]);
}


Have a nice weekend,
Klaus

On Fri, Jun 9, 2017 at 3:59 PM, Joseph W. Brown  wrote:


Liam,

I put the (updated) code up as a gist . Feel free to use/improve/whatever.
Emmanuel sees room for improvement but frankly taking 0.3 seconds to find
the MRCA of 5000 taxa on a million-tip tree is good enough for my purposes.

JWB


Re: [R-sig-phylo] A possible alternate MRCA function to APE's getMRCA

2017-06-10 Thread Joseph W. Brown
Emmanuel and Klaus,

After Klaus reminded me that `which` was so inefficient I realized using it in 
finding the match in the first lineage is unnecessary. Instead, we can 
capitalize on the fact that shallower nodes have smaller nodeids. So I threw 
out the vector index `mrcaind` altogether:

getMRCA_new <- function(phy, tip) {
   if (!inherits(phy, "phylo"))
   stop('object "phy" is not of class "phylo"')
   if (length(tip) < 2) return(NULL)
   Ntip <- length(phy$tip.label)
   rootnd <- Ntip + 1L

   pars <- integer(phy$Nnode) # worst case assignment, usually far too long
   tnd <- if (is.character(tip)) match(tip, phy$tip.label) else tip

   ## build a lookup table to get parents faster
   pvec <- integer(Ntip + phy$Nnode)
   pvec[phy$edge[, 2]] <- phy$edge[, 1]

   ## get entire lineage for first tip
   nd <- tnd[1]
   for (k in 1:phy$Nnode) {
   nd <- pvec[nd]
   pars[k] <- nd
   if (nd == rootnd) break
   }
   pars <- pars[1:k] # delete the rest
   mrcand <- pars[1]; # keep track of actual node rather than index in pars
   
   ## traverse lineages for remaining tips, stop if hit common ancestor
   for (i in 2:length(tnd)) {
   done <- FALSE
   cnd <- tnd[i]
   while (!done) {
   cpar <- pvec[cnd] # get immediate parent
   if (cpar %in% pars) {
   if (cpar == rootnd) return(rootnd) # early exit
   mrcand <- min(mrcand, cpar); # shallowest node has lowest nodeid
   done <- TRUE
   }
   cnd <- cpar # keep going!
   }
   }
   mrcand
}

I also played around with compiling the function:

library(compiler);
getMRCA_em_cmp <- cmpfun(getMRCA_em);
getMRCA_new_cmp <- cmpfun(getMRCA_new);

Here are some timings (*_em is your code, *_new is the code above). For my 
original example (100K tips, 500 tips in MRCA query) there is not a whole lot 
of difference:

microbenchmark(getMRCA_em(phy, tips), getMRCA_new(phy, tips), 
getMRCA_em_cmp(phy, tips), getMRCA_new_cmp(phy, tips));
Unit: milliseconds
   expr  min   lq mean   median   uq   
max neval
  getMRCA_em(phy, tips) 23.99879 29.81964 33.67437 32.39149 35.61888 
136.34521   100
 getMRCA_new(phy, tips) 23.76941 28.02500 31.84285 31.37001 34.89640  
44.66380   100
  getMRCA_em_cmp(phy, tips) 20.81924 23.60118 28.23957 28.25202 30.18535  
59.72102   100
 getMRCA_new_cmp(phy, tips) 20.16588 23.19571 27.19610 26.65783 29.82149  
42.61702   100

However, for the worst-case scenario that Klaus devised the results are very 
impressive:

microbenchmark(getMRCA_em(tree, c(1,2)), getMRCA_new(tree, c(1,2)), 
getMRCA_em_cmp(tree, c(1,2)), getMRCA_new_cmp(tree, c(1,2)));
Unit: milliseconds
   expr   minlq  meanmedian
uq   max neval
  getMRCA_em(tree, c(1, 2)) 124.79040 140.44473 146.98292 145.39224 
150.96044 286.78335   100
 getMRCA_new(tree, c(1, 2)) 123.39924 139.41609 145.21311 143.92062 
148.34205 297.73846   100
  getMRCA_em_cmp(tree, c(1, 2))  23.72105  26.84944  29.94950  29.87935  
32.26490  41.99264   100
 getMRCA_new_cmp(tree, c(1, 2))  23.15223  25.74686  29.21526  29.25370  
30.80915  69.68522   100

So it might be worth compiling.

HTH.
JWB

Joseph W. Brown
Post-doctoral Researcher, Smith Laboratory
University of Michigan
Department of Ecology & Evolutionary Biology
Room 2071, Kraus Natural Sciences Building
Ann Arbor MI 48109-1079
josep...@umich.edu



> On 10 Jun, 2017, at 05:26, Emmanuel Paradis  wrote:
> 
> Joseph, Klaus,
> 
> This is great. I slightly edited the code and renamed the argument 'tips' to 
> 'tip' (as in the original definition). I run on a bunch of random trees and 
> the results are exactly identical with the current version of getMRCA. I'm 
> changing the code in ape now.
> 
> Cheers,
> 
> Emmanuel
> 
> getMRCA <- function(phy, tip)
> {
>if (!inherits(phy, "phylo"))
>stop('object "phy" is not of class "phylo"')
>if (length(tip) < 2) return(NULL)
>Ntip <- length(phy$tip.label)
>rootnd <- Ntip + 1L
> 
>pars <- integer(phy$Nnode) # worst case assignment, usually far too long
>mrcaind <- 1L # index in pars of the mrca node. will return highest one 
> traversed by other lineages
>tnd <- if (is.character(tip)) match(tip, phy$tip.label) else tip
> 
>## build a lookup table to get parents faster
>pvec <- integer(Ntip + phy$Nnode)
>pvec[phy$edge[, 2]] <- phy$edge[, 1]
> 
>## get entire lineage for first tip
>nd <- tnd[1]
>for (k in 1:phy$Nnode) {
>nd <- pvec[nd]
>pars[k] <- nd
>if (nd == rootnd) break
>}
>pars <- pars[1:k] # delete the rest
> 
>## traverse lineages for remaining tips, stop if hit common ancestor
>for (i in 2:length(tnd)) {
>done <- FALSE
>cnd <- tnd[i]
>while (!done) {
>cpar <- pvec[cnd] # get immediate parent
>if 

Re: [R-sig-phylo] A possible alternate MRCA function to APE's getMRCA

2017-06-10 Thread Klaus Schliep
Hi Joseph & Co,
don't trust that shallowest node has lowest node id. That is a byproduct by
functions like rtree() or read.tree(), but must not be generally true. E.g.
after tree rearrangements.
There is nothing wrong with this tree:

library(ape)
tree <- structure(list(edge = structure(c(6L, 6L, 8L, 7L, 7L, 8L, 6L,
  1L, 8L, 7L, 2L, 3L, 4L, 5L), .Dim = c(7L, 2L)), tip.label = c("t1", "t2",
"t3", "t4", "t5"), Nnode = 3L), .Names = c("edge", "tip.label", "Nnode"),
class = "phylo", order = "cladewise")
plot(tree)
nodelabels()
getMRCA_new(tree, c(2:4))
[1] 7

So here is my last version:

getMRCA <- function(phy, tip) {
if (!inherits(phy, "phylo"))
stop('object "phy" is not of class "phylo"')
if (length(tip) < 2) return(NULL)
Ntip <- length(phy$tip.label)
rootnd <- Ntip + 1L

pars <- integer(phy$Nnode) # worst case assignment, usually far too long
tnd <- if (is.character(tip)) match(tip, phy$tip.label) else tip

done_v <- logical(max(phy$edge))

## build a lookup table to get parents faster
pvec <- integer(Ntip + phy$Nnode)
pvec[phy$edge[, 2]] <- phy$edge[, 1]

## get entire lineage for first tip
nd <- tnd[1]
for (k in 1:phy$Nnode) {
nd <- pvec[nd]
pars[k] <- nd
if (nd == rootnd) break
}
pars <- pars[1:k] # delete the rest
mrcind <- integer(max(pars))
mrcind[pars] <- 1:k

mrcand <- pars[1]

## traverse lineages for remaining tips, stop if hit common ancestor
for (i in 2:length(tnd)) {
cnd <- tnd[i]
done <- done_v[cnd]
while(!done){
done_v[cnd] <- TRUE
cpar <- pvec[cnd] # get immediate parent
done <- done_v[cpar] # early exit if TRUE
if (cpar %in% pars) {
if (cpar == rootnd) return(rootnd) # early exit
if(mrcind[cpar] > mrcind[mrcand]) mrcand <- cpar
done_v[cpar] <- TRUE
done <- TRUE
}
cnd <- cpar # keep going!
}
}
mrcand
}


It exists also early if an internal node was traversed before hand, so the
following code is a bit faster now.

library(ape)
tree <- stree(2^17, "balanced")
system.time(getMRCA(tree, 1:2^16))

Cheers,
Klaus




On Sat, Jun 10, 2017 at 1:01 PM, Emmanuel Paradis 
wrote:

> Joseph,
>
> Really cool. I updated the code in ape.
>
> About (byte-)compiling, this is usually done during installing the package:
>
> http://ape-package.ird.fr/ape_installation.html#linux
>
> I guess Win and Mac versions from CRAN are byte-compiled. To check if a
> function is compiled, just print it in R, it will show the bytecode at the
> end of the display:
>
> R> rtree
> 
> 
> 
>
> Besides, the NEWS for R 3.4.0 says:
>
> "The JIT (‘Just In Time’) byte-code compiler is now enabled by default at
> its level 3. This means functions will be compiled on first or second use
> and top-level loops will be compiled and then run. (Thanks to Tomas
> Kalibera for extensive work to make this possible.)"

This seems to be effective:
>
> R> foo <- function(x) x
> R> foo(1)
> [1] 1
> R> foo
> function(x) x
> R> foo(1)
> [1] 1
> R> foo
> function(x) x
> 
>
> Best,
>
> Emmanuel
>
>
> Le 10/06/2017 à 14:29, Joseph W. Brown a écrit :
>
>> Emmanuel and Klaus,
>>
>> After Klaus reminded me that `which` was so inefficient I realized using
>> it in finding the match in the first lineage is unnecessary. Instead, we
>> can capitalize on the fact that shallower nodes have smaller nodeids. So I
>> threw out the vector index `mrcaind` altogether:
>>
>> getMRCA_new<- function(phy, tip) {
>> if (!inherits(phy, "phylo"))
>> stop('object "phy" is not of class "phylo"')
>> if (length(tip) < 2) return(NULL)
>> Ntip <- length(phy$tip.label)
>> rootnd <- Ntip + 1L
>>
>> pars<- integer(phy$Nnode) # worst case assignment, usually far too long
>> tnd <- if (is.character(tip)) match(tip, phy$tip.label) else tip
>>
>> ## build a lookup table to get parents faster
>> pvec <- integer(Ntip + phy$Nnode)
>> pvec[phy$edge[, 2]] <- phy$edge[, 1]
>>
>> ## get entire lineage for first tip
>> nd <- tnd[1]
>> for (k in 1:phy$Nnode) {
>> nd <- pvec[nd]
>> pars[k] <- nd
>> if (nd == rootnd) break
>> }
>> pars<- pars[1:k] # delete the rest
>> mrcand <- pars[1]; # keep track of actual node rather than index in pars
>>
>>
>> ## traverse lineages for remaining tips, stop if hit common ancestor
>> for (i in 2:length(tnd)) {
>> done <- FALSE
>> cnd <- tnd[i]
>> while (!done) {
>> cpar<- pvec[cnd] # get immediate parent
>> if (cpar %in% pars) {
>> if (cpar == rootnd) return(rootnd) # early exit
>> mrcand <- min(mrcand,cpar); # shallowest node has lowest nodeid
>>
>> done <- TRUE
>> }
>> cnd <- cpar # keep going!
>> }
>> }
>> mrcand
>> }
>>
>> I also played around with compiling the function:
>>
>> library(compiler);
>> getMRCA_em_cmp<- cmpfun(getMRCA_em);
>> getMRCA_new_cmp<- cmpfun(getMRCA_new);
>>
>> Here are some timings (*_em