Re: [R-sig-phylo] HKY GTR distances
Dear Andreas, you could play around with dist.ml() in phangorn. It does ML distances optimisation and you can supply a rate matrix and base frequencies for all the distances. However this is of course slower than the closed form solutions, but it can handle ambiguous states nicely. I can make a little change to export (log-)likelihood values and one could write some code to iterative optimize edge length and transition parameters. You might even use a modelTest like approach and estimate the model parameter on a tree (the tree does not need to be perfect) this may be scaling better than pairwise distances if you have many sequences. Klaus On Fri, Feb 3, 2017 at 3:52 PM, Joe Felsenstein wrote: > Emmanuel wrote: > > > > > There is no distance formula for HKY or GTR model. For GTR, Rodrı́guez et > > al. developed a procedure to calculate a distance (also in Yang's 2006 > > book). An example is given below with the woodmouse: > > > > matlog <- function(x) { > > tmp <- eigen(X) > > V <- tmp$vectors > > U <- diag(log(tmp$values)) > > V %*% U %*% solve(V) > > } > > > > tr <- function(x) sum(diag(x)) > > > > data(woodmouse) > > > > PI <- diag(base.freq(woodmouse[1:2, ])) > > Ft <- Ftab(woodmouse[1:2, ]) > > F <- Ft/sum(Ft) > > X <- solve(PI) %*% F > > -tr(PI %*% matlog(X)) > > > > You have to call this code for each pair of sequences (or wrap it in a > > function). > > > > For HKY, Yang mentioned a procedure Rzhetsky & Nei (1994, J Mol Evol). > > > > Strictly speaking there *is* a formula for GTR, which I think may be > equivalent to the above. The formula is given in my book "Inferring > Phylogenies" on page 209 as equation 13.24. Note that for both of > these, the equilibrium frequencies of the bases are estimated from the > empirical frequencies in the two sequences. That means that for each > distance, the equilibrium frequencies are somewhat different, as > different sequences are being used to estimate the base frequencies. > We are not, in the use of these formulas, estimating one overall set > of equilibrium frequencies and then using that for all the the > distances in our data set. > > The Rzhetsky-Nei 1994 distance formula is, I believe, an > approximation, though probably a very good one. It is not quite the > same as the estimate you would get by maximizing the likelihood. > > Distance formulas that involve empirically estimated base frequencies, > which all of these do, have in common the problem that one either > estimates those separately for each pair of sequences, or jointly > estimates them from the whole dataset, without using a tree in the > process. > > J.F. > > Joe Felsenstein j...@gs.washington.edu > Department of Genome Sciences and Department of Biology, > University of Washington, Box 355065, Seattle, WA 98195-5065 USA > > ___ > R-sig-phylo mailing list - R-sig-phylo@r-project.org > https://stat.ethz.ch/mailman/listinfo/r-sig-phylo > Searchable archive at http://www.mail-archive.com/r- > sig-ph...@r-project.org/ > -- Klaus Schliep Postdoctoral Fellow Revell Lab, University of Massachusetts Boston http://www.phangorn.org/ [[alternative HTML version deleted]] ___ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/
Re: [R-sig-phylo] HKY GTR distances
Emmanuel wrote: > > There is no distance formula for HKY or GTR model. For GTR, Rodrı́guez et > al. developed a procedure to calculate a distance (also in Yang's 2006 > book). An example is given below with the woodmouse: > > matlog <- function(x) { > tmp <- eigen(X) > V <- tmp$vectors > U <- diag(log(tmp$values)) > V %*% U %*% solve(V) > } > > tr <- function(x) sum(diag(x)) > > data(woodmouse) > > PI <- diag(base.freq(woodmouse[1:2, ])) > Ft <- Ftab(woodmouse[1:2, ]) > F <- Ft/sum(Ft) > X <- solve(PI) %*% F > -tr(PI %*% matlog(X)) > > You have to call this code for each pair of sequences (or wrap it in a > function). > > For HKY, Yang mentioned a procedure Rzhetsky & Nei (1994, J Mol Evol). > Strictly speaking there *is* a formula for GTR, which I think may be equivalent to the above. The formula is given in my book "Inferring Phylogenies" on page 209 as equation 13.24. Note that for both of these, the equilibrium frequencies of the bases are estimated from the empirical frequencies in the two sequences. That means that for each distance, the equilibrium frequencies are somewhat different, as different sequences are being used to estimate the base frequencies. We are not, in the use of these formulas, estimating one overall set of equilibrium frequencies and then using that for all the the distances in our data set. The Rzhetsky-Nei 1994 distance formula is, I believe, an approximation, though probably a very good one. It is not quite the same as the estimate you would get by maximizing the likelihood. Distance formulas that involve empirically estimated base frequencies, which all of these do, have in common the problem that one either estimates those separately for each pair of sequences, or jointly estimates them from the whole dataset, without using a tree in the process. J.F. Joe Felsenstein j...@gs.washington.edu Department of Genome Sciences and Department of Biology, University of Washington, Box 355065, Seattle, WA 98195-5065 USA ___ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/
Re: [R-sig-phylo] HKY GTR distances
Dear Andreas, There is no distance formula for HKY or GTR model. For GTR, Rodrı́guez et al. developed a procedure to calculate a distance (also in Yang's 2006 book). An example is given below with the woodmouse: matlog <- function(x) { tmp <- eigen(X) V <- tmp$vectors U <- diag(log(tmp$values)) V %*% U %*% solve(V) } tr <- function(x) sum(diag(x)) data(woodmouse) PI <- diag(base.freq(woodmouse[1:2, ])) Ft <- Ftab(woodmouse[1:2, ]) F <- Ft/sum(Ft) X <- solve(PI) %*% F -tr(PI %*% matlog(X)) You have to call this code for each pair of sequences (or wrap it in a function). For HKY, Yang mentioned a procedure Rzhetsky & Nei (1994, J Mol Evol). Best, Emmanuel Rodrı́guez F., Oliver J. L., Marı́n A. & Medina J. R. 1990. The general stochastic model of nucleotide substitution. Journal of Theoretical Biology 142: 485–501. Le 01/02/2017 à 23:52, kolte...@rub.de a écrit : Dear Phylothusiasts, I need to compare multiple substitution models side-by-side (species clustering stuff by distances only). Unfortunately, I am not aware of an implementation of HKY and GTR distance computations using R. Maybe there is some Github code or something else that I have been missing? I do not want to build a phylogenetic tree. I can not use PAUP (>500 big fasta files). Any ideas are greatly appreciated. Best wishes, Andreas ___ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/ ___ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/
Re: [R-sig-phylo] HKY GTR distances
Can you do this in MEGA? Jake > On Feb 1, 2017, at 5:52 PM, kolte...@rub.de > wrote: > > Dear Phylothusiasts, > I need to compare multiple substitution models side-by-side (species > clustering stuff by distances only). Unfortunately, I am not aware of an > implementation of HKY and GTR distance computations using R. Maybe there is > some Github code or something else that I have been missing? > I do not want to build a phylogenetic tree. I can not use PAUP (>500 big > fasta files). > Any ideas are greatly appreciated. > Best wishes, > Andreas > > ___ > R-sig-phylo mailing list - R-sig-phylo@r-project.org > https://stat.ethz.ch/mailman/listinfo/r-sig-phylo > Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/ ___ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/
[R-sig-phylo] HKY GTR distances
Dear Phylothusiasts, I need to compare multiple substitution models side-by-side (species clustering stuff by distances only). Unfortunately, I am not aware of an implementation of HKY and GTR distance computations using R. Maybe there is some Github code or something else that I have been missing? I do not want to build a phylogenetic tree. I can not use PAUP (>500 big fasta files). Any ideas are greatly appreciated. Best wishes, Andreas ___ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/