Re: [R-sig-phylo] phylip file compression ratios
Indeed - this is a population genomic dataset with very few site patterns relative to the size of the full dataset. Cool! Jake > On Apr 18, 2017, at 2:18 PM, Joe Felsenstein wrote: > > I would guess that the compressibility of interleaved sequences would > be highest when the sequences are closely related. [[alternative HTML version deleted]] ___ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/
Re: [R-sig-phylo] phylip file compression ratios
Jacob Berv noted: > > I noticed today that the compression ratio for an interleaved phylip file > (zip compressed) was about 84:1, (390MB uncompressed —> 4.6MB compressed) > whereas the compression ratio for the same data non-interleaved was a much > worse 3.4:1 (390 MB uncompressed —> 113.9 MB). Not knowing much about how zip > compression actually works - I thought this might be an interesting > observation for the group… Interleaved sequences have blocks of (say) 50 bases. Successive lines may repeat a whole block or nearly repeat it. I wonder whether that makes the interleaved format easier to compress. I would guess that the compressibility of interleaved sequences would be highest when the sequences are closely related. In that case there would be 50-base blocks of nearly identical sequences. With less closely related sequences the compressibility should be much lower. Joe Joe Felsenstein j...@gs.washington.edu Department of Genome Sciences and Department of Biology, University of Washington, Box 355065, Seattle, WA 98195-5065 USA ___ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/
[R-sig-phylo] phylip file compression ratios
I noticed today that the compression ratio for an interleaved phylip file (zip compressed) was about 84:1, (390MB uncompressed —> 4.6MB compressed) whereas the compression ratio for the same data non-interleaved was a much worse 3.4:1 (390 MB uncompressed —> 113.9 MB). Not knowing much about how zip compression actually works - I thought this might be an interesting observation for the group… Best, Jake Berv ___ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/