Jacob Berv noted:
>
> I noticed today that the compression ratio for an interleaved phylip file
> (zip compressed) was about 84:1, (390MB uncompressed —> 4.6MB compressed)
> whereas the compression ratio for the same data non-interleaved was a much
> worse 3.4:1 (390 MB uncompressed —> 113.9 MB). Not knowing much about how zip
> compression actually works - I thought this might be an interesting
> observation for the group…
Interleaved sequences have blocks of (say) 50 bases. Successive lines
may repeat a whole block or nearly repeat it. I wonder whether that
makes the interleaved format easier to compress.
I would guess that the compressibility of interleaved sequences would
be highest when the sequences are closely related. In that case there
would be 50-base blocks of nearly identical sequences. With less
closely related sequences the compressibility should be much lower.
Joe
Joe Felsenstein j...@gs.washington.edu
Department of Genome Sciences and Department of Biology,
University of Washington, Box 355065, Seattle, WA 98195-5065 USA
___
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/