Re: [R-sig-phylo] phylip file compression ratios

2017-04-18 Thread Joe Felsenstein
Jacob Berv noted:

>
> I noticed today that the compression ratio for an interleaved phylip file 
> (zip compressed) was about 84:1, (390MB uncompressed —> 4.6MB compressed) 
> whereas the compression ratio for the same data non-interleaved was a much 
> worse 3.4:1 (390 MB uncompressed —> 113.9 MB). Not knowing much about how zip 
> compression actually works - I thought this might be an interesting 
> observation for the group…


Interleaved sequences have blocks of (say) 50 bases.  Successive lines
may repeat a whole block or nearly repeat it.  I wonder whether that
makes the interleaved format easier to compress.

I would guess that the compressibility of interleaved sequences would
be highest when the sequences are closely related.  In that case there
would be 50-base blocks of nearly identical sequences.  With less
closely related sequences the compressibility should be much lower.

Joe

Joe Felsenstein j...@gs.washington.edu
 Department of Genome Sciences and Department of Biology,
 University of Washington, Box 355065, Seattle, WA 98195-5065 USA

___
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

[R-sig-phylo] phylip file compression ratios

2017-04-18 Thread Jacob Berv
I noticed today that the compression ratio for an interleaved phylip file (zip 
compressed) was about 84:1, (390MB uncompressed —> 4.6MB compressed) whereas 
the compression ratio for the same data non-interleaved was a much worse 3.4:1 
(390 MB uncompressed —> 113.9 MB). Not knowing much about how zip compression 
actually works - I thought this might be an interesting observation for the 
group…

Best,
Jake Berv
___
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/