The current compression algorithm does lazy parsing of matches, backed
by a binary tree match-finder with one byte hashing. Performance-wise,
this approach is not very good for several reasons:
(1) One byte hashing results in a lot of hash collisions, which slows
down searches.
(2) With
Also, results from tests I did copying a file to compressed directory on a
NTFS-3g mount, with time elapsed and compressed sizes shown:
silesia_corpus.tar (211,957,760 bytes)
Current 43.318s 111,230,976 bytes
Proposed12.903s 111,751,168 bytes
canterbury_corpus.tar