On Tue, 8 Dec 2009, Edward Ned Harvey wrote:
If you're backing up some large data set, you're probably not going across a
slow network link. You're probably going locally disk to disk (or disk to
tape), and you probably want the rate of compression to keep pace with the
hardware I/O.
In my backups, I found with compression enabled, jobs run slower. At
minimum 9% longer (via lzop), typically 16% longer (via gzip --fast) or 37%
longer (via gzip), and worst case, 238% longer (via bzip2).
one thing with bzip2 is that it operates on one block at a time and if you
are piping data into it, the input will block while it is doing the
compression. This can slow things down significantly if the input does not
have a large enough (900K+) buffer on it's output and ends up stopping to
wait. I have had cases where this comes close to doubling the time of a
task. Creating a small app to just buffer the data from the input and send
it to the bzip2 app as it can take it could speed things up significantly.
I have also seen some people working on versions of bzip2 that can use
multiple cores to do the compression, besides just processing more data,
these versions tend to buffer the input better.
Also, there is a very definate trade-off between compression speed, final
size, and uncompression speed. I've had many cases where higher
compression time results in a end result that's enough smaller that the
time taken to uncompress it is less than the time it would take to read
the uncompressed version from disk.
In my experiance gzip usually falls in this category, bzip2 does not (and
I haven't used LZA compression enough yet to get a feel for it, but the
claims are that it is similar to gzip in being a win in this scenerio)
by the way, is anyone aware of a version of the split utility that can
compress the output files?
David Lang
The root cause is piping the data stream through a single processor to
compress serially. So I created threadzip.
http://code.google.com/p/threadzip/
This project is in its infancy now, but it is stable release 1.0. It's
tiny, it's simple, and for these reasons, there's not much room for mistakes
in the code yet.
Based on the results below, the clear winners are:
If you care about speed: threadzip
If you care about size: pbzip2
164s 930MB tar cf - /usr/share | cat > /dev/null
167s 377MB tar cf - /usr/share | threadzip.py -t 4
--fast > /dev/null
179s 433MB tar cf - /usr/share | lzop -c > /dev/null
190s 378MB tar cf - /usr/share | gzip --fast >
/dev/null
200s 301MB tar cf - /usr/share | pbzip2 -c >
/dev/null
225s 345MB tar cf - /usr/share | gzip > /dev/null
391s 300MB tar cf - /usr/share | bzip2 -c > /dev/null
_______________________________________________
Tech mailing list
[email protected]
http://lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
http://lopsa.org/
_______________________________________________
Tech mailing list
[email protected]
http://lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
http://lopsa.org/