On Tue, 8 Dec 2009, Edward Ned Harvey wrote:

If you're backing up some large data set, you're probably not going across a
slow network link.  You're probably going locally disk to disk (or disk to
tape), and you probably want the rate of compression to keep pace with the
hardware I/O.



In my backups, I found with compression enabled, jobs run slower.  At
minimum 9% longer (via lzop), typically 16% longer (via gzip --fast) or 37%
longer (via gzip), and worst case, 238% longer (via bzip2).

one thing with bzip2 is that it operates on one block at a time and if you are piping data into it, the input will block while it is doing the compression. This can slow things down significantly if the input does not have a large enough (900K+) buffer on it's output and ends up stopping to wait. I have had cases where this comes close to doubling the time of a task. Creating a small app to just buffer the data from the input and send it to the bzip2 app as it can take it could speed things up significantly.

I have also seen some people working on versions of bzip2 that can use multiple cores to do the compression, besides just processing more data, these versions tend to buffer the input better.

Also, there is a very definate trade-off between compression speed, final size, and uncompression speed. I've had many cases where higher compression time results in a end result that's enough smaller that the time taken to uncompress it is less than the time it would take to read the uncompressed version from disk.

In my experiance gzip usually falls in this category, bzip2 does not (and I haven't used LZA compression enough yet to get a feel for it, but the claims are that it is similar to gzip in being a win in this scenerio)

by the way, is anyone aware of a version of the split utility that can compress the output files?

David Lang



The root cause is piping the data stream through a single processor to
compress serially.  So I created threadzip.

http://code.google.com/p/threadzip/



This project is in  its infancy now, but it is stable release 1.0.  It's
tiny, it's simple, and for these reasons, there's not much room for mistakes
in the code yet.



Based on the results below, the clear winners are:

If you care about speed:  threadzip

If you care about size:  pbzip2



164s       930MB                  tar cf - /usr/share | cat > /dev/null

167s       377MB                  tar cf - /usr/share | threadzip.py -t 4
--fast > /dev/null

179s       433MB                  tar cf - /usr/share | lzop -c > /dev/null

190s       378MB                  tar cf - /usr/share | gzip --fast >
/dev/null

200s       301MB                  tar cf - /usr/share | pbzip2 -c >
/dev/null

225s       345MB                  tar cf - /usr/share | gzip > /dev/null

391s       300MB                  tar cf - /usr/share | bzip2 -c > /dev/null

_______________________________________________
Tech mailing list
[email protected]
http://lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
 http://lopsa.org/
_______________________________________________
Tech mailing list
[email protected]
http://lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Reply via email to