I run gzip in production, mostly because we have no requirement for random 
access, and the improved compression ratio is a big win for our application.

The other day, I ran some tests between gzip and LZO to try and get some 
numbers about what performance we might or might not be missing.  What I found 
is that performance is generally a wash, so I'm happy to continue with the 
better compression ratio and simpler build of the gzip native libs.

Details of the test:

I took 2 months of records from our test environment (13,499,320 to be exact, 
about 1kB each) and copied them over to new tables. One table was compressed 
with gzip, the other with LZO.  Then I compacted each table.  The gzip table 
wound up using 11 regions to store the data, while the LZO table used 15 
regions (region size and block size settings are all default).  I then ran a 
simple map reduce job against each table.  As you would expect, the job against 
the gzip table finished more quickly because it had fewer maps to do (this is a 
2-node test cluster it had to run the maps in series).

Best times of several trials were:

LZO: 2mins, 24sec
Gzip: 2mins, 21sec

Obviously, this is a tiny dataset on a tiny cluster, but I don't see any reason 
why the results wouldn't hold up in a real world setting, all else being equal 
(at least for my workload).  I'll probably continue to use it (with native libs 
- falling back to Java for gzip is terrible) and maybe look at snappy in the 
future if our requirements change.

Sandy


-----Original Message-----
From: Wayne [mailto:[email protected]] 
Sent: Wednesday, September 14, 2011 5:34 AM
To: [email protected]
Subject: Compression

I wanted to do a poll on what compression libraries people are using and why. 
We currently use lzo but are considering other alternatives for various 
reasons. We would like to move to CDH3 but adding lzo ourselves is a hassle we 
are not looking to take on. It kind of defeats the purpose os using CDH3 to 
begin with. We current run 20.0 append.

I know there are a lot of variables that affect the best decision, but we are 
looking for general trends in the community.

Is lzo still the most recommended? Is there benefit in using the lzo 
professional library and does anyone use this?
Is snappy just as good as lzo and a lot easier to deal with in term of node 
build/releases?
Does zlib/gzip have any traction?

Compression ratios are important but as always performance/speed is our biggest 
requirement. What are people using and why? Where is the momentum going? 
Compression is a huge benefit of hadoop/hbase and having high compression 
ratios with solid performance is a major benefit.

Any recommendations would be appreciated.

Thanks.

Reply via email to