I run gzip in production, mostly because we have no requirement for random access, and the improved compression ratio is a big win for our application.
The other day, I ran some tests between gzip and LZO to try and get some numbers about what performance we might or might not be missing. What I found is that performance is generally a wash, so I'm happy to continue with the better compression ratio and simpler build of the gzip native libs. Details of the test: I took 2 months of records from our test environment (13,499,320 to be exact, about 1kB each) and copied them over to new tables. One table was compressed with gzip, the other with LZO. Then I compacted each table. The gzip table wound up using 11 regions to store the data, while the LZO table used 15 regions (region size and block size settings are all default). I then ran a simple map reduce job against each table. As you would expect, the job against the gzip table finished more quickly because it had fewer maps to do (this is a 2-node test cluster it had to run the maps in series). Best times of several trials were: LZO: 2mins, 24sec Gzip: 2mins, 21sec Obviously, this is a tiny dataset on a tiny cluster, but I don't see any reason why the results wouldn't hold up in a real world setting, all else being equal (at least for my workload). I'll probably continue to use it (with native libs - falling back to Java for gzip is terrible) and maybe look at snappy in the future if our requirements change. Sandy -----Original Message----- From: Wayne [mailto:[email protected]] Sent: Wednesday, September 14, 2011 5:34 AM To: [email protected] Subject: Compression I wanted to do a poll on what compression libraries people are using and why. We currently use lzo but are considering other alternatives for various reasons. We would like to move to CDH3 but adding lzo ourselves is a hassle we are not looking to take on. It kind of defeats the purpose os using CDH3 to begin with. We current run 20.0 append. I know there are a lot of variables that affect the best decision, but we are looking for general trends in the community. Is lzo still the most recommended? Is there benefit in using the lzo professional library and does anyone use this? Is snappy just as good as lzo and a lot easier to deal with in term of node build/releases? Does zlib/gzip have any traction? Compression ratios are important but as always performance/speed is our biggest requirement. What are people using and why? Where is the momentum going? Compression is a huge benefit of hadoop/hbase and having high compression ratios with solid performance is a major benefit. Any recommendations would be appreciated. Thanks.
