You're definitely going to want to use the native libraries for zlib and gzip.
http://hadoop.apache.org/common/docs/current/native_libraries.html It's actually a fairly easy build, and it comes out of the box with CDH IIRC. You can put a symlink to hadoop/lib/native in hbase/lib and you're done. When HBase falls back to Java for GZ and zlib, it will definitely be a bad thing =/ Sandy > -----Original Message----- > From: BlueDavy Lin [mailto:[email protected]] > Sent: Wednesday, August 17, 2011 19:07 > To: [email protected] > Subject: Re: GZ better than LZO? > > We test gz also,but when we use gz,it seems will cause memory out of > usage. > > It seems maybe because gz not use Deflater/Inflater correctly (not call end > method explicit) > > 2011/8/18 Sandy Pratt <[email protected]>: > > I also switched from LZO to GZ a while back. I didn't do any micro- > benchmarks, but I did note that the overall time of some MR jobs on our > small cluster (~2B records at the time IIRC) went down slightly after the > change. > > > > The primary reason I switched was not due to performance, however, but > due to compression ratio and licensing/build issues. AFAIK, the GZ code is > branched, tested and released along with Hadoop, whereas LZO wasn't > when I last used it (not an academic concern, it turned out). > > > > One speculation about where the discrepancy between micro-benchmarks > and actual use may arise: do benchmarks include the cost of marshaling the > data (64MB before compression region say) from disk? If the benchmark > starts with the data in memory (and how do you know if it does or not, given > the layers of cache between you and the platters) then it might not reflect > real world HBase scenarios. GZ may need to read only 20MB while LZO might > need to read 32MB. Does that difference dominate the computational cost > of decompression? > > > > > > Sandy > > > > > >> -----Original Message----- > >> From: lars hofhansl [mailto:[email protected]] > >> Sent: Friday, July 29, 2011 08:44 > >> To: [email protected] > >> Subject: Re: GZ better than LZO? > >> > >> For what's it worth I had similar observations. > >> > >> I simulated heavy write load and I found that NO compression was the > >> fastest, followed by GZ, followed by LZO. > >> After the tests I did a major_compact of the tables, and I included > >> that time in the total. > >> Also these tests where done with a single region server, in order to > >> isolate compression performance better. > >> > >> > >> So at least you're not the only one seeing this :) However, it seems > >> that this heavily depends on the details of your setup (relative CPU > >> vs IO performance, for example). > >> > >> > >> ----- Original Message ----- > >> From: Steinmaurer Thomas <[email protected]> > >> To: [email protected] > >> Cc: > >> Sent: Thursday, July 28, 2011 11:27 PM > >> Subject: RE: GZ better than LZO? > >> > >> Hello, > >> > >> we simulated real looking data (as in our expected production system) > >> in respect to row-key, column families ... > >> > >> The test client (TDG) basically implement a three-part row key. > >> > >> vehicle-device-reversedtimestamp > >> > >> vehicle: 16 characters, left-padded with "0" > >> device: 16 characters, left-padded with "0" > >> reversedtimestamp: YYYYMMDDhhmmss > >> > >> There are four column families, although currently only one called > >> "data_details" is filled by the TDG. The others are reserved for later use. > >> Replication (REPLICATION_SCOPE = 1) is enabled for all column families. > >> > >> The qualifiers for "data_details" are basically based on an enum with > >> 25 members. And each member has three occurrences, defined by adding > >> a different suffix to the qualifier name. > >> > >> Let's say, there is an enum member called "temperature1", then there > >> are the following qualifiers used: > >> > >> temperature1_value > >> temperature1_unit > >> temperature1_validity > >> > >> So, we end up with 25 * 3 = 75 qualifiers per row, filled with random > >> values in a range from [0, 65535] each. > >> > >> TDG basically allows to define the number of simulated clients (one > >> thread per client), enabled to run them in multi-threaded mode or in > >> single- threaded mode. Data volume is defined by number of iterations > >> of the set of simulated clients, the number of iterations per client, > >> number of devices per client and number of rows per device. > >> > >> After the test has finished, 1.008.000 rows were inserted and > >> successfully replicated to our backup test cluster. > >> > >> Any further ideas? > >> > >> PS: We are currently running a test with ~ 4mio rows following the > >> pattern above. > >> > >> Thanks, > >> Thomas > >> > >> > >> > >> -----Original Message----- > >> From: Chiku [mailto:[email protected]] > >> Sent: Donnerstag, 28. Juli 2011 15:35 > >> To: [email protected] > >> Subject: Re: GZ better than LZO? > >> > >> Are you getting this results because of the nature of test data generated? > >> > >> Would you mind sharing some details about the test client and the > >> data it generates? > >> > >> > >> On Thu, Jul 28, 2011 at 7:01 PM, Steinmaurer Thomas < > >> [email protected]> wrote: > >> > >> > Hello, > >> > > >> > > >> > > >> > we ran a test client generating data into GZ and LZO compressed table. > >> > Equal data sets (number of rows: 1008000 and the same table > >> > schema). ~ > >> > 7.78 GB disk space uncompressed in HDFS. LZO is ~ 887 MB whereas GZ > >> > is > >> > >> > ~ > >> > 444 MB, so basically half of LZO. > >> > > >> > > >> > > >> > Execution time of the data generating client was 1373 seconds into > >> > the > >> > >> > uncompressed table, 3374 sec. into LZO and 2198 sec. into GZ. The > >> > data > >> > >> > generation client is based on HTablePool and using batch operations. > >> > > >> > > >> > > >> > So in our (simple) test, GZ beats LZO in both, disk usage and > >> > execution time of the client. We haven't tried reads yet. > >> > > >> > > >> > > >> > Is this an expected result? I thought LZO is the recommended > >> > compression algorithm? Or does LZO outperforms GZ with a growing > >> > amount of data or in read scenarios? > >> > > >> > > >> > > >> > Regards, > >> > > >> > Thomas > >> > > >> > > >> > > >> > > > > > > > > > -- > ============================= > | BlueDavy | > | http://www.bluedavy.com | > =============================
