We test gz also,but when we use gz,it seems will cause memory out of usage.
It seems maybe because gz not use Deflater/Inflater correctly (not call end method explicit) 2011/8/18 Sandy Pratt <[email protected]>: > I also switched from LZO to GZ a while back. I didn't do any > micro-benchmarks, but I did note that the overall time of some MR jobs on our > small cluster (~2B records at the time IIRC) went down slightly after the > change. > > The primary reason I switched was not due to performance, however, but due to > compression ratio and licensing/build issues. AFAIK, the GZ code is > branched, tested and released along with Hadoop, whereas LZO wasn't when I > last used it (not an academic concern, it turned out). > > One speculation about where the discrepancy between micro-benchmarks and > actual use may arise: do benchmarks include the cost of marshaling the data > (64MB before compression region say) from disk? If the benchmark starts with > the data in memory (and how do you know if it does or not, given the layers > of cache between you and the platters) then it might not reflect real world > HBase scenarios. GZ may need to read only 20MB while LZO might need to read > 32MB. Does that difference dominate the computational cost of decompression? > > > Sandy > > >> -----Original Message----- >> From: lars hofhansl [mailto:[email protected]] >> Sent: Friday, July 29, 2011 08:44 >> To: [email protected] >> Subject: Re: GZ better than LZO? >> >> For what's it worth I had similar observations. >> >> I simulated heavy write load and I found that NO compression was the >> fastest, followed by GZ, followed by LZO. >> After the tests I did a major_compact of the tables, and I included that time >> in the total. >> Also these tests where done with a single region server, in order to isolate >> compression performance better. >> >> >> So at least you're not the only one seeing this :) However, it seems that >> this >> heavily depends on the details of your setup (relative CPU vs IO >> performance, for example). >> >> >> ----- Original Message ----- >> From: Steinmaurer Thomas <[email protected]> >> To: [email protected] >> Cc: >> Sent: Thursday, July 28, 2011 11:27 PM >> Subject: RE: GZ better than LZO? >> >> Hello, >> >> we simulated real looking data (as in our expected production system) in >> respect to row-key, column families ... >> >> The test client (TDG) basically implement a three-part row key. >> >> vehicle-device-reversedtimestamp >> >> vehicle: 16 characters, left-padded with "0" >> device: 16 characters, left-padded with "0" >> reversedtimestamp: YYYYMMDDhhmmss >> >> There are four column families, although currently only one called >> "data_details" is filled by the TDG. The others are reserved for later use. >> Replication (REPLICATION_SCOPE = 1) is enabled for all column families. >> >> The qualifiers for "data_details" are basically based on an enum with 25 >> members. And each member has three occurrences, defined by adding a >> different suffix to the qualifier name. >> >> Let's say, there is an enum member called "temperature1", then there are >> the following qualifiers used: >> >> temperature1_value >> temperature1_unit >> temperature1_validity >> >> So, we end up with 25 * 3 = 75 qualifiers per row, filled with random values >> in >> a range from [0, 65535] each. >> >> TDG basically allows to define the number of simulated clients (one thread >> per client), enabled to run them in multi-threaded mode or in single- >> threaded mode. Data volume is defined by number of iterations of the set of >> simulated clients, the number of iterations per client, number of devices per >> client and number of rows per device. >> >> After the test has finished, 1.008.000 rows were inserted and successfully >> replicated to our backup test cluster. >> >> Any further ideas? >> >> PS: We are currently running a test with ~ 4mio rows following the pattern >> above. >> >> Thanks, >> Thomas >> >> >> >> -----Original Message----- >> From: Chiku [mailto:[email protected]] >> Sent: Donnerstag, 28. Juli 2011 15:35 >> To: [email protected] >> Subject: Re: GZ better than LZO? >> >> Are you getting this results because of the nature of test data generated? >> >> Would you mind sharing some details about the test client and the data it >> generates? >> >> >> On Thu, Jul 28, 2011 at 7:01 PM, Steinmaurer Thomas < >> [email protected]> wrote: >> >> > Hello, >> > >> > >> > >> > we ran a test client generating data into GZ and LZO compressed table. >> > Equal data sets (number of rows: 1008000 and the same table schema). ~ >> > 7.78 GB disk space uncompressed in HDFS. LZO is ~ 887 MB whereas GZ is >> >> > ~ >> > 444 MB, so basically half of LZO. >> > >> > >> > >> > Execution time of the data generating client was 1373 seconds into the >> >> > uncompressed table, 3374 sec. into LZO and 2198 sec. into GZ. The data >> >> > generation client is based on HTablePool and using batch operations. >> > >> > >> > >> > So in our (simple) test, GZ beats LZO in both, disk usage and >> > execution time of the client. We haven't tried reads yet. >> > >> > >> > >> > Is this an expected result? I thought LZO is the recommended >> > compression algorithm? Or does LZO outperforms GZ with a growing >> > amount of data or in read scenarios? >> > >> > >> > >> > Regards, >> > >> > Thomas >> > >> > >> > >> > > > -- ============================= | BlueDavy | | http://www.bluedavy.com | =============================
