Re: GZ better than LZO?

BlueDavy Lin Wed, 17 Aug 2011 19:07:00 -0700

We test gz also,but when we use gz,it seems will cause memory out of usage.


It seems maybe because gz not use Deflater/Inflater correctly (not
call end method explicit)

2011/8/18 Sandy Pratt <[email protected]>:
> I also switched from LZO to GZ a while back.  I didn't do any 
> micro-benchmarks, but I did note that the overall time of some MR jobs on our 
> small cluster (~2B records at the time IIRC) went down slightly after the 
> change.
>
> The primary reason I switched was not due to performance, however, but due to 
> compression ratio and licensing/build issues.  AFAIK, the GZ code is 
> branched, tested and released along with Hadoop, whereas LZO wasn't when I 
> last used it (not an academic concern, it turned out).
>
> One speculation about where the discrepancy between micro-benchmarks and 
> actual use may arise: do benchmarks include the cost of marshaling the data 
> (64MB before compression region say) from disk?  If the benchmark starts with 
> the data in memory (and how do you know if it does or not, given the layers 
> of cache between you and the platters) then it might not reflect real world 
> HBase scenarios.  GZ may need to read only 20MB while LZO might need to read 
> 32MB.  Does that difference dominate the computational cost of decompression?
>
>
> Sandy
>
>
>> -----Original Message-----
>> From: lars hofhansl [mailto:[email protected]]
>> Sent: Friday, July 29, 2011 08:44
>> To: [email protected]
>> Subject: Re: GZ better than LZO?
>>
>> For what's it worth I had similar observations.
>>
>> I simulated heavy write load and I found that NO compression was the
>> fastest, followed by GZ, followed by LZO.
>> After the tests I did a major_compact of the tables, and I included that time
>> in the total.
>> Also these tests where done with a single region server, in order to isolate
>> compression performance better.
>>
>>
>> So at least you're not the only one seeing this :) However, it seems that 
>> this
>> heavily depends on the details of your setup (relative CPU vs IO
>> performance, for example).
>>
>>
>> ----- Original Message -----
>> From: Steinmaurer Thomas <[email protected]>
>> To: [email protected]
>> Cc:
>> Sent: Thursday, July 28, 2011 11:27 PM
>> Subject: RE: GZ better than LZO?
>>
>> Hello,
>>
>> we simulated real looking data (as in our expected production system) in
>> respect to row-key, column families ...
>>
>> The test client (TDG) basically implement a three-part row key.
>>
>> vehicle-device-reversedtimestamp
>>
>> vehicle: 16 characters, left-padded with "0"
>> device: 16 characters, left-padded with "0"
>> reversedtimestamp: YYYYMMDDhhmmss
>>
>> There are four column families, although currently only one called
>> "data_details" is filled by the TDG. The others are reserved for later use.
>> Replication (REPLICATION_SCOPE = 1) is enabled for all column families.
>>
>> The qualifiers for "data_details" are basically based on an enum with 25
>> members. And each member has three occurrences, defined by adding a
>> different suffix to the qualifier name.
>>
>> Let's say, there is an enum member called "temperature1", then there are
>> the following qualifiers used:
>>
>> temperature1_value
>> temperature1_unit
>> temperature1_validity
>>
>> So, we end up with 25 * 3 = 75 qualifiers per row, filled with random values 
>> in
>> a range from [0, 65535] each.
>>
>> TDG basically allows to define the number of simulated clients (one thread
>> per client), enabled to run them in multi-threaded mode or in single-
>> threaded mode. Data volume is defined by number of iterations of the set of
>> simulated clients, the number of iterations per client, number of devices per
>> client and number of rows per device.
>>
>> After the test has finished, 1.008.000 rows were inserted and successfully
>> replicated to our backup test cluster.
>>
>> Any further ideas?
>>
>> PS: We are currently running a test with ~ 4mio rows following the pattern
>> above.
>>
>> Thanks,
>> Thomas
>>
>>
>>
>> -----Original Message-----
>> From: Chiku [mailto:[email protected]]
>> Sent: Donnerstag, 28. Juli 2011 15:35
>> To: [email protected]
>> Subject: Re: GZ better than LZO?
>>
>> Are you getting this results because of the nature of test data generated?
>>
>> Would you mind sharing some details about the test client and the data it
>> generates?
>>
>>
>> On Thu, Jul 28, 2011 at 7:01 PM, Steinmaurer Thomas <
>> [email protected]> wrote:
>>
>> > Hello,
>> >
>> >
>> >
>> > we ran a test client generating data into GZ and LZO compressed table.
>> > Equal data sets (number of rows: 1008000 and the same table schema). ~
>> > 7.78 GB disk space uncompressed in HDFS. LZO is ~ 887 MB whereas GZ is
>>
>> > ~
>> > 444 MB, so basically half of LZO.
>> >
>> >
>> >
>> > Execution time of the data generating client was 1373 seconds into the
>>
>> > uncompressed table, 3374 sec. into LZO and 2198 sec. into GZ. The data
>>
>> > generation client is based on HTablePool and using batch operations.
>> >
>> >
>> >
>> > So in our (simple) test, GZ beats LZO in both, disk usage and
>> > execution time of the client. We haven't tried reads yet.
>> >
>> >
>> >
>> > Is this an expected result? I thought LZO is the recommended
>> > compression algorithm? Or does LZO outperforms GZ with a growing
>> > amount of data or in read scenarios?
>> >
>> >
>> >
>> > Regards,
>> >
>> > Thomas
>> >
>> >
>> >
>> >
>
>



-- 
=============================
|     BlueDavy                                      |
|     http://www.bluedavy.com                |
=============================

Re: GZ better than LZO?

Reply via email to