I also switched from LZO to GZ a while back.  I didn't do any micro-benchmarks, 
but I did note that the overall time of some MR jobs on our small cluster (~2B 
records at the time IIRC) went down slightly after the change.

The primary reason I switched was not due to performance, however, but due to 
compression ratio and licensing/build issues.  AFAIK, the GZ code is branched, 
tested and released along with Hadoop, whereas LZO wasn't when I last used it 
(not an academic concern, it turned out).

One speculation about where the discrepancy between micro-benchmarks and actual 
use may arise: do benchmarks include the cost of marshaling the data (64MB 
before compression region say) from disk?  If the benchmark starts with the 
data in memory (and how do you know if it does or not, given the layers of 
cache between you and the platters) then it might not reflect real world HBase 
scenarios.  GZ may need to read only 20MB while LZO might need to read 32MB.  
Does that difference dominate the computational cost of decompression?


Sandy


> -----Original Message-----
> From: lars hofhansl [mailto:lhofha...@yahoo.com]
> Sent: Friday, July 29, 2011 08:44
> To: user@hbase.apache.org
> Subject: Re: GZ better than LZO?
> 
> For what's it worth I had similar observations.
> 
> I simulated heavy write load and I found that NO compression was the
> fastest, followed by GZ, followed by LZO.
> After the tests I did a major_compact of the tables, and I included that time
> in the total.
> Also these tests where done with a single region server, in order to isolate
> compression performance better.
> 
> 
> So at least you're not the only one seeing this :) However, it seems that this
> heavily depends on the details of your setup (relative CPU vs IO
> performance, for example).
> 
> 
> ----- Original Message -----
> From: Steinmaurer Thomas <thomas.steinmau...@scch.at>
> To: user@hbase.apache.org
> Cc:
> Sent: Thursday, July 28, 2011 11:27 PM
> Subject: RE: GZ better than LZO?
> 
> Hello,
> 
> we simulated real looking data (as in our expected production system) in
> respect to row-key, column families ...
> 
> The test client (TDG) basically implement a three-part row key.
> 
> vehicle-device-reversedtimestamp
> 
> vehicle: 16 characters, left-padded with "0"
> device: 16 characters, left-padded with "0"
> reversedtimestamp: YYYYMMDDhhmmss
> 
> There are four column families, although currently only one called
> "data_details" is filled by the TDG. The others are reserved for later use.
> Replication (REPLICATION_SCOPE = 1) is enabled for all column families.
> 
> The qualifiers for "data_details" are basically based on an enum with 25
> members. And each member has three occurrences, defined by adding a
> different suffix to the qualifier name.
> 
> Let's say, there is an enum member called "temperature1", then there are
> the following qualifiers used:
> 
> temperature1_value
> temperature1_unit
> temperature1_validity
> 
> So, we end up with 25 * 3 = 75 qualifiers per row, filled with random values 
> in
> a range from [0, 65535] each.
> 
> TDG basically allows to define the number of simulated clients (one thread
> per client), enabled to run them in multi-threaded mode or in single-
> threaded mode. Data volume is defined by number of iterations of the set of
> simulated clients, the number of iterations per client, number of devices per
> client and number of rows per device.
> 
> After the test has finished, 1.008.000 rows were inserted and successfully
> replicated to our backup test cluster.
> 
> Any further ideas?
> 
> PS: We are currently running a test with ~ 4mio rows following the pattern
> above.
> 
> Thanks,
> Thomas
> 
> 
> 
> -----Original Message-----
> From: Chiku [mailto:hakise...@gmail.com]
> Sent: Donnerstag, 28. Juli 2011 15:35
> To: user@hbase.apache.org
> Subject: Re: GZ better than LZO?
> 
> Are you getting this results because of the nature of test data generated?
> 
> Would you mind sharing some details about the test client and the data it
> generates?
> 
> 
> On Thu, Jul 28, 2011 at 7:01 PM, Steinmaurer Thomas <
> thomas.steinmau...@scch.at> wrote:
> 
> > Hello,
> >
> >
> >
> > we ran a test client generating data into GZ and LZO compressed table.
> > Equal data sets (number of rows: 1008000 and the same table schema). ~
> > 7.78 GB disk space uncompressed in HDFS. LZO is ~ 887 MB whereas GZ is
> 
> > ~
> > 444 MB, so basically half of LZO.
> >
> >
> >
> > Execution time of the data generating client was 1373 seconds into the
> 
> > uncompressed table, 3374 sec. into LZO and 2198 sec. into GZ. The data
> 
> > generation client is based on HTablePool and using batch operations.
> >
> >
> >
> > So in our (simple) test, GZ beats LZO in both, disk usage and
> > execution time of the client. We haven't tried reads yet.
> >
> >
> >
> > Is this an expected result? I thought LZO is the recommended
> > compression algorithm? Or does LZO outperforms GZ with a growing
> > amount of data or in read scenarios?
> >
> >
> >
> > Regards,
> >
> > Thomas
> >
> >
> >
> >

Reply via email to