Hi J-D

We have 8 drives (~500G per drive - total 4G) per machine

The metrics from my run indicate that I achieve around
for writes -
around 1 row(5k) in 2ms => 500 rows(5K) in 1 sec => 2.5 Mb/sec

and from your the observation at StumbleUpon

200k rows (presuming 100 bytes per row)/sec  => 20Mb/sec
Wow !! that an order of difference
I am sure disabling WAL during the writes is giving you a significant boost.

Are you reading the data at the same time as you are writing?

Thx
Jacob

On Fri, May 28, 2010 at 9:04 PM, Jean-Daniel Cryans <jdcry...@apache.org> wrote:
>> What I wanted out of this discussion was to find out whether I am in the
>> ballpark of what I can juice out of HBase or I am way off the mark.
>>
>
> I understand... but this is a distributed system we're talking about.
> Unless I have the same code, hbase/hadoop version, configuration,
> number of nodes, cpu, RAM, # of HDDs, OS, network equipment, data set,
> etc... it's really hard to assess right? For starters, I don't think
> you specified the number of drives you have per machine, and HBase is
> mostly IO-bound.
>
> FWIW, here's our experience. At StumbleUpon, we uploaded our main data
> set consisting of 13B*2 rows on 20 machines (2xi7, 24GB (8 for HBase),
> 4x 1TB JBOD) with MapReduce (using 8 maps per machine) pulling from a
> MySQL cluster (we were selecting large ranges in batches), inserting
> at an average rate of 150-200k rows per second, peaks at 1M. Our rows
> are a few bytes, mostly integers and some text. We did it in the time
> with HBase 0.20.3 + the parallel-put patch we wrote here (available in
> trunk) with the configuration I pasted previously. For that upload the
> WAL was disabled and ALL our tables are LZOed (can't stress enough the
> importance of compressing your tables!) and 1GB max file size.
>
> My guess is yes you can juice it out more, first by using LZO ;)
>
> Also, are your machines even stressed during the test? Do you monitor?
> Could you increase the number of clients?
>
> Sorry I can't give you a very clear answer, but without using a common
> benchmark to compare numbers we're pretty much all in the dark. YCSB
> is one, but IIRC it needs some patches to work efficiently (Todd
> Lipcon from Cloudera has them in his github).
>
> J-D
>

Reply via email to