I have a related question: I tried a simple load experiment too using Hbase's 
Java API.. (The nodes do only loading: nothing else.. The client programs 
generate random data on the fly to load.. So, no reads of the input data)..

120m rows 15KB each. 2 column families.
5 region servers, ran around 4 or 5 clients per node on the 5 nodes that run 
the region servers..

2MB block size, 2gigs region size, WAL disabled, auto flush disabled.. 2MB 
write buffer.. Major compactions disabled..

The other configs are quite similar to the configs discussed in this thread..

And I get a throughput of around 1.5 MB per second per node..
(500 rows per second for the entire cluster)..  Do these values seem reasonable?

Thanks
Vidhya

On 5/29/10 6:36 PM, "Jacob Isaac" <ja...@ebrary.com> wrote:

Hi J-D

We have 8 drives (~500G per drive - total 4G) per machine

The metrics from my run indicate that I achieve around
for writes -
around 1 row(5k) in 2ms => 500 rows(5K) in 1 sec => 2.5 Mb/sec

and from your the observation at StumbleUpon

200k rows (presuming 100 bytes per row)/sec  => 20Mb/sec
Wow !! that an order of difference
I am sure disabling WAL during the writes is giving you a significant boost.

Are you reading the data at the same time as you are writing?

Thx
Jacob

On Fri, May 28, 2010 at 9:04 PM, Jean-Daniel Cryans <jdcry...@apache.org> wrote:
>> What I wanted out of this discussion was to find out whether I am in the
>> ballpark of what I can juice out of HBase or I am way off the mark.
>>
>
> I understand... but this is a distributed system we're talking about.
> Unless I have the same code, hbase/hadoop version, configuration,
> number of nodes, cpu, RAM, # of HDDs, OS, network equipment, data set,
> etc... it's really hard to assess right? For starters, I don't think
> you specified the number of drives you have per machine, and HBase is
> mostly IO-bound.
>
> FWIW, here's our experience. At StumbleUpon, we uploaded our main data
> set consisting of 13B*2 rows on 20 machines (2xi7, 24GB (8 for HBase),
> 4x 1TB JBOD) with MapReduce (using 8 maps per machine) pulling from a
> MySQL cluster (we were selecting large ranges in batches), inserting
> at an average rate of 150-200k rows per second, peaks at 1M. Our rows
> are a few bytes, mostly integers and some text. We did it in the time
> with HBase 0.20.3 + the parallel-put patch we wrote here (available in
> trunk) with the configuration I pasted previously. For that upload the
> WAL was disabled and ALL our tables are LZOed (can't stress enough the
> importance of compressing your tables!) and 1GB max file size.
>
> My guess is yes you can juice it out more, first by using LZO ;)
>
> Also, are your machines even stressed during the test? Do you monitor?
> Could you increase the number of clients?
>
> Sorry I can't give you a very clear answer, but without using a common
> benchmark to compare numbers we're pretty much all in the dark. YCSB
> is one, but IIRC it needs some patches to work efficiently (Todd
> Lipcon from Cloudera has them in his github).
>
> J-D
>

Reply via email to