Region Splitting for moderate amount of daily data - Improve MapReduce Performance

Himanish Kushary Thu, 14 Apr 2011 10:42:08 -0700

Hi,

We are executing a small scale implementation using HBase. We receive around
200 - 300 MB of data each day for processing.Some of our number crunching
and processing are based on this single day data.


The problem we are facing is because of this low size of the data , a single
day data is residing in max 2-5 regions (our hbase split size is set to 64
MB).
Due to this when our Map-reduce runs there are only about 2-5 tasks doing
the effective work and so not performant to our expectations.It would be
ideal to have more tasks working on this 200-300 MB data.

One way we could increase the number of tasks is by further lowering the
split sizes but in that case other jobs which process 30 days or 60 days
data will be split into lots of tasks.

Is further lowering the hbase file split size recommended ?

Could anyone please suggest any other option to handle this scenario.

Is it possible through some configuration or code to split the 200 - 300 MB
daily data(maybe,while it gets inserted into HBase)into multiple regions but
still sticking with the hbase split size (64/128/256 MB whatever)

----------
Thanks
Himanish

Region Splitting for moderate amount of daily data - Improve MapReduce Performance

Reply via email to