Could it be that your row key is not distributing the data well enough?
That is, if your key is primarily based on the current date, it will
only put the
data into a small number of regions.
Dave Schnepper
On 14/Apr/2011 11:19, Jean-Daniel Cryans wrote:
Trying to tune for small data in MapReduce isn't really the situation
you want to be in, because that's not what MR is meant for.
I would suggest instead that you use a single process with good
scanner caching to process that data. Since there's no overhead from
the MR framework it might be even faster.
J-D
On Thu, Apr 14, 2011 at 10:30 AM, Himanish Kushary<[email protected]> wrote:
Hi,
We are executing a small scale implementation using HBase. We receive around
200 - 300 MB of data each day for processing.Some of our number crunching
and processing are based on this single day data.
The problem we are facing is because of this low size of the data , a single
day data is residing in max 2-5 regions (our hbase split size is set to 64
MB).
Due to this when our Map-reduce runs there are only about 2-5 tasks doing
the effective work and so not performant to our expectations.It would be
ideal to have more tasks working on this 200-300 MB data.
One way we could increase the number of tasks is by further lowering the
split sizes but in that case other jobs which process 30 days or 60 days
data will be split into lots of tasks.
Is further lowering the hbase file split size recommended ?
Could anyone please suggest any other option to handle this scenario.
Is it possible through some configuration or code to split the 200 - 300 MB
daily data(maybe,while it gets inserted into HBase)into multiple regions but
still sticking with the hbase split size (64/128/256 MB whatever)
----------
Thanks
Himanish