Trying to tune for small data in MapReduce isn't really the situation
you want to be in, because that's not what MR is meant for.

I would suggest instead that you use a single process with good
scanner caching to process that data. Since there's no overhead from
the MR framework it might be even faster.

J-D

On Thu, Apr 14, 2011 at 10:30 AM, Himanish Kushary <[email protected]> wrote:
> Hi,
>
> We are executing a small scale implementation using HBase. We receive around
> 200 - 300 MB of data each day for processing.Some of our number crunching
> and processing are based on this single day data.
>
> The problem we are facing is because of this low size of the data , a single
> day data is residing in max 2-5 regions (our hbase split size is set to 64
> MB).
> Due to this when our Map-reduce runs there are only about 2-5 tasks doing
> the effective work and so not performant to our expectations.It would be
> ideal to have more tasks working on this 200-300 MB data.
>
> One way we could increase the number of tasks is by further lowering the
> split sizes but in that case other jobs which process 30 days or 60 days
> data will be split into lots of tasks.
>
> Is further lowering the hbase file split size recommended ?
>
> Could anyone please suggest any other option to handle this scenario.
>
> Is it possible through some configuration or code to split the 200 - 300 MB
> daily data(maybe,while it gets inserted into HBase)into multiple regions but
> still sticking with the hbase split size (64/128/256 MB whatever)
>
> ----------
> Thanks
> Himanish
>

Reply via email to