Trying to tune for small data in MapReduce isn't really the situation you want to be in, because that's not what MR is meant for.
I would suggest instead that you use a single process with good scanner caching to process that data. Since there's no overhead from the MR framework it might be even faster. J-D On Thu, Apr 14, 2011 at 10:30 AM, Himanish Kushary <[email protected]> wrote: > Hi, > > We are executing a small scale implementation using HBase. We receive around > 200 - 300 MB of data each day for processing.Some of our number crunching > and processing are based on this single day data. > > The problem we are facing is because of this low size of the data , a single > day data is residing in max 2-5 regions (our hbase split size is set to 64 > MB). > Due to this when our Map-reduce runs there are only about 2-5 tasks doing > the effective work and so not performant to our expectations.It would be > ideal to have more tasks working on this 200-300 MB data. > > One way we could increase the number of tasks is by further lowering the > split sizes but in that case other jobs which process 30 days or 60 days > data will be split into lots of tasks. > > Is further lowering the hbase file split size recommended ? > > Could anyone please suggest any other option to handle this scenario. > > Is it possible through some configuration or code to split the 200 - 300 MB > daily data(maybe,while it gets inserted into HBase)into multiple regions but > still sticking with the hbase split size (64/128/256 MB whatever) > > ---------- > Thanks > Himanish >
