What have you configured your hbase.hstore.blockingStoreFiles and hbase.hregion.memstore.block.multiplier? Both of these block updates when the limit is hit. Try increasing these to say 20 and 4 from the default 7 and 2 and see if it helps.
If this still doesn't help, see if you can set up ganglia to get a better insight into what is bottlenecking. --Suraj On Thu, Oct 11, 2012 at 11:47 PM, Pankaj Misra <[email protected]> wrote: > OK, Looks like I missed out reading that part in your original mail. Did you > try some of the compaction tweaks and configurations as explained in the > following link for your data? > http://hbase.apache.org/book/regions.arch.html#compaction > > > Also, how much data are your putting into the regions, and how big is one > region at the end of data ingestion? > > Thanks and Regards > Pankaj Misra > > -----Original Message----- > From: Jonathan Bishop [mailto:[email protected]] > Sent: Friday, October 12, 2012 12:04 PM > To: [email protected] > Subject: RE: more regionservers does not improve performance > > Pankaj, > > Thanks for the reply. > > Actually, I am using MD5 hashing to evenly spread the keys among the splits, > so I don’t believe there is any hotspot. In fact, when I monitory the web UI > for HBase I see a very even load on all the regionservers. > > Jon > > Sent from my Windows 8 PC <http://windows.microsoft.com/consumer-preview> > > *From:* Pankaj Misra <[email protected]> > *Sent:* Thursday, October 11, 2012 8:24:32 PM > *To:* [email protected] > *Subject:* RE: more regionservers does not improve performance > > Hi Jonathan, > > What seems to me is that, while doing the split across all 40 mappers, the > keys are not randomized enough to leverage multiple regions and the pre-split > strategy. This may be happening because all the 40 mappers may be trying to > write onto a single region for sometime, making it a HOT region, till the > key falls into another region, and then the other region becomes a HOT region > hence you may seeing a high impact of compaction cycles reducing your > throughput. > > Are the keys incremental? Are the keys randomized enough across the splits? > > Ideally when all 40 mappers are running you should see all the regions being > filled up in parallel for maximum throughput. Hope it helps. > > Thanks and Regards > Pankaj Misra > > > ________________________________________ > From: Jonathan Bishop [[email protected]] > Sent: Friday, October 12, 2012 5:38 AM > To: [email protected] > Subject: more regionservers does not improve performance > > Hi, > > I am running a MR job with 40 simultaneous mappers, each of which does puts > to HBase. I have ganged up the puts into groups of 1000 (this seems to help > quite a bit) and also made sure that the table is pre-split into 100 regions, > and that the row keys are randomized using MD5 hashing. > > My cluster size is 10, and I am allowing 4 mappers per tasktracker. > > In my MR job I know that the mappers are able to generate puts much faster > than the puts can be handled in hbase. In other words if I let the mappers > run without doing hbase puts then everything scales as you would expect with > the number of mappers created. It is the hbase puts which seem to be the > bottleneck. > > What is strange is that I do not get much run time improvement by increasing > the number regionservers beyond about 4. Indeed, it seems that the system > runs slower with 8 regionservers than with 4. > > I have added the following in hbase-env.sh hoping this would help... (from > the book HBase in Action) > > export HBASE_OPTS="-Xmx8g" > export HBASE_REGIONSERVER_OPTS="-Xmx8g -Xms8g -Xmn128m -XX:+UseParNewGC > -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70" > > # Uncomment below to enable java garbage collection logging in the .out file. > export HBASE_OPTS="${HBASE_OPTS} -verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCDateStamps -Xloggc:${HBASE_HOME}/logs/gc-hbase.log" > > Monitoring hbase through the web ui I see that there are pauses for flushing, > which seems to run pretty quickly, and for compacting, which seems to take > somewhat longer. > > Any advice for making this run faster would be greatly appreciated. > Currently I am looking into installing Ganglia to better monitory my cluster, > but yet to have that running. > > I suspect an I/O issue as the regionservers do not seem terribly loaded. > > Thanks, > > Jon > > ________________________________ > > Impetus Ranked in the Top 50 India’s Best Companies to Work For 2012. > > Impetus webcast ‘Designing a Test Automation Framework for Multi-vendor > Interoperable Systems’ available at http://lf1.me/0E/. > > > NOTE: This message may contain information that is confidential, proprietary, > privileged or otherwise protected by law. The message is intended solely for > the named addressee. If received in error, please destroy and notify the > sender. Any use of this email is prohibited when received in error. Impetus > does not represent, warrant and/or guarantee, that the integrity of this > communication has been maintained nor that the communication is free of > errors, virus, interception or interference. > > ________________________________ > > Impetus Ranked in the Top 50 India’s Best Companies to Work For 2012. > > Impetus webcast ‘Designing a Test Automation Framework for Multi-vendor > Interoperable Systems’ available at http://lf1.me/0E/. > > > NOTE: This message may contain information that is confidential, proprietary, > privileged or otherwise protected by law. The message is intended solely for > the named addressee. If received in error, please destroy and notify the > sender. Any use of this email is prohibited when received in error. Impetus > does not represent, warrant and/or guarantee, that the integrity of this > communication has been maintained nor that the communication is free of > errors, virus, interception or interference.
