Doug, St.ack, We changed our production setup to CDH3 to resolve below mentioned issue. I noticed that even though the severs were running JDK 1.6 u25 (as per JAVA_HOME in hbase-env.sh), I still ran into read taking more a than minute issue. So, I have added -XX:+UseMembar and it seems to be okay after that.
We have a parallel system that is still running CDH2. I added -XX:+UseMembar there too as that too had this issue and was only worse. It too seems to be lot stabler now. Regards Srikanth -----Original Message----- From: Doug Meil [mailto:[email protected]] Sent: Friday, July 15, 2011 9:06 PM To: [email protected] Subject: Re: HBase Read and Write Issues in Mutlithreaded Environments Glad to hear things are better Srikanth. I'll add that to the Troubleshooting chapter too to make it a little more obvious. On 7/15/11 11:30 AM, "Srikanth P. Shreenivas" <[email protected]> wrote: >Hi St.Ack, > >I stumbled upon http://hbase.apache.org/book.html#d730e4957 in one of the >other mail threads in HBase user mailing list. > >We realized that we were running JVM 1.6.0_20-b02, and hence, we tried >adding -XX:+UseMembar as suggested in above mentioned FAQ. >This seems to have resolved the issue. I ran the test app for 20 minutes >with no read timeouts. > > >Thanks for all the help. > >Regards, >Srikanth > > > >-----Original Message----- >From: Srikanth P. Shreenivas >Sent: Sunday, July 10, 2011 5:20 PM >To: [email protected] >Subject: RE: HBase Read and Write Issues in Mutlithreaded Environments > >Hi St.Ack, > >I noticed that one of the region server machines had time running one day >in future. >I corrected the date. I ran into some issues after restarting, I was >getting error with respect to .META. and stuff which I did not understand >much. Also, status command in hbase shell was displaying "3 servers, 1 >dead" whereas I had only 3 region server. > >So, I cleaned the "/hbase" (to get to real problem) and restarted the >hbase nodes. > >After starting all the 3 nodes of HBase, I ran the test app again and was >observing the log files of all the 3 region servers. >I noticed that when test app seemed hung, the web app's thread that was >serving the request has gone to sleep at below code. I think it stayed >like that for around 10 minutes before Tomcat probably interrupted it. > >Thread-#8 - Thread t@29 > java.lang.Thread.State: TIMED_WAITING > at java.lang.Thread.sleep(Native Method) > at >org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementatio >n.locateRegionInMeta(HConnectionManager.java:791) > at >org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementatio >n.locateRegion(HConnectionManager.java:589) > at >org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementatio >n.relocateRegion(HConnectionManager.java:564) > at >org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementatio >n.getRegionLocation(HConnectionManager.java:415) > at >org.apache.hadoop.hbase.client.ServerCallable.instantiateServer(ServerCall >able.java:57) > at >org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementatio >n.getRegionServerWithRetries(HConnectionManager.java:1002) > at >org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:514) > at >org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:133) > at >org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:95) > at >org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementatio >n.prefetchRegionCache(HConnectionManager.java:648) > at >org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementatio >n.locateRegionInMeta(HConnectionManager.java:702) > - locked java.lang.Object@75826e08 > at >org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementatio >n.locateRegion(HConnectionManager.java:593) > at >org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementatio >n.relocateRegion(HConnectionManager.java:564) > at >org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementatio >n.getRegionLocation(HConnectionManager.java:415) > at >org.apache.hadoop.hbase.client.ServerCallable.instantiateServer(ServerCall >able.java:57) > at >org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementatio >n.getRegionServerWithRetries(HConnectionManager.java:1002) > at org.apache.hadoop.hbase.client.HTable.get(HTable.java:546) > <.. app specific trace removed ...> > at >java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor. >java:886) > at >java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java >:908) > at java.lang.Thread.run(Thread.java:619) > >========================================================================== >================== >After 10 minutes, web app log showed: >2011-07-10 16:50:28,804 [Thread-#8] ERROR >[persistence.handler.HBaseHandler] - Exception occurred in searchData: >java.io.IOException: Giving up trying to get region server: thread is >interrupted. > at >org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementatio >n.getRegionServerWithRetries(HConnectionManager.java:1016) > at org.apache.hadoop.hbase.client.HTable.get(HTable.java:546) > >========================================================================== >================== >I did not see anything happening on region server either, the log had >occasional entries like these: > >2011-07-10 16:43:53,648 DEBUG >org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=6.52 MB, >free=788.08 MB, max=794.6 MB, blocks=0, accesses=1080, hits=0, >hitRatio=0.00%%, cachingAccesses=0, cachingHits=0, cachingHitsRatio=ï¿1Ž2%, >evictions=0, evicted=0, evictedPerRun=NaN >2011-07-10 16:48:53,649 DEBUG >org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=6.52 MB, >free=788.08 MB, max=794.6 MB, blocks=0, accesses=1080, hits=0, >hitRatio=0.00%%, cachingAccesses=0, cachingHits=0, cachingHitsRatio=ï¿1Ž2%, >evictions=0, evicted=0, evictedPerRun=NaN >2011-07-10 16:53:53,648 DEBUG >org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=6.52 MB, >free=788.08 MB, max=794.6 MB, blocks=0, accesses=1080, hits=0, >hitRatio=0.00%%, cachingAccesses=0, cachingHits=0, cachingHitsRatio=ï¿1Ž2%, >evictions=0, evicted=0, evictedPerRun=NaN >2 > > > > > >Regards, >Srikanth > > >-----Original Message----- >From: [email protected] [mailto:[email protected]] On Behalf Of Stack >Sent: Saturday, July 09, 2011 9:41 PM >To: [email protected] >Subject: Re: HBase Read and Write Issues in Mutlithreaded Environments > >You read the requirements section in our docs and you have upped the >ulimits, nprocs, etc? http://hbase.apache.org/book/os.html > >If you know the row, can you deduce the regionserver its talking too? >(Below is the client failure -- we need to figure whats up on >server-side). Once you've done that, can you check its logs? See if >you can figure anything on why the hang? > >Thanks, >St.Ack > >On Sat, Jul 9, 2011 at 6:14 AM, Srikanth P. Shreenivas ><[email protected]> wrote: >> Hi St.Ack, >> >> We upgraded to CDH 3 (hadoop-0.20-0.20.2+923.21-1.noarch.rpm, >>hadoop-hbase-0.90.1+15.18-1.noarch.rpm, >>hadoop-zookeeper-3.3.3+12.1-1.noarch.rpm). >> >> I ran a the same test which I was running for the app when it was >>running on CDH2. The test app posts a request the web app every 100ms, >>and the web app reads a HBase record, performs some logic, and saves an >>audit trail by writing another HBase record. >> >> When our app was running on CDH2, I observed the below issue for every >>10 to 15 requests. >> With CDH3, this issue is not happening at all. So, seems like >>situation has improved a lot, and our app seems to be lot more stable. >> >> However, I am still seeing an issue though. There are many requests >>(around 1%) which are not able to read the record from the HBase, and >>the get call is hanging for almost 10 minutes. This is what I see in >>application log: >> >> 2011-07-09 18:27:25,537 [gridgain-#6%authGrid%] ERROR >>[my.app.HBaseHandler] - Exception occurred in searchData: >> java.io.IOException: Giving up trying to get region server: thread is >>interrupted. >> at >>org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementati >>on.getRegionServerWithRetries(HConnectionManager.java:1016) >> at org.apache.hadoop.hbase.client.HTable.get(HTable.java:546) >> >> <...app specific trace removed...> >> >> at >>java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) >> at >>java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) >> at java.util.concurrent.FutureTask.run(FutureTask.java:138) >> at >>org.gridgain.grid.util.runnable.GridRunnable.run(GridRunnable.java:194) >> at >>java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor >>.java:886) >> at >>java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.jav >>a:908) >> at java.lang.Thread.run(Thread.java:619) >> >> >> I am running the test on the same record, so all by "get" are for same >>row id. >> >> >> >> It will be of immense help if you can provide some inputs on whether we >>are missing some configuration settings, or is there a way to get around >>this. >> >> Thanks, >> Srikanth >> >> >> >> >> >> >> -----Original Message----- >> From: [email protected] [mailto:[email protected]] On Behalf Of >>Stack >> Sent: Wednesday, June 29, 2011 7:48 PM >> To: [email protected] >> Subject: Re: HBase Read and Write Issues in Mutlithreaded Environments >> >> Go to CDH3 if you can. CDH2 is also old. >> St.Ack >> >> On Wed, Jun 29, 2011 at 7:15 AM, Srikanth P. Shreenivas >> <[email protected]> wrote: >>> Thanks St. Ack for the inputs. >>> >>> Will upgrading to CDH3 help or is there a version within CDH2 that you >>>recommend we should upgrade to? >>> >>> Regards, >>> Srikanth >>> >>> -----Original Message----- >>> From: [email protected] [mailto:[email protected]] On Behalf Of >>>Stack >>> Sent: Wednesday, June 29, 2011 11:16 AM >>> To: [email protected] >>> Subject: Re: HBase Read and Write Issues in Mutlithreaded Environments >>> >>> Can you upgrade? That release is > 18 months old. A bunch has >>> happened in the meantime. >>> >>> For retries exhausted, check whats going on on the remote regionserver >>> that you are trying to write too. Its probably struggling and thats >>> why requests are not going through -- or the client missed the fact >>> that region moved (all stuff that should be working better in latest >>> hbase). >>> >>> St.Ack >>> >>> On Tue, Jun 28, 2011 at 9:51 PM, Srikanth P. Shreenivas >>> <[email protected]> wrote: >>>> Hi, >>>> >>>> We are using HBase 0.20.3 (hbase-0.20-0.20.3-1.cloudera.noarch.rpm) >>>>cluster in distributed mode with Hadoop 0.20.2 >>>>(hadoop-0.20-0.20.2+320-1.noarch). >>>> We are using pretty much default configuration, and only thing we >>>>have customized is that we have allocated 4GB RAM in >>>>/etc/hbase-0.20/conf/hbase-env.sh >>>> >>>> In our setup, we have a web application that reads a record from >>>>HBase and writes a record as part of each web request. The >>>>application is hosted in Apache Tomcat 7 and is a stateless web >>>>application providing a REST-like web service API. >>>> >>>> We are observing that our reads and writes times out once in a >>>>while. This happens more for writes. >>>> We see below exception in our application logs: >>>> >>>> >>>> Exception Type 1 - During Get: >>>> --------------------------------------- >>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to >>>>contact region server 10.1.68.36:60020 for region >>>>employeedata,be8784ac8b57c45625a03d52be981b88097c2fdc,1308657957879, >>>>row 'd51b74eb05e07f96cee0ec556f5d8d161e3281f3', but failed after 10 >>>>attempts. >>>> Exceptions: >>>> java.io.IOException: Call to /10.1.68.36:60020 failed on local >>>>exception: java.nio.channels.ClosedByInterruptException >>>> java.nio.channels.ClosedByInterruptException >>>> java.nio.channels.ClosedByInterruptException >>>> java.nio.channels.ClosedByInterruptException >>>> java.nio.channels.ClosedByInterruptException >>>> java.nio.channels.ClosedByInterruptException >>>> java.nio.channels.ClosedByInterruptException >>>> java.nio.channels.ClosedByInterruptException >>>> java.nio.channels.ClosedByInterruptException >>>> java.nio.channels.ClosedByInterruptException >>>> >>>> at >>>>org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegio >>>>nServerWithRetries(HConnectionManager.java:1048) >>>> at org.apache.hadoop.hbase.client.HTable.get(HTable.java:417) >>>> <snip> >>>> >>>> Exception Type 2 - During Put: >>>> --------------------------------------------- >>>> Caused by: org.apache.hadoop.hbase.client.RetriesExhaustedException: >>>>Trying to contact region server 10.1.68.34:60020 for region >>>>audittable,,1309183872019, row >>>>'2a012017120f80a801b28f5f66a83dc2a8882d1b', but failed after 10 >>>>attempts. >>>> Exceptions: >>>> java.io.IOException: Call to /10.1.68.34:60020 failed on local >>>>exception: java.nio.channels.ClosedByInterruptException >>>> java.io.IOException: Call to /10.1.68.34:60020 failed on local >>>>exception: java.nio.channels.ClosedByInterruptException >>>> java.io.IOException: Call to /10.1.68.34:60020 failed on local >>>>exception: java.nio.channels.ClosedByInterruptException >>>> java.io.IOException: Call to /10.1.68.34:60020 failed on local >>>>exception: java.nio.channels.ClosedByInterruptException >>>> java.io.IOException: Call to /10.1.68.34:60020 failed on local >>>>exception: java.nio.channels.ClosedByInterruptException >>>> java.io.IOException: Call to /10.1.68.34:60020 failed on local >>>>exception: java.nio.channels.ClosedByInterruptException >>>> java.io.IOException: Call to /10.1.68.34:60020 failed on local >>>>exception: java.nio.channels.ClosedByInterruptException >>>> java.io.IOException: Call to /10.1.68.34:60020 failed on local >>>>exception: java.nio.channels.ClosedByInterruptException >>>> java.io.IOException: Call to /10.1.68.34:60020 failed on local >>>>exception: java.nio.channels.ClosedByInterruptException >>>> java.io.IOException: Call to /10.1.68.34:60020 failed on local >>>>exception: java.nio.channels.ClosedByInterruptException >>>> >>>> at >>>>org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegio >>>>nServerWithRetries(HConnectionManager.java:1048) >>>> at >>>>org.apache.hadoop.hbase.client.HConnectionManager$TableServers$3.doCall >>>>(HConnectionManager.java:1239) >>>> at >>>>org.apache.hadoop.hbase.client.HConnectionManager$TableServers$Batch.pr >>>>ocess(HConnectionManager.java:1161) >>>> at >>>>org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processB >>>>atchOfRows(HConnectionManager.java:1247) >>>> at >>>>org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:609) >>>> at org.apache.hadoop.hbase.client.HTable.put(HTable.java:474) >>>> <snip> >>>> >>>> Any inputs on why this is happening, or how to rectify it will be of >>>>immense help. >>>> >>>> Thanks, >>>> Srikanth >>>> >>>> >>>> >>>> Srikanth P Shreenivas|Principal Consultant | MindTree Ltd.|Global >>>>Village, RVCE Post, Mysore Road, Bangalore-560 059, INDIA|Voice +91 80 >>>>26264000 / Fax +91 80 2626 4100| Mob: 9880141059|email: >>>>[email protected]<mailto:[email protected]> >>>>|www.mindtree.com<http://www.mindtree.com/> | >>>> >>>> >>>> ________________________________ >>>> >>>> http://www.mindtree.com/email/disclaimer.html >>>> >>> >>
