Hi, Going with the assumption that our client threads may getting interrupted and it may not be an hbase issue, we rebuilt our client application without GridGain. Earlier our code was being executed by GridGain's thread pool, but now we made the app to run in raw Tomcat.
I am very glad to say that we are not seeing any " java.nio.channels.ClosedByInterruptException". My client app is working just great, performing its hbase read/writes as expected. Thanks a lot for all the help. Regards, Srikanth PS: I must say that HBase community is really great. Really appreciate all the inputs and suggestions. -----Original Message----- From: Srikanth P. Shreenivas Sent: Monday, August 22, 2011 3:43 PM To: [email protected] Subject: RE: Query regarding HTable.get and timeouts Yes, DC1AuthDFSC1D3 hosts the root region. It is also region server 3. DC1AuthDFSC1D1, DC1AuthDFSC1D2, DC1AuthDFSC1D3 and DC1AuthDFSC1D4 are 4 region servers in our cluster. ****************************************** I checked with Data Centre team, they confirmed that there is no firewall in the network where hbase servers and client applications is running. ****************************************** Regarding client and server running different versions, they are running same versions. If there was version mismatch, I guess we would be seeing the issue for all the reads. Here we see the issue only for few reads, one in 10-15 reads fail this way. We do use same hbase, zookeeper and hadoop jars as found in the HBase distribution. Strangely enough, I saw the below for the first time today, and it has occurred only once so far. 10.3.48.61 is the IP address where our client app is running. 2011-08-22 11:46:55,905 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7625 got version 6 expected version 3 2011-08-22 11:46:57,542 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7626 got version 6 expected version 3 2011-08-22 11:46:58,483 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7627 got version 6 expected version 3 2011-08-22 11:46:59,335 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7628 got version 6 expected version 3 2011-08-22 11:47:00,164 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7629 got version 6 expected version 3 2011-08-22 11:47:00,972 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7630 got version 6 expected version 3 2011-08-22 11:47:01,768 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7631 got version 6 expected version 3 2011-08-22 11:47:02,648 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 10.3.48.61:7632 got version 6 expected version 3 ****************************************** I enabled debug logging level for all classes today. Here is the exception associated with "null" messages. *** Do you think that some thread in client is doing interrupt() resulting in "java.nio.channels.ClosedByInterruptException" below? *** 2011-08-22 11:51:29,663 [gridgain-#6%authGrid%:grid-job-worker] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation] - locateRegionInMeta parentTable=-ROOT-, metaLocation=address: DC1AuthDFSC1D3.cidr.gov.in:6020, regioninfo: -ROOT-,,0.70236052, attempt=0 of 10 failed; retrying after sleep of 1000 because: null 2011-08-22 11:51:29,663 [gridgain-#6%authGrid%:grid-job-worker] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation] - Lookedup root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@211c7f8d; hsa=DC1AuthDFSC1D3.cidr.gov.in:6020 2011-08-22 11:51:30,665 [gridgain-#6%authGrid%:grid-job-worker] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation] - Lookedup root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@211c7f8d; hsa=DC1AuthDFSC1D3.cidr.gov.in:6020 2011-08-22 11:51:30,665 [gridgain-#6%authGrid%:grid-job-worker] DEBUG [hadoop.ipc.HBaseClient] - Connecting to DC1AuthDFSC1D3.cidr.gov.in/10.3.48.69:6020 2011-08-22 11:51:30,665 [gridgain-#6%authGrid%:grid-job-worker] DEBUG [hadoop.ipc.HBaseClient] - closing ipc connection to DC1AuthDFSC1D3.cidr.gov.in/10.3.48.69:6020: null java.nio.channels.ClosedByInterruptException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:184) at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:511) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:408) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:328) at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:883) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750) at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257) at $Proxy41.getClosestRowBefore(Unknown Source) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:719) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:589) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:558) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:687) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:593) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:564) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:415) at org.apache.hadoop.hbase.client.ServerCallable.instantiateServer(ServerCallable.java:57) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1002) at org.apache.hadoop.hbase.client.HTable.get(HTable.java:546) at in.gov.uidai.platform.impl.persistence.handler.HBaseHandler.findEntities(HBaseHandler.java:271) at in.gov.uidai.platform.impl.persistence.handler.HBaseHandler.findObject(HBaseHandler.java:156) at in.gov.uidai.platform.impl.persistence.provider.AbstractPersistenceProvider.findObject(AbstractPersistenceProvider.java:116) at in.gov.uidai.platform.impl.persistence.PersistenceManagerProvider.findObject(PersistenceManagerProvider.java:270) at in.gov.uidai.authcommon.dao.impl.hbase.ResidentDetailsDAOImpl.findResidentDetailEntity(ResidentDetailsDAOImpl.java:69) at in.gov.uidai.authcommon.dao.impl.hbase.ResidentDetailsDAOImpl.findResidentDetails(ResidentDetailsDAOImpl.java:48) at in.gov.uidai.authcommon.core.impl.steps.ResidentDetailsReader.findResident(ResidentDetailsReader.java:176) at in.gov.uidai.authcommon.core.impl.steps.ResidentDetailsReader.doPerform(ResidentDetailsReader.java:63) at in.gov.uidai.authcommon.core.ProcessingStep.perform(ProcessingStep.java:36) at in.gov.uidai.authcommon.core.impl.Authenticator.performAndReturnContext(Authenticator.java:40) at in.gov.uidai.authserver.grid.AuthenticationGridJob.execute(AuthenticationGridJob.java:27) at org.gridgain.grid.kernal.processors.job.GridJobWorker.body(GridJobWorker.java:406) at org.gridgain.grid.util.runnable.GridRunnable$1.run(GridRunnable.java:142) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at org.gridgain.grid.util.runnable.GridRunnable.run(GridRunnable.java:194) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) 2011-08-22 11:51:30,666 [gridgain-#6%authGrid%:grid-job-worker] DEBUG [hadoop.ipc.HBaseClient] - IPC Client (47) connection to DC1AuthDFSC1D3.cidr.gov.in/10.3.48.69:6020 from an unknown user: closed 2011-08-22 11:51:30,666 [gridgain-#6%authGrid%:grid-job-worker] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation] - locateRegionInMeta parentTable=-ROOT-, metaLocation=address: DC1AuthDFSC1D3.cidr.gov.in:6020, regioninfo: -ROOT-,,0.70236052, attempt=1 of 10 failed; retrying after sleep of 1000 because: null 2011-08-22 11:51:30,666 [gridgain-#6%authGrid%:grid-job-worker] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation] - Lookedup root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@211c7f8d; hsa=DC1AuthDFSC1D3.cidr.gov.in:6020 ... ... ... And above pattern keeps repeating. ****************************************** Regards, Srikanth -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Jean-Daniel Cryans Sent: Monday, August 22, 2011 2:32 AM To: [email protected] Subject: Re: Query regarding HTable.get and timeouts Yeah that null message isn't really helpful :) So one thing that might be helpful would be to know who DC1AuthDFSC1D3 is, since you identified the logs as "Region server n". Then look at the master's web UI and see where -ROOT- is assigned. Is it also DC1AuthDFSC1D3? If so, then I would proceed by checking if there's a firewall in between the client and the cluster, also I would make sure that the client is running the same version as the server. J-D On Sat, Aug 20, 2011 at 5:56 AM, Srikanth P. Shreenivas <[email protected]> wrote: > Further in this investigation, we enabled the debug logs on client side. > > We are observing that client is trying to root region, and is continuously > failing to do so. The logs are filled with entries like this: > > 2011-08-20 17:20:09,092 [gridgain-#6%authGrid%] DEBUG > [hbase.client.HConnectionManager$HConnectionImplementation] - Lookedup root > region location, > connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@2cc25ae3; > hsa=DC1AuthDFSC1D3.cidr.gov.in:6020 > 2011-08-20 17:20:09,092 [gridgain-#6%authGrid%] DEBUG > [hbase.client.HConnectionManager$HConnectionImplementation] - > locateRegionInMeta parentTable=-ROOT-, metaLocation=address: > DC1AuthDFSC1D3.cidr.gov.in:6020, regioninfo: -ROOT-,,0.70236052, attempt=0 of > 10 failed; retrying after sleep of 1000 > because: null > > Client keeps retrying and retries get exhausted. > > > Complete logs are available here: https://gist.github.com/1159064 including > logs of master, zookeeper and region servers. > > > If you can please look at the logs and provide some inputs on this issue, > then it will be really helpful. > We are really not sure why client is failing to get root regions from the > server. Any guidance will be greatly appreciated. > > > Thanks a lot, > Srikanth ________________________________ http://www.mindtree.com/email/disclaimer.html
