Yes, DC1AuthDFSC1D3 hosts the root region.  It is also region server 3.    
DC1AuthDFSC1D1, DC1AuthDFSC1D2, DC1AuthDFSC1D3 and DC1AuthDFSC1D4 are 4 region 
servers in our cluster.

******************************************

I checked with Data Centre team, they confirmed that there is no firewall in 
the network where hbase servers and client applications is running.

******************************************

Regarding client and server running different versions, they are running same 
versions.  If there was version mismatch, I guess we would be seeing the issue 
for all the reads.  Here we see the issue only for few reads, one in 10-15 
reads fail this way.  We do use same hbase, zookeeper and hadoop jars as found 
in the HBase distribution.

Strangely enough, I saw the below for the first time today, and it has occurred 
only once so far.  10.3.48.61 is the IP address where our client app is running.
2011-08-22 11:46:55,905 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect 
header or version mismatch from 10.3.48.61:7625 got version 6 expected version 3
2011-08-22 11:46:57,542 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect 
header or version mismatch from 10.3.48.61:7626 got version 6 expected version 3
2011-08-22 11:46:58,483 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect 
header or version mismatch from 10.3.48.61:7627 got version 6 expected version 3
2011-08-22 11:46:59,335 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect 
header or version mismatch from 10.3.48.61:7628 got version 6 expected version 3
2011-08-22 11:47:00,164 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect 
header or version mismatch from 10.3.48.61:7629 got version 6 expected version 3
2011-08-22 11:47:00,972 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect 
header or version mismatch from 10.3.48.61:7630 got version 6 expected version 3
2011-08-22 11:47:01,768 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect 
header or version mismatch from 10.3.48.61:7631 got version 6 expected version 3
2011-08-22 11:47:02,648 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect 
header or version mismatch from 10.3.48.61:7632 got version 6 expected version 3

******************************************

I enabled debug logging level for all classes today.  Here is the exception 
associated with "null" messages.

*** Do you think that some thread in client is doing interrupt() resulting in 
"java.nio.channels.ClosedByInterruptException" below? ***


2011-08-22 11:51:29,663 [gridgain-#6%authGrid%:grid-job-worker] DEBUG 
[hbase.client.HConnectionManager$HConnectionImplementation]  - 
locateRegionInMeta parentTable=-ROOT-, metaLocation=address: 
DC1AuthDFSC1D3.cidr.gov.in:6020, regioninfo: -ROOT-,,0.70236052, attempt=0 of 
10 failed; retrying after sleep of 1000 because: null
2011-08-22 11:51:29,663 [gridgain-#6%authGrid%:grid-job-worker] DEBUG 
[hbase.client.HConnectionManager$HConnectionImplementation]  - Lookedup root 
region location, 
connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@211c7f8d;
 hsa=DC1AuthDFSC1D3.cidr.gov.in:6020
2011-08-22 11:51:30,665 [gridgain-#6%authGrid%:grid-job-worker] DEBUG 
[hbase.client.HConnectionManager$HConnectionImplementation]  - Lookedup root 
region location, 
connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@211c7f8d;
 hsa=DC1AuthDFSC1D3.cidr.gov.in:6020
2011-08-22 11:51:30,665 [gridgain-#6%authGrid%:grid-job-worker] DEBUG 
[hadoop.ipc.HBaseClient]  - Connecting to 
DC1AuthDFSC1D3.cidr.gov.in/10.3.48.69:6020
2011-08-22 11:51:30,665 [gridgain-#6%authGrid%:grid-job-worker] DEBUG 
[hadoop.ipc.HBaseClient]  - closing ipc connection to 
DC1AuthDFSC1D3.cidr.gov.in/10.3.48.69:6020: null
java.nio.channels.ClosedByInterruptException
        at 
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:184)
        at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:511)
        at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:408)
        at 
org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:328)
        at 
org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:883)
        at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750)
        at 
org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
        at $Proxy41.getClosestRowBefore(Unknown Source)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:719)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:589)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:558)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:687)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:593)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:564)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:415)
        at 
org.apache.hadoop.hbase.client.ServerCallable.instantiateServer(ServerCallable.java:57)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1002)
        at org.apache.hadoop.hbase.client.HTable.get(HTable.java:546)
        at 
in.gov.uidai.platform.impl.persistence.handler.HBaseHandler.findEntities(HBaseHandler.java:271)
        at 
in.gov.uidai.platform.impl.persistence.handler.HBaseHandler.findObject(HBaseHandler.java:156)
        at 
in.gov.uidai.platform.impl.persistence.provider.AbstractPersistenceProvider.findObject(AbstractPersistenceProvider.java:116)
        at 
in.gov.uidai.platform.impl.persistence.PersistenceManagerProvider.findObject(PersistenceManagerProvider.java:270)
        at 
in.gov.uidai.authcommon.dao.impl.hbase.ResidentDetailsDAOImpl.findResidentDetailEntity(ResidentDetailsDAOImpl.java:69)
        at 
in.gov.uidai.authcommon.dao.impl.hbase.ResidentDetailsDAOImpl.findResidentDetails(ResidentDetailsDAOImpl.java:48)
        at 
in.gov.uidai.authcommon.core.impl.steps.ResidentDetailsReader.findResident(ResidentDetailsReader.java:176)
        at 
in.gov.uidai.authcommon.core.impl.steps.ResidentDetailsReader.doPerform(ResidentDetailsReader.java:63)
        at 
in.gov.uidai.authcommon.core.ProcessingStep.perform(ProcessingStep.java:36)
        at 
in.gov.uidai.authcommon.core.impl.Authenticator.performAndReturnContext(Authenticator.java:40)
        at 
in.gov.uidai.authserver.grid.AuthenticationGridJob.execute(AuthenticationGridJob.java:27)
        at 
org.gridgain.grid.kernal.processors.job.GridJobWorker.body(GridJobWorker.java:406)
        at 
org.gridgain.grid.util.runnable.GridRunnable$1.run(GridRunnable.java:142)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
org.gridgain.grid.util.runnable.GridRunnable.run(GridRunnable.java:194)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
2011-08-22 11:51:30,666 [gridgain-#6%authGrid%:grid-job-worker] DEBUG 
[hadoop.ipc.HBaseClient]  - IPC Client (47) connection to 
DC1AuthDFSC1D3.cidr.gov.in/10.3.48.69:6020 from an unknown user: closed
2011-08-22 11:51:30,666 [gridgain-#6%authGrid%:grid-job-worker] DEBUG 
[hbase.client.HConnectionManager$HConnectionImplementation]  - 
locateRegionInMeta parentTable=-ROOT-, metaLocation=address: 
DC1AuthDFSC1D3.cidr.gov.in:6020, regioninfo: -ROOT-,,0.70236052, attempt=1 of 
10 failed; retrying after sleep of 1000 because: null
2011-08-22 11:51:30,666 [gridgain-#6%authGrid%:grid-job-worker] DEBUG 
[hbase.client.HConnectionManager$HConnectionImplementation]  - Lookedup root 
region location, 
connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@211c7f8d;
 hsa=DC1AuthDFSC1D3.cidr.gov.in:6020
...
...
...
And above pattern keeps repeating.

******************************************



Regards,
Srikanth


-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Jean-Daniel 
Cryans
Sent: Monday, August 22, 2011 2:32 AM
To: [email protected]
Subject: Re: Query regarding HTable.get and timeouts

Yeah that null message isn't really helpful :)

So one thing that might be helpful would be to know who DC1AuthDFSC1D3
is, since you identified the logs as "Region server n".

Then look at the master's web UI and see where -ROOT- is assigned. Is
it also DC1AuthDFSC1D3?

If so, then I would proceed by checking if there's a firewall in
between the client and the cluster, also I would make sure that the
client is running the same version as the server.

J-D

On Sat, Aug 20, 2011 at 5:56 AM, Srikanth P. Shreenivas
<[email protected]> wrote:
> Further in this investigation, we enabled the debug logs on client side.
>
> We are observing that client is trying to root region, and is continuously 
> failing to do so.  The logs are filled with entries like this:
>
> 2011-08-20 17:20:09,092 [gridgain-#6%authGrid%] DEBUG 
> [hbase.client.HConnectionManager$HConnectionImplementation]  - Lookedup root 
> region location, 
> connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@2cc25ae3;
>  hsa=DC1AuthDFSC1D3.cidr.gov.in:6020
> 2011-08-20 17:20:09,092 [gridgain-#6%authGrid%] DEBUG 
> [hbase.client.HConnectionManager$HConnectionImplementation]  - 
> locateRegionInMeta parentTable=-ROOT-, metaLocation=address: 
> DC1AuthDFSC1D3.cidr.gov.in:6020, regioninfo: -ROOT-,,0.70236052, attempt=0 of 
> 10 failed; retrying after sleep of 1000
> because: null
>
> Client keeps retrying and retries get exhausted.
>
>
> Complete logs are available here: https://gist.github.com/1159064  including 
> logs of master, zookeeper and region servers.
>
>
> If you can please look at the logs and provide some inputs on this issue, 
> then it will be really helpful.
> We are really not sure why client is failing to get root regions from the 
> server.  Any guidance will be greatly appreciated.
>
>
> Thanks a lot,
> Srikanth

________________________________

http://www.mindtree.com/email/disclaimer.html

Reply via email to