Hey Lads:

Any chance of some pointers debugging a session TIMED OUT?

Client is hosted inside an hbase regionserver.  Usually session
timeout is because of some fat GC pause that is longer than session
timeout but thats not the case here.  It seems to be a connectivity
problem.  Let me post a few logs.  You fellas might have seen it
before or at least can give me pointers or where to go look next.

Alls well for hours -- I see pings on the problem session reported at
some regularity with 1ms latency usually -- and then:

2010-02-21 18:23:50,158 [main-SendThread] WARN
org.apache.zookeeper.ClientCnxn: Exception closing session
0x26ed968d880001 to sun.nio.ch.selectionkeyi...@788ab708
java.io.IOException: Read error rc = -1
java.nio.DirectByteBuffer[pos=0 lim=4 cap=4]
        at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)

and....

2010-02-21 18:23:54,240 [main-SendThread] WARN
org.apache.zookeeper.ClientCnxn: Exception closing session
0x26ed968d880001 to sun.nio.ch.selectionkeyi...@33bcd1d5
java.io.IOException: Read error rc = -1
java.nio.DirectByteBuffer[pos=0 lim=4 cap=4]
        at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)

There is one more of the above and then:

2010-02-21 18:23:55,095 [main-SendThread] WARN
org.apache.zookeeper.ClientCnxn: Ignoring exception during shutdown
input
java.net.SocketException: Transport endpoint is not connected
        at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
        at 
sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640)
        at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
        at 
org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:999)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)

Then another of the above follows immediately.

Then I see:

2010-02-21 18:23:55,583 [main-SendThread] INFO
org.apache.zookeeper.ClientCnxn: Attempting connection to server
14.u.XXX.com/X.X.X.141:2181
2010-02-21 18:24:00,423
[regionserver/208.76.44.140:60020.compactor-SendThread] WARN
org.apache.zookeeper.ClientCnxn: Exception closing session
0x226ed968a270003 to sun.nio.ch.selectionkeyi...@2a50e9a3
java.io.IOException: TIMED OUT
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)

The server 14.u.XXX.com was likely under load at the time.  The
ensemble member does not have a dedicated disk (this is a test
cluster).  Could that have been it?  I have logs from under the
14.u.XXX.com zk ensemble member but nothing in there, no exceptions at
least.

Here is a zk dump:

  Quorum Server Statistics:
    - 14.u.XXX.com:2181
        Zookeeper version: 3.2.2-888565, built on 12/08/2009 21:51 GMT
        Clients:
         /X.X.X.139:49535[1](queued=0,recved=194,sent=0)
         /X.X.X.141:48336[1](queued=0,recved=0,sent=0)

        Latency min/avg/max: 2/125/345
        Received: 194
        Sent: 0
        Outstanding: 0
        Zxid: 0x300000004
        Mode: follower
        Node count: 11
    - 13.u.XXX.com:2181
        Zookeeper version: 3.2.2-888565, built on 12/08/2009 21:51 GMT
        Clients:
         
/fdfb:c1e8:7640:2:2e0:81ff:fe41:d9ae:41308[1](queued=0,recved=10,sent=0)
         /X.X.X.141:39900[1](queued=0,recved=3773,sent=0)
         /X.X.X.141:36583[1](queued=0,recved=0,sent=0)
         /X.X.X.139:59343[1](queued=0,recved=3773,sent=0)

        Latency min/avg/max: 0/7/81
        Received: 7556
        Sent: 0
        Outstanding: 0
        Zxid: 0x300000004
        Mode: leader
        Node count: 11
    - 15.u.XXX.com:2181
        Zookeeper version: 3.2.2-888565, built on 12/08/2009 21:51 GMT
        Clients:
         
/fdfb:c1e8:7640:2:2e0:81ff:fe41:d9ae:50573[1](queued=0,recved=3773,sent=0)
         /X.X.X.141:57536[1](queued=0,recved=0,sent=0)

        Latency min/avg/max: 1/1/1
        Received: 3773
        Sent: 0
        Outstanding: 0
        Zxid: 0x300000004
        Mode: follower
        Node count: 11

Thanks for any pointers on how to debug.

St.Ack

Reply via email to