Hey Lads: Any chance of some pointers debugging a session TIMED OUT?
Client is hosted inside an hbase regionserver. Usually session timeout is because of some fat GC pause that is longer than session timeout but thats not the case here. It seems to be a connectivity problem. Let me post a few logs. You fellas might have seen it before or at least can give me pointers or where to go look next. Alls well for hours -- I see pings on the problem session reported at some regularity with 1ms latency usually -- and then: 2010-02-21 18:23:50,158 [main-SendThread] WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x26ed968d880001 to sun.nio.ch.selectionkeyi...@788ab708 java.io.IOException: Read error rc = -1 java.nio.DirectByteBuffer[pos=0 lim=4 cap=4] at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945) and.... 2010-02-21 18:23:54,240 [main-SendThread] WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x26ed968d880001 to sun.nio.ch.selectionkeyi...@33bcd1d5 java.io.IOException: Read error rc = -1 java.nio.DirectByteBuffer[pos=0 lim=4 cap=4] at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945) There is one more of the above and then: 2010-02-21 18:23:55,095 [main-SendThread] WARN org.apache.zookeeper.ClientCnxn: Ignoring exception during shutdown input java.net.SocketException: Transport endpoint is not connected at sun.nio.ch.SocketChannelImpl.shutdown(Native Method) at sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640) at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360) at org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:999) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970) Then another of the above follows immediately. Then I see: 2010-02-21 18:23:55,583 [main-SendThread] INFO org.apache.zookeeper.ClientCnxn: Attempting connection to server 14.u.XXX.com/X.X.X.141:2181 2010-02-21 18:24:00,423 [regionserver/208.76.44.140:60020.compactor-SendThread] WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x226ed968a270003 to sun.nio.ch.selectionkeyi...@2a50e9a3 java.io.IOException: TIMED OUT at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906) The server 14.u.XXX.com was likely under load at the time. The ensemble member does not have a dedicated disk (this is a test cluster). Could that have been it? I have logs from under the 14.u.XXX.com zk ensemble member but nothing in there, no exceptions at least. Here is a zk dump: Quorum Server Statistics: - 14.u.XXX.com:2181 Zookeeper version: 3.2.2-888565, built on 12/08/2009 21:51 GMT Clients: /X.X.X.139:49535[1](queued=0,recved=194,sent=0) /X.X.X.141:48336[1](queued=0,recved=0,sent=0) Latency min/avg/max: 2/125/345 Received: 194 Sent: 0 Outstanding: 0 Zxid: 0x300000004 Mode: follower Node count: 11 - 13.u.XXX.com:2181 Zookeeper version: 3.2.2-888565, built on 12/08/2009 21:51 GMT Clients: /fdfb:c1e8:7640:2:2e0:81ff:fe41:d9ae:41308[1](queued=0,recved=10,sent=0) /X.X.X.141:39900[1](queued=0,recved=3773,sent=0) /X.X.X.141:36583[1](queued=0,recved=0,sent=0) /X.X.X.139:59343[1](queued=0,recved=3773,sent=0) Latency min/avg/max: 0/7/81 Received: 7556 Sent: 0 Outstanding: 0 Zxid: 0x300000004 Mode: leader Node count: 11 - 15.u.XXX.com:2181 Zookeeper version: 3.2.2-888565, built on 12/08/2009 21:51 GMT Clients: /fdfb:c1e8:7640:2:2e0:81ff:fe41:d9ae:50573[1](queued=0,recved=3773,sent=0) /X.X.X.141:57536[1](queued=0,recved=0,sent=0) Latency min/avg/max: 1/1/1 Received: 3773 Sent: 0 Outstanding: 0 Zxid: 0x300000004 Mode: follower Node count: 11 Thanks for any pointers on how to debug. St.Ack