Re: Bit of help debugging a TIMED OUT session please

2010-02-23 Thread Patrick Hunt

Stack you might look at the following:

1) why does server 14 have such a low recv count?

Received: 194

while the other servers are at 3.7k + received. Did server 14 fail at 
some point? Or it's network? This may have caused the timeout seen by 
the client:


--snippet-
2010-02-21 18:23:55,583 [main-SendThread] INFO
org.apache.zookeeper.ClientCnxn: Attempting connection to server
14.u.XXX.com/X.X.X.141:2181
2010-02-21 18:24:00,423
[regionserver/208.76.44.140:60020.compactor-SendThread] WARN
org.apache.zookeeper.ClientCnxn: Exception closing session
0x226ed968a270003 to sun.nio.ch.selectionkeyi...@2a50e9a3
java.io.IOException: TIMED OUT
at 
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)

---

2) connection timeout is different from session timeout. connection 
timeout is the amount of time we allow for connection establishment 
(socket open) until the server accepts the connection, this value is the 
session timeout (as requested by the client) divided by the number of 
hosts in the host list. This could account for why the timeout (above 
snippet) occurred after 5 seconds. What timeout value is this client 
using? 15 seconds?


Patrick

Stack wrote:

Dang.  Didn't save the log.  Pardon me.

I pasted exceptions only and thought it all about 0x26ed968d880001
session but now I see that what I posted above has TIMED_OUT on
another session altogether.  Above I skipped pasting exceptions
thinking them on the same session but now it seems they probably were
not.

I'm trying to track a case where zk seems of a sudden, client-side, to
give up the ghost w/ exceptions like those pasted above --
connectivity probs.  There has been pollution in here where long gc
pauses that are  session timeout would trigger TIMED_OUT but those
have been tamed.

I'll be back if I get another instance on my hook.

Meantime, thanks for the comments.

St.Ack

On Mon, Feb 22, 2010 at 6:43 PM, Mahadev Konar maha...@yahoo-inc.com wrote:

HI stack,
 the other interesting part is with the session:
0x26ed968d880001

Looks like it gets disconnected from one of the servers (TIMEOUT). DO you
see any of these messages: Attempting connection to server in the logs
before you see all the consecutive

org.apache.zookeeper.ClientCnxn: Exception closing session
0x26ed968d880001 to sun.nio.ch.selectionkeyi...@788ab708
java.io.IOException: Read error rc = -1
java.nio.DirectByteBuffer[pos=0 lim=4 cap=4]
   at
org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
   at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)

and


From the cient 0x26ed968d880001?

Thanks
mahadev


On 2/22/10 11:42 AM, Stack st...@duboce.net wrote:


The thing that seems odd to me is that the connectivity complaints are
out of the zk client, right?, why is it failing getting to member 14
and why not move to another ensemble member if issue w/ 14?, and if
there were a general connectivity issue, I'd think that the running
hbase cluster would be complaining at about the same time (its talking
to datanodes and masters at this time).

(Thanks for the input lads)

St.Ack


On Mon, Feb 22, 2010 at 11:26 AM, Mahadev Konar maha...@yahoo-inc.com wrote:

I also looked at the logs. Ted might have a point. It does look like that
zookeeper server's are doing fine (though as ted mentions the skew is a
little concerning, though that might be due to very few packets served by
the first server). Other than that the latencies of 300 ms at max should not
cause any timeouts.
Also, the number of packets received is pretty low - meaning that it wasn't
serving huge traffic. Is there anyway we can check if the network connection
from the client to the server is not flaky?

Thanks
mahadev


On 2/22/10 10:40 AM, Ted Dunning ted.dunn...@gmail.com wrote:


Not sure this helps at all, but these times are remarkably asymmetrical.  I
would expect members of a ZK  cluster to have very comparable times.

Additionally, 345 ms is nowhere near large enough to cause a session to
expire.  My take is that ZK doesn't think it caused the timeout.

On Mon, Feb 22, 2010 at 10:18 AM, Stack st...@duboce.net wrote:


   Latency min/avg/max: 2/125/345
...
   Latency min/avg/max: 0/7/81
...
   Latency min/avg/max: 1/1/1

Thanks for any pointers on how to debug.







Re: Bit of help debugging a TIMED OUT session please

2010-02-22 Thread Ted Dunning
Not sure this helps at all, but these times are remarkably asymmetrical.  I
would expect members of a ZK  cluster to have very comparable times.

Additionally, 345 ms is nowhere near large enough to cause a session to
expire.  My take is that ZK doesn't think it caused the timeout.

On Mon, Feb 22, 2010 at 10:18 AM, Stack st...@duboce.net wrote:

Latency min/avg/max: 2/125/345
 ...
Latency min/avg/max: 0/7/81
 ...
Latency min/avg/max: 1/1/1

 Thanks for any pointers on how to debug.



Re: Bit of help debugging a TIMED OUT session please

2010-02-22 Thread Mahadev Konar
HI stack,
 the other interesting part is with the session:
0x26ed968d880001

Looks like it gets disconnected from one of the servers (TIMEOUT). DO you
see any of these messages: Attempting connection to server in the logs
before you see all the consecutive

org.apache.zookeeper.ClientCnxn: Exception closing session
0x26ed968d880001 to sun.nio.ch.selectionkeyi...@788ab708
java.io.IOException: Read error rc = -1
java.nio.DirectByteBuffer[pos=0 lim=4 cap=4]
at 
org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
at 
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)

and


From the cient 0x26ed968d880001?

Thanks
mahadev


On 2/22/10 11:42 AM, Stack st...@duboce.net wrote:

 The thing that seems odd to me is that the connectivity complaints are
 out of the zk client, right?, why is it failing getting to member 14
 and why not move to another ensemble member if issue w/ 14?, and if
 there were a general connectivity issue, I'd think that the running
 hbase cluster would be complaining at about the same time (its talking
 to datanodes and masters at this time).
 
 (Thanks for the input lads)
 
 St.Ack
 
 
 On Mon, Feb 22, 2010 at 11:26 AM, Mahadev Konar maha...@yahoo-inc.com wrote:
 I also looked at the logs. Ted might have a point. It does look like that
 zookeeper server's are doing fine (though as ted mentions the skew is a
 little concerning, though that might be due to very few packets served by
 the first server). Other than that the latencies of 300 ms at max should not
 cause any timeouts.
 Also, the number of packets received is pretty low - meaning that it wasn't
 serving huge traffic. Is there anyway we can check if the network connection
 from the client to the server is not flaky?
 
 Thanks
 mahadev
 
 
 On 2/22/10 10:40 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 Not sure this helps at all, but these times are remarkably asymmetrical.  I
 would expect members of a ZK  cluster to have very comparable times.
 
 Additionally, 345 ms is nowhere near large enough to cause a session to
 expire.  My take is that ZK doesn't think it caused the timeout.
 
 On Mon, Feb 22, 2010 at 10:18 AM, Stack st...@duboce.net wrote:
 
        Latency min/avg/max: 2/125/345
 ...
        Latency min/avg/max: 0/7/81
 ...
        Latency min/avg/max: 1/1/1
 
 Thanks for any pointers on how to debug.