Re: Bit of help debugging a TIMED OUT session please

2010-02-23 Thread Patrick Hunt
Hard to say based on the bits/pieces of the log we have access to. I'd 
have to see the full log, preferably from both the server and client, to 
gain more insight.


re low numbers, this is the received count for the server, this should 
always increase never decrease. The fact that it is so low either 
indicates that the server recently restarted, or clients are not 
attaching to it. Seems like it should be near the other servers but 
again, hard to tell based on the small aperture we have via mail.


Patrick

Stack wrote:

Thanks Patrick.  See below.

On Tue, Feb 23, 2010 at 1:19 PM, Patrick Hunt  wrote:

Stack you might look at the following:

1) why does server 14 have such a low recv count?

   Received: 194

while the other servers are at 3.7k + received. Did server 14 fail at some
point? Or it's network? This may have caused the timeout seen by the client:



Ok.  Will check into this the next time.  I did take the dump after
the observed TIMED_OUT, a good while after.  Could this be why the
numbers are low?


--snippet-
2010-02-21 18:23:55,583 [main-SendThread] INFO
org.apache.zookeeper.ClientCnxn: Attempting connection to server
14.u.XXX.com/X.X.X.141:2181
2010-02-21 18:24:00,423
[regionserver/208.76.44.140:60020.compactor-SendThread] WARN
org.apache.zookeeper.ClientCnxn: Exception closing session
0x226ed968a270003 to sun.nio.ch.selectionkeyi...@2a50e9a3
java.io.IOException: TIMED OUT
   at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
---

2) connection timeout is different from session timeout. connection timeout
is the amount of time we allow for connection establishment (socket open)
until the server accepts the connection, this value is the session timeout
(as requested by the client) divided by the number of hosts in the host
list. This could account for why the timeout (above snippet) occurred after
5 seconds. What timeout value is this client using? 15 seconds?


We ask for a session timeout of 60 seconds -- the hbase default -- and
our ticktime is 3 seconds.

You are not troubled at all by the exceptions closing sessions above?
 Are these just noise?

Thanks for the input,
St.Ack


Re: Bit of help debugging a TIMED OUT session please

2010-02-23 Thread Stack
Thanks Patrick.  See below.

On Tue, Feb 23, 2010 at 1:19 PM, Patrick Hunt  wrote:
> Stack you might look at the following:
>
> 1) why does server 14 have such a low recv count?
>
>        Received: 194
>
> while the other servers are at 3.7k + received. Did server 14 fail at some
> point? Or it's network? This may have caused the timeout seen by the client:
>

Ok.  Will check into this the next time.  I did take the dump after
the observed TIMED_OUT, a good while after.  Could this be why the
numbers are low?

> --snippet-
> 2010-02-21 18:23:55,583 [main-SendThread] INFO
> org.apache.zookeeper.ClientCnxn: Attempting connection to server
> 14.u.XXX.com/X.X.X.141:2181
> 2010-02-21 18:24:00,423
> [regionserver/208.76.44.140:60020.compactor-SendThread] WARN
> org.apache.zookeeper.ClientCnxn: Exception closing session
> 0x226ed968a270003 to sun.nio.ch.selectionkeyi...@2a50e9a3
> java.io.IOException: TIMED OUT
>        at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> ---
>
> 2) connection timeout is different from session timeout. connection timeout
> is the amount of time we allow for connection establishment (socket open)
> until the server accepts the connection, this value is the session timeout
> (as requested by the client) divided by the number of hosts in the host
> list. This could account for why the timeout (above snippet) occurred after
> 5 seconds. What timeout value is this client using? 15 seconds?
>
We ask for a session timeout of 60 seconds -- the hbase default -- and
our ticktime is 3 seconds.

You are not troubled at all by the exceptions closing sessions above?
 Are these just noise?

Thanks for the input,
St.Ack


Re: Bit of help debugging a TIMED OUT session please

2010-02-23 Thread Patrick Hunt

Stack you might look at the following:

1) why does server 14 have such a low recv count?

Received: 194

while the other servers are at 3.7k + received. Did server 14 fail at 
some point? Or it's network? This may have caused the timeout seen by 
the client:


--snippet-
2010-02-21 18:23:55,583 [main-SendThread] INFO
org.apache.zookeeper.ClientCnxn: Attempting connection to server
14.u.XXX.com/X.X.X.141:2181
2010-02-21 18:24:00,423
[regionserver/208.76.44.140:60020.compactor-SendThread] WARN
org.apache.zookeeper.ClientCnxn: Exception closing session
0x226ed968a270003 to sun.nio.ch.selectionkeyi...@2a50e9a3
java.io.IOException: TIMED OUT
at 
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)

---

2) connection timeout is different from session timeout. connection 
timeout is the amount of time we allow for connection establishment 
(socket open) until the server accepts the connection, this value is the 
session timeout (as requested by the client) divided by the number of 
hosts in the host list. This could account for why the timeout (above 
snippet) occurred after 5 seconds. What timeout value is this client 
using? 15 seconds?


Patrick

Stack wrote:

Dang.  Didn't save the log.  Pardon me.

I pasted exceptions only and thought it all about 0x26ed968d880001
session but now I see that what I posted above has TIMED_OUT on
another session altogether.  Above I skipped pasting exceptions
thinking them on the same session but now it seems they probably were
not.

I'm trying to track a case where zk seems of a sudden, client-side, to
give up the ghost w/ exceptions like those pasted above --
connectivity probs.  There has been pollution in here where long gc
pauses that are > session timeout would trigger TIMED_OUT but those
have been tamed.

I'll be back if I get another instance on my hook.

Meantime, thanks for the comments.

St.Ack

On Mon, Feb 22, 2010 at 6:43 PM, Mahadev Konar  wrote:

HI stack,
 the other interesting part is with the session:
0x26ed968d880001

Looks like it gets disconnected from one of the servers (TIMEOUT). DO you
see any of these messages: "Attempting connection to server" in the logs
before you see all the consecutive

org.apache.zookeeper.ClientCnxn: Exception closing session
0x26ed968d880001 to sun.nio.ch.selectionkeyi...@788ab708
java.io.IOException: Read error rc = -1
java.nio.DirectByteBuffer[pos=0 lim=4 cap=4]
   at
org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
   at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)

and


From the cient 0x26ed968d880001?

Thanks
mahadev


On 2/22/10 11:42 AM, "Stack"  wrote:


The thing that seems odd to me is that the connectivity complaints are
out of the zk client, right?, why is it failing getting to member 14
and why not move to another ensemble member if issue w/ 14?, and if
there were a general connectivity issue, I'd think that the running
hbase cluster would be complaining at about the same time (its talking
to datanodes and masters at this time).

(Thanks for the input lads)

St.Ack


On Mon, Feb 22, 2010 at 11:26 AM, Mahadev Konar  wrote:

I also looked at the logs. Ted might have a point. It does look like that
zookeeper server's are doing fine (though as ted mentions the skew is a
little concerning, though that might be due to very few packets served by
the first server). Other than that the latencies of 300 ms at max should not
cause any timeouts.
Also, the number of packets received is pretty low - meaning that it wasn't
serving huge traffic. Is there anyway we can check if the network connection
from the client to the server is not flaky?

Thanks
mahadev


On 2/22/10 10:40 AM, "Ted Dunning"  wrote:


Not sure this helps at all, but these times are remarkably asymmetrical.  I
would expect members of a ZK  cluster to have very comparable times.

Additionally, 345 ms is nowhere near large enough to cause a session to
expire.  My take is that ZK doesn't think it caused the timeout.

On Mon, Feb 22, 2010 at 10:18 AM, Stack  wrote:


   Latency min/avg/max: 2/125/345
...
   Latency min/avg/max: 0/7/81
...
   Latency min/avg/max: 1/1/1

Thanks for any pointers on how to debug.







Re: Bit of help debugging a TIMED OUT session please

2010-02-22 Thread Stack
Dang.  Didn't save the log.  Pardon me.

I pasted exceptions only and thought it all about 0x26ed968d880001
session but now I see that what I posted above has TIMED_OUT on
another session altogether.  Above I skipped pasting exceptions
thinking them on the same session but now it seems they probably were
not.

I'm trying to track a case where zk seems of a sudden, client-side, to
give up the ghost w/ exceptions like those pasted above --
connectivity probs.  There has been pollution in here where long gc
pauses that are > session timeout would trigger TIMED_OUT but those
have been tamed.

I'll be back if I get another instance on my hook.

Meantime, thanks for the comments.

St.Ack

On Mon, Feb 22, 2010 at 6:43 PM, Mahadev Konar  wrote:
> HI stack,
>  the other interesting part is with the session:
> 0x26ed968d880001
>
> Looks like it gets disconnected from one of the servers (TIMEOUT). DO you
> see any of these messages: "Attempting connection to server" in the logs
> before you see all the consecutive
>
> org.apache.zookeeper.ClientCnxn: Exception closing session
> 0x26ed968d880001 to sun.nio.ch.selectionkeyi...@788ab708
> java.io.IOException: Read error rc = -1
> java.nio.DirectByteBuffer[pos=0 lim=4 cap=4]
>        at
> org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
>        at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
>
> and
>
>
> From the cient 0x26ed968d880001?
>
> Thanks
> mahadev
>
>
> On 2/22/10 11:42 AM, "Stack"  wrote:
>
>> The thing that seems odd to me is that the connectivity complaints are
>> out of the zk client, right?, why is it failing getting to member 14
>> and why not move to another ensemble member if issue w/ 14?, and if
>> there were a general connectivity issue, I'd think that the running
>> hbase cluster would be complaining at about the same time (its talking
>> to datanodes and masters at this time).
>>
>> (Thanks for the input lads)
>>
>> St.Ack
>>
>>
>> On Mon, Feb 22, 2010 at 11:26 AM, Mahadev Konar  
>> wrote:
>>> I also looked at the logs. Ted might have a point. It does look like that
>>> zookeeper server's are doing fine (though as ted mentions the skew is a
>>> little concerning, though that might be due to very few packets served by
>>> the first server). Other than that the latencies of 300 ms at max should not
>>> cause any timeouts.
>>> Also, the number of packets received is pretty low - meaning that it wasn't
>>> serving huge traffic. Is there anyway we can check if the network connection
>>> from the client to the server is not flaky?
>>>
>>> Thanks
>>> mahadev
>>>
>>>
>>> On 2/22/10 10:40 AM, "Ted Dunning"  wrote:
>>>
 Not sure this helps at all, but these times are remarkably asymmetrical.  I
 would expect members of a ZK  cluster to have very comparable times.

 Additionally, 345 ms is nowhere near large enough to cause a session to
 expire.  My take is that ZK doesn't think it caused the timeout.

 On Mon, Feb 22, 2010 at 10:18 AM, Stack  wrote:

>        Latency min/avg/max: 2/125/345
> ...
>        Latency min/avg/max: 0/7/81
> ...
>        Latency min/avg/max: 1/1/1
>
> Thanks for any pointers on how to debug.
>
>>>
>>>
>
>


Re: Bit of help debugging a TIMED OUT session please

2010-02-22 Thread Mahadev Konar
HI stack,
 the other interesting part is with the session:
0x26ed968d880001

Looks like it gets disconnected from one of the servers (TIMEOUT). DO you
see any of these messages: "Attempting connection to server" in the logs
before you see all the consecutive

org.apache.zookeeper.ClientCnxn: Exception closing session
0x26ed968d880001 to sun.nio.ch.selectionkeyi...@788ab708
java.io.IOException: Read error rc = -1
java.nio.DirectByteBuffer[pos=0 lim=4 cap=4]
at 
org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
at 
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)

and


>From the cient 0x26ed968d880001?

Thanks
mahadev


On 2/22/10 11:42 AM, "Stack"  wrote:

> The thing that seems odd to me is that the connectivity complaints are
> out of the zk client, right?, why is it failing getting to member 14
> and why not move to another ensemble member if issue w/ 14?, and if
> there were a general connectivity issue, I'd think that the running
> hbase cluster would be complaining at about the same time (its talking
> to datanodes and masters at this time).
> 
> (Thanks for the input lads)
> 
> St.Ack
> 
> 
> On Mon, Feb 22, 2010 at 11:26 AM, Mahadev Konar  wrote:
>> I also looked at the logs. Ted might have a point. It does look like that
>> zookeeper server's are doing fine (though as ted mentions the skew is a
>> little concerning, though that might be due to very few packets served by
>> the first server). Other than that the latencies of 300 ms at max should not
>> cause any timeouts.
>> Also, the number of packets received is pretty low - meaning that it wasn't
>> serving huge traffic. Is there anyway we can check if the network connection
>> from the client to the server is not flaky?
>> 
>> Thanks
>> mahadev
>> 
>> 
>> On 2/22/10 10:40 AM, "Ted Dunning"  wrote:
>> 
>>> Not sure this helps at all, but these times are remarkably asymmetrical.  I
>>> would expect members of a ZK  cluster to have very comparable times.
>>> 
>>> Additionally, 345 ms is nowhere near large enough to cause a session to
>>> expire.  My take is that ZK doesn't think it caused the timeout.
>>> 
>>> On Mon, Feb 22, 2010 at 10:18 AM, Stack  wrote:
>>> 
        Latency min/avg/max: 2/125/345
 ...
        Latency min/avg/max: 0/7/81
 ...
        Latency min/avg/max: 1/1/1
 
 Thanks for any pointers on how to debug.
 
>> 
>> 



Re: Bit of help debugging a TIMED OUT session please

2010-02-22 Thread Stack
The thing that seems odd to me is that the connectivity complaints are
out of the zk client, right?, why is it failing getting to member 14
and why not move to another ensemble member if issue w/ 14?, and if
there were a general connectivity issue, I'd think that the running
hbase cluster would be complaining at about the same time (its talking
to datanodes and masters at this time).

(Thanks for the input lads)

St.Ack


On Mon, Feb 22, 2010 at 11:26 AM, Mahadev Konar  wrote:
> I also looked at the logs. Ted might have a point. It does look like that
> zookeeper server's are doing fine (though as ted mentions the skew is a
> little concerning, though that might be due to very few packets served by
> the first server). Other than that the latencies of 300 ms at max should not
> cause any timeouts.
> Also, the number of packets received is pretty low - meaning that it wasn't
> serving huge traffic. Is there anyway we can check if the network connection
> from the client to the server is not flaky?
>
> Thanks
> mahadev
>
>
> On 2/22/10 10:40 AM, "Ted Dunning"  wrote:
>
>> Not sure this helps at all, but these times are remarkably asymmetrical.  I
>> would expect members of a ZK  cluster to have very comparable times.
>>
>> Additionally, 345 ms is nowhere near large enough to cause a session to
>> expire.  My take is that ZK doesn't think it caused the timeout.
>>
>> On Mon, Feb 22, 2010 at 10:18 AM, Stack  wrote:
>>
>>>        Latency min/avg/max: 2/125/345
>>> ...
>>>        Latency min/avg/max: 0/7/81
>>> ...
>>>        Latency min/avg/max: 1/1/1
>>>
>>> Thanks for any pointers on how to debug.
>>>
>
>


Re: Bit of help debugging a TIMED OUT session please

2010-02-22 Thread Mahadev Konar
I also looked at the logs. Ted might have a point. It does look like that
zookeeper server's are doing fine (though as ted mentions the skew is a
little concerning, though that might be due to very few packets served by
the first server). Other than that the latencies of 300 ms at max should not
cause any timeouts.
Also, the number of packets received is pretty low - meaning that it wasn't
serving huge traffic. Is there anyway we can check if the network connection
from the client to the server is not flaky?

Thanks
mahadev


On 2/22/10 10:40 AM, "Ted Dunning"  wrote:

> Not sure this helps at all, but these times are remarkably asymmetrical.  I
> would expect members of a ZK  cluster to have very comparable times.
> 
> Additionally, 345 ms is nowhere near large enough to cause a session to
> expire.  My take is that ZK doesn't think it caused the timeout.
> 
> On Mon, Feb 22, 2010 at 10:18 AM, Stack  wrote:
> 
>>Latency min/avg/max: 2/125/345
>> ...
>>Latency min/avg/max: 0/7/81
>> ...
>>Latency min/avg/max: 1/1/1
>> 
>> Thanks for any pointers on how to debug.
>> 



Re: Bit of help debugging a TIMED OUT session please

2010-02-22 Thread Ted Dunning
Not sure this helps at all, but these times are remarkably asymmetrical.  I
would expect members of a ZK  cluster to have very comparable times.

Additionally, 345 ms is nowhere near large enough to cause a session to
expire.  My take is that ZK doesn't think it caused the timeout.

On Mon, Feb 22, 2010 at 10:18 AM, Stack  wrote:

>Latency min/avg/max: 2/125/345
> ...
>Latency min/avg/max: 0/7/81
> ...
>Latency min/avg/max: 1/1/1
>
> Thanks for any pointers on how to debug.
>


Bit of help debugging a TIMED OUT session please

2010-02-22 Thread Stack
Hey Lads:

Any chance of some pointers debugging a session TIMED OUT?

Client is hosted inside an hbase regionserver.  Usually session
timeout is because of some fat GC pause that is longer than session
timeout but thats not the case here.  It seems to be a connectivity
problem.  Let me post a few logs.  You fellas might have seen it
before or at least can give me pointers or where to go look next.

Alls well for hours -- I see pings on the problem session reported at
some regularity with 1ms latency usually -- and then:

2010-02-21 18:23:50,158 [main-SendThread] WARN
org.apache.zookeeper.ClientCnxn: Exception closing session
0x26ed968d880001 to sun.nio.ch.selectionkeyi...@788ab708
java.io.IOException: Read error rc = -1
java.nio.DirectByteBuffer[pos=0 lim=4 cap=4]
at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)

and

2010-02-21 18:23:54,240 [main-SendThread] WARN
org.apache.zookeeper.ClientCnxn: Exception closing session
0x26ed968d880001 to sun.nio.ch.selectionkeyi...@33bcd1d5
java.io.IOException: Read error rc = -1
java.nio.DirectByteBuffer[pos=0 lim=4 cap=4]
at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)

There is one more of the above and then:

2010-02-21 18:23:55,095 [main-SendThread] WARN
org.apache.zookeeper.ClientCnxn: Ignoring exception during shutdown
input
java.net.SocketException: Transport endpoint is not connected
at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
at 
sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640)
at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
at 
org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:999)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)

Then another of the above follows immediately.

Then I see:

2010-02-21 18:23:55,583 [main-SendThread] INFO
org.apache.zookeeper.ClientCnxn: Attempting connection to server
14.u.XXX.com/X.X.X.141:2181
2010-02-21 18:24:00,423
[regionserver/208.76.44.140:60020.compactor-SendThread] WARN
org.apache.zookeeper.ClientCnxn: Exception closing session
0x226ed968a270003 to sun.nio.ch.selectionkeyi...@2a50e9a3
java.io.IOException: TIMED OUT
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)

The server 14.u.XXX.com was likely under load at the time.  The
ensemble member does not have a dedicated disk (this is a test
cluster).  Could that have been it?  I have logs from under the
14.u.XXX.com zk ensemble member but nothing in there, no exceptions at
least.

Here is a zk dump:

  Quorum Server Statistics:
- 14.u.XXX.com:2181
Zookeeper version: 3.2.2-888565, built on 12/08/2009 21:51 GMT
Clients:
 /X.X.X.139:49535[1](queued=0,recved=194,sent=0)
 /X.X.X.141:48336[1](queued=0,recved=0,sent=0)

Latency min/avg/max: 2/125/345
Received: 194
Sent: 0
Outstanding: 0
Zxid: 0x30004
Mode: follower
Node count: 11
- 13.u.XXX.com:2181
Zookeeper version: 3.2.2-888565, built on 12/08/2009 21:51 GMT
Clients:
 
/fdfb:c1e8:7640:2:2e0:81ff:fe41:d9ae:41308[1](queued=0,recved=10,sent=0)
 /X.X.X.141:39900[1](queued=0,recved=3773,sent=0)
 /X.X.X.141:36583[1](queued=0,recved=0,sent=0)
 /X.X.X.139:59343[1](queued=0,recved=3773,sent=0)

Latency min/avg/max: 0/7/81
Received: 7556
Sent: 0
Outstanding: 0
Zxid: 0x30004
Mode: leader
Node count: 11
- 15.u.XXX.com:2181
Zookeeper version: 3.2.2-888565, built on 12/08/2009 21:51 GMT
Clients:
 
/fdfb:c1e8:7640:2:2e0:81ff:fe41:d9ae:50573[1](queued=0,recved=3773,sent=0)
 /X.X.X.141:57536[1](queued=0,recved=0,sent=0)

Latency min/avg/max: 1/1/1
Received: 3773
Sent: 0
Outstanding: 0
Zxid: 0x30004
Mode: follower
Node count: 11

Thanks for any pointers on how to debug.

St.Ack