Re: Bit of help debugging a TIMED OUT session please
Hard to say based on the bits/pieces of the log we have access to. I'd have to see the full log, preferably from both the server and client, to gain more insight. re low numbers, this is the received count for the server, this should always increase never decrease. The fact that it is so low either indicates that the server recently restarted, or clients are not attaching to it. Seems like it should be near the other servers but again, hard to tell based on the small aperture we have via mail. Patrick Stack wrote: Thanks Patrick. See below. On Tue, Feb 23, 2010 at 1:19 PM, Patrick Hunt wrote: Stack you might look at the following: 1) why does server 14 have such a low recv count? Received: 194 while the other servers are at 3.7k + received. Did server 14 fail at some point? Or it's network? This may have caused the timeout seen by the client: Ok. Will check into this the next time. I did take the dump after the observed TIMED_OUT, a good while after. Could this be why the numbers are low? --snippet- 2010-02-21 18:23:55,583 [main-SendThread] INFO org.apache.zookeeper.ClientCnxn: Attempting connection to server 14.u.XXX.com/X.X.X.141:2181 2010-02-21 18:24:00,423 [regionserver/208.76.44.140:60020.compactor-SendThread] WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x226ed968a270003 to sun.nio.ch.selectionkeyi...@2a50e9a3 java.io.IOException: TIMED OUT at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906) --- 2) connection timeout is different from session timeout. connection timeout is the amount of time we allow for connection establishment (socket open) until the server accepts the connection, this value is the session timeout (as requested by the client) divided by the number of hosts in the host list. This could account for why the timeout (above snippet) occurred after 5 seconds. What timeout value is this client using? 15 seconds? We ask for a session timeout of 60 seconds -- the hbase default -- and our ticktime is 3 seconds. You are not troubled at all by the exceptions closing sessions above? Are these just noise? Thanks for the input, St.Ack
Re: Bit of help debugging a TIMED OUT session please
Thanks Patrick. See below. On Tue, Feb 23, 2010 at 1:19 PM, Patrick Hunt wrote: > Stack you might look at the following: > > 1) why does server 14 have such a low recv count? > > Received: 194 > > while the other servers are at 3.7k + received. Did server 14 fail at some > point? Or it's network? This may have caused the timeout seen by the client: > Ok. Will check into this the next time. I did take the dump after the observed TIMED_OUT, a good while after. Could this be why the numbers are low? > --snippet- > 2010-02-21 18:23:55,583 [main-SendThread] INFO > org.apache.zookeeper.ClientCnxn: Attempting connection to server > 14.u.XXX.com/X.X.X.141:2181 > 2010-02-21 18:24:00,423 > [regionserver/208.76.44.140:60020.compactor-SendThread] WARN > org.apache.zookeeper.ClientCnxn: Exception closing session > 0x226ed968a270003 to sun.nio.ch.selectionkeyi...@2a50e9a3 > java.io.IOException: TIMED OUT > at > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906) > --- > > 2) connection timeout is different from session timeout. connection timeout > is the amount of time we allow for connection establishment (socket open) > until the server accepts the connection, this value is the session timeout > (as requested by the client) divided by the number of hosts in the host > list. This could account for why the timeout (above snippet) occurred after > 5 seconds. What timeout value is this client using? 15 seconds? > We ask for a session timeout of 60 seconds -- the hbase default -- and our ticktime is 3 seconds. You are not troubled at all by the exceptions closing sessions above? Are these just noise? Thanks for the input, St.Ack
Re: Bit of help debugging a TIMED OUT session please
Stack you might look at the following: 1) why does server 14 have such a low recv count? Received: 194 while the other servers are at 3.7k + received. Did server 14 fail at some point? Or it's network? This may have caused the timeout seen by the client: --snippet- 2010-02-21 18:23:55,583 [main-SendThread] INFO org.apache.zookeeper.ClientCnxn: Attempting connection to server 14.u.XXX.com/X.X.X.141:2181 2010-02-21 18:24:00,423 [regionserver/208.76.44.140:60020.compactor-SendThread] WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x226ed968a270003 to sun.nio.ch.selectionkeyi...@2a50e9a3 java.io.IOException: TIMED OUT at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906) --- 2) connection timeout is different from session timeout. connection timeout is the amount of time we allow for connection establishment (socket open) until the server accepts the connection, this value is the session timeout (as requested by the client) divided by the number of hosts in the host list. This could account for why the timeout (above snippet) occurred after 5 seconds. What timeout value is this client using? 15 seconds? Patrick Stack wrote: Dang. Didn't save the log. Pardon me. I pasted exceptions only and thought it all about 0x26ed968d880001 session but now I see that what I posted above has TIMED_OUT on another session altogether. Above I skipped pasting exceptions thinking them on the same session but now it seems they probably were not. I'm trying to track a case where zk seems of a sudden, client-side, to give up the ghost w/ exceptions like those pasted above -- connectivity probs. There has been pollution in here where long gc pauses that are > session timeout would trigger TIMED_OUT but those have been tamed. I'll be back if I get another instance on my hook. Meantime, thanks for the comments. St.Ack On Mon, Feb 22, 2010 at 6:43 PM, Mahadev Konar wrote: HI stack, the other interesting part is with the session: 0x26ed968d880001 Looks like it gets disconnected from one of the servers (TIMEOUT). DO you see any of these messages: "Attempting connection to server" in the logs before you see all the consecutive org.apache.zookeeper.ClientCnxn: Exception closing session 0x26ed968d880001 to sun.nio.ch.selectionkeyi...@788ab708 java.io.IOException: Read error rc = -1 java.nio.DirectByteBuffer[pos=0 lim=4 cap=4] at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945) and From the cient 0x26ed968d880001? Thanks mahadev On 2/22/10 11:42 AM, "Stack" wrote: The thing that seems odd to me is that the connectivity complaints are out of the zk client, right?, why is it failing getting to member 14 and why not move to another ensemble member if issue w/ 14?, and if there were a general connectivity issue, I'd think that the running hbase cluster would be complaining at about the same time (its talking to datanodes and masters at this time). (Thanks for the input lads) St.Ack On Mon, Feb 22, 2010 at 11:26 AM, Mahadev Konar wrote: I also looked at the logs. Ted might have a point. It does look like that zookeeper server's are doing fine (though as ted mentions the skew is a little concerning, though that might be due to very few packets served by the first server). Other than that the latencies of 300 ms at max should not cause any timeouts. Also, the number of packets received is pretty low - meaning that it wasn't serving huge traffic. Is there anyway we can check if the network connection from the client to the server is not flaky? Thanks mahadev On 2/22/10 10:40 AM, "Ted Dunning" wrote: Not sure this helps at all, but these times are remarkably asymmetrical. I would expect members of a ZK cluster to have very comparable times. Additionally, 345 ms is nowhere near large enough to cause a session to expire. My take is that ZK doesn't think it caused the timeout. On Mon, Feb 22, 2010 at 10:18 AM, Stack wrote: Latency min/avg/max: 2/125/345 ... Latency min/avg/max: 0/7/81 ... Latency min/avg/max: 1/1/1 Thanks for any pointers on how to debug.
Re: Bit of help debugging a TIMED OUT session please
Dang. Didn't save the log. Pardon me. I pasted exceptions only and thought it all about 0x26ed968d880001 session but now I see that what I posted above has TIMED_OUT on another session altogether. Above I skipped pasting exceptions thinking them on the same session but now it seems they probably were not. I'm trying to track a case where zk seems of a sudden, client-side, to give up the ghost w/ exceptions like those pasted above -- connectivity probs. There has been pollution in here where long gc pauses that are > session timeout would trigger TIMED_OUT but those have been tamed. I'll be back if I get another instance on my hook. Meantime, thanks for the comments. St.Ack On Mon, Feb 22, 2010 at 6:43 PM, Mahadev Konar wrote: > HI stack, > the other interesting part is with the session: > 0x26ed968d880001 > > Looks like it gets disconnected from one of the servers (TIMEOUT). DO you > see any of these messages: "Attempting connection to server" in the logs > before you see all the consecutive > > org.apache.zookeeper.ClientCnxn: Exception closing session > 0x26ed968d880001 to sun.nio.ch.selectionkeyi...@788ab708 > java.io.IOException: Read error rc = -1 > java.nio.DirectByteBuffer[pos=0 lim=4 cap=4] > at > org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701) > at > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945) > > and > > > From the cient 0x26ed968d880001? > > Thanks > mahadev > > > On 2/22/10 11:42 AM, "Stack" wrote: > >> The thing that seems odd to me is that the connectivity complaints are >> out of the zk client, right?, why is it failing getting to member 14 >> and why not move to another ensemble member if issue w/ 14?, and if >> there were a general connectivity issue, I'd think that the running >> hbase cluster would be complaining at about the same time (its talking >> to datanodes and masters at this time). >> >> (Thanks for the input lads) >> >> St.Ack >> >> >> On Mon, Feb 22, 2010 at 11:26 AM, Mahadev Konar >> wrote: >>> I also looked at the logs. Ted might have a point. It does look like that >>> zookeeper server's are doing fine (though as ted mentions the skew is a >>> little concerning, though that might be due to very few packets served by >>> the first server). Other than that the latencies of 300 ms at max should not >>> cause any timeouts. >>> Also, the number of packets received is pretty low - meaning that it wasn't >>> serving huge traffic. Is there anyway we can check if the network connection >>> from the client to the server is not flaky? >>> >>> Thanks >>> mahadev >>> >>> >>> On 2/22/10 10:40 AM, "Ted Dunning" wrote: >>> Not sure this helps at all, but these times are remarkably asymmetrical. I would expect members of a ZK cluster to have very comparable times. Additionally, 345 ms is nowhere near large enough to cause a session to expire. My take is that ZK doesn't think it caused the timeout. On Mon, Feb 22, 2010 at 10:18 AM, Stack wrote: > Latency min/avg/max: 2/125/345 > ... > Latency min/avg/max: 0/7/81 > ... > Latency min/avg/max: 1/1/1 > > Thanks for any pointers on how to debug. > >>> >>> > >
Re: Bit of help debugging a TIMED OUT session please
HI stack, the other interesting part is with the session: 0x26ed968d880001 Looks like it gets disconnected from one of the servers (TIMEOUT). DO you see any of these messages: "Attempting connection to server" in the logs before you see all the consecutive org.apache.zookeeper.ClientCnxn: Exception closing session 0x26ed968d880001 to sun.nio.ch.selectionkeyi...@788ab708 java.io.IOException: Read error rc = -1 java.nio.DirectByteBuffer[pos=0 lim=4 cap=4] at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945) and >From the cient 0x26ed968d880001? Thanks mahadev On 2/22/10 11:42 AM, "Stack" wrote: > The thing that seems odd to me is that the connectivity complaints are > out of the zk client, right?, why is it failing getting to member 14 > and why not move to another ensemble member if issue w/ 14?, and if > there were a general connectivity issue, I'd think that the running > hbase cluster would be complaining at about the same time (its talking > to datanodes and masters at this time). > > (Thanks for the input lads) > > St.Ack > > > On Mon, Feb 22, 2010 at 11:26 AM, Mahadev Konar wrote: >> I also looked at the logs. Ted might have a point. It does look like that >> zookeeper server's are doing fine (though as ted mentions the skew is a >> little concerning, though that might be due to very few packets served by >> the first server). Other than that the latencies of 300 ms at max should not >> cause any timeouts. >> Also, the number of packets received is pretty low - meaning that it wasn't >> serving huge traffic. Is there anyway we can check if the network connection >> from the client to the server is not flaky? >> >> Thanks >> mahadev >> >> >> On 2/22/10 10:40 AM, "Ted Dunning" wrote: >> >>> Not sure this helps at all, but these times are remarkably asymmetrical. I >>> would expect members of a ZK cluster to have very comparable times. >>> >>> Additionally, 345 ms is nowhere near large enough to cause a session to >>> expire. My take is that ZK doesn't think it caused the timeout. >>> >>> On Mon, Feb 22, 2010 at 10:18 AM, Stack wrote: >>> Latency min/avg/max: 2/125/345 ... Latency min/avg/max: 0/7/81 ... Latency min/avg/max: 1/1/1 Thanks for any pointers on how to debug. >> >>
Re: Bit of help debugging a TIMED OUT session please
The thing that seems odd to me is that the connectivity complaints are out of the zk client, right?, why is it failing getting to member 14 and why not move to another ensemble member if issue w/ 14?, and if there were a general connectivity issue, I'd think that the running hbase cluster would be complaining at about the same time (its talking to datanodes and masters at this time). (Thanks for the input lads) St.Ack On Mon, Feb 22, 2010 at 11:26 AM, Mahadev Konar wrote: > I also looked at the logs. Ted might have a point. It does look like that > zookeeper server's are doing fine (though as ted mentions the skew is a > little concerning, though that might be due to very few packets served by > the first server). Other than that the latencies of 300 ms at max should not > cause any timeouts. > Also, the number of packets received is pretty low - meaning that it wasn't > serving huge traffic. Is there anyway we can check if the network connection > from the client to the server is not flaky? > > Thanks > mahadev > > > On 2/22/10 10:40 AM, "Ted Dunning" wrote: > >> Not sure this helps at all, but these times are remarkably asymmetrical. I >> would expect members of a ZK cluster to have very comparable times. >> >> Additionally, 345 ms is nowhere near large enough to cause a session to >> expire. My take is that ZK doesn't think it caused the timeout. >> >> On Mon, Feb 22, 2010 at 10:18 AM, Stack wrote: >> >>> Latency min/avg/max: 2/125/345 >>> ... >>> Latency min/avg/max: 0/7/81 >>> ... >>> Latency min/avg/max: 1/1/1 >>> >>> Thanks for any pointers on how to debug. >>> > >
Re: Bit of help debugging a TIMED OUT session please
I also looked at the logs. Ted might have a point. It does look like that zookeeper server's are doing fine (though as ted mentions the skew is a little concerning, though that might be due to very few packets served by the first server). Other than that the latencies of 300 ms at max should not cause any timeouts. Also, the number of packets received is pretty low - meaning that it wasn't serving huge traffic. Is there anyway we can check if the network connection from the client to the server is not flaky? Thanks mahadev On 2/22/10 10:40 AM, "Ted Dunning" wrote: > Not sure this helps at all, but these times are remarkably asymmetrical. I > would expect members of a ZK cluster to have very comparable times. > > Additionally, 345 ms is nowhere near large enough to cause a session to > expire. My take is that ZK doesn't think it caused the timeout. > > On Mon, Feb 22, 2010 at 10:18 AM, Stack wrote: > >>Latency min/avg/max: 2/125/345 >> ... >>Latency min/avg/max: 0/7/81 >> ... >>Latency min/avg/max: 1/1/1 >> >> Thanks for any pointers on how to debug. >>
Re: Bit of help debugging a TIMED OUT session please
Not sure this helps at all, but these times are remarkably asymmetrical. I would expect members of a ZK cluster to have very comparable times. Additionally, 345 ms is nowhere near large enough to cause a session to expire. My take is that ZK doesn't think it caused the timeout. On Mon, Feb 22, 2010 at 10:18 AM, Stack wrote: >Latency min/avg/max: 2/125/345 > ... >Latency min/avg/max: 0/7/81 > ... >Latency min/avg/max: 1/1/1 > > Thanks for any pointers on how to debug. >
Bit of help debugging a TIMED OUT session please
Hey Lads: Any chance of some pointers debugging a session TIMED OUT? Client is hosted inside an hbase regionserver. Usually session timeout is because of some fat GC pause that is longer than session timeout but thats not the case here. It seems to be a connectivity problem. Let me post a few logs. You fellas might have seen it before or at least can give me pointers or where to go look next. Alls well for hours -- I see pings on the problem session reported at some regularity with 1ms latency usually -- and then: 2010-02-21 18:23:50,158 [main-SendThread] WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x26ed968d880001 to sun.nio.ch.selectionkeyi...@788ab708 java.io.IOException: Read error rc = -1 java.nio.DirectByteBuffer[pos=0 lim=4 cap=4] at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945) and 2010-02-21 18:23:54,240 [main-SendThread] WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x26ed968d880001 to sun.nio.ch.selectionkeyi...@33bcd1d5 java.io.IOException: Read error rc = -1 java.nio.DirectByteBuffer[pos=0 lim=4 cap=4] at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945) There is one more of the above and then: 2010-02-21 18:23:55,095 [main-SendThread] WARN org.apache.zookeeper.ClientCnxn: Ignoring exception during shutdown input java.net.SocketException: Transport endpoint is not connected at sun.nio.ch.SocketChannelImpl.shutdown(Native Method) at sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640) at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360) at org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:999) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970) Then another of the above follows immediately. Then I see: 2010-02-21 18:23:55,583 [main-SendThread] INFO org.apache.zookeeper.ClientCnxn: Attempting connection to server 14.u.XXX.com/X.X.X.141:2181 2010-02-21 18:24:00,423 [regionserver/208.76.44.140:60020.compactor-SendThread] WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x226ed968a270003 to sun.nio.ch.selectionkeyi...@2a50e9a3 java.io.IOException: TIMED OUT at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906) The server 14.u.XXX.com was likely under load at the time. The ensemble member does not have a dedicated disk (this is a test cluster). Could that have been it? I have logs from under the 14.u.XXX.com zk ensemble member but nothing in there, no exceptions at least. Here is a zk dump: Quorum Server Statistics: - 14.u.XXX.com:2181 Zookeeper version: 3.2.2-888565, built on 12/08/2009 21:51 GMT Clients: /X.X.X.139:49535[1](queued=0,recved=194,sent=0) /X.X.X.141:48336[1](queued=0,recved=0,sent=0) Latency min/avg/max: 2/125/345 Received: 194 Sent: 0 Outstanding: 0 Zxid: 0x30004 Mode: follower Node count: 11 - 13.u.XXX.com:2181 Zookeeper version: 3.2.2-888565, built on 12/08/2009 21:51 GMT Clients: /fdfb:c1e8:7640:2:2e0:81ff:fe41:d9ae:41308[1](queued=0,recved=10,sent=0) /X.X.X.141:39900[1](queued=0,recved=3773,sent=0) /X.X.X.141:36583[1](queued=0,recved=0,sent=0) /X.X.X.139:59343[1](queued=0,recved=3773,sent=0) Latency min/avg/max: 0/7/81 Received: 7556 Sent: 0 Outstanding: 0 Zxid: 0x30004 Mode: leader Node count: 11 - 15.u.XXX.com:2181 Zookeeper version: 3.2.2-888565, built on 12/08/2009 21:51 GMT Clients: /fdfb:c1e8:7640:2:2e0:81ff:fe41:d9ae:50573[1](queued=0,recved=3773,sent=0) /X.X.X.141:57536[1](queued=0,recved=0,sent=0) Latency min/avg/max: 1/1/1 Received: 3773 Sent: 0 Outstanding: 0 Zxid: 0x30004 Mode: follower Node count: 11 Thanks for any pointers on how to debug. St.Ack