Re: permanent ZSESSIONMOVED
Hello Mahadev, 2010/3/18 Mahadev Konar maha...@yahoo-inc.com: Hi Lukasz, I looked at the logs and havent been able to determine how the session move can happen. You should go ahead and open a jira for this! Please attach the logs to that jira. Mark it for 3.3.0 until we can say that it isnt a problem as of now. I created issue for this https://issues.apache.org/jira/browse/ZOOKEEPER-710. I marked it as 3.2.2 as this is version we are using. Also, please go ahead and attach non filtered logs to the jira for the same timeline as you had done earlier (by non filtered I mean not grepping for the session id). Also, do attach the earlier logs you had emailed. I attached logs in jira. Also, may I know what function are you using to print that statement Mar 16 22:09:01 ggmail-c1-mbox-1 ggmail_mbox[7976]: ERROR: ZooKeeper: ZooKeeper::getChildren() path='/ggmboxlocks/16314'; callback error='session moved to another server, so operation is ignored' info added to jira (If you can cut and paste that code on the jira that you create, it would be helpful!)... Thanks mahadev On 3/17/10 2:56 AM, Łukasz Osipiuk luk...@osipiuk.net wrote: Ok. I analysed some logs from time when problem last occurred. brokensession.txt - logs concerning session which had a problem leader.txt - logs concerning leader election (i grepped for Leader in zookeeper.log) Some observations below - Network glitch which resulted in problem occurred at about 22:08. - From what I see since 17:48 node2 was the leader and it did not change later yesterday. - Client was connected to node2 since 17:50 - At around 22:09 client tried to connect to every node (1,2,3). Connections to node1 and node3 were closed with exception Exception causing close of session 0x22767e1c963 due to java.io.IOException: Read error. Connection to node2 stood alive. - All subsequent operations were refused with ZSESSIONMOVED error. Error visible both on client and on server side. Some more clarifications concerning questions from different posts: 1. Saying we have huge packet drop I mean we have huge packet drop for a local network. We are still investigating source of it. From time to time we observe that machines loose contact with others for few seconds. 2. We do not use any fancy network tricks. All machines are in single vlan and use permanently assigned IP addresses. Let me know if anything more can help. PS. Should I already create JIRA issue for this or is not concrete enough? Regards, Łukasz On Tue, Mar 16, 2010 at 20:39, Patrick Hunt ph...@apache.org wrote: Yea, that's great. (no complaints on my end, just wasn't sure what you meant, wanted to make sure I was clear.). Can you identify some sort of pattern? We're discussing on our end, but this is the first time we've heard of such an issue (outside 3.2.0 bug) and we're looking for a direction to hunt... anything you could add would help. Is it possible for you to upgrade your client/server to 3.2.2 and report the next failure, including any server/client logs if they are available? You should create a JIRA for this in order to track the details as they emerge. (incl attaching the logs, etc...) Patrick Łukasz Osipiuk wrote: On Tue, Mar 16, 2010 at 20:05, Patrick Hunt ph...@apache.org wrote: We'll probably need the ZK server/client logs to hunt this down. Can you tell if the MOVED happens in some particular scenario, say you are connected to a follower and move to a leader, or perhaps you are connected to server A, get disconnected and reconnected to server A? is there some pattern that could help us understand what's causing this? Session gets broken after zookeeper library reconnects to cluster, although most of reconnection leave session if working state. I'm not sure I follow this ^^^, could you restate it? (I get the first about about the session is broken after reconnect, what's the second part saying?) Sorry for my English. I meant that it only happens from time to time. We get reconnects quite often due to large packet loss in our network, but most of the time they do not break the session. Is it clear now? Regards, Łukasz Osipiuk Regards, Patrick Łukasz Osipiuk wrote: On Tue, Mar 16, 2010 at 19:22, Patrick Hunt ph...@apache.org wrote: Łukasz Osipiuk wrote: On Tue, Mar 16, 2010 at 17:18, Patrick Hunt ph...@apache.org wrote: Can you verify that you are using 3.2.2 on all servers? You can do this by running the stat command against each of your servers and look at the very top of the output (we include the version of zk server there). http://bit.ly/dglVld Thanks for hint. We had 3.2.2 jars installed on all server - I checked it previously but apparently our script which restarts instance after installing .deb does not work very well and node2 and node3 were running version 3.2.1. Could using 3.2.1 be source of the problem? I heard this story where customer
Re: permanent ZSESSIONMOVED
We've identified the problem and are working on a fix. Full details here: https://issues.apache.org/jira/browse/ZOOKEEPER-710 Patrick Łukasz Osipiuk wrote: Hello Mahadev, 2010/3/18 Mahadev Konar maha...@yahoo-inc.com: Hi Lukasz, I looked at the logs and havent been able to determine how the session move can happen. You should go ahead and open a jira for this! Please attach the logs to that jira. Mark it for 3.3.0 until we can say that it isnt a problem as of now. I created issue for this https://issues.apache.org/jira/browse/ZOOKEEPER-710. I marked it as 3.2.2 as this is version we are using. Also, please go ahead and attach non filtered logs to the jira for the same timeline as you had done earlier (by non filtered I mean not grepping for the session id). Also, do attach the earlier logs you had emailed. I attached logs in jira. Also, may I know what function are you using to print that statement Mar 16 22:09:01 ggmail-c1-mbox-1 ggmail_mbox[7976]: ERROR: ZooKeeper: ZooKeeper::getChildren() path='/ggmboxlocks/16314'; callback error='session moved to another server, so operation is ignored' info added to jira (If you can cut and paste that code on the jira that you create, it would be helpful!)... Thanks mahadev On 3/17/10 2:56 AM, Łukasz Osipiuk luk...@osipiuk.net wrote: Ok. I analysed some logs from time when problem last occurred. brokensession.txt - logs concerning session which had a problem leader.txt - logs concerning leader election (i grepped for Leader in zookeeper.log) Some observations below - Network glitch which resulted in problem occurred at about 22:08. - From what I see since 17:48 node2 was the leader and it did not change later yesterday. - Client was connected to node2 since 17:50 - At around 22:09 client tried to connect to every node (1,2,3). Connections to node1 and node3 were closed with exception Exception causing close of session 0x22767e1c963 due to java.io.IOException: Read error. Connection to node2 stood alive. - All subsequent operations were refused with ZSESSIONMOVED error. Error visible both on client and on server side. Some more clarifications concerning questions from different posts: 1. Saying we have huge packet drop I mean we have huge packet drop for a local network. We are still investigating source of it. From time to time we observe that machines loose contact with others for few seconds. 2. We do not use any fancy network tricks. All machines are in single vlan and use permanently assigned IP addresses. Let me know if anything more can help. PS. Should I already create JIRA issue for this or is not concrete enough? Regards, Łukasz On Tue, Mar 16, 2010 at 20:39, Patrick Hunt ph...@apache.org wrote: Yea, that's great. (no complaints on my end, just wasn't sure what you meant, wanted to make sure I was clear.). Can you identify some sort of pattern? We're discussing on our end, but this is the first time we've heard of such an issue (outside 3.2.0 bug) and we're looking for a direction to hunt... anything you could add would help. Is it possible for you to upgrade your client/server to 3.2.2 and report the next failure, including any server/client logs if they are available? You should create a JIRA for this in order to track the details as they emerge. (incl attaching the logs, etc...) Patrick Łukasz Osipiuk wrote: On Tue, Mar 16, 2010 at 20:05, Patrick Hunt ph...@apache.org wrote: We'll probably need the ZK server/client logs to hunt this down. Can you tell if the MOVED happens in some particular scenario, say you are connected to a follower and move to a leader, or perhaps you are connected to server A, get disconnected and reconnected to server A? is there some pattern that could help us understand what's causing this? Session gets broken after zookeeper library reconnects to cluster, although most of reconnection leave session if working state. I'm not sure I follow this ^^^, could you restate it? (I get the first about about the session is broken after reconnect, what's the second part saying?) Sorry for my English. I meant that it only happens from time to time. We get reconnects quite often due to large packet loss in our network, but most of the time they do not break the session. Is it clear now? Regards, Łukasz Osipiuk Regards, Patrick Łukasz Osipiuk wrote: On Tue, Mar 16, 2010 at 19:22, Patrick Hunt ph...@apache.org wrote: Łukasz Osipiuk wrote: On Tue, Mar 16, 2010 at 17:18, Patrick Hunt ph...@apache.org wrote: Can you verify that you are using 3.2.2 on all servers? You can do this by running the stat command against each of your servers and look at the very top of the output (we include the version of zk server there). http://bit.ly/dglVld Thanks for hint. We had 3.2.2 jars installed on all server - I checked it previously but apparently our script which restarts instance after installing .deb does not work very well and node2 and node3 were running version 3.2.1. Could
Re: permanent ZSESSIONMOVED
Hi! On Thu, Mar 18, 2010 at 21:23, Patrick Hunt ph...@apache.org wrote: We've identified the problem and are working on a fix. Full details here: https://issues.apache.org/jira/browse/ZOOKEEPER-710 That is great. Thank you very much for your work! All the best, Łukasz Patrick Łukasz Osipiuk wrote: Hello Mahadev, 2010/3/18 Mahadev Konar maha...@yahoo-inc.com: Hi Lukasz, I looked at the logs and havent been able to determine how the session move can happen. You should go ahead and open a jira for this! Please attach the logs to that jira. Mark it for 3.3.0 until we can say that it isnt a problem as of now. I created issue for this https://issues.apache.org/jira/browse/ZOOKEEPER-710. I marked it as 3.2.2 as this is version we are using. Also, please go ahead and attach non filtered logs to the jira for the same timeline as you had done earlier (by non filtered I mean not grepping for the session id). Also, do attach the earlier logs you had emailed. I attached logs in jira. Also, may I know what function are you using to print that statement Mar 16 22:09:01 ggmail-c1-mbox-1 ggmail_mbox[7976]: ERROR: ZooKeeper: ZooKeeper::getChildren() path='/ggmboxlocks/16314'; callback error='session moved to another server, so operation is ignored' info added to jira (If you can cut and paste that code on the jira that you create, it would be helpful!)... Thanks mahadev On 3/17/10 2:56 AM, Łukasz Osipiuk luk...@osipiuk.net wrote: Ok. I analysed some logs from time when problem last occurred. brokensession.txt - logs concerning session which had a problem leader.txt - logs concerning leader election (i grepped for Leader in zookeeper.log) Some observations below - Network glitch which resulted in problem occurred at about 22:08. - From what I see since 17:48 node2 was the leader and it did not change later yesterday. - Client was connected to node2 since 17:50 - At around 22:09 client tried to connect to every node (1,2,3). Connections to node1 and node3 were closed with exception Exception causing close of session 0x22767e1c963 due to java.io.IOException: Read error. Connection to node2 stood alive. - All subsequent operations were refused with ZSESSIONMOVED error. Error visible both on client and on server side. Some more clarifications concerning questions from different posts: 1. Saying we have huge packet drop I mean we have huge packet drop for a local network. We are still investigating source of it. From time to time we observe that machines loose contact with others for few seconds. 2. We do not use any fancy network tricks. All machines are in single vlan and use permanently assigned IP addresses. Let me know if anything more can help. PS. Should I already create JIRA issue for this or is not concrete enough? Regards, Łukasz On Tue, Mar 16, 2010 at 20:39, Patrick Hunt ph...@apache.org wrote: Yea, that's great. (no complaints on my end, just wasn't sure what you meant, wanted to make sure I was clear.). Can you identify some sort of pattern? We're discussing on our end, but this is the first time we've heard of such an issue (outside 3.2.0 bug) and we're looking for a direction to hunt... anything you could add would help. Is it possible for you to upgrade your client/server to 3.2.2 and report the next failure, including any server/client logs if they are available? You should create a JIRA for this in order to track the details as they emerge. (incl attaching the logs, etc...) Patrick Łukasz Osipiuk wrote: On Tue, Mar 16, 2010 at 20:05, Patrick Hunt ph...@apache.org wrote: We'll probably need the ZK server/client logs to hunt this down. Can you tell if the MOVED happens in some particular scenario, say you are connected to a follower and move to a leader, or perhaps you are connected to server A, get disconnected and reconnected to server A? is there some pattern that could help us understand what's causing this? Session gets broken after zookeeper library reconnects to cluster, although most of reconnection leave session if working state. I'm not sure I follow this ^^^, could you restate it? (I get the first about about the session is broken after reconnect, what's the second part saying?) Sorry for my English. I meant that it only happens from time to time. We get reconnects quite often due to large packet loss in our network, but most of the time they do not break the session. Is it clear now? Regards, Łukasz Osipiuk Regards, Patrick Łukasz Osipiuk wrote: On Tue, Mar 16, 2010 at 19:22, Patrick Hunt ph...@apache.org wrote: Łukasz Osipiuk wrote: On Tue, Mar 16, 2010 at 17:18, Patrick Hunt ph...@apache.org wrote: Can you verify that you are using 3.2.2 on all servers? You can do this by running the stat command against each of your servers and look at the very top of the output (we include the version of zk server there).
Re: permanent ZSESSIONMOVED
JIRA is great because we can move the discussion off the list and track explicitly along with possible fixes. Also if you create the jira you'll get updates on any changes to the record (you could 'watch' the JIRA instead, that's fine too). Can you include the session id in your client logs? That would be very helpful for debugging. (it's available from ZooKeeper class) Best if you tar/gz the logs (zip fine too). Having the full server (all server logs) and client logs for the time/incident in question we can follow along if you will, pretty closely with what's happening. It should really shed some light. Thanks for your help on this, we'd really like to nail this one down. Regards, Patrick Mahadev Konar wrote: Hi Lukasz, I looked at the logs and havent been able to determine how the session move can happen. You should go ahead and open a jira for this! Please attach the logs to that jira. Mark it for 3.3.0 until we can say that it isnt a problem as of now. Also, please go ahead and attach non filtered logs to the jira for the same timeline as you had done earlier (by non filtered I mean not grepping for the session id). Also, do attach the earlier logs you had emailed. Also, may I know what function are you using to print that statement Mar 16 22:09:01 ggmail-c1-mbox-1 ggmail_mbox[7976]: ERROR: ZooKeeper: ZooKeeper::getChildren() path='/ggmboxlocks/16314'; callback error='session moved to another server, so operation is ignored' (If you can cut and paste that code on the jira that you create, it would be helpful!)... Thanks mahadev On 3/17/10 2:56 AM, Łukasz Osipiuk luk...@osipiuk.net wrote: Ok. I analysed some logs from time when problem last occurred. brokensession.txt - logs concerning session which had a problem leader.txt - logs concerning leader election (i grepped for Leader in zookeeper.log) Some observations below - Network glitch which resulted in problem occurred at about 22:08. - From what I see since 17:48 node2 was the leader and it did not change later yesterday. - Client was connected to node2 since 17:50 - At around 22:09 client tried to connect to every node (1,2,3). Connections to node1 and node3 were closed with exception Exception causing close of session 0x22767e1c963 due to java.io.IOException: Read error. Connection to node2 stood alive. - All subsequent operations were refused with ZSESSIONMOVED error. Error visible both on client and on server side. Some more clarifications concerning questions from different posts: 1. Saying we have huge packet drop I mean we have huge packet drop for a local network. We are still investigating source of it. From time to time we observe that machines loose contact with others for few seconds. 2. We do not use any fancy network tricks. All machines are in single vlan and use permanently assigned IP addresses. Let me know if anything more can help. PS. Should I already create JIRA issue for this or is not concrete enough? Regards, Łukasz On Tue, Mar 16, 2010 at 20:39, Patrick Hunt ph...@apache.org wrote: Yea, that's great. (no complaints on my end, just wasn't sure what you meant, wanted to make sure I was clear.). Can you identify some sort of pattern? We're discussing on our end, but this is the first time we've heard of such an issue (outside 3.2.0 bug) and we're looking for a direction to hunt... anything you could add would help. Is it possible for you to upgrade your client/server to 3.2.2 and report the next failure, including any server/client logs if they are available? You should create a JIRA for this in order to track the details as they emerge. (incl attaching the logs, etc...) Patrick Łukasz Osipiuk wrote: On Tue, Mar 16, 2010 at 20:05, Patrick Hunt ph...@apache.org wrote: We'll probably need the ZK server/client logs to hunt this down. Can you tell if the MOVED happens in some particular scenario, say you are connected to a follower and move to a leader, or perhaps you are connected to server A, get disconnected and reconnected to server A? is there some pattern that could help us understand what's causing this? Session gets broken after zookeeper library reconnects to cluster, although most of reconnection leave session if working state. I'm not sure I follow this ^^^, could you restate it? (I get the first about about the session is broken after reconnect, what's the second part saying?) Sorry for my English. I meant that it only happens from time to time. We get reconnects quite often due to large packet loss in our network, but most of the time they do not break the session. Is it clear now? Regards, Łukasz Osipiuk Regards, Patrick Łukasz Osipiuk wrote: On Tue, Mar 16, 2010 at 19:22, Patrick Hunt ph...@apache.org wrote: Łukasz Osipiuk wrote: On Tue, Mar 16, 2010 at 17:18, Patrick Hunt ph...@apache.org wrote: Can you verify that you are using 3.2.2 on all servers? You can do this by running the stat command against each of your servers and look
Re: permanent ZSESSIONMOVED
do you ever use zookeeper_init() with the clientid field set to something other than null? ben On 03/16/2010 07:43 AM, Łukasz Osipiuk wrote: Hi everyone! I am writing to this group because recently we are getting some strange errors with our production zookeeper setup. From time to time we are observing that our client application (C++ based) disconnects from zookeeper (session state is changed to 1) and reconnects (state changed to 3). This itself is not a problem - usually application continues to run without problems after reconnect. But from time to time after above happens all subsequent operations start to return ZSESSIONMOVED error. To make it work again we have to restart application (which creates new zookeeper session). I noticed that in 3.2.0 introduced a bug http://issues.apache.org/jira/browse/ZOOKEEPER-449 but we are using zookeeper v. 3.2.2. I just noticed that app at compile time used 3.2.0 library but patches fixing bug 449 did not touch C client lib so I believe that our problems are not related with that. In zookeeper logs at moment which initiated the problem with client application I have node1: 2010-03-16 14:21:43,510 - INFO [NIOServerCxn.Factory:2181:nioserverc...@607] - Connected to /10.1.112.61:37197 lastZxid 42992576502 2010-03-16 14:21:43,510 - INFO [NIOServerCxn.Factory:2181:nioserverc...@636] - Renewing session 0x324dcc1ba580085 2010-03-16 14:21:49,443 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@992] - Finished init of 0x324dcc1ba580085 valid:true 2010-03-16 14:21:49,443 - WARN [NIOServerCxn.Factory:2181:nioserverc...@518] - Exception causing close of session 0x324dcc1ba580085 due to java.io.IOException: Read error 2010-03-16 14:21:49,444 - INFO [NIOServerCxn.Factory:2181:nioserverc...@857] - closing session:0x324dcc1ba580085 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.1.112.62:2181 remote=/10.1.112.61:37197] node2: 2010-03-16 14:21:40,580 - WARN [NIOServerCxn.Factory:2181:nioserverc...@494] - Exception causing close of session 0x324dcc1ba580085 due to java.io.IOException: Read error 2010-03-16 14:21:40,581 - INFO [NIOServerCxn.Factory:2181:nioserverc...@833] - closing session:0x324dcc1ba580085 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.1.112.63:2181 remote=/10.1.112.61:60693] 2010-03-16 14:21:46,839 - INFO [NIOServerCxn.Factory:2181:nioserverc...@583] - Connected to /10.1.112.61:48336 lastZxid 42992576502 2010-03-16 14:21:46,839 - INFO [NIOServerCxn.Factory:2181:nioserverc...@612] - Renewing session 0x324dcc1ba580085 2010-03-16 14:21:49,439 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@964] - Finished init of 0x324dcc1ba580085 valid:true node3: 2010-03-16 02:14:48,961 - WARN [NIOServerCxn.Factory:2181:nioserverc...@494] - Exception causing close of session 0x324dcc1ba580085 due to java.io.IOException: Read error 2010-03-16 02:14:48,962 - INFO [NIOServerCxn.Factory:2181:nioserverc...@833] - closing session:0x324dcc1ba580085 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.1.112.64:2181 remote=/10.1.112.61:57309] and then lots of entries like this 2010-03-16 02:14:54,696 - WARN [ProcessThread:-1:preprequestproces...@402] - Got exception when processing sessionid:0x324dcc1ba580085 type:create cxid:0x4b9e9e49 zxid:0xfffe txntype:unknown /locks/9871253/lock-8589943989- org.apache.zookeeper.KeeperException$SessionMovedException: KeeperErrorCode = Session moved at org.apache.zookeeper.server.SessionTrackerImpl.checkSession(SessionTrackerImpl.java:231) at org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:211) at org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:114) 2010-03-16 14:22:06,428 - WARN [ProcessThread:-1:preprequestproces...@402] - Got exception when processing sessionid:0x324dcc1ba580085 type:create cxid:0x4b9f6603 zxid:0xfffe txntype:unknown /locks/1665960/lock-8589961006- org.apache.zookeeper.KeeperException$SessionMovedException: KeeperErrorCode = Session moved at org.apache.zookeeper.server.SessionTrackerImpl.checkSession(SessionTrackerImpl.java:231) at org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:211) at org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:114) To workaround disconnections I am going to increase session timeout from 5 to 15 seconds but event if it helps at all it is just a workaround. Do you have an idea where is the source of my problem. Regards, Łukasz Osipiuk
Re: permanent ZSESSIONMOVED
nope. I always pass 0 as clientid. Łukasz On Tue, Mar 16, 2010 at 16:20, Benjamin Reed br...@yahoo-inc.com wrote: do you ever use zookeeper_init() with the clientid field set to something other than null? ben On 03/16/2010 07:43 AM, Łukasz Osipiuk wrote: Hi everyone! I am writing to this group because recently we are getting some strange errors with our production zookeeper setup. From time to time we are observing that our client application (C++ based) disconnects from zookeeper (session state is changed to 1) and reconnects (state changed to 3). This itself is not a problem - usually application continues to run without problems after reconnect. But from time to time after above happens all subsequent operations start to return ZSESSIONMOVED error. To make it work again we have to restart application (which creates new zookeeper session). I noticed that in 3.2.0 introduced a bug http://issues.apache.org/jira/browse/ZOOKEEPER-449 but we are using zookeeper v. 3.2.2. I just noticed that app at compile time used 3.2.0 library but patches fixing bug 449 did not touch C client lib so I believe that our problems are not related with that. In zookeeper logs at moment which initiated the problem with client application I have node1: 2010-03-16 14:21:43,510 - INFO [NIOServerCxn.Factory:2181:nioserverc...@607] - Connected to /10.1.112.61:37197 lastZxid 42992576502 2010-03-16 14:21:43,510 - INFO [NIOServerCxn.Factory:2181:nioserverc...@636] - Renewing session 0x324dcc1ba580085 2010-03-16 14:21:49,443 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@992] - Finished init of 0x324dcc1ba580085 valid:true 2010-03-16 14:21:49,443 - WARN [NIOServerCxn.Factory:2181:nioserverc...@518] - Exception causing close of session 0x324dcc1ba580085 due to java.io.IOException: Read error 2010-03-16 14:21:49,444 - INFO [NIOServerCxn.Factory:2181:nioserverc...@857] - closing session:0x324dcc1ba580085 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.1.112.62:2181 remote=/10.1.112.61:37197] node2: 2010-03-16 14:21:40,580 - WARN [NIOServerCxn.Factory:2181:nioserverc...@494] - Exception causing close of session 0x324dcc1ba580085 due to java.io.IOException: Read error 2010-03-16 14:21:40,581 - INFO [NIOServerCxn.Factory:2181:nioserverc...@833] - closing session:0x324dcc1ba580085 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.1.112.63:2181 remote=/10.1.112.61:60693] 2010-03-16 14:21:46,839 - INFO [NIOServerCxn.Factory:2181:nioserverc...@583] - Connected to /10.1.112.61:48336 lastZxid 42992576502 2010-03-16 14:21:46,839 - INFO [NIOServerCxn.Factory:2181:nioserverc...@612] - Renewing session 0x324dcc1ba580085 2010-03-16 14:21:49,439 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@964] - Finished init of 0x324dcc1ba580085 valid:true node3: 2010-03-16 02:14:48,961 - WARN [NIOServerCxn.Factory:2181:nioserverc...@494] - Exception causing close of session 0x324dcc1ba580085 due to java.io.IOException: Read error 2010-03-16 02:14:48,962 - INFO [NIOServerCxn.Factory:2181:nioserverc...@833] - closing session:0x324dcc1ba580085 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.1.112.64:2181 remote=/10.1.112.61:57309] and then lots of entries like this 2010-03-16 02:14:54,696 - WARN [ProcessThread:-1:preprequestproces...@402] - Got exception when processing sessionid:0x324dcc1ba580085 type:create cxid:0x4b9e9e49 zxid:0xfffe txntype:unknown /locks/9871253/lock-8589943989- org.apache.zookeeper.KeeperException$SessionMovedException: KeeperErrorCode = Session moved at org.apache.zookeeper.server.SessionTrackerImpl.checkSession(SessionTrackerImpl.java:231) at org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:211) at org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:114) 2010-03-16 14:22:06,428 - WARN [ProcessThread:-1:preprequestproces...@402] - Got exception when processing sessionid:0x324dcc1ba580085 type:create cxid:0x4b9f6603 zxid:0xfffe txntype:unknown /locks/1665960/lock-8589961006- org.apache.zookeeper.KeeperException$SessionMovedException: KeeperErrorCode = Session moved at org.apache.zookeeper.server.SessionTrackerImpl.checkSession(SessionTrackerImpl.java:231) at org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:211) at org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:114) To workaround disconnections I am going to increase session timeout from 5 to 15 seconds but event if it helps at all it is just a workaround. Do you have an idea where is the source of my problem. Regards, Łukasz Osipiuk -- -- Łukasz Osipiuk mailto:luk...@osipiuk.net
Re: permanent ZSESSIONMOVED
weird, this does sound like a bug. do you have a reliable way of reproducing the problem? thanx ben On 03/16/2010 08:27 AM, Łukasz Osipiuk wrote: nope. I always pass 0 as clientid. Łukasz On Tue, Mar 16, 2010 at 16:20, Benjamin Reedbr...@yahoo-inc.com wrote: do you ever use zookeeper_init() with the clientid field set to something other than null? ben On 03/16/2010 07:43 AM, Łukasz Osipiuk wrote: Hi everyone! I am writing to this group because recently we are getting some strange errors with our production zookeeper setup. From time to time we are observing that our client application (C++ based) disconnects from zookeeper (session state is changed to 1) and reconnects (state changed to 3). This itself is not a problem - usually application continues to run without problems after reconnect. But from time to time after above happens all subsequent operations start to return ZSESSIONMOVED error. To make it work again we have to restart application (which creates new zookeeper session). I noticed that in 3.2.0 introduced a bug http://issues.apache.org/jira/browse/ZOOKEEPER-449 but we are using zookeeper v. 3.2.2. I just noticed that app at compile time used 3.2.0 library but patches fixing bug 449 did not touch C client lib so I believe that our problems are not related with that. In zookeeper logs at moment which initiated the problem with client application I have node1: 2010-03-16 14:21:43,510 - INFO [NIOServerCxn.Factory:2181:nioserverc...@607] - Connected to /10.1.112.61:37197 lastZxid 42992576502 2010-03-16 14:21:43,510 - INFO [NIOServerCxn.Factory:2181:nioserverc...@636] - Renewing session 0x324dcc1ba580085 2010-03-16 14:21:49,443 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@992] - Finished init of 0x324dcc1ba580085 valid:true 2010-03-16 14:21:49,443 - WARN [NIOServerCxn.Factory:2181:nioserverc...@518] - Exception causing close of session 0x324dcc1ba580085 due to java.io.IOException: Read error 2010-03-16 14:21:49,444 - INFO [NIOServerCxn.Factory:2181:nioserverc...@857] - closing session:0x324dcc1ba580085 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.1.112.62:2181 remote=/10.1.112.61:37197] node2: 2010-03-16 14:21:40,580 - WARN [NIOServerCxn.Factory:2181:nioserverc...@494] - Exception causing close of session 0x324dcc1ba580085 due to java.io.IOException: Read error 2010-03-16 14:21:40,581 - INFO [NIOServerCxn.Factory:2181:nioserverc...@833] - closing session:0x324dcc1ba580085 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.1.112.63:2181 remote=/10.1.112.61:60693] 2010-03-16 14:21:46,839 - INFO [NIOServerCxn.Factory:2181:nioserverc...@583] - Connected to /10.1.112.61:48336 lastZxid 42992576502 2010-03-16 14:21:46,839 - INFO [NIOServerCxn.Factory:2181:nioserverc...@612] - Renewing session 0x324dcc1ba580085 2010-03-16 14:21:49,439 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@964] - Finished init of 0x324dcc1ba580085 valid:true node3: 2010-03-16 02:14:48,961 - WARN [NIOServerCxn.Factory:2181:nioserverc...@494] - Exception causing close of session 0x324dcc1ba580085 due to java.io.IOException: Read error 2010-03-16 02:14:48,962 - INFO [NIOServerCxn.Factory:2181:nioserverc...@833] - closing session:0x324dcc1ba580085 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.1.112.64:2181 remote=/10.1.112.61:57309] and then lots of entries like this 2010-03-16 02:14:54,696 - WARN [ProcessThread:-1:preprequestproces...@402] - Got exception when processing sessionid:0x324dcc1ba580085 type:create cxid:0x4b9e9e49 zxid:0xfffe txntype:unknown /locks/9871253/lock-8589943989- org.apache.zookeeper.KeeperException$SessionMovedException: KeeperErrorCode = Session moved at org.apache.zookeeper.server.SessionTrackerImpl.checkSession(SessionTrackerImpl.java:231) at org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:211) at org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:114) 2010-03-16 14:22:06,428 - WARN [ProcessThread:-1:preprequestproces...@402] - Got exception when processing sessionid:0x324dcc1ba580085 type:create cxid:0x4b9f6603 zxid:0xfffe txntype:unknown /locks/1665960/lock-8589961006- org.apache.zookeeper.KeeperException$SessionMovedException: KeeperErrorCode = Session moved at org.apache.zookeeper.server.SessionTrackerImpl.checkSession(SessionTrackerImpl.java:231) at org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:211) at org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:114) To workaround disconnections I am going to increase session timeout from 5 to 15 seconds but event if it helps at all it is just a workaround. Do you have an idea where is the source of my problem. Regards, Łukasz Osipiuk
Re: permanent ZSESSIONMOVED
Can you verify that you are using 3.2.2 on all servers? You can do this by running the stat command against each of your servers and look at the very top of the output (we include the version of zk server there). http://bit.ly/dglVld Are you using synchronous or async operations in your client? I see snippets from the server logs, but I don't see anything from you client logs. It would really help if you could zip up all the logs (server and client) and provide them. Patrick Łukasz Osipiuk wrote: not really - it happens occasionally - every few days :( I believe it is somewhat connected with our network environment which suffers from some packet loss which leads to connection timeouts. I can switch on some more logging if you can lead me which categories are worth to enable DEBUG for them. Regards, Łukasz Osipiuk On Tue, Mar 16, 2010 at 16:35, Benjamin Reed br...@yahoo-inc.com wrote: weird, this does sound like a bug. do you have a reliable way of reproducing the problem? thanx ben On 03/16/2010 08:27 AM, Łukasz Osipiuk wrote: nope. I always pass 0 as clientid. Łukasz On Tue, Mar 16, 2010 at 16:20, Benjamin Reedbr...@yahoo-inc.com wrote: do you ever use zookeeper_init() with the clientid field set to something other than null? ben On 03/16/2010 07:43 AM, Łukasz Osipiuk wrote: Hi everyone! I am writing to this group because recently we are getting some strange errors with our production zookeeper setup. From time to time we are observing that our client application (C++ based) disconnects from zookeeper (session state is changed to 1) and reconnects (state changed to 3). This itself is not a problem - usually application continues to run without problems after reconnect. But from time to time after above happens all subsequent operations start to return ZSESSIONMOVED error. To make it work again we have to restart application (which creates new zookeeper session). I noticed that in 3.2.0 introduced a bug http://issues.apache.org/jira/browse/ZOOKEEPER-449 but we are using zookeeper v. 3.2.2. I just noticed that app at compile time used 3.2.0 library but patches fixing bug 449 did not touch C client lib so I believe that our problems are not related with that. In zookeeper logs at moment which initiated the problem with client application I have node1: 2010-03-16 14:21:43,510 - INFO [NIOServerCxn.Factory:2181:nioserverc...@607] - Connected to /10.1.112.61:37197 lastZxid 42992576502 2010-03-16 14:21:43,510 - INFO [NIOServerCxn.Factory:2181:nioserverc...@636] - Renewing session 0x324dcc1ba580085 2010-03-16 14:21:49,443 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@992] - Finished init of 0x324dcc1ba580085 valid:true 2010-03-16 14:21:49,443 - WARN [NIOServerCxn.Factory:2181:nioserverc...@518] - Exception causing close of session 0x324dcc1ba580085 due to java.io.IOException: Read error 2010-03-16 14:21:49,444 - INFO [NIOServerCxn.Factory:2181:nioserverc...@857] - closing session:0x324dcc1ba580085 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.1.112.62:2181 remote=/10.1.112.61:37197] node2: 2010-03-16 14:21:40,580 - WARN [NIOServerCxn.Factory:2181:nioserverc...@494] - Exception causing close of session 0x324dcc1ba580085 due to java.io.IOException: Read error 2010-03-16 14:21:40,581 - INFO [NIOServerCxn.Factory:2181:nioserverc...@833] - closing session:0x324dcc1ba580085 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.1.112.63:2181 remote=/10.1.112.61:60693] 2010-03-16 14:21:46,839 - INFO [NIOServerCxn.Factory:2181:nioserverc...@583] - Connected to /10.1.112.61:48336 lastZxid 42992576502 2010-03-16 14:21:46,839 - INFO [NIOServerCxn.Factory:2181:nioserverc...@612] - Renewing session 0x324dcc1ba580085 2010-03-16 14:21:49,439 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@964] - Finished init of 0x324dcc1ba580085 valid:true node3: 2010-03-16 02:14:48,961 - WARN [NIOServerCxn.Factory:2181:nioserverc...@494] - Exception causing close of session 0x324dcc1ba580085 due to java.io.IOException: Read error 2010-03-16 02:14:48,962 - INFO [NIOServerCxn.Factory:2181:nioserverc...@833] - closing session:0x324dcc1ba580085 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.1.112.64:2181 remote=/10.1.112.61:57309] and then lots of entries like this 2010-03-16 02:14:54,696 - WARN [ProcessThread:-1:preprequestproces...@402] - Got exception when processing sessionid:0x324dcc1ba580085 type:create cxid:0x4b9e9e49 zxid:0xfffe txntype:unknown /locks/9871253/lock-8589943989- org.apache.zookeeper.KeeperException$SessionMovedException: KeeperErrorCode = Session moved at org.apache.zookeeper.server.SessionTrackerImpl.checkSession(SessionTrackerImpl.java:231) at org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:211) at org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:114) 2010-03-16 14:22:06,428 - WARN
Re: permanent ZSESSIONMOVED
On Tue, Mar 16, 2010 at 17:18, Patrick Hunt ph...@apache.org wrote: Can you verify that you are using 3.2.2 on all servers? You can do this by running the stat command against each of your servers and look at the very top of the output (we include the version of zk server there). http://bit.ly/dglVld Thanks for hint. We had 3.2.2 jars installed on all server - I checked it previously but apparently our script which restarts instance after installing .deb does not work very well and node2 and node3 were running version 3.2.1. Could using 3.2.1 be source of the problem? Are you using synchronous or async operations in your client? We are only using async operations. I see snippets from the server logs, but I don't see anything from you client logs. It would really help if you could zip up all the logs (server and client) and provide them. We do not really have much logging on application side :(. Failed operations are logged with error code (ZSESSIONEXPIRED in our case). We also log session state changes but we noticed only changes between 1 and 3. Unfortunately we do not have logs coming directly from zookeeper library as it logs to stderr/stdout and we do not store any of them at the moment. Regards, Łukasz Patrick Łukasz Osipiuk wrote: not really - it happens occasionally - every few days :( I believe it is somewhat connected with our network environment which suffers from some packet loss which leads to connection timeouts. I can switch on some more logging if you can lead me which categories are worth to enable DEBUG for them. Regards, Łukasz Osipiuk On Tue, Mar 16, 2010 at 16:35, Benjamin Reed br...@yahoo-inc.com wrote: weird, this does sound like a bug. do you have a reliable way of reproducing the problem? thanx ben On 03/16/2010 08:27 AM, Łukasz Osipiuk wrote: nope. I always pass 0 as clientid. Łukasz On Tue, Mar 16, 2010 at 16:20, Benjamin Reedbr...@yahoo-inc.com wrote: do you ever use zookeeper_init() with the clientid field set to something other than null? ben On 03/16/2010 07:43 AM, Łukasz Osipiuk wrote: Hi everyone! I am writing to this group because recently we are getting some strange errors with our production zookeeper setup. From time to time we are observing that our client application (C++ based) disconnects from zookeeper (session state is changed to 1) and reconnects (state changed to 3). This itself is not a problem - usually application continues to run without problems after reconnect. But from time to time after above happens all subsequent operations start to return ZSESSIONMOVED error. To make it work again we have to restart application (which creates new zookeeper session). I noticed that in 3.2.0 introduced a bug http://issues.apache.org/jira/browse/ZOOKEEPER-449 but we are using zookeeper v. 3.2.2. I just noticed that app at compile time used 3.2.0 library but patches fixing bug 449 did not touch C client lib so I believe that our problems are not related with that. In zookeeper logs at moment which initiated the problem with client application I have node1: 2010-03-16 14:21:43,510 - INFO [NIOServerCxn.Factory:2181:nioserverc...@607] - Connected to /10.1.112.61:37197 lastZxid 42992576502 2010-03-16 14:21:43,510 - INFO [NIOServerCxn.Factory:2181:nioserverc...@636] - Renewing session 0x324dcc1ba580085 2010-03-16 14:21:49,443 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@992] - Finished init of 0x324dcc1ba580085 valid:true 2010-03-16 14:21:49,443 - WARN [NIOServerCxn.Factory:2181:nioserverc...@518] - Exception causing close of session 0x324dcc1ba580085 due to java.io.IOException: Read error 2010-03-16 14:21:49,444 - INFO [NIOServerCxn.Factory:2181:nioserverc...@857] - closing session:0x324dcc1ba580085 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.1.112.62:2181 remote=/10.1.112.61:37197] node2: 2010-03-16 14:21:40,580 - WARN [NIOServerCxn.Factory:2181:nioserverc...@494] - Exception causing close of session 0x324dcc1ba580085 due to java.io.IOException: Read error 2010-03-16 14:21:40,581 - INFO [NIOServerCxn.Factory:2181:nioserverc...@833] - closing session:0x324dcc1ba580085 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.1.112.63:2181 remote=/10.1.112.61:60693] 2010-03-16 14:21:46,839 - INFO [NIOServerCxn.Factory:2181:nioserverc...@583] - Connected to /10.1.112.61:48336 lastZxid 42992576502 2010-03-16 14:21:46,839 - INFO [NIOServerCxn.Factory:2181:nioserverc...@612] - Renewing session 0x324dcc1ba580085 2010-03-16 14:21:49,439 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@964] - Finished init of 0x324dcc1ba580085 valid:true node3: 2010-03-16 02:14:48,961 - WARN [NIOServerCxn.Factory:2181:nioserverc...@494] - Exception causing close of session 0x324dcc1ba580085 due to java.io.IOException: Read error 2010-03-16 02:14:48,962 - INFO [NIOServerCxn.Factory:2181:nioserverc...@833] - closing
Re: permanent ZSESSIONMOVED
On Tue, Mar 16, 2010 at 19:22, Patrick Hunt ph...@apache.org wrote: Łukasz Osipiuk wrote: On Tue, Mar 16, 2010 at 17:18, Patrick Hunt ph...@apache.org wrote: Can you verify that you are using 3.2.2 on all servers? You can do this by running the stat command against each of your servers and look at the very top of the output (we include the version of zk server there). http://bit.ly/dglVld Thanks for hint. We had 3.2.2 jars installed on all server - I checked it previously but apparently our script which restarts instance after installing .deb does not work very well and node2 and node3 were running version 3.2.1. Could using 3.2.1 be source of the problem? I heard this story where customer service for dell asks customers to replug in the mouse because it might be a dusty connection. Invariably when ppl look at the back of the box they find that the mouse is not actually plugged in. Brilliant ideas from CS perspective. ;-) I think that issue was fixed in 3.2.1, so I don't think that's an issue (version). Damn :( Are you using synchronous or async operations in your client? We are only using async operations. Ok. good info. I see snippets from the server logs, but I don't see anything from you client logs. It would really help if you could zip up all the logs (server and client) and provide them. We do not really have much logging on application side :(. Failed operations are logged with error code (ZSESSIONEXPIRED in our case). We also log session state changes but we noticed only changes between 1 and 3. Unfortunately we do not have logs coming directly from zookeeper library as it logs to stderr/stdout and we do not store any of them at the moment. I see, in this case I think that the log messages theyselves are the problem, ie we are warning about this, but it's not a real problem. We warn on things that we don't like but can handle, we error if we don'tlike/can'thandle. MOVED basically means that a client initiated a request on one server, then moved to another server before the request could be processed. In your case you are using async requests, that can actually make this problem show up more (more in flight requests vs sync operation). So if you are not seeing MOVED on the client then this is fine. If you see them in the server log it means what I said above. That make sense, or am I missing something? It seems I was what I wrote was not clear. We are noticing errors on client side. Any operation after session gets into broken state fails with ZSESSIONMOVE error code (I wrote ZSESSIONEXPIRED by mistake in previous post). We get ZSESSIONMOVE from callback on rc parameter. Session gets broken after zookeeper library reconnects to cluster, although most of reconnection leave session if working state. Regards, Łukasz Patrick Regards, Łukasz Patrick Łukasz Osipiuk wrote: not really - it happens occasionally - every few days :( I believe it is somewhat connected with our network environment which suffers from some packet loss which leads to connection timeouts. I can switch on some more logging if you can lead me which categories are worth to enable DEBUG for them. Regards, Łukasz Osipiuk On Tue, Mar 16, 2010 at 16:35, Benjamin Reed br...@yahoo-inc.com wrote: weird, this does sound like a bug. do you have a reliable way of reproducing the problem? thanx ben On 03/16/2010 08:27 AM, Łukasz Osipiuk wrote: nope. I always pass 0 as clientid. Łukasz On Tue, Mar 16, 2010 at 16:20, Benjamin Reedbr...@yahoo-inc.com wrote: do you ever use zookeeper_init() with the clientid field set to something other than null? ben On 03/16/2010 07:43 AM, Łukasz Osipiuk wrote: Hi everyone! I am writing to this group because recently we are getting some strange errors with our production zookeeper setup. From time to time we are observing that our client application (C++ based) disconnects from zookeeper (session state is changed to 1) and reconnects (state changed to 3). This itself is not a problem - usually application continues to run without problems after reconnect. But from time to time after above happens all subsequent operations start to return ZSESSIONMOVED error. To make it work again we have to restart application (which creates new zookeeper session). I noticed that in 3.2.0 introduced a bug http://issues.apache.org/jira/browse/ZOOKEEPER-449 but we are using zookeeper v. 3.2.2. I just noticed that app at compile time used 3.2.0 library but patches fixing bug 449 did not touch C client lib so I believe that our problems are not related with that. In zookeeper logs at moment which initiated the problem with client application I have node1: 2010-03-16 14:21:43,510 - INFO [NIOServerCxn.Factory:2181:nioserverc...@607] - Connected to /10.1.112.61:37197 lastZxid 42992576502 2010-03-16 14:21:43,510 - INFO [NIOServerCxn.Factory:2181:nioserverc...@636] - Renewing session
Re: permanent ZSESSIONMOVED
We'll probably need the ZK server/client logs to hunt this down. Can you tell if the MOVED happens in some particular scenario, say you are connected to a follower and move to a leader, or perhaps you are connected to server A, get disconnected and reconnected to server A? is there some pattern that could help us understand what's causing this? Session gets broken after zookeeper library reconnects to cluster, although most of reconnection leave session if working state. I'm not sure I follow this ^^^, could you restate it? (I get the first about about the session is broken after reconnect, what's the second part saying?) Regards, Patrick Łukasz Osipiuk wrote: On Tue, Mar 16, 2010 at 19:22, Patrick Hunt ph...@apache.org wrote: Łukasz Osipiuk wrote: On Tue, Mar 16, 2010 at 17:18, Patrick Hunt ph...@apache.org wrote: Can you verify that you are using 3.2.2 on all servers? You can do this by running the stat command against each of your servers and look at the very top of the output (we include the version of zk server there). http://bit.ly/dglVld Thanks for hint. We had 3.2.2 jars installed on all server - I checked it previously but apparently our script which restarts instance after installing .deb does not work very well and node2 and node3 were running version 3.2.1. Could using 3.2.1 be source of the problem? I heard this story where customer service for dell asks customers to replug in the mouse because it might be a dusty connection. Invariably when ppl look at the back of the box they find that the mouse is not actually plugged in. Brilliant ideas from CS perspective. ;-) I think that issue was fixed in 3.2.1, so I don't think that's an issue (version). Damn :( Are you using synchronous or async operations in your client? We are only using async operations. Ok. good info. I see snippets from the server logs, but I don't see anything from you client logs. It would really help if you could zip up all the logs (server and client) and provide them. We do not really have much logging on application side :(. Failed operations are logged with error code (ZSESSIONEXPIRED in our case). We also log session state changes but we noticed only changes between 1 and 3. Unfortunately we do not have logs coming directly from zookeeper library as it logs to stderr/stdout and we do not store any of them at the moment. I see, in this case I think that the log messages theyselves are the problem, ie we are warning about this, but it's not a real problem. We warn on things that we don't like but can handle, we error if we don'tlike/can'thandle. MOVED basically means that a client initiated a request on one server, then moved to another server before the request could be processed. In your case you are using async requests, that can actually make this problem show up more (more in flight requests vs sync operation). So if you are not seeing MOVED on the client then this is fine. If you see them in the server log it means what I said above. That make sense, or am I missing something? It seems I was what I wrote was not clear. We are noticing errors on client side. Any operation after session gets into broken state fails with ZSESSIONMOVE error code (I wrote ZSESSIONEXPIRED by mistake in previous post). We get ZSESSIONMOVE from callback on rc parameter. Session gets broken after zookeeper library reconnects to cluster, although most of reconnection leave session if working state. Regards, Łukasz Patrick Regards, Łukasz Patrick Łukasz Osipiuk wrote: not really - it happens occasionally - every few days :( I believe it is somewhat connected with our network environment which suffers from some packet loss which leads to connection timeouts. I can switch on some more logging if you can lead me which categories are worth to enable DEBUG for them. Regards, Łukasz Osipiuk On Tue, Mar 16, 2010 at 16:35, Benjamin Reed br...@yahoo-inc.com wrote: weird, this does sound like a bug. do you have a reliable way of reproducing the problem? thanx ben On 03/16/2010 08:27 AM, Łukasz Osipiuk wrote: nope. I always pass 0 as clientid. Łukasz On Tue, Mar 16, 2010 at 16:20, Benjamin Reedbr...@yahoo-inc.com wrote: do you ever use zookeeper_init() with the clientid field set to something other than null? ben On 03/16/2010 07:43 AM, Łukasz Osipiuk wrote: Hi everyone! I am writing to this group because recently we are getting some strange errors with our production zookeeper setup. From time to time we are observing that our client application (C++ based) disconnects from zookeeper (session state is changed to 1) and reconnects (state changed to 3). This itself is not a problem - usually application continues to run without problems after reconnect. But from time to time after above happens all subsequent operations start to return ZSESSIONMOVED error. To make it work again we have to restart application (which creates new zookeeper session). I noticed that in 3.2.0
Re: permanent ZSESSIONMOVED
On Tue, Mar 16, 2010 at 20:05, Patrick Hunt ph...@apache.org wrote: We'll probably need the ZK server/client logs to hunt this down. Can you tell if the MOVED happens in some particular scenario, say you are connected to a follower and move to a leader, or perhaps you are connected to server A, get disconnected and reconnected to server A? is there some pattern that could help us understand what's causing this? When I get to office tomorrow I will try to investigate logs and maybe i will be able to find out what the error scenario is. But I am not sure if I will be able to find out what was the role of each node when problem occurred? Does zookeeper server log when node state changes between follower and leader. Or can I make it log it? Regards, Łukasz -- -- Łukasz Osipiuk mailto:luk...@osipiuk.net
Re: permanent ZSESSIONMOVED
Yes, if you search back (older entries) in the server log you will be able to see who the leader is, it will say something like LEADING or FOLLOWING, but this may change over time (which is why you need to search back as I mention) if leadership within the ZK cluster changes (say due to networking issue). This is why I mention the logs so highly - it really will give us much additional insight into the issue. here's an example of a 5 server ensemble: ph...@valhalla:~/dev/workspace/zkconf/test5[master]$ egrep LEAD local*/*.log localhost:2184/zoo.log:2010-03-16 12:50:13,711 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2184:quorump...@632] - LEADING ph...@valhalla:~/dev/workspace/zkconf/test5[master]$ egrep FOLLOW local*/*.log localhost:2181/zoo.log:2010-03-16 12:50:13,649 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@620] - FOLLOWING localhost:2182/zoo.log:2010-03-16 12:50:13,933 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2182:quorump...@620] - FOLLOWING localhost:2183/zoo.log:2010-03-16 12:50:13,901 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2183:quorump...@620] - FOLLOWING localhost:2185/zoo.log:2010-03-16 12:50:13,661 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2185:quorump...@620] - FOLLOWING Additionally if you use the stat 4letter word you will see the current status of the server, leader or follower. (JMX as well) You might also find this useful: http://github.com/phunt/zktop Patrick Łukasz Osipiuk wrote: On Tue, Mar 16, 2010 at 20:05, Patrick Hunt ph...@apache.org wrote: We'll probably need the ZK server/client logs to hunt this down. Can you tell if the MOVED happens in some particular scenario, say you are connected to a follower and move to a leader, or perhaps you are connected to server A, get disconnected and reconnected to server A? is there some pattern that could help us understand what's causing this? When I get to office tomorrow I will try to investigate logs and maybe i will be able to find out what the error scenario is. But I am not sure if I will be able to find out what was the role of each node when problem occurred? Does zookeeper server log when node state changes between follower and leader. Or can I make it log it? Regards, Łukasz
Re: permanent ZSESSIONMOVED
Agree. Patrick Ted Dunning wrote: This is very different from most uses of ZK. Normally Zookeeper is used on a private network with almost no packet loss. Your high rate of loss may be tickling a protocol bug that other people just never see. On Tue, Mar 16, 2010 at 12:24 PM, Łukasz Osipiuk luk...@osipiuk.net wrote: We get reconnects quite often due to large packet loss in our network, but most of the time they do not break the session.
Re: permanent ZSESSIONMOVED
It will be good to see the logs, however I had one additional thought. The leader (the zk leader) is the one checking for session MOVED. It keeps track of which server the session is currently attached to and will throw the moved exception if the session proposes a request through a server other than who the leader thinks is the owner. I'm wondering, if/when you see this again, if you restart the server that the session is attached to (use netstat on the client for this) what would happen. The client will re-attach to the cluster, I'm wondering if this would fix the problem. (rather than trying to restart the client as you have been doing). Not sure if you can try this (production env?) but it would be an interesting additional data point if you can give it a try. Regards, Patrick Patrick Hunt wrote: Yes, if you search back (older entries) in the server log you will be able to see who the leader is, it will say something like LEADING or FOLLOWING, but this may change over time (which is why you need to search back as I mention) if leadership within the ZK cluster changes (say due to networking issue). This is why I mention the logs so highly - it really will give us much additional insight into the issue. here's an example of a 5 server ensemble: ph...@valhalla:~/dev/workspace/zkconf/test5[master]$ egrep LEAD local*/*.log localhost:2184/zoo.log:2010-03-16 12:50:13,711 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2184:quorump...@632] - LEADING ph...@valhalla:~/dev/workspace/zkconf/test5[master]$ egrep FOLLOW local*/*.log localhost:2181/zoo.log:2010-03-16 12:50:13,649 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@620] - FOLLOWING localhost:2182/zoo.log:2010-03-16 12:50:13,933 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2182:quorump...@620] - FOLLOWING localhost:2183/zoo.log:2010-03-16 12:50:13,901 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2183:quorump...@620] - FOLLOWING localhost:2185/zoo.log:2010-03-16 12:50:13,661 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2185:quorump...@620] - FOLLOWING Additionally if you use the stat 4letter word you will see the current status of the server, leader or follower. (JMX as well) You might also find this useful: http://github.com/phunt/zktop Patrick Łukasz Osipiuk wrote: On Tue, Mar 16, 2010 at 20:05, Patrick Hunt ph...@apache.org wrote: We'll probably need the ZK server/client logs to hunt this down. Can you tell if the MOVED happens in some particular scenario, say you are connected to a follower and move to a leader, or perhaps you are connected to server A, get disconnected and reconnected to server A? is there some pattern that could help us understand what's causing this? When I get to office tomorrow I will try to investigate logs and maybe i will be able to find out what the error scenario is. But I am not sure if I will be able to find out what was the role of each node when problem occurred? Does zookeeper server log when node state changes between follower and leader. Or can I make it log it? Regards, Łukasz
Re: permanent ZSESSIONMOVED
Hmm... this inspires me to have a thought as well. Łukasz, there isn't any fancy network stuff going on here is there? No NATing or fancy load balancing or reassignment of IP addresses of servers, right? On Tue, Mar 16, 2010 at 4:51 PM, Patrick Hunt ph...@apache.org wrote: It will be good to see the logs, however I had one additional thought. The leader (the zk leader) is the one checking for session MOVED. It keeps track of which server the session is currently attached to and will throw the moved exception if the session proposes a request through a server other than who the leader thinks is the owner.