[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934202#action_12934202 ] Flavio Junqueira commented on ZOOKEEPER-335: Radu, It sounds like the problem you mention has been resolved in ZOOKEEPER-790. I'm not sure which version you're using, but perhaps you should consider moving to 3.3.2. zookeeper servers should commit the new leader txn to their logs. - Key: ZOOKEEPER-335 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335 Project: Zookeeper Issue Type: Bug Components: server Affects Versions: 3.1.0 Reporter: Mahadev konar Assignee: Mahadev konar Priority: Blocker Fix For: 3.4.0 Attachments: faultynode-vishal.txt, zk.log.gz, zklogs.tar.gz, ZOOKEEPER-790.travis.log.bz2 currently the zookeeper followers do not commit the new leader election. This will cause problems in a failure scenarios with a follower acking to the same leader txn id twice, which might be two different intermittent leaders and allowing them to propose two different txn's of the same zxid. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888684#action_12888684 ] Travis Crawford commented on ZOOKEEPER-335: --- Unfortunately I still observed the Leader epoch issue and needed to manually force a leader election for the cluster to recover. This test was performed with the following base+patches, applied in the order listed. Zookeeper 3.3.1 ZOOKEEPER-744 ZOOKEEPER-790 {code} 2010-07-15 02:43:57,181 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:files...@82] - Reading snapshot /data/zookeeper/version-2/snapshot.231ac2 2010-07-15 02:43:57,384 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@649] - New election. My id = 1, Proposed zxid = 154618826848 2010-07-15 02:43:57,385 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - Notification: 1, 154618826848, 4, 1, LOOKING, LOOKING, 1 2010-07-15 02:43:57,385 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 2, 146030952153, 3, 1, LOOKING, LEADING, 2 2010-07-15 02:43:57,385 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 2, 146030952153, 3, 1, LOOKING, FOLLOWING, 3 2010-07-15 02:43:57,385 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@642] - FOLLOWING 2010-07-15 02:43:57,385 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:zookeeperser...@151] - Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 4 datadir /data/zookeeper/txlog/version-2 snapdir /data/zookeeper/version-2 2010-07-15 02:43:57,387 - FATAL [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@71] - Leader epoch 23 is less than our epoch 24 2010-07-15 02:43:57,387 - WARN [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@82] - Exception when following the leader java.io.IOException: Error: Epoch of leader is lower at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:73) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:644) 2010-07-15 02:43:57,387 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@166] - shutdown called java.lang.Exception: shutdown Follower at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:648) {code} I followed the recipe @vishal provided for recreating. (a) Stop one follower in a three node cluster (b) Get some tea while it falls behind (c) Start the node stopped in (a). These timestamps show where the follower was stopped. It also shows when it was turned back on. {code} 2010-07-15 02:35:36,398 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@1661] - Established session 0x229aa13cfc6276b with negotiated timeout 1 for client /10.209.45.114:34562 2010-07-15 02:39:18,907 - INFO [main:quorumpeercon...@90] - Reading configuration from: /etc/zookeeper/conf/zoo.cfg {code} This timestamp is the first ``Leader epoch`` line. Everything between these two points will be the interesting bits. {code} 2010-07-15 02:39:43,339 - FATAL [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@71] - Leader epoch 23 is less than our epoch 24 {code} zookeeper servers should commit the new leader txn to their logs. - Key: ZOOKEEPER-335 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335 Project: Zookeeper Issue Type: Bug Components: server Affects Versions: 3.1.0 Reporter: Mahadev konar Assignee: Mahadev konar Priority: Blocker Fix For: 3.4.0 Attachments: faultynode-vishal.txt, zk.log.gz, zklogs.tar.gz currently the zookeeper followers do not commit the new leader election. This will cause problems in a failure scenarios with a follower acking to the same leader txn id twice, which might be two different intermittent leaders and allowing them to propose two different txn's of the same zxid. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881168#action_12881168 ] Flavio Paiva Junqueira commented on ZOOKEEPER-335: -- Thanks for detailed assessment, Vishal. In Step b, the fact that the process believes it is the leader is not a problem, and it happens because we queue notification messages during leader election. The real issue is that leader code is setting the last processed zxid to the first of the new epoch even before connecting to a quorum of followers. Because the leader code sets this value before connecting to a quorum of followers (Leader.java:281) and the follower code throws an IOException (Follower.java:73) if the leader epoch is smaller, we have that when the false leader drops leadership and becomes a follower, it finds a smaller epoch and kills itself. I noticed that this follower check was not there before (not present in 3.0 branch), and it might have been introduced when we did the observer reorganization. For now I propose that we move line Leader.java:281 to Leader.java:470. It simply changes the point in which we set the last processed zxid to one in which we know that a quorum of followers supports the leader. I reasoned a bit about it and verified that tests pass. A patch for the change I'm proposing is trivial, but a unit test will require some work, so I'd rather hear opinions first. Also, please note that this problem is not related to the topic of this jira, so we might consider working on a different jira from this point on. zookeeper servers should commit the new leader txn to their logs. - Key: ZOOKEEPER-335 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335 Project: Zookeeper Issue Type: Bug Components: server Affects Versions: 3.1.0 Reporter: Mahadev konar Assignee: Mahadev konar Priority: Blocker Fix For: 3.4.0 Attachments: faultynode-vishal.txt, zk.log.gz, zklogs.tar.gz currently the zookeeper followers do not commit the new leader election. This will cause problems in a failure scenarios with a follower acking to the same leader txn id twice, which might be two different intermittent leaders and allowing them to propose two different txn's of the same zxid. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881236#action_12881236 ] Patrick Hunt commented on ZOOKEEPER-335: Vishal, if Flavio provides you with a patch could you apply it and verify with your configuration? Flavio, please provide an initial patch that people could use to verify. We'll hold off on a release until you add the test(s), but this would be great to start with. Thanks all for helping to track this down! I'd like to fast track a 3.3.2 release, so if possible please make this a priority. zookeeper servers should commit the new leader txn to their logs. - Key: ZOOKEEPER-335 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335 Project: Zookeeper Issue Type: Bug Components: server Affects Versions: 3.1.0 Reporter: Mahadev konar Assignee: Mahadev konar Priority: Blocker Fix For: 3.4.0 Attachments: faultynode-vishal.txt, zk.log.gz, zklogs.tar.gz currently the zookeeper followers do not commit the new leader election. This will cause problems in a failure scenarios with a follower acking to the same leader txn id twice, which might be two different intermittent leaders and allowing them to propose two different txn's of the same zxid. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881244#action_12881244 ] Flavio Paiva Junqueira commented on ZOOKEEPER-335: -- I have created a new jira for this issue: ZOOKEEPER-790. There is a patch there. zookeeper servers should commit the new leader txn to their logs. - Key: ZOOKEEPER-335 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335 Project: Zookeeper Issue Type: Bug Components: server Affects Versions: 3.1.0 Reporter: Mahadev konar Assignee: Mahadev konar Priority: Blocker Fix For: 3.4.0 Attachments: faultynode-vishal.txt, zk.log.gz, zklogs.tar.gz currently the zookeeper followers do not commit the new leader election. This will cause problems in a failure scenarios with a follower acking to the same leader txn id twice, which might be two different intermittent leaders and allowing them to propose two different txn's of the same zxid. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881280#action_12881280 ] Vishal K commented on ZOOKEEPER-335: I will try out the patch. FYI I am using 3.3.0. zookeeper servers should commit the new leader txn to their logs. - Key: ZOOKEEPER-335 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335 Project: Zookeeper Issue Type: Bug Components: server Affects Versions: 3.1.0 Reporter: Mahadev konar Assignee: Mahadev konar Priority: Blocker Fix For: 3.4.0 Attachments: faultynode-vishal.txt, zk.log.gz, zklogs.tar.gz currently the zookeeper followers do not commit the new leader election. This will cause problems in a failure scenarios with a follower acking to the same leader txn id twice, which might be two different intermittent leaders and allowing them to propose two different txn's of the same zxid. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880917#action_12880917 ] Patrick Hunt commented on ZOOKEEPER-335: vishal comment on list: I might be wrong here, but let me try to chip in my few cents. I think the problem is in LearnerHandler.java at the leader fo this Follower. /* see what other packets from the proposal * and tobeapplied queues need to be sent * and then decide if we can just send a DIFF * or we actually need to send the whole snapshot */ long leaderLastZxid = leader.startForwarding(this, updates); --- this leaderLastZxid returned is probably incorrect. // a special case when both the ids are the same if (peerLastZxid == leaderLastZxid) { packetToSend = Leader.DIFF; zxidToSend = leaderLastZxid; } QuorumPacket newLeaderQP = new QuorumPacket(Leader.NEWLEADER, leaderLastZxid, null, null); oa.writeRecord(newLeaderQP, packet); bufferedOutput.flush() zookeeper servers should commit the new leader txn to their logs. - Key: ZOOKEEPER-335 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335 Project: Zookeeper Issue Type: Bug Components: server Affects Versions: 3.1.0 Reporter: Mahadev konar Assignee: Mahadev konar Priority: Blocker Fix For: 3.4.0 Attachments: zk.log.gz, zklogs.tar.gz currently the zookeeper followers do not commit the new leader election. This will cause problems in a failure scenarios with a follower acking to the same leader txn id twice, which might be two different intermittent leaders and allowing them to propose two different txn's of the same zxid. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880918#action_12880918 ] Patrick Hunt commented on ZOOKEEPER-335: vishal comment on list: Nevermind. I am on the wrong track. Flavio's earlier mail did clarify that the follower received the epoch before restart. zookeeper servers should commit the new leader txn to their logs. - Key: ZOOKEEPER-335 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335 Project: Zookeeper Issue Type: Bug Components: server Affects Versions: 3.1.0 Reporter: Mahadev konar Assignee: Mahadev konar Priority: Blocker Fix For: 3.4.0 Attachments: zk.log.gz, zklogs.tar.gz currently the zookeeper followers do not commit the new leader election. This will cause problems in a failure scenarios with a follower acking to the same leader txn id twice, which might be two different intermittent leaders and allowing them to propose two different txn's of the same zxid. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.
Please use the JIRA for followups, otw it's hard to track progress/status. thanks. Patrick On 06/18/2010 04:45 PM, Vishal K wrote: Hi Flavio, I have 3 set of logs and they all seem to indicate two problems on the misbehaving follower: Problem 1: Expected zxid is incorrect =0[QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x30002 expected 0x1 =0[QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x30002 expected 0x1 =2495 [QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x40001 expected 0x1 =2495 [QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x40001 expected 0x1 =191617 [QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x50001 expected 0x1 =191617 [QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x50001 expected 0x1 =0[QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x60001 expected 0x1 =0[QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x60001 expected 0x1 =245016 [QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x70001 expected 0x1 =245016 [QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x70001 expected 0x1 Note expected zxid is always 0x1 (lastQueued is always 0?) Problem 2: While joining the cluster expected epoch is 1 higher than seen earlier =14991 [QuorumPeer:/0.0.0.0:2181] FATAL org.apache.zookeeper.server.quorum.Learner - Leader epoch 7 is less than our epoch 8 -Vishal
[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880919#action_12880919 ] Patrick Hunt commented on ZOOKEEPER-335: Vishal comment on list: Hi Flavio, I have 3 set of logs and they all seem to indicate two problems on the misbehaving follower: Problem 1: Expected zxid is incorrect =0[QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x30002 expected 0x1 =0[QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x30002 expected 0x1 =2495 [QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x40001 expected 0x1 =2495 [QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x40001 expected 0x1 =191617 [QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x50001 expected 0x1 =191617 [QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x50001 expected 0x1 =0[QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x60001 expected 0x1 =0[QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x60001 expected 0x1 =245016 [QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x70001 expected 0x1 =245016 [QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x70001 expected 0x1 Note expected zxid is always 0x1 (lastQueued is always 0?) Problem 2: While joining the cluster expected epoch is 1 higher than seen earlier =14991 [QuorumPeer:/0.0.0.0:2181] FATAL org.apache.zookeeper.server.quorum.Learner - Leader epoch 7 is less than our epoch 8 -Vishal zookeeper servers should commit the new leader txn to their logs. - Key: ZOOKEEPER-335 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335 Project: Zookeeper Issue Type: Bug Components: server Affects Versions: 3.1.0 Reporter: Mahadev konar Assignee: Mahadev konar Priority: Blocker Fix For: 3.4.0 Attachments: zk.log.gz, zklogs.tar.gz currently the zookeeper followers do not commit the new leader election. This will cause problems in a failure scenarios with a follower acking to the same leader txn id twice, which might be two different intermittent leaders and allowing them to propose two different txn's of the same zxid. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881028#action_12881028 ] Vishal K commented on ZOOKEEPER-335: Hi, I enabled tracing and did some more debugging. Looks like the restarted peer (and trying to join the cluster) determines that it is a leader and increments its epoch. However, rest of the nodes don't acknowledge this node as the leader, and hence, have an older epoch. I will attache the log. Unfortunately, I don't have traces from other nodes. I will repeat the experiment later and attache logs from other nodes. Scenario: - Form a 3 node cluster. This is not just ZK cluster. It also involves our application cluster that uses ZK. - Kill one of the follower - After a minute or so restart follower - Follower rejects leader with Leader epoch y is less than our epoch y + 1 From logs: a) Peer X restarts and starts leader election. a) For a small window of time, X thinks that it is the new leader! During this window, for some reason, rest of the nodes tell X that they are also trying to find a leader. I.e., all 3 nodes are in LOOKING state. After seeing that all 3 nodes are in LOOKING state, X decides to be a leader? 155 2010-06-20 23:22:46,421 - DEBUG [WorkerSender Thread:quorumcnxmana...@346] - Opening channel to server 1 156 2010-06-20 23:22:46,423 - DEBUG [WorkerReceiver Thread:fastleaderelection$messenger$workerrecei...@214] - Receive new notification message. My id = 0 157 2010-06-20 23:22:46,424 - INFO [QuorumPeer:/0.0.0.0:2181:fastleaderelect...@689] - Notification: 0, 77309411393, 1, 0, LOOKING, LOOKING, 0 158 2010-06-20 23:22:46,424 - DEBUG [QuorumPeer:/0.0.0.0:2181:fastleaderelect...@495] - id: 0, proposed id: 0, zxid: 77309411393, proposed zxid: 77309411393 159 2010-06-20 23:22:46,424 - DEBUG [QuorumPeer:/0.0.0.0:2181:fastleaderelect...@717] - Adding vote: From = 0, Proposed leader = 0, Porposed zxid = 77309411393, Proposed epoch = 1 160 2010-06-20 23:22:46,426 - INFO [WorkerSender Thread:quorumcnxmana...@162] - Have smaller server identifier, so dropping the connection: (1, 0) 161 2010-06-20 23:22:46,426 - DEBUG [WorkerSender Thread:quorumcnxmana...@346] - Opening channel to server 2 162 2010-06-20 23:22:46,427 - DEBUG [Thread-1:quorumcnxmanager$liste...@445] - Connection request /192.168.1.182:46701 163 2010-06-20 23:22:46,427 - DEBUG [Thread-1:quorumcnxmanager$liste...@448] - Connection request: 0 164 2010-06-20 23:22:46,428 - DEBUG [Thread-1:quorumcnxmanager$sendwor...@504] - Address of remote peer: 1 165 2010-06-20 23:22:46,428 - INFO [WorkerSender Thread:quorumcnxmana...@162] - Have smaller server identifier, so dropping the connection: (2, 0) 166 2010-06-20 23:22:46,431 - DEBUG [WorkerReceiver Thread:fastleaderelection$messenger$workerrecei...@214] - Receive new notification message. My id = 0 167 2010-06-20 23:22:46,432 - INFO [QuorumPeer:/0.0.0.0:2181:fastleaderelect...@689] - Notification: 1, 77309411372, 1, 0, LOOKING, LOOKING, 1 168 2010-06-20 23:22:46,432 - DEBUG [QuorumPeer:/0.0.0.0:2181:fastleaderelect...@495] - id: 1, proposed id: 0, zxid: 77309411372, proposed zxid: 77309411393 169 2010-06-20 23:22:46,432 - DEBUG [QuorumPeer:/0.0.0.0:2181:fastleaderelect...@717] - Adding vote: From = 1, Proposed leader = 1, Porposed zxid = 77309411372, Proposed epoch = 1 170 2010-06-20 23:22:46,436 - DEBUG [Thread-1:quorumcnxmanager$liste...@445] - Connection request /192.168.1.183:44310 171 2010-06-20 23:22:46,436 - DEBUG [Thread-1:quorumcnxmanager$liste...@448] - Connection request: 0 172 2010-06-20 23:22:46,436 - DEBUG [Thread-1:quorumcnxmanager$sendwor...@504] - Address of remote peer: 2 173 2010-06-20 23:22:46,440 - DEBUG [WorkerReceiver Thread:fastleaderelection$messenger$workerrecei...@214] - Receive new notification message. My id = 0 174 2010-06-20 23:22:46,440 - INFO [QuorumPeer:/0.0.0.0:2181:fastleaderelect...@689] - Notification: 2, 7301097, 1, 0, LOOKING, LOOKING, 2 175 2010-06-20 23:22:46,440 - DEBUG [QuorumPeer:/0.0.0.0:2181:fastleaderelect...@495] - id: 2, proposed id: 0, zxid: 7301097, proposed zxid: 77309411393 176 2010-06-20 23:22:46,441 - DEBUG [QuorumPeer:/0.0.0.0:2181:fastleaderelect...@717] - Adding vote: From = 2, Proposed leader = 2, Porposed zxid = 7301097, Proposed epoch = 1 177 2010-06-20 23:22:46,441 - INFO [QuorumPeer:/0.0.0.0:2181:quorump...@647] - LEADING b) As a result X increments its epoch. Worse, since this node decided to be a leader, it starts doing transactions. The first set of transactions start removing all ephemeral nodes. But these transactions are only done locally. Other peers do not ack these transactions since they know that this peer is not the leader. c) After a few seconds (8 secs), X relinquishes leadership since it does not receive any ack from rest of
[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880202#action_12880202 ] Flavio Paiva Junqueira commented on ZOOKEEPER-335: -- Mike, There is one thing I don't understand. From the logs, it looks like servers 1 and 3 are proposing a zxid of 0 (second field of notification) during election, which makes me think that they had no state at all: {noformat} 2010-06-17 14:35:40,714 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - Notification: 2, 8589934884, 2, 2, LOOKING, LOOKING, 2 2010-06-17 14:35:40,714 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 3, 0, 1, 2, LOOKING, FOLLOWING, 1 2010-06-17 14:35:40,714 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 3, 0, 1, 2, LOOKING, LEADING, 3 {noformat} Server 2 on the other hand had accepted updates based on the zxid it proposes. Were they supposed to have no state at all? Have you deleted your logs and snapshots before restarting the servers? zookeeper servers should commit the new leader txn to their logs. - Key: ZOOKEEPER-335 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335 Project: Zookeeper Issue Type: Bug Components: server Affects Versions: 3.1.0 Reporter: Mahadev konar Assignee: Mahadev konar Priority: Blocker Fix For: 3.4.0 Attachments: zk.log.gz currently the zookeeper followers do not commit the new leader election. This will cause problems in a failure scenarios with a follower acking to the same leader txn id twice, which might be two different intermittent leaders and allowing them to propose two different txn's of the same zxid. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880320#action_12880320 ] Flavio Paiva Junqueira commented on ZOOKEEPER-335: -- Guys, I don't see enough information in these logs to determine what's going on. Let me tell you what I'm seeing so that perhaps other folks can help me out here. One part of the log that is suspicious is this one: {noformat} =6693 [QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x30001 expected 0x1 =6693 [QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x30001 expected 0x1 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor30] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor27] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor22] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor23] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor18] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor20] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor19] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor31] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor21] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor26] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor25] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor33] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor29] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor28] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor24] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor32] * NODE RESTARTED HERE ** {noformat} Before being restarted, the bad node receives a proposal with zxid 3,1 and it expects 0,1. Next in the logs after being restarted, I can see that it is complaining that it has epoch 4 and the leader 3. Something strange apparently happened during the restart. It also seems to be the case that the node was being able to talk to the others (first entries in the log before the excerpt above). Do you guys see anything I'm overlooking? zookeeper servers should commit the new leader txn to their logs. - Key: ZOOKEEPER-335 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335 Project: Zookeeper Issue Type: Bug Components: server Affects Versions: 3.1.0 Reporter: Mahadev konar Assignee: Mahadev konar Priority: Blocker Fix For: 3.4.0 Attachments: zk.log.gz, zklogs.tar.gz currently the zookeeper followers do not commit the new leader election. This will cause problems in a failure scenarios with a follower acking to the same leader txn id twice, which might be two different intermittent leaders and allowing them to propose two different txn's of the same zxid. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.
I might be wrong here, but let me try to chip in my few cents. I think the problem is in LearnerHandler.java at the leader fo this Follower. /* see what other packets from the proposal * and tobeapplied queues need to be sent * and then decide if we can just send a DIFF * or we actually need to send the whole snapshot */ long leaderLastZxid = leader.startForwarding(this, updates); --- this leaderLastZxid returned is probably incorrect. // a special case when both the ids are the same if (peerLastZxid == leaderLastZxid) { packetToSend = Leader.DIFF; zxidToSend = leaderLastZxid; } QuorumPacket newLeaderQP = new QuorumPacket(Leader.NEWLEADER, leaderLastZxid, null, null); oa.writeRecord(newLeaderQP, packet); bufferedOutput.flush() On Fri, Jun 18, 2010 at 4:49 PM, Flavio Paiva Junqueira (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880320#action_12880320] Flavio Paiva Junqueira commented on ZOOKEEPER-335: -- Guys, I don't see enough information in these logs to determine what's going on. Let me tell you what I'm seeing so that perhaps other folks can help me out here. One part of the log that is suspicious is this one: {noformat} =6693 [QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x30001 expected 0x1 =6693 [QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x30001 expected 0x1 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor30] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor27] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor22] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor23] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor18] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor20] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor19] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor31] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor21] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor26] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor25] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor33] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor29] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor28] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor24] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor32] * NODE RESTARTED HERE ** {noformat} Before being restarted, the bad node receives a proposal with zxid 3,1 and it expects 0,1. Next in the logs after being restarted, I can see that it is complaining that it has epoch 4 and the leader 3. Something strange apparently happened during the restart. It also seems to be the case that the node was being able to talk to the others (first entries in the log before the excerpt above). Do you guys see anything I'm overlooking? zookeeper servers should commit the new leader txn to their logs. - Key: ZOOKEEPER-335 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335 Project: Zookeeper Issue Type: Bug Components: server Affects Versions: 3.1.0 Reporter: Mahadev konar Assignee: Mahadev konar Priority: Blocker Fix For: 3.4.0 Attachments: zk.log.gz, zklogs.tar.gz currently the zookeeper followers do not commit the new leader election. This will cause problems in a failure scenarios with a follower acking to the same leader txn id twice, which might be two different intermittent leaders and allowing them to propose two different txn's of the same zxid. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.
Nevermind. I am on the wrong track. Flavio's earlier mail did clarify that the follower received the epoch before restart. On Fri, Jun 18, 2010 at 6:20 PM, Vishal K vishalm...@gmail.com wrote: I might be wrong here, but let me try to chip in my few cents. I think the problem is in LearnerHandler.java at the leader fo this Follower. /* see what other packets from the proposal * and tobeapplied queues need to be sent * and then decide if we can just send a DIFF * or we actually need to send the whole snapshot */ long leaderLastZxid = leader.startForwarding(this, updates); --- this leaderLastZxid returned is probably incorrect. // a special case when both the ids are the same if (peerLastZxid == leaderLastZxid) { packetToSend = Leader.DIFF; zxidToSend = leaderLastZxid; } QuorumPacket newLeaderQP = new QuorumPacket(Leader.NEWLEADER, leaderLastZxid, null, null); oa.writeRecord(newLeaderQP, packet); bufferedOutput.flush() On Fri, Jun 18, 2010 at 4:49 PM, Flavio Paiva Junqueira (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880320#action_12880320] Flavio Paiva Junqueira commented on ZOOKEEPER-335: -- Guys, I don't see enough information in these logs to determine what's going on. Let me tell you what I'm seeing so that perhaps other folks can help me out here. One part of the log that is suspicious is this one: {noformat} =6693 [QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x30001 expected 0x1 =6693 [QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x30001 expected 0x1 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor30] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor27] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor22] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor23] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor18] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor20] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor19] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor31] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor21] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor26] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor25] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor33] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor29] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor28] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor24] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor32] * NODE RESTARTED HERE ** {noformat} Before being restarted, the bad node receives a proposal with zxid 3,1 and it expects 0,1. Next in the logs after being restarted, I can see that it is complaining that it has epoch 4 and the leader 3. Something strange apparently happened during the restart. It also seems to be the case that the node was being able to talk to the others (first entries in the log before the excerpt above). Do you guys see anything I'm overlooking? zookeeper servers should commit the new leader txn to their logs. - Key: ZOOKEEPER-335 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335 Project: Zookeeper Issue Type: Bug Components: server Affects Versions: 3.1.0 Reporter: Mahadev konar Assignee: Mahadev konar Priority: Blocker Fix For: 3.4.0 Attachments: zk.log.gz, zklogs.tar.gz currently the zookeeper followers do not commit the new leader election. This will cause problems in a failure scenarios with a follower acking to the same leader txn id twice, which might be two different intermittent leaders and allowing them to propose two different txn's of the same zxid. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.
Hi Flavio, I have 3 set of logs and they all seem to indicate two problems on the misbehaving follower: Problem 1: Expected zxid is incorrect =0[QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x30002 expected 0x1 =0[QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x30002 expected 0x1 =2495 [QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x40001 expected 0x1 =2495 [QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x40001 expected 0x1 =191617 [QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x50001 expected 0x1 =191617 [QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x50001 expected 0x1 =0[QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x60001 expected 0x1 =0[QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x60001 expected 0x1 =245016 [QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x70001 expected 0x1 =245016 [QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x70001 expected 0x1 Note expected zxid is always 0x1 (lastQueued is always 0?) Problem 2: While joining the cluster expected epoch is 1 higher than seen earlier =14991 [QuorumPeer:/0.0.0.0:2181] FATAL org.apache.zookeeper.server.quorum.Learner - Leader epoch 7 is less than our epoch 8 -Vishal On Fri, Jun 18, 2010 at 6:33 PM, Vishal K vishalm...@gmail.com wrote: Nevermind. I am on the wrong track. Flavio's earlier mail did clarify that the follower received the epoch before restart. On Fri, Jun 18, 2010 at 6:20 PM, Vishal K vishalm...@gmail.com wrote: I might be wrong here, but let me try to chip in my few cents. I think the problem is in LearnerHandler.java at the leader fo this Follower. /* see what other packets from the proposal * and tobeapplied queues need to be sent * and then decide if we can just send a DIFF * or we actually need to send the whole snapshot */ long leaderLastZxid = leader.startForwarding(this, updates); --- this leaderLastZxid returned is probably incorrect. // a special case when both the ids are the same if (peerLastZxid == leaderLastZxid) { packetToSend = Leader.DIFF; zxidToSend = leaderLastZxid; } QuorumPacket newLeaderQP = new QuorumPacket(Leader.NEWLEADER, leaderLastZxid, null, null); oa.writeRecord(newLeaderQP, packet); bufferedOutput.flush() On Fri, Jun 18, 2010 at 4:49 PM, Flavio Paiva Junqueira (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880320#action_12880320] Flavio Paiva Junqueira commented on ZOOKEEPER-335: -- Guys, I don't see enough information in these logs to determine what's going on. Let me tell you what I'm seeing so that perhaps other folks can help me out here. One part of the log that is suspicious is this one: {noformat} =6693 [QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x30001 expected 0x1 =6693 [QuorumPeer:/0.0.0.0:2181] WARN org.apache.zookeeper.server.quorum.Learner - Got zxid 0x30001 expected 0x1 [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor30] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor27] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor22] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor23] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor18] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor20] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor19] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor31] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor21] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor26] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor25] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor33] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor29] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor28] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor24] [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor32] * NODE RESTARTED HERE ** {noformat} Before being restarted, the bad node
[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879967#action_12879967 ] Mike Solomon commented on ZOOKEEPER-335: I am having this exact issue, but I am not upgrading. I am merely restarting the cluster. I have a cluster of three. I took down host1 and verified that my application remained and reconnected to host2 and host3. With host1 back online, I took down host2. I noticed that the java process was spinning over 100% CPU and realized it had not come back up. This is running the 3.3.0 JAR release on a dual proc, quad-core Intel box. I'm running SuSE 10.3, 64-bit, with this version of java: java version 1.6.0_10 Java(TM) SE Runtime Environment (build 1.6.0_10-b33) Java HotSpot(TM) Server VM (build 11.0-b15, mixed mode) I will attach a log file. zookeeper servers should commit the new leader txn to their logs. - Key: ZOOKEEPER-335 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335 Project: Zookeeper Issue Type: Bug Components: server Affects Versions: 3.1.0 Reporter: Mahadev konar Assignee: Mahadev konar Priority: Blocker Fix For: 3.4.0 currently the zookeeper followers do not commit the new leader election. This will cause problems in a failure scenarios with a follower acking to the same leader txn id twice, which might be two different intermittent leaders and allowing them to propose two different txn's of the same zxid. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880001#action_12880001 ] Patrick Hunt commented on ZOOKEEPER-335: Thanks for the log Mike. This issue does seem similar to what Charity reported: 2010-06-17 14:35:34,263 - FATAL [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@71] - Leader epoch 1 is less than our epoch 2 Unfortunately the attached log shows information only after the problem occurred. Any chance you could upload the logs during the initial event? (what I mean is when the problem originally started) Also the logs from the other servers in the ensemble (again, at the time that the problem originally occurred) would really help. Thanks. Have you been able to clear the problem? It's fairly straightforward to resolve - Charity resolved by; 1) bring down the failing server, 2) clear the data directory of that server (only), 3) start that server. You only want to do this for the server that's unable to rejoin the quorum - ie the one thats outputting Leader epoch 1 is less than our epoch 2, _not_ for all servers in the ensemble. zookeeper servers should commit the new leader txn to their logs. - Key: ZOOKEEPER-335 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335 Project: Zookeeper Issue Type: Bug Components: server Affects Versions: 3.1.0 Reporter: Mahadev konar Assignee: Mahadev konar Priority: Blocker Fix For: 3.4.0 Attachments: zk.log.gz currently the zookeeper followers do not commit the new leader election. This will cause problems in a failure scenarios with a follower acking to the same leader txn id twice, which might be two different intermittent leaders and allowing them to propose two different txn's of the same zxid. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879520#action_12879520 ] Vishal K commented on ZOOKEEPER-335: Hi, We are running into this bug very often (almost 60-75% hit rate) while testing our newly developed application over ZK. This is almost a blocker for us. Will the fix be simplified if backward compatibility was not an issue? Thanks. zookeeper servers should commit the new leader txn to their logs. - Key: ZOOKEEPER-335 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335 Project: Zookeeper Issue Type: Bug Components: server Affects Versions: 3.1.0 Reporter: Mahadev konar Assignee: Mahadev konar Priority: Blocker Fix For: 3.4.0 currently the zookeeper followers do not commit the new leader election. This will cause problems in a failure scenarios with a follower acking to the same leader txn id twice, which might be two different intermittent leaders and allowing them to propose two different txn's of the same zxid. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879548#action_12879548 ] Patrick Hunt commented on ZOOKEEPER-335: We are unable to reproduce this issue. If you can provide the server logs (all servers) and attach them to this jira it would be very helpful. Some detail on the approx time of the issue so we can correlate to the logs would help too (summary of what you did/do to cause it, etc... anything that might help us nail this one down). Detail on ZK version, OS, Java version, HW info, etc... would also be of use to us. zookeeper servers should commit the new leader txn to their logs. - Key: ZOOKEEPER-335 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335 Project: Zookeeper Issue Type: Bug Components: server Affects Versions: 3.1.0 Reporter: Mahadev konar Assignee: Mahadev konar Priority: Blocker Fix For: 3.4.0 currently the zookeeper followers do not commit the new leader election. This will cause problems in a failure scenarios with a follower acking to the same leader txn id twice, which might be two different intermittent leaders and allowing them to propose two different txn's of the same zxid. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12874888#action_12874888 ] Charity Majors commented on ZOOKEEPER-335: -- I ran into this bug this morning, but it also seemed to put my cluster into an unusable state. The cluster stopped accepting all connections, until I restarted node one. After node one departed the cluster, nodes two and three formed a quorum and started serving again. Node one was unable to rejoin, and had this error: 2010-06-02 17:04:56,486 - FATAL [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@71] - Leader epoch a is less than our epoch b 2010-06-02 17:04:56,486 - WARN [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@82] - Exception when following the leader java.io.IOException: Error: Epoch of leader is lower until I cleared the data directory and restarted again. zookeeper servers should commit the new leader txn to their logs. - Key: ZOOKEEPER-335 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335 Project: Zookeeper Issue Type: Bug Components: server Affects Versions: 3.1.0 Reporter: Mahadev konar Assignee: Mahadev konar Priority: Blocker Fix For: 3.4.0 currently the zookeeper followers do not commit the new leader election. This will cause problems in a failure scenarios with a follower acking to the same leader txn id twice, which might be two different intermittent leaders and allowing them to propose two different txn's of the same zxid. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.