[ https://issues.apache.org/jira/browse/ZOOKEEPER-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12888304#action_12888304 ]
Flavio Paiva Junqueira commented on ZOOKEEPER-790: -------------------------------------------------- Hi Vishal, Thanks for all the information. I haven't been able to reproduce it yet, but here are some thoughts after looking over your logs again: 1- It is not a problem that server 0 is declaring itself leader, even though there is another leader running. Server 0 will be ignored by the others and eventually will drop its leadership as you have observed; 2- The notifications of 1 and 2 say looking because they have been queued at the time 1 and 2 were looking for a leader. That's not an issue; 3- I don't understand why the patch doesn't work. Let me tell you how I'm interpreting your run. Server 0 is receiving the notifications from 1 and 2, and deciding that it should be the leader. Because in the current trunk code we set the first zxid for the new epoch before hearing from a quorum, once server 0 drops leadership, it has a higher zxid than everyone else. Consequently, it correctly refuses to talk to the current leader. Now, setting the first epoch zxid prematurely is a problem, and the patch I have uploaded should fix it. The bottom line is that I can't understand why the patch I uploaded does not fix it. Have you made sure to apply it before running your new tests? Either way, I would appreciate if you could upload logs out of a run with the current 790 patch. Thanks! > Last processed zxid set prematurely while establishing leadership > ----------------------------------------------------------------- > > Key: ZOOKEEPER-790 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-790 > Project: Zookeeper > Issue Type: Bug > Components: quorum > Affects Versions: 3.3.1 > Reporter: Flavio Paiva Junqueira > Assignee: Flavio Paiva Junqueira > Priority: Blocker > Fix For: 3.3.2, 3.4.0 > > Attachments: ZOOKEEPER-790.patch > > > The leader code is setting the last processed zxid to the first of the new > epoch even before connecting to a quorum of followers. Because the leader > code sets this value before connecting to a quorum of followers > (Leader.java:281) and the follower code throws an IOException > (Follower.java:73) if the leader epoch is smaller, we have that when the > false leader drops leadership and becomes a follower, it finds a smaller > epoch and kills itself. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.