[ https://issues.apache.org/jira/browse/ZOOKEEPER-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12888688#action_12888688 ]
Travis Crawford commented on ZOOKEEPER-790: ------------------------------------------- I accidentally posted this in ZOOKEEPER-335 -- reposting here. Sorry for the posting mixup -- the content is correct though :) Unfortunately I still observed the "Leader epoch" issue and needed to manually force a leader election for the cluster to recover. This test was performed with the following base+patches, applied in the order listed. Zookeeper 3.3.1 ZOOKEEPER-744 ZOOKEEPER-790 {code} 2010-07-15 02:43:57,181 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:files...@82] - Reading snapshot /data/zookeeper/version-2/snapshot.2300001ac2 2010-07-15 02:43:57,384 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@649] - New election. My id = 1, Proposed zxid = 154618826848 2010-07-15 02:43:57,385 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - Notification: 1, 154618826848, 4, 1, LOOKING, LOOKING, 1 2010-07-15 02:43:57,385 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 2, 146030952153, 3, 1, LOOKING, LEADING, 2 2010-07-15 02:43:57,385 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 2, 146030952153, 3, 1, LOOKING, FOLLOWING, 3 2010-07-15 02:43:57,385 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@642] - FOLLOWING 2010-07-15 02:43:57,385 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:zookeeperser...@151] - Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir /data/zookeeper/txlog/version-2 snapdir /data/zookeeper/version-2 2010-07-15 02:43:57,387 - FATAL [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@71] - Leader epoch 23 is less than our epoch 24 2010-07-15 02:43:57,387 - WARN [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@82] - Exception when following the leader java.io.IOException: Error: Epoch of leader is lower at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:73) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:644) 2010-07-15 02:43:57,387 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@166] - shutdown called java.lang.Exception: shutdown Follower at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:648) {code} I followed the recipe @vishal provided for recreating. (a) Stop one follower in a three node cluster (b) Get some tea while it falls behind (c) Start the node stopped in (a). These timestamps show where the follower was stopped. It also shows when it was turned back on. {code} 2010-07-15 02:35:36,398 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@1661] - Established session 0x229aa13cfc6276b with negotiated timeout 10000 for client /10.209.45.114:34562 2010-07-15 02:39:18,907 - INFO [main:quorumpeercon...@90] - Reading configuration from: /etc/zookeeper/conf/zoo.cfg {code} This timestamp is the first ``Leader epoch`` line. Everything between these two points will be the interesting bits. {code} 2010-07-15 02:39:43,339 - FATAL [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@71] - Leader epoch 23 is less than our epoch 24 {code} > Last processed zxid set prematurely while establishing leadership > ----------------------------------------------------------------- > > Key: ZOOKEEPER-790 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-790 > Project: Zookeeper > Issue Type: Bug > Components: quorum > Affects Versions: 3.3.1 > Reporter: Flavio Paiva Junqueira > Assignee: Flavio Paiva Junqueira > Priority: Blocker > Fix For: 3.3.2, 3.4.0 > > Attachments: ZOOKEEPER-790.patch, ZOOKEEPER-790.travis.log.bz2 > > > The leader code is setting the last processed zxid to the first of the new > epoch even before connecting to a quorum of followers. Because the leader > code sets this value before connecting to a quorum of followers > (Leader.java:281) and the follower code throws an IOException > (Follower.java:73) if the leader epoch is smaller, we have that when the > false leader drops leadership and becomes a follower, it finds a smaller > epoch and kills itself. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.