This looks a bit like a small bobble we had when upgrading a bit ago. I THINK that the answer here is to mind-wipe the misbehaving node and have it resynch from scratch from the other nodes.
Wait for confirmation from somebody real. On Wed, Jun 2, 2010 at 11:11 AM, Charity Majors <char...@shopkick.com>wrote: > I upgraded my zookeeper cluster last week from 3.2.1 to 3.3.1, in an > attempt to get away from a client bug that was crashing my backend services. > > Unfortunately, this morning I had a server crash, and it brought down my > entire cluster. I don't have the logs leading up to the crash, because -- > argghffbuggle -- log4j wasn't set up correctly. But I restarted all three > nodes, and odes two and three came back up and formed a quorum. > > Node one, meanwhile, does this: > > 2010-06-02 17:04:56,446 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@620] - LOOKING > 2010-06-02 17:04:56,446 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:files...@82] - Reading snapshot > /services/zookeeper/data/zookeeper/version-2/snapshot.a00000045 > 2010-06-02 17:04:56,476 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@649] - New election. > My id = 1, Proposed zxid = 47244640287 > 2010-06-02 17:04:56,486 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - Notification: > 1, 47244640287, 4, 1, LOOKING, LOOKING, 1 > 2010-06-02 17:04:56,486 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: > 3, 38654707048, 3, 1, LOOKING, LEADING, 3 > 2010-06-02 17:04:56,486 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: > 3, 38654707048, 3, 1, LOOKING, FOLLOWING, 2 > 2010-06-02 17:04:56,486 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@642] - FOLLOWING > 2010-06-02 17:04:56,486 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:zookeeperser...@151] - Created server > with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir > /services/zookeeper/data/zookeeper/version-2 snapdir > /services/zookeeper/data/zookeeper/version-2 > 2010-06-02 17:04:56,486 - FATAL > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@71] - Leader epoch a is less > than our epoch b > 2010-06-02 17:04:56,486 - WARN > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@82] - Exception when following > the leader > java.io.IOException: Error: Epoch of leader is lower > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:73) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:644) > 2010-06-02 17:04:56,486 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@166] - shutdown called > java.lang.Exception: shutdown Follower > at > org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:648) > > > > All I can find is this, > http://www.mail-archive.com/zookeeper-comm...@hadoop.apache.org/msg00449.html, > which implies that this state should never happen. > > Any suggestions? If it happens again, I'll just have to roll everything > back to 3.2.1 and live with the client crashes. > > > > >