Hey all - I believe we just suffered an outage from this issue. Short version is while restarting quorum members with GC flags recommended in the Troubleshooting wiki page a follower logged messages similar two the following jiras:
2010-07-06 23:14:01,438 - FATAL [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@71] - Leader epoch 20 is less than our epoch 21 2010-07-06 23:14:01,438 - WARN [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@82] - Exception when following the leader java.io.IOException: Error: Epoch of leader is lower https://issues.apache.org/jira/browse/ZOOKEEPER-335 https://issues.apache.org/jira/browse/ZOOKEEPER-790 Reading through the jira's its unclear if the issue is well understood at this point (as there's a patch available) or still being understood. If its still being understood let me know and I can attach the relevant log lines to the appropriate jira. Or if the patch appears good I can make a new release and help test. Let me know :) --travis On Wed, Jun 16, 2010 at 3:25 PM, Flavio Junqueira <f...@yahoo-inc.com> wrote: > I would recommend opening a separate jira issue. I'm not convinced the > issues are the same, so I'd rather keep them separate and link the issues if > it is the case. > > -Flavio > > On Jun 17, 2010, at 12:16 AM, Patrick Hunt wrote: > >> We are unable to reproduce this issue. If you can provide the server >> logs (all servers) and attach them to the jira it would be very helpful. >> Some detail on the approx time of the issue so we can correlate to the >> logs would help too (summary of what you did/do to cause it, etc... >> anything that might help us nail this one down). >> >> https://issues.apache.org/jira/browse/ZOOKEEPER-335 >> >> Some detail on ZK version, OS, Java version, HW info, etc... would also >> be of use to us. >> >> Patrick >> >> On 06/16/2010 02:49 PM, Vishal K wrote: >>> >>> Hi, >>> >>> We are running into this bug very often (almost 60-75% hit rate) while >>> testing our newly developed application over ZK. This is almost a blocker >>> for us. Will the fix be simplified if backward compatibility was not an >>> issue? >>> >>> Considering that this bug is rarely reported, I am wondering why we are >>> running into this problem so often. Also, on a side note, I am curious >>> why >>> the systest that comes with ZooKeeper did not detect this bug. Can anyone >>> please give an overview of the problem? >>> >>> Thanks. >>> -Vishal >>> >>> >>> On Wed, Jun 2, 2010 at 8:17 PM, Charity Majors<char...@shopkick.com> >>> wrote: >>> >>>> Sure thing. >>>> >>>> We got paged this morning because backend services were not able to >>>> write >>>> to the database. Each server discovers the DB master using zookeeper, >>>> so >>>> when zookeeper goes down, they assume they no longer know who the DB >>>> master >>>> is and stop working. >>>> >>>> When we realized there were no problems with the database, we logged in >>>> to >>>> the zookeeper nodes. We weren't able to connect to zookeeper using >>>> zkCli.sh >>>> from any of the three nodes, so we decided to restart them all, starting >>>> with node one. However, after restarting node one, the cluster started >>>> responding normally again. >>>> >>>> (The timestamps on the zookeeper processes on nodes two and three *are* >>>> dated today, but none of us restarted them. We checked shell histories >>>> and >>>> sudo logs, and they seem to back us up.) >>>> >>>> We tried getting node one to come back up and join the cluster, but >>>> that's >>>> when we realized we weren't getting any logs, because log4j.properties >>>> was >>>> in the wrong location. Sorry -- I REALLY wish I had those logs for you. >>>> We >>>> put log4j back in place, and that's when we saw the spew I pasted in my >>>> first message. >>>> >>>> I'll tack this on to ZK-335. >>>> >>>> >>>> >>>> On Jun 2, 2010, at 4:17 PM, Benjamin Reed wrote: >>>> >>>>> charity, do you mind going through your scenario again to give a >>>>> timeline for the failure? i'm a bit confused as to what happened. >>>>> >>>>> ben >>>>> >>>>> On 06/02/2010 01:32 PM, Charity Majors wrote: >>>>>> >>>>>> Thanks. That worked for me. I'm a little confused about why it threw >>>> >>>> the entire cluster into an unusable state, though. >>>>>> >>>>>> I said before that we restarted all three nodes, but tracing back, we >>>> >>>> actually didn't. The zookeeper cluster was refusing all connections >>>> until >>>> we restarted node one. But once node one had been dropped from the >>>> cluster, >>>> the other two nodes formed a quorum and started responding to queries on >>>> their own. >>>>>> >>>>>> Is that expected as well? I didn't see it in ZOOKEEPER-335, so >>>>>> thought >>>> >>>> I'd mention it. >>>>>> >>>>>> >>>>>> >>>>>> On Jun 2, 2010, at 11:49 AM, Patrick Hunt wrote: >>>>>> >>>>>> >>>>>>> Hi Charity, unfortunately this is a known issue not specific to 3.3 >>>> >>>> that >>>>>>> >>>>>>> we are working to address. See this thread for some background: >>>>>>> >>>>>>> >>>> >>>> http://zookeeper-user.578899.n2.nabble.com/odd-error-message-td4933761.html >>>>>>> >>>>>>> I've raised the JIRA level to "blocker" to ensure we address this >>>>>>> asap. >>>>>>> >>>>>>> As Ted suggested you can remove the datadir -- only on the effected >>>>>>> server -- and then restart it. That should resolve the issue (the >>>> >>>> server >>>>>>> >>>>>>> will d/l a snapshot of the current db from the leader). >>>>>>> >>>>>>> Patrick >>>>>>> >>>>>>> On 06/02/2010 11:11 AM, Charity Majors wrote: >>>>>>> >>>>>>>> I upgraded my zookeeper cluster last week from 3.2.1 to 3.3.1, in an >>>> >>>> attempt to get away from a client bug that was crashing my backend >>>> services. >>>>>>>> >>>>>>>> Unfortunately, this morning I had a server crash, and it brought >>>>>>>> down >>>> >>>> my entire cluster. I don't have the logs leading up to the crash, >>>> because >>>> -- argghffbuggle -- log4j wasn't set up correctly. But I restarted all >>>> three nodes, and odes two and three came back up and formed a quorum. >>>>>>>> >>>>>>>> Node one, meanwhile, does this: >>>>>>>> >>>>>>>> 2010-06-02 17:04:56,446 - INFO >>>> >>>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@620] - LOOKING >>>>>>>> >>>>>>>> 2010-06-02 17:04:56,446 - INFO >>>> >>>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:files...@82] - Reading snapshot >>>> /services/zookeeper/data/zookeeper/version-2/snapshot.a00000045 >>>>>>>> >>>>>>>> 2010-06-02 17:04:56,476 - INFO >>>> >>>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@649] - New >>>> election. >>>> My id = 1, Proposed zxid = 47244640287 >>>>>>>> >>>>>>>> 2010-06-02 17:04:56,486 - INFO >>>> >>>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - >>>> Notification: >>>> 1, 47244640287, 4, 1, LOOKING, LOOKING, 1 >>>>>>>> >>>>>>>> 2010-06-02 17:04:56,486 - INFO >>>> >>>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - >>>> Notification: >>>> 3, 38654707048, 3, 1, LOOKING, LEADING, 3 >>>>>>>> >>>>>>>> 2010-06-02 17:04:56,486 - INFO >>>> >>>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - >>>> Notification: >>>> 3, 38654707048, 3, 1, LOOKING, FOLLOWING, 2 >>>>>>>> >>>>>>>> 2010-06-02 17:04:56,486 - INFO >>>> >>>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@642] - FOLLOWING >>>>>>>> >>>>>>>> 2010-06-02 17:04:56,486 - INFO >>>> >>>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:zookeeperser...@151] - Created server >>>> with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 >>>> datadir >>>> /services/zookeeper/data/zookeeper/version-2 snapdir >>>> /services/zookeeper/data/zookeeper/version-2 >>>>>>>> >>>>>>>> 2010-06-02 17:04:56,486 - FATAL >>>> >>>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@71] - Leader epoch a is less >>>> than our epoch b >>>>>>>> >>>>>>>> 2010-06-02 17:04:56,486 - WARN >>>> >>>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@82] - Exception when >>>> following >>>> the leader >>>>>>>> >>>>>>>> java.io.IOException: Error: Epoch of leader is lower >>>>>>>> at >>>> >>>> >>>> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:73) >>>>>>>> >>>>>>>> at >>>> >>>> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:644) >>>>>>>> >>>>>>>> 2010-06-02 17:04:56,486 - INFO >>>> >>>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@166] - shutdown called >>>>>>>> >>>>>>>> java.lang.Exception: shutdown Follower >>>>>>>> at >>>> >>>> org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166) >>>>>>>> >>>>>>>> at >>>> >>>> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:648) >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> All I can find is this, >>>> >>>> >>>> http://www.mail-archive.com/zookeeper-comm...@hadoop.apache.org/msg00449.html, >>>> which implies that this state should never happen. >>>>>>>> >>>>>>>> Any suggestions? If it happens again, I'll just have to roll >>>> >>>> everything back to 3.2.1 and live with the client crashes. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>> >>>> >>>> >>> > >