Hi, We are running into this bug very often (almost 60-75% hit rate) while testing our newly developed application over ZK. This is almost a blocker for us. Will the fix be simplified if backward compatibility was not an issue?
Considering that this bug is rarely reported, I am wondering why we are running into this problem so often. Also, on a side note, I am curious why the systest that comes with ZooKeeper did not detect this bug. Can anyone please give an overview of the problem? Thanks. -Vishal On Wed, Jun 2, 2010 at 8:17 PM, Charity Majors <char...@shopkick.com> wrote: > Sure thing. > > We got paged this morning because backend services were not able to write > to the database. Each server discovers the DB master using zookeeper, so > when zookeeper goes down, they assume they no longer know who the DB master > is and stop working. > > When we realized there were no problems with the database, we logged in to > the zookeeper nodes. We weren't able to connect to zookeeper using zkCli.sh > from any of the three nodes, so we decided to restart them all, starting > with node one. However, after restarting node one, the cluster started > responding normally again. > > (The timestamps on the zookeeper processes on nodes two and three *are* > dated today, but none of us restarted them. We checked shell histories and > sudo logs, and they seem to back us up.) > > We tried getting node one to come back up and join the cluster, but that's > when we realized we weren't getting any logs, because log4j.properties was > in the wrong location. Sorry -- I REALLY wish I had those logs for you. We > put log4j back in place, and that's when we saw the spew I pasted in my > first message. > > I'll tack this on to ZK-335. > > > > On Jun 2, 2010, at 4:17 PM, Benjamin Reed wrote: > > > charity, do you mind going through your scenario again to give a > > timeline for the failure? i'm a bit confused as to what happened. > > > > ben > > > > On 06/02/2010 01:32 PM, Charity Majors wrote: > >> Thanks. That worked for me. I'm a little confused about why it threw > the entire cluster into an unusable state, though. > >> > >> I said before that we restarted all three nodes, but tracing back, we > actually didn't. The zookeeper cluster was refusing all connections until > we restarted node one. But once node one had been dropped from the cluster, > the other two nodes formed a quorum and started responding to queries on > their own. > >> > >> Is that expected as well? I didn't see it in ZOOKEEPER-335, so thought > I'd mention it. > >> > >> > >> > >> On Jun 2, 2010, at 11:49 AM, Patrick Hunt wrote: > >> > >> > >>> Hi Charity, unfortunately this is a known issue not specific to 3.3 > that > >>> we are working to address. See this thread for some background: > >>> > >>> > http://zookeeper-user.578899.n2.nabble.com/odd-error-message-td4933761.html > >>> > >>> I've raised the JIRA level to "blocker" to ensure we address this asap. > >>> > >>> As Ted suggested you can remove the datadir -- only on the effected > >>> server -- and then restart it. That should resolve the issue (the > server > >>> will d/l a snapshot of the current db from the leader). > >>> > >>> Patrick > >>> > >>> On 06/02/2010 11:11 AM, Charity Majors wrote: > >>> > >>>> I upgraded my zookeeper cluster last week from 3.2.1 to 3.3.1, in an > attempt to get away from a client bug that was crashing my backend services. > >>>> > >>>> Unfortunately, this morning I had a server crash, and it brought down > my entire cluster. I don't have the logs leading up to the crash, because > -- argghffbuggle -- log4j wasn't set up correctly. But I restarted all > three nodes, and odes two and three came back up and formed a quorum. > >>>> > >>>> Node one, meanwhile, does this: > >>>> > >>>> 2010-06-02 17:04:56,446 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@620] - LOOKING > >>>> 2010-06-02 17:04:56,446 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:files...@82] - Reading snapshot > /services/zookeeper/data/zookeeper/version-2/snapshot.a00000045 > >>>> 2010-06-02 17:04:56,476 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@649] - New election. > My id = 1, Proposed zxid = 47244640287 > >>>> 2010-06-02 17:04:56,486 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - Notification: > 1, 47244640287, 4, 1, LOOKING, LOOKING, 1 > >>>> 2010-06-02 17:04:56,486 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: > 3, 38654707048, 3, 1, LOOKING, LEADING, 3 > >>>> 2010-06-02 17:04:56,486 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: > 3, 38654707048, 3, 1, LOOKING, FOLLOWING, 2 > >>>> 2010-06-02 17:04:56,486 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@642] - FOLLOWING > >>>> 2010-06-02 17:04:56,486 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:zookeeperser...@151] - Created server > with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir > /services/zookeeper/data/zookeeper/version-2 snapdir > /services/zookeeper/data/zookeeper/version-2 > >>>> 2010-06-02 17:04:56,486 - FATAL > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@71] - Leader epoch a is less > than our epoch b > >>>> 2010-06-02 17:04:56,486 - WARN > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@82] - Exception when following > the leader > >>>> java.io.IOException: Error: Epoch of leader is lower > >>>> at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:73) > >>>> at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:644) > >>>> 2010-06-02 17:04:56,486 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@166] - shutdown called > >>>> java.lang.Exception: shutdown Follower > >>>> at > org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166) > >>>> at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:648) > >>>> > >>>> > >>>> > >>>> All I can find is this, > http://www.mail-archive.com/zookeeper-comm...@hadoop.apache.org/msg00449.html, > which implies that this state should never happen. > >>>> > >>>> Any suggestions? If it happens again, I'll just have to roll > everything back to 3.2.1 and live with the client crashes. > >>>> > >>>> > >>>> > >>>> > >>>> > >> > > > >