Re: zookeeper crash

Flavio Junqueira Wed, 02 Jun 2010 13:55:14 -0700

Hi Charity, This is certainly not expected. It would be very useful ifyou could provide us with as much information about your issue aspossible. I would suggest that either you create a new jira and linkit to ZOOKEEPER-335, or that you add to 335 directly.

We'll be looking further into why you have seen this problem andworking on a fix.


Cheers,
-Flavio

On Jun 2, 2010, at 10:32 PM, Charity Majors wrote:

Thanks. That worked for me. I'm a little confused about why itthrew the entire cluster into an unusable state, though.
I said before that we restarted all three nodes, but tracing back,we actually didn't. The zookeeper cluster was refusing allconnections until we restarted node one. But once node one had beendropped from the cluster, the other two nodes formed a quorum andstarted responding to queries on their own.
Is that expected as well? I didn't see it in ZOOKEEPER-335, sothought I'd mention it.
On Jun 2, 2010, at 11:49 AM, Patrick Hunt wrote:
Hi Charity, unfortunately this is a known issue not specific to 3.3that
we are working to address. See this thread for some background:

http://zookeeper-user.578899.n2.nabble.com/odd-error-message-td4933761.html
I've raised the JIRA level to "blocker" to ensure we address thisasap.
As Ted suggested you can remove the datadir -- only on the effected
server -- and then restart it. That should resolve the issue (theserver
will d/l a snapshot of the current db from the leader).

Patrick

On 06/02/2010 11:11 AM, Charity Majors wrote:
I upgraded my zookeeper cluster last week from 3.2.1 to 3.3.1, inan attempt to get away from a client bug that was crashing mybackend services.
Unfortunately, this morning I had a server crash, and it broughtdown my entire cluster. I don't have the logs leading up to thecrash, because -- argghffbuggle -- log4j wasn't set up correctly.But I restarted all three nodes, and odes two and three came backup and formed a quorum.
Node one, meanwhile, does this:
2010-06-02 17:04:56,446 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@620] - LOOKING2010-06-02 17:04:56,446 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:files...@82] - Reading snapshot /services/zookeeper/data/zookeeper/version-2/snapshot.a000000452010-06-02 17:04:56,476 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@649] - New election. My id= 1, Proposed zxid = 472446402872010-06-02 17:04:56,486 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - Notification: 1,47244640287, 4, 1, LOOKING, LOOKING, 12010-06-02 17:04:56,486 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 3,38654707048, 3, 1, LOOKING, LEADING, 32010-06-02 17:04:56,486 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 3,38654707048, 3, 1, LOOKING, FOLLOWING, 22010-06-02 17:04:56,486 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@642] - FOLLOWING2010-06-02 17:04:56,486 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:zookeeperser...@151] - Created server withtickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000datadir /services/zookeeper/data/zookeeper/version-2 snapdir /services/zookeeper/data/zookeeper/version-22010-06-02 17:04:56,486 - FATAL [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@71] - Leader epoch a is less thanour epoch b2010-06-02 17:04:56,486 - WARN [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@82] - Exception when following theleader
java.io.IOException: Error: Epoch of leader is lower
atorg.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:73)atorg.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:644)2010-06-02 17:04:56,486 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@166] - shutdown called
java.lang.Exception: shutdown Follower
atorg.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)atorg.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:648)
All I can find is this, http://www.mail-archive.com/zookeeper-comm...@hadoop.apache.org/msg00449.html, which implies that this state should never happen.
Any suggestions? If it happens again, I'll just have to rolleverything back to 3.2.1 and live with the client crashes.

Re: zookeeper crash

Reply via email to