Hi Charity, This is certainly not expected. It would be very useful if
you could provide us with as much information about your issue as
possible. I would suggest that either you create a new jira and link
it to ZOOKEEPER-335, or that you add to 335 directly.
We'll be looking further into why you have seen this problem and
working on a fix.
Cheers,
-Flavio
On Jun 2, 2010, at 10:32 PM, Charity Majors wrote:
Thanks. That worked for me. I'm a little confused about why it
threw the entire cluster into an unusable state, though.
I said before that we restarted all three nodes, but tracing back,
we actually didn't. The zookeeper cluster was refusing all
connections until we restarted node one. But once node one had been
dropped from the cluster, the other two nodes formed a quorum and
started responding to queries on their own.
Is that expected as well? I didn't see it in ZOOKEEPER-335, so
thought I'd mention it.
On Jun 2, 2010, at 11:49 AM, Patrick Hunt wrote:
Hi Charity, unfortunately this is a known issue not specific to 3.3
that
we are working to address. See this thread for some background:
http://zookeeper-user.578899.n2.nabble.com/odd-error-message-td4933761.html
I've raised the JIRA level to "blocker" to ensure we address this
asap.
As Ted suggested you can remove the datadir -- only on the effected
server -- and then restart it. That should resolve the issue (the
server
will d/l a snapshot of the current db from the leader).
Patrick
On 06/02/2010 11:11 AM, Charity Majors wrote:
I upgraded my zookeeper cluster last week from 3.2.1 to 3.3.1, in
an attempt to get away from a client bug that was crashing my
backend services.
Unfortunately, this morning I had a server crash, and it brought
down my entire cluster. I don't have the logs leading up to the
crash, because -- argghffbuggle -- log4j wasn't set up correctly.
But I restarted all three nodes, and odes two and three came back
up and formed a quorum.
Node one, meanwhile, does this:
2010-06-02 17:04:56,446 - INFO [QuorumPeer:/
0:0:0:0:0:0:0:0:2181:quorump...@620] - LOOKING
2010-06-02 17:04:56,446 - INFO [QuorumPeer:/
0:0:0:0:0:0:0:0:2181:files...@82] - Reading snapshot /services/
zookeeper/data/zookeeper/version-2/snapshot.a00000045
2010-06-02 17:04:56,476 - INFO [QuorumPeer:/
0:0:0:0:0:0:0:0:2181:fastleaderelect...@649] - New election. My id
= 1, Proposed zxid = 47244640287
2010-06-02 17:04:56,486 - INFO [QuorumPeer:/
0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - Notification: 1,
47244640287, 4, 1, LOOKING, LOOKING, 1
2010-06-02 17:04:56,486 - INFO [QuorumPeer:/
0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 3,
38654707048, 3, 1, LOOKING, LEADING, 3
2010-06-02 17:04:56,486 - INFO [QuorumPeer:/
0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 3,
38654707048, 3, 1, LOOKING, FOLLOWING, 2
2010-06-02 17:04:56,486 - INFO [QuorumPeer:/
0:0:0:0:0:0:0:0:2181:quorump...@642] - FOLLOWING
2010-06-02 17:04:56,486 - INFO [QuorumPeer:/
0:0:0:0:0:0:0:0:2181:zookeeperser...@151] - Created server with
tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000
datadir /services/zookeeper/data/zookeeper/version-2 snapdir /
services/zookeeper/data/zookeeper/version-2
2010-06-02 17:04:56,486 - FATAL [QuorumPeer:/
0:0:0:0:0:0:0:0:2181:follo...@71] - Leader epoch a is less than
our epoch b
2010-06-02 17:04:56,486 - WARN [QuorumPeer:/
0:0:0:0:0:0:0:0:2181:follo...@82] - Exception when following the
leader
java.io.IOException: Error: Epoch of leader is lower
at
org
.apache
.zookeeper.server.quorum.Follower.followLeader(Follower.java:73)
at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:
644)
2010-06-02 17:04:56,486 - INFO [QuorumPeer:/
0:0:0:0:0:0:0:0:2181:follo...@166] - shutdown called
java.lang.Exception: shutdown Follower
at
org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:
166)
at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:
648)
All I can find is this, http://www.mail-archive.com/zookeeper-comm...@hadoop.apache.org/msg00449.html
, which implies that this state should never happen.
Any suggestions? If it happens again, I'll just have to roll
everything back to 3.2.1 and live with the client crashes.