Sure thing.
We got paged this morning because backend services were not able
to write
to the database. Each server discovers the DB master using
zookeeper, so
when zookeeper goes down, they assume they no longer know who the
DB master
is and stop working.
When we realized there were no problems with the database, we
logged in to
the zookeeper nodes. We weren't able to connect to zookeeper
using zkCli.sh
from any of the three nodes, so we decided to restart them all,
starting
with node one. However, after restarting node one, the cluster
started
responding normally again.
(The timestamps on the zookeeper processes on nodes two and three
*are*
dated today, but none of us restarted them. We checked shell
histories and
sudo logs, and they seem to back us up.)
We tried getting node one to come back up and join the cluster,
but that's
when we realized we weren't getting any logs, because
log4j.properties was
in the wrong location. Sorry -- I REALLY wish I had those logs
for you. We
put log4j back in place, and that's when we saw the spew I pasted
in my
first message.
I'll tack this on to ZK-335.
On Jun 2, 2010, at 4:17 PM, Benjamin Reed wrote:
charity, do you mind going through your scenario again to give a
timeline for the failure? i'm a bit confused as to what happened.
ben
On 06/02/2010 01:32 PM, Charity Majors wrote:
Thanks. That worked for me. I'm a little confused about why it
threw
the entire cluster into an unusable state, though.
I said before that we restarted all three nodes, but tracing
back, we
actually didn't. The zookeeper cluster was refusing all
connections until
we restarted node one. But once node one had been dropped from
the cluster,
the other two nodes formed a quorum and started responding to
queries on
their own.
Is that expected as well? I didn't see it in ZOOKEEPER-335, so
thought
I'd mention it.
On Jun 2, 2010, at 11:49 AM, Patrick Hunt wrote:
Hi Charity, unfortunately this is a known issue not specific to
3.3
that
we are working to address. See this thread for some background:
http://zookeeper-user.578899.n2.nabble.com/odd-error-message-td4933761.html
I've raised the JIRA level to "blocker" to ensure we address
this asap.
As Ted suggested you can remove the datadir -- only on the
effected
server -- and then restart it. That should resolve the issue (the
server
will d/l a snapshot of the current db from the leader).
Patrick
On 06/02/2010 11:11 AM, Charity Majors wrote:
I upgraded my zookeeper cluster last week from 3.2.1 to 3.3.1,
in an
attempt to get away from a client bug that was crashing my backend
services.
Unfortunately, this morning I had a server crash, and it
brought down
my entire cluster. I don't have the logs leading up to the crash,
because
-- argghffbuggle -- log4j wasn't set up correctly. But I
restarted all
three nodes, and odes two and three came back up and formed a
quorum.
Node one, meanwhile, does this:
2010-06-02 17:04:56,446 - INFO
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@620] - LOOKING
2010-06-02 17:04:56,446 - INFO
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:files...@82] - Reading snapshot
/services/zookeeper/data/zookeeper/version-2/snapshot.a00000045
2010-06-02 17:04:56,476 - INFO
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@649] - New
election.
My id = 1, Proposed zxid = 47244640287
2010-06-02 17:04:56,486 - INFO
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] -
Notification:
1, 47244640287, 4, 1, LOOKING, LOOKING, 1
2010-06-02 17:04:56,486 - INFO
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] -
Notification:
3, 38654707048, 3, 1, LOOKING, LEADING, 3
2010-06-02 17:04:56,486 - INFO
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] -
Notification:
3, 38654707048, 3, 1, LOOKING, FOLLOWING, 2
2010-06-02 17:04:56,486 - INFO
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@642] - FOLLOWING
2010-06-02 17:04:56,486 - INFO
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:zookeeperser...@151] - Created
server
with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000
datadir
/services/zookeeper/data/zookeeper/version-2 snapdir
/services/zookeeper/data/zookeeper/version-2
2010-06-02 17:04:56,486 - FATAL
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@71] - Leader epoch a is
less
than our epoch b
2010-06-02 17:04:56,486 - WARN
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@82] - Exception when
following
the leader
java.io.IOException: Error: Epoch of leader is lower
at
org
.apache
.zookeeper.server.quorum.Follower.followLeader(Follower.java:73)
at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:
644)
2010-06-02 17:04:56,486 - INFO
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@166] - shutdown called
java.lang.Exception: shutdown Follower
at
org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:
166)
at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:
648)
All I can find is this,
http://www.mail-archive.com/zookeeper-comm...@hadoop.apache.org/msg00449.html
,
which implies that this state should never happen.
Any suggestions? If it happens again, I'll just have to roll
everything back to 3.2.1 and live with the client crashes.