Re: zookeeper crash

Flavio Junqueira Wed, 16 Jun 2010 15:26:04 -0700

I would recommend opening a separate jira issue. I'm not convinced theissues are the same, so I'd rather keep them separate and link theissues if it is the case.


-Flavio


On Jun 17, 2010, at 12:16 AM, Patrick Hunt wrote:

We are unable to reproduce this issue. If you can provide the server
logs (all servers) and attach them to the jira it would be veryhelpful.
Some detail on the approx time of the issue so we can correlate to the
logs would help too (summary of what you did/do to cause it, etc...
anything that might help us nail this one down).

https://issues.apache.org/jira/browse/ZOOKEEPER-335
Some detail on ZK version, OS, Java version, HW info, etc... wouldalso
be of use to us.

Patrick

On 06/16/2010 02:49 PM, Vishal K wrote:
Hi,
We are running into this bug very often (almost 60-75% hit rate)whiletesting our newly developed application over ZK. This is almost ablockerfor us. Will the fix be simplified if backward compatibility wasnot an
issue?
Considering that this bug is rarely reported, I am wondering why wearerunning into this problem so often. Also, on a side note, I amcurious whythe systest that comes with ZooKeeper did not detect this bug. Cananyone
please give an overview of the problem?

Thanks.
-Vishal
On Wed, Jun 2, 2010 at 8:17 PM, CharityMajors<char...@shopkick.com> wrote:
Sure thing.
We got paged this morning because backend services were not ableto writeto the database. Each server discovers the DB master usingzookeeper, sowhen zookeeper goes down, they assume they no longer know who theDB master
is and stop working.
When we realized there were no problems with the database, welogged in tothe zookeeper nodes. We weren't able to connect to zookeeperusing zkCli.shfrom any of the three nodes, so we decided to restart them all,startingwith node one. However, after restarting node one, the clusterstarted
responding normally again.
(The timestamps on the zookeeper processes on nodes two and three*are*dated today, but none of us restarted them. We checked shellhistories and
sudo logs, and they seem to back us up.)
We tried getting node one to come back up and join the cluster,but that'swhen we realized we weren't getting any logs, becauselog4j.properties wasin the wrong location. Sorry -- I REALLY wish I had those logsfor you. Weput log4j back in place, and that's when we saw the spew I pastedin my
first message.

I'll tack this on to ZK-335.



On Jun 2, 2010, at 4:17 PM, Benjamin Reed wrote:
charity, do you mind going through your scenario again to give a
timeline for the failure? i'm a bit confused as to what happened.

ben

On 06/02/2010 01:32 PM, Charity Majors wrote:
Thanks. That worked for me. I'm a little confused about why itthrew
the entire cluster into an unusable state, though.
I said before that we restarted all three nodes, but tracingback, we
actually didn't. The zookeeper cluster was refusing allconnections untilwe restarted node one. But once node one had been dropped fromthe cluster,the other two nodes formed a quorum and started responding toqueries on
their own.
Is that expected as well? I didn't see it in ZOOKEEPER-335, sothought
I'd mention it.
On Jun 2, 2010, at 11:49 AM, Patrick Hunt wrote:
Hi Charity, unfortunately this is a known issue not specific to3.3
that
we are working to address. See this thread for some background:
http://zookeeper-user.578899.n2.nabble.com/odd-error-message-td4933761.html
I've raised the JIRA level to "blocker" to ensure we addressthis asap.
As Ted suggested you can remove the datadir -- only on theeffected
server -- and then restart it. That should resolve the issue (the
server
will d/l a snapshot of the current db from the leader).

Patrick

On 06/02/2010 11:11 AM, Charity Majors wrote:
I upgraded my zookeeper cluster last week from 3.2.1 to 3.3.1,in an
attempt to get away from a client bug that was crashing my backendservices.
Unfortunately, this morning I had a server crash, and itbrought down
my entire cluster. I don't have the logs leading up to the crash,because-- argghffbuggle -- log4j wasn't set up correctly. But Irestarted allthree nodes, and odes two and three came back up and formed aquorum.
Node one, meanwhile, does this:

2010-06-02 17:04:56,446 - INFO
 [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@620] - LOOKING
2010-06-02 17:04:56,446 - INFO
 [QuorumPeer:/0:0:0:0:0:0:0:0:2181:files...@82] - Reading snapshot
/services/zookeeper/data/zookeeper/version-2/snapshot.a00000045
2010-06-02 17:04:56,476 - INFO
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@649] - Newelection.
My id =  1, Proposed zxid = 47244640287
2010-06-02 17:04:56,486 - INFO
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] -Notification:
1, 47244640287, 4, 1, LOOKING, LOOKING, 1
2010-06-02 17:04:56,486 - INFO
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] -Notification:
3, 38654707048, 3, 1, LOOKING, LEADING, 3
2010-06-02 17:04:56,486 - INFO
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] -Notification:
3, 38654707048, 3, 1, LOOKING, FOLLOWING, 2
2010-06-02 17:04:56,486 - INFO
 [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@642] - FOLLOWING
2010-06-02 17:04:56,486 - INFO
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:zookeeperser...@151] - Createdserverwith tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000datadir
/services/zookeeper/data/zookeeper/version-2 snapdir
/services/zookeeper/data/zookeeper/version-2
2010-06-02 17:04:56,486 - FATAL
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@71] - Leader epoch a isless
than our epoch b
2010-06-02 17:04:56,486 - WARN
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@82] - Exception whenfollowing
the leader
java.io.IOException: Error: Epoch of leader is lower
       at
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:73)
       at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:644)
2010-06-02 17:04:56,486 - INFO
 [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@166] - shutdown called
java.lang.Exception: shutdown Follower
       at
org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
       at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:648)
All I can find is this,
http://www.mail-archive.com/zookeeper-comm...@hadoop.apache.org/msg00449.html,
which implies that this state should never happen.
Any suggestions?  If it happens again, I'll just have to roll
everything back to 3.2.1 and live with the client crashes.

Re: zookeeper crash

Reply via email to