Re: zookeeper quorum falling apart with continuous leader election

kishore g Wed, 12 Feb 2014 09:57:39 -0800

Just for my understanding what do these messages indicate.  Also I see that
n.zxid keeps incrementing, does it mean the system is accepting writes?


node 2
2014-02-10 19:49:06,860 [myid:235] - INFO
[WorkerReceiver[myid=235]:FastLeaderElection@594] - Notification: 234
(n.leader), 0x4afe00000001 (n.zxid), 0x4b00 (n.round), LOOKING (n.state),
234 (n.sid), 0x4aff (n.peerEPoch), LOOKING (my state)1 (n.config version)

node 1
2014-02-10 19:42:02,936 [myid:234] - INFO
[WorkerReceiver[myid=234]:FastLeaderElection@594] - Notification: 234
(n.leader), 0x4afa00000001 (n.zxid), 0x4afc (n.round), LOOKING (n.state),
234 (n.sid), 0x4afb (n.peerEPoch), LOOKING (my state)1 (n.config version)




On Wed, Feb 12, 2014 at 6:47 AM, Flavio Junqueira <[email protected]>wrote:

> It sounds like LE is completing periodically, but the servers are not
> being able to complete the synchronization step. We are also getting this
> connection refused exception when the follower is trying to connect. This
> is what I spotted for the follower:
>
> 2014-02-10 18:54:04,414 [myid:234] - INFO
>  [QuorumPeer[myid=234]/0:0:0:0:0:0:0:0:2181:Follower@65] - FOLLOWING -
> LEADER ELECTION TOOK - 1
> 2014-02-10 18:54:04,415 [myid:234] - WARN
>  [QuorumPeer[myid=234]/0:0:0:0:0:0:0:0:2181:Learner@239] - Unexpected
> exception, tries=0, connecting to 10.0.57.235/10.0.57.235:2888
> java.net.ConnectException: Connection refused
>         at java.net.PlainSocketImpl.socketConnect(Native Method)
>         at java.net.AbstractPlainSocketImpl.doConnect(Unknown Source)
>         at java.net.AbstractPlainSocketImpl.connectToAddress(Unknown
> Source)
>         at java.net.AbstractPlainSocketImpl.connect(Unknown Source)
>         at java.net.SocksSocketImpl.connect(Unknown Source)
>         at java.net.Socket.connect(Unknown Source)
>         at
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:231)
>         at
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:73)
>         at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:936)
>
> and this:
>
> 2014-02-10 18:55:05,508 [myid:234] - INFO
>  [QuorumPeer[myid=234]/0:0:0:0:0:0:0:0:2181:Learner@442] - Learner
> received UPTODATE message
> 2014-02-10 18:55:05,508 [myid:234] - WARN
>  [QuorumPeer[myid=234]/0:0:0:0:0:0:0:0:2181:Follower@92] - Exception when
> following the leader
> java.net.SocketException: Broken pipe
>         at java.net.SocketOutputStream.socketWrite0(Native Method)
>         at java.net.SocketOutputStream.socketWrite(Unknown Source)
>         at java.net.SocketOutputStream.write(Unknown Source)
>         at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
>         at java.io.BufferedOutputStream.flush(Unknown Source)
>         at
> org.apache.zookeeper.server.quorum.Learner.writePacket(Learner.java:145)
>         at
> org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:477)
>         at
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
>         at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:936)
>
> On the leader side, we have this:
>
> 2014-02-10 19:48:03,705 [myid:235] - INFO
>  [LearnerHandler-/10.0.57.234:58829:LearnerHandler@328] - Synchronizing
> with Follower sid: 234 maxCommittedLog=0x4afe00000001
> minCommittedLog=0x4afe00000001 peerLastZxid=0x4afd00000001
> 2014-02-10 19:48:03,705 [myid:235] - WARN
>  [LearnerHandler-/10.0.57.234:58829:LearnerHandler@389] - Unhandled
> proposal scenario
> 2014-02-10 19:48:03,705 [myid:235] - INFO
>  [LearnerHandler-/10.0.57.234:58829:LearnerHandler@404] - Sending SNAP
> 2014-02-10 19:48:03,705 [myid:235] - INFO
>  [LearnerHandler-/10.0.57.234:58829:LearnerHandler@435] - Sending
> snapshot last zxid of peer is 0x4afd00000001  zxid of leader is
> 0x4aff00000000sent zxid of db as 0x4afe00000001
> 2014-02-10 19:48:03,724 [myid:235] - WARN
>  [LearnerHandler-/10.0.57.234:58829:Leader@698] - Commiting zxid
> 0x4aff00000000 from /10.0.57.235:2888 not first!
>
> There are a couple of odd warnings there. Just to confirm, the node
> missing in the logs is the one with the bad disk, right?
>
> -Flavio
>
> On 12 Feb 2014, at 02:26, Deepak Jagtap <[email protected]> wrote:
>
> > Hi ,
> >
> > I have 3 node zookeeper 3.5.0.1458648 quorum on my setup.
> > We came across a situation where one of the zk server in the cluster went
> > down
> > due to bad disk.
> > We observed that leader election keeps running in loop (starts, completes
> > and again starts). The loop repeats every couple of minutes.
> > Even restarting zookeeper server on both nodes doesn't help recovering
> from
> > this loop.
> > Network connection looks fine though, as I could telnet leader election
> > port and ssh from one node to other.
> > zookeeper client on each node is using "127.0.0.1:2181" as quorum string
> > for connecting to server, therefore if local zookeeper server is down
> > client app is dead.
> >
> > I have uploaded zookeeper.log for both nodes at following link:
> > https://dl.dropboxusercontent.com/u/36429721/zkSupportLog.tar.gz
> >
> > Any idea what might be wrong with the quorum? Please note that restarting
> > zookeeper server on both nodes doesn't help to recover from this
> situations.
> >
> > Thanks & Regards,
> > Deepak
>
>

Re: zookeeper quorum falling apart with continuous leader election

Reply via email to