>> it started a new round but then I seem to see the election messages from
Node 2 again. Any idea why?

My guess is node 2 is back online and ZK service was started. In this use
case node 2 does not get offline permanently, IIUC.


On Thu, Oct 13, 2016 at 3:41 PM, Ryan Zhang <yangzhangr...@hotmail.com>
wrote:

> Hi, Anand, I took a look and I wonder how do you explain the following
>
> The N1.log starts at around
> 2016-10-03 12:37:38,073 [myid:1] - INFO  [QuorumPeer[myid=1]/127.0.0.1:
> 5002:QuorumPeer@714] - LOOKING
>
> and it failed to talk to Node 2 (id: 2)
> 2016-10-03 12:37:38,136 [myid:1] - WARN  [WorkerSender[myid=1]:
> QuorumCnxManager@382] - Cannot open channel to 2 at election address
> node2.controller.local/127.0.0.8:5004
> java.net.ConnectException: Connection refused
>         at java.net.PlainSocketImpl.socketConnect(Native Method)
>         at java.net.AbstractPlainSocketImpl.doConnect(
> AbstractPlainSocketImpl.java:339)
>         at java.net.AbstractPlainSocketImpl.connectToAddress(
> AbstractPlainSocketImpl.java:200)
>         at java.net.AbstractPlainSocketImpl.connect(
> AbstractPlainSocketImpl.java:182)
>         at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>         at java.net.Socket.connect(Socket.java:579)
>
> However, after failed to get any election messages from Node 3, it started
> a new round but then I seem to see the election messages from Node 2 again.
> Any idea why?
>
> 2016-10-03 12:38:32,815 [myid:1] - INFO  [QuorumPeer[myid=1]/127.0.0.1:
> 5002:QuorumPeer@714] - LOOKING
> 2016-10-03 12:38:32,816 [myid:1] - INFO  [QuorumPeer[myid=1]/127.0.0.1:
> 5002:FileSnap@83] - Reading snapshot /var/lib/zookeeper/version-2/
> snapshot.20002aa0b
> 2016-10-03 12:38:32,875 [myid:1] - INFO  [QuorumPeer[myid=1]/127.0.0.1:
> 5002:FastLeaderElection@815] - New election. My id =  1, proposed
> zxid=0x20002bfb5
> 2016-10-03 12:38:32,876 [myid:1] - INFO  [WorkerReceiver[myid=1]:
> FastLeaderElection@597] - Notification: 1 (message format version), 1
> (n.leader), 0x20002bfb5 (n.zxid), 0x6 (n.round), LOOKING (n.state), 1
> (n.sid), 0x2 (n.peerEpoch) LOOKING (my state)
> 2016-10-03 12:38:32,878 [myid:1] - INFO  [WorkerReceiver[myid=1]:
> FastLeaderElection@597] - Notification: 1 (message format version), 2
> (n.leader), 0x20002bfb5 (n.zxid), 0x6 (n.round), LOOKING (n.state), 2
> (n.sid), 0x2 (n.peerEpoch) LOOKING (my state)
> 2016-10-03 12:38:32,878 [myid:1] - INFO  [WorkerReceiver[myid=1]:
> FastLeaderElection@597] - Notification: 1 (message format version), 2
> (n.leader), 0x20002bfb5 (n.zxid), 0x6 (n.round), LOOKING (n.state), 1
> (n.sid), 0x2 (n.peerEpoch) LOOKING (my state)
> 2016-10-03 12:38:32,917 [myid:1] - INFO  [WorkerReceiver[myid=1]:
> FastLeaderElection@597] - Notification: 1 (message format version), 3
> (n.leader), 0x20002bfb5 (n.zxid), 0x4 (n.round), LEADING (n.state), 3
> (n.sid), 0x2 (n.peerEpoch) LOOKING (my state)
> 2016-10-03 12:38:33,117 [myid:1] - INFO  [QuorumPeer[myid=1]/127.0.0.1:
> 5002:FastLeaderElection@849] - Notification time out: 400
> 2016-10-03 12:38:33,118 [myid:1] - INFO  [WorkerReceiver[myid=1]:
> FastLeaderElection@597] - Notification: 1 (message format version), 2
> (n.leader), 0x20002bfb5 (n.zxid), 0x6 (n.round), LOOKING (n.state), 1
> (n.sid), 0x2 (n.peerEpoch) LOOKING (my state)
> 2016-10-03 12:38:33,159 [myid:1] - INFO  [WorkerReceiver[myid=1]:
> FastLeaderElection@597] - Notification: 1 (message format version), 3
> (n.leader), 0x20002bfb5 (n.zxid), 0x4 (n.round), LEADING (n.state), 3
> (n.sid), 0x2 (n.peerEpoch) LOOKING (my state)
> 2016-10-03 12:38:33,559 [myid:1] - INFO  [QuorumPeer[myid=1]/127.0.0.1:
> 5002:FastLeaderElection@849] - Notification time out: 800
> 2016-10-03 12:38:33,560 [myid:1] - INFO  [WorkerReceiver[myid=1]:
> FastLeaderElection@597] - Notification: 1 (message format version), 2
> (n.leader), 0x20002bfb5 (n.zxid), 0x6 (n.round), LOOKING (n.state), 1
> (n.sid), 0x2 (n.peerEpoch) LOOKING (my state)
> 2016-10-03 12:38:33,561 [myid:1] - INFO  [WorkerReceiver[myid=1]:
> FastLeaderElection@597] - Notification: 1 (message format version), 2
> (n.leader), 0x20002bfb5 (n.zxid), 0x6 (n.round), LEADING (n.state), 2
> (n.sid), 0x2 (n.peerEpoch) LOOKING (my state)
> 2016-10-03 12:38:33,601 [myid:1] - INFO  [WorkerReceiver[myid=1]:
> FastLeaderElection@597] - Notification: 1 (message format version), 3
> (n.leader), 0x20002bfb5 (n.zxid), 0x4 (n.round), LEADING (n.state), 3
> (n.sid), 0x2 (n.peerEpoch) LOOKING (my state)
> 2016-10-03 12:38:34,402 [myid:1] - INFO  [QuorumPeer[myid=1]/127.0.0.1:
> 5002:FastLeaderElection@849] - Notification time out: 1600
>
>
>
> On Oct 8, 2016, at 10:05 AM, Anand Parthasarathy <anpar...@avinetworks.com
> <mailto:anpar...@avinetworks.com>> wrote:
>
> Hi Flavio,
>
> I have attached the logs from node 1 and node 3. Node 2 was powered off
> around 10-03 12:36. Leader election kept going until 10-03 15:57:16 when it
> finally converged.
>
> Thanks,
> Anand.
>
> On Sat, Oct 8, 2016 at 7:55 AM, Flavio Junqueira <f...@apache.org<mailto:
> f...@apache.org>> wrote:
> Hi Anand,
>
> I don't understand whether 1 and 3 were able or even trying to connect to
> each other. They should be able to elect a leader between them and make
> progress. You might want to upload logs and let us know.
>
> -Flavio
>
> > On 08 Oct 2016, at 02:11, Anand Parthasarathy <anpar...@avinetworks.com<
> mailto:anpar...@avinetworks.com>> wrote:
> >
> > Hi,
> >
> > We are currently using zookeeper 3.4.6 version and use a 3 node solution
> in
> > our system. We see that occasionally, when a node is powered off (in this
> > instance, it was actually a leader node), the remaining two nodes do not
> > form a quorum for a really long time. Looking at the logs, it appears the
> > sequence is as follows:
> > - Node 2 is the zookeeper leader
> > - Node 2 is powered off
> > - Node 1 and Node 3 recognize and start the election
> > - Node 3 times out after initLimit * tickTime with "Timeout while waiting
> > for quorum" for Round N
> > - Node 1 times out after initLimit * tickTime with "Exception while
> trying
> > to follow leader" for Round N+1 at the same time.
> > - And the process continues where N is sequentially incrementing.
> > - This happens for a long time.
> > - In one instance, we used tickTime=5000 and initLimit=20 and it took
> > around 3.5 hours to converge.
> > - In a given round, Node 1 will try connecting to Node 2, gets connection
> > refused waits for notification timeout which increases by 2 every
> iteration
> > until it hits the initLimit. Connection Refused is because the node 2
> comes
> > up after reboot, but zookeeper process is not started (due to a different
> > failure).
> >
> > It looks similar to ZOOKEEPER-2164 but there it is a connection timeout
> > where Node 2 is not reachable.
> >
> > Could you pls. share if you have seen this issue and if so, what is the
> > workaround that can be employed in 3.4.6.
> >
> > Thanks,
> > Anand.
>
>
> <log.tgz>
>
>


-- 
Cheers
Michael.

Reply via email to