Hi Michael and Ryan,

Thanks for looking into the logs.

Actually, in this case, node 2 was brought down ungracefully and was down
permanently. In this customer deployment, our product runs as a docker
container in a bare metal environment. It is possible that the docker proxy
that provides external connectivity to the zookeeper ports is down but the
container itself is up and running for several more minutes. If you grep
for "2 (n.sid)" you will see that there is no packets after 12:46.
Convergence after that did not happen until 15:57 or so.

Thanks,
Anand.

On Thu, Oct 13, 2016 at 5:10 PM, Michael Han <h...@cloudera.com> wrote:

> >> it started a new round but then I seem to see the election messages
> from Node 2 again. Any idea why?
>
> My guess is node 2 is back online and ZK service was started. In this use
> case node 2 does not get offline permanently, IIUC.
>
>
> On Thu, Oct 13, 2016 at 3:41 PM, Ryan Zhang <yangzhangr...@hotmail.com>
> wrote:
>
>> Hi, Anand, I took a look and I wonder how do you explain the following
>>
>> The N1.log starts at around
>> 2016-10-03 12:37:38,073 [myid:1] - INFO  [QuorumPeer[myid=1]/127.0.0.1:
>> 5002:QuorumPeer@714] - LOOKING
>>
>> and it failed to talk to Node 2 (id: 2)
>> 2016-10-03 12:37:38,136 [myid:1] - WARN  [WorkerSender[myid=1]:QuorumCn
>> xManager@382] - Cannot open channel to 2 at election address
>> node2.controller.local/127.0.0.8:5004
>> java.net.ConnectException: Connection refused
>>         at java.net.PlainSocketImpl.socketConnect(Native Method)
>>         at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSock
>> etImpl.java:339)
>>         at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPl
>> ainSocketImpl.java:200)
>>         at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocket
>> Impl.java:182)
>>         at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>>         at java.net.Socket.connect(Socket.java:579)
>>
>> However, after failed to get any election messages from Node 3, it
>> started a new round but then I seem to see the election messages from Node
>> 2 again. Any idea why?
>>
>> 2016-10-03 12:38:32,815 [myid:1] - INFO  [QuorumPeer[myid=1]/127.0.0.1:
>> 5002:QuorumPeer@714] - LOOKING
>> 2016-10-03 12:38:32,816 [myid:1] - INFO  [QuorumPeer[myid=1]/127.0.0.1:
>> 5002:FileSnap@83] - Reading snapshot /var/lib/zookeeper/version-2/s
>> napshot.20002aa0b
>> 2016-10-03 12:38:32,875 [myid:1] - INFO  [QuorumPeer[myid=1]/127.0.0.1:
>> 5002:FastLeaderElection@815] - New election. My id =  1, proposed
>> zxid=0x20002bfb5
>> 2016-10-03 12:38:32,876 [myid:1] - INFO  [WorkerReceiver[myid=1]:FastLe
>> aderElection@597] - Notification: 1 (message format version), 1
>> (n.leader), 0x20002bfb5 (n.zxid), 0x6 (n.round), LOOKING (n.state), 1
>> (n.sid), 0x2 (n.peerEpoch) LOOKING (my state)
>> 2016-10-03 12:38:32,878 [myid:1] - INFO  [WorkerReceiver[myid=1]:FastLe
>> aderElection@597] - Notification: 1 (message format version), 2
>> (n.leader), 0x20002bfb5 (n.zxid), 0x6 (n.round), LOOKING (n.state), 2
>> (n.sid), 0x2 (n.peerEpoch) LOOKING (my state)
>> 2016-10-03 12:38:32,878 [myid:1] - INFO  [WorkerReceiver[myid=1]:FastLe
>> aderElection@597] - Notification: 1 (message format version), 2
>> (n.leader), 0x20002bfb5 (n.zxid), 0x6 (n.round), LOOKING (n.state), 1
>> (n.sid), 0x2 (n.peerEpoch) LOOKING (my state)
>> 2016-10-03 12:38:32,917 [myid:1] - INFO  [WorkerReceiver[myid=1]:FastLe
>> aderElection@597] - Notification: 1 (message format version), 3
>> (n.leader), 0x20002bfb5 (n.zxid), 0x4 (n.round), LEADING (n.state), 3
>> (n.sid), 0x2 (n.peerEpoch) LOOKING (my state)
>> 2016-10-03 12:38:33,117 [myid:1] - INFO  [QuorumPeer[myid=1]/127.0.0.1:
>> 5002:FastLeaderElection@849] - Notification time out: 400
>> 2016-10-03 12:38:33,118 [myid:1] - INFO  [WorkerReceiver[myid=1]:FastLe
>> aderElection@597] - Notification: 1 (message format version), 2
>> (n.leader), 0x20002bfb5 (n.zxid), 0x6 (n.round), LOOKING (n.state), 1
>> (n.sid), 0x2 (n.peerEpoch) LOOKING (my state)
>> 2016-10-03 12:38:33,159 [myid:1] - INFO  [WorkerReceiver[myid=1]:FastLe
>> aderElection@597] - Notification: 1 (message format version), 3
>> (n.leader), 0x20002bfb5 (n.zxid), 0x4 (n.round), LEADING (n.state), 3
>> (n.sid), 0x2 (n.peerEpoch) LOOKING (my state)
>> 2016-10-03 12:38:33,559 [myid:1] - INFO  [QuorumPeer[myid=1]/127.0.0.1:
>> 5002:FastLeaderElection@849] - Notification time out: 800
>> 2016-10-03 12:38:33,560 [myid:1] - INFO  [WorkerReceiver[myid=1]:FastLe
>> aderElection@597] - Notification: 1 (message format version), 2
>> (n.leader), 0x20002bfb5 (n.zxid), 0x6 (n.round), LOOKING (n.state), 1
>> (n.sid), 0x2 (n.peerEpoch) LOOKING (my state)
>> 2016-10-03 12:38:33,561 [myid:1] - INFO  [WorkerReceiver[myid=1]:FastLe
>> aderElection@597] - Notification: 1 (message format version), 2
>> (n.leader), 0x20002bfb5 (n.zxid), 0x6 (n.round), LEADING (n.state), 2
>> (n.sid), 0x2 (n.peerEpoch) LOOKING (my state)
>> 2016-10-03 12:38:33,601 [myid:1] - INFO  [WorkerReceiver[myid=1]:FastLe
>> aderElection@597] - Notification: 1 (message format version), 3
>> (n.leader), 0x20002bfb5 (n.zxid), 0x4 (n.round), LEADING (n.state), 3
>> (n.sid), 0x2 (n.peerEpoch) LOOKING (my state)
>> 2016-10-03 12:38:34,402 [myid:1] - INFO  [QuorumPeer[myid=1]/127.0.0.1:
>> 5002:FastLeaderElection@849] - Notification time out: 1600
>>
>>
>>
>> On Oct 8, 2016, at 10:05 AM, Anand Parthasarathy <
>> anpar...@avinetworks.com<mailto:anpar...@avinetworks.com>> wrote:
>>
>> Hi Flavio,
>>
>> I have attached the logs from node 1 and node 3. Node 2 was powered off
>> around 10-03 12:36. Leader election kept going until 10-03 15:57:16 when it
>> finally converged.
>>
>> Thanks,
>> Anand.
>>
>> On Sat, Oct 8, 2016 at 7:55 AM, Flavio Junqueira <f...@apache.org<mailto:
>> f...@apache.org>> wrote:
>> Hi Anand,
>>
>> I don't understand whether 1 and 3 were able or even trying to connect to
>> each other. They should be able to elect a leader between them and make
>> progress. You might want to upload logs and let us know.
>>
>> -Flavio
>>
>> > On 08 Oct 2016, at 02:11, Anand Parthasarathy <anpar...@avinetworks.com
>> <mailto:anpar...@avinetworks.com>> wrote:
>> >
>> > Hi,
>> >
>> > We are currently using zookeeper 3.4.6 version and use a 3 node
>> solution in
>> > our system. We see that occasionally, when a node is powered off (in
>> this
>> > instance, it was actually a leader node), the remaining two nodes do not
>> > form a quorum for a really long time. Looking at the logs, it appears
>> the
>> > sequence is as follows:
>> > - Node 2 is the zookeeper leader
>> > - Node 2 is powered off
>> > - Node 1 and Node 3 recognize and start the election
>> > - Node 3 times out after initLimit * tickTime with "Timeout while
>> waiting
>> > for quorum" for Round N
>> > - Node 1 times out after initLimit * tickTime with "Exception while
>> trying
>> > to follow leader" for Round N+1 at the same time.
>> > - And the process continues where N is sequentially incrementing.
>> > - This happens for a long time.
>> > - In one instance, we used tickTime=5000 and initLimit=20 and it took
>> > around 3.5 hours to converge.
>> > - In a given round, Node 1 will try connecting to Node 2, gets
>> connection
>> > refused waits for notification timeout which increases by 2 every
>> iteration
>> > until it hits the initLimit. Connection Refused is because the node 2
>> comes
>> > up after reboot, but zookeeper process is not started (due to a
>> different
>> > failure).
>> >
>> > It looks similar to ZOOKEEPER-2164 but there it is a connection timeout
>> > where Node 2 is not reachable.
>> >
>> > Could you pls. share if you have seen this issue and if so, what is the
>> > workaround that can be employed in 3.4.6.
>> >
>> > Thanks,
>> > Anand.
>>
>>
>> <log.tgz>
>>
>>
>
>
> --
> Cheers
> Michael.
>

Reply via email to