Hello!
as far as I can tell, the provided logs are not enough to determine the
exact root cause of the problem. Maybe someone else will have a better
idea, but my best guess would be that some internal listener thread in the
leader (zoo3) died before, so it was not able to parse the leader election
messages from zoo1 and/or zoo2. When you restarted the leader, then the
listener threads re-initialized, so everything went back to normal.
There were a couple of issues like this reported already:
- https://issues.apache.org/jira/browse/ZOOKEEPER-2938
- https://issues.apache.org/jira/browse/ZOOKEEPER-2186
- https://issues.apache.org/jira/browse/ZOOKEEPER-3016
...
ZooKeeper 3.4.14 should already contain the fix for these above. However,
we just recently fixed a similar issue:
https://issues.apache.org/jira/browse/ZOOKEEPER-3769
this fix will be part of 3.6.1 and 3.5.8.
Of course, it is possible that you were hitting an independent / unknown
issue... We would need all the logs to verify that (the logs from each
ZooKeeper servers since their last restart before starting the rolling
upgrade).
Anyway, I strongly suggest to upgrade your ZooKeeper cluster, as the 3.4
will be EOL soon, see the announcement:
https://mail-archives.apache.org/mod_mbox/zookeeper-user/202004.mbox/browser
Kind regards,
Mate
On Wed, Apr 22, 2020 at 2:14 AM blb.dev wrote:
> Hi team,
>
> During a recent patching for our ZK quorum, we experienced an unrecoverable
> outage. We have performed patches like this many times previously and is
> working fine in other environments of ours. The goal was to shut down each
> server one by one and provide patch updates then restart. However, this
> time, when zoo1 (follower) was shut down, the leader (zoo3) shutdown
> connection with remaining follower (zoo2) as well.
>
> What would cause the entire quorum to shutdown and not recover due to
> stopping only zoo1?
>
> Running zookeeper 3.4.14 in docker containers.
> zk_version 3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf, built on
> 03/06/2019 16:18 GMT
>
>
> *Config for master nodes:*
> tickTime=2000
> maxClientCnxns=0
> dataDir=/data
> dataLogDir=/datalog
> clientPort=2181
> secureClientPort=2281
> initLimit=10
> syncLimit=5
> autopurge.snapRetainCount=10
> autopurge.purgeInterval=24
>
> server.1=zoo1:2888:3888
> server.2=zoo2:2888:3888
> server.3=zoo3:2888:3888
> server.4=zoo4:2888:3888:observer
> server.5=zoo5:2888:3888:observer
> server.6=zoo6:2888:3888:observer
>
> *Config for observer nodes:*
> tickTime=2000
> maxClientCnxns=0
> dataDir=/data
> dataLogDir=/datalog
> clientPort=2181
> secureClientPort=2281
> initLimit=10
> syncLimit=5
> autopurge.snapRetainCount=10
> autopurge.purgeInterval=24
> peerType=observer
> server.1=zoo1:2888:3888
> server.2=zoo2:2888:3888
> server.3=zoo3:2888:3888
> server.4=zoo4:2888:3888:observer
> server.5=zoo5:2888:3888:observer
> server.6=zoo6:2888:3888:observer
>
> zoo1-zoo6 are the FQDNs of each server.
>
> Shutdown of zoo1 and quorum outage at 05:02 UTC
>
>
> *Logs on zoo3 (leader):*
> 2020-04-21 05:02:00,120 [myid:3] - WARN
> [RecvWorker:1:QuorumCnxManager$RecvWorker@1028] - Interrupting SendWorker
> java.io.EOFException
> at java.io.DataInputStream.readInt(DataInputStream.java:392)
> at
>
> org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1010)
> 2020-04-21 05:02:00,096 [myid:3] - WARN
> [RecvWorker:1:QuorumCnxManager$RecvWorker@1025] - Connection broken for id
> 1, my id = 3, error =
> java.io.EOFException
> at java.io.DataInputStream.readInt(DataInputStream.java:392)
> at
>
> org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1010)
> 2020-04-21 05:02:00,120 [myid:3] - WARN
> [RecvWorker:1:QuorumCnxManager$RecvWorker@1028] - Interrupting SendWorker
> 2020-04-21 05:02:00,143 [myid:3] - WARN
> [SendWorker:1:QuorumCnxManager$SendWorker@941] - Interrupted while waiting
> for message on queue
> java.lang.InterruptedException
> at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
> at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
> at
> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
> at
>
> org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1094)
> at
>
> org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74)
> at
>
> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:929)
>
> 2020-04-21 05:02:00,143 [myid:3] - WARN
> [SendWorker:1:QuorumCnxManager$SendWorker@951] - Send worker leaving
> thread
> java.lang.InterruptedException
> at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
> at
>
>