Hello, I have a kubernetes cluster with 3 zookeeper nodes (as stateful set) that, for cost-saving purposes, it downscaled every evening and upscaled in the morning. Since the upgrade to 3.5.6 (previously I was using using zookeeper shipped with Kafka 2.3.0 archive) the nodes are experiencing issues with establishing the quorum. I believe it's related to the JIRA ticket https://issues.apache.org/jira/browse/ZOOKEEPER-2164. However, although it seems to me as a quite a serious bug, the ticket is stale. Have there been any other steps performed to fix that issue or are there any workarounds?
Currently, 2 out of 3 nodes have established a quorum and second node is always returning "This ZooKeeper instance is not currently serving requests" although there are no errors in the logs, only some repeated lines about FastLeaderElection: [2020-01-22 11:38:46,076] INFO Have smaller server identifier, so dropping the connection: (3, 2) (org.apache.zookeeper.server.quorum.QuorumCnxManager) [2020-01-22 11:38:46,076] INFO Notification: 2 (message format version), 2 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 2 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)0 (n.config version) (org.apache.zookeeper.server.quorum.FastLeaderElection) [2020-01-22 11:38:46,078] INFO Notification: 2 (message format version), 1 (n.leader), 0x100000010 (n.zxid), 0x1 (n.round), LEADING (n.state), 1 (n.sid), 0x2 (n.peerEPoch), LOOKING (my state)0 (n.config version) (org.apache.zookeeper.server.quorum.FastLeaderElection) I understand that scaling zookeeper to 1 node and then scaling up step by step should allow them to establish correct quorum of 3 but I don't want to have to do things like this every morning. Also, as far as I understand the issue correctly, if I were now to perform a rolling update of the zookeeper pods (which happens in reverse order) the pods wouldn't establish any quorum again. Thanks, Jan