Hi all, We've identified some issues related to the problem, and opened 2 PRs: https://github.com/apache/zookeeper/pull/2040 https://github.com/apache/zookeeper/pull/2041
We'd like to get your feedback on these PRs. Please take a look when available. Thank you. Luke On Tue, Jun 27, 2023 at 1:34 PM Paolo Patierno <paolo.patie...@gmail.com> wrote: > Hi all, > we are still facing this issue and I opened > https://issues.apache.org/jira/browse/ZOOKEEPER-4708 > I went through git bisect to try understanding where the "problem" started > to happen between 3.6.3 and 3.6.4. > To avoid repeating myself you can find details on the JIRA ticket. > Any help is really appreciated :-) > > Thanks, > Paolo > > On Tue, 20 Jun 2023 at 08:46, Paolo Patierno <paolo.patie...@gmail.com> > wrote: > > > Hi Enrico, > > we are working on the Strimzi project (deploying Apache Kafka on > > Kubernetes, so together with ZooKeeper). > > It has been working fine until Apache Kafka was using ZooKeeper 3.6.3 (or > > any other previous version). > > With 3.6.4 we are facing the issue I described. > > > > Thanks, > > Paolo > > > > On Mon, 19 Jun 2023 at 23:29, Enrico Olivelli <eolive...@gmail.com> > wrote: > > > >> Paolo, > >> > >> Il Lun 19 Giu 2023, 16:43 Paolo Patierno <paolo.patie...@gmail.com> ha > >> scritto: > >> > >> > Hi all, > >> > We were able to overcome the binding issue by setting > >> > quorumListenOnAllIPs=true but from there we are getting a new issue > >> that is > >> > preventing leader election completion on first start-up. > >> > > >> > Getting the log of the current ZooKeeper leader (ID=3) we see the > >> > following. > >> > (Starting with ** you can see some additional logs added to > >> > org.apache.zookeeper.server.quorum.Leader#getDesignatedLeader in order > >> to > >> > get more information.) > >> > > >> > 2023-06-19 12:32:51,990 INFO Have quorum of supporters, sids: [[1, > >> 3],[1, > >> > 3]]; starting up and setting last processed zxid: 0x100000000 > >> > (org.apache.zookeeper.server.quorum.Leader) > >> > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] > >> > 2023-06-19 12:32:51,990 INFO ** > >> > > >> > > >> > newQVAcksetPair.getQuorumVerifier().getVotingMembers().get(self.getId()).addr > >> > = my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/ > >> > 172.17.0.6:2888 (org.apache.zookeeper.server.quorum.Leader) > >> > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] > >> > 2023-06-19 12:32:51,990 INFO ** self.getQuorumAddress() = > >> > > >> > > >> > my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/<unresolved>:2888 > >> > (org.apache.zookeeper.server.quorum.Leader) > >> > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] > >> > 2023-06-19 12:32:51,992 INFO ** qs.addr > >> > my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/ > >> > 172.17.0.6:2888, qs.electionAddr > >> > my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/ > >> > 172.17.0.6:3888, qs.clientAddr/127.0.0.1:12181 > >> > (org.apache.zookeeper.server.quorum.QuorumPeer) > >> > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] > >> > 2023-06-19 12:32:51,992 DEBUG zookeeper > >> > (org.apache.zookeeper.common.PathTrie) > >> > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] > >> > 2023-06-19 12:32:51,993 WARN Restarting Leader Election > >> > (org.apache.zookeeper.server.quorum.QuorumPeer) > >> > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] > >> > > >> > So the leader is ZooKeeper with ID=3 and it was ACKed by the ZooKeeper > >> node > >> > ID=1. > >> > As you can see we are in the Leader#startZkServer method, and because > of > >> > the reconfiguration enabled, the designatedLeader is processed. The > >> problem > >> > is that the Leader#getDesignatedLeader is not returning “self” as > leader > >> > but another one (ID=1), because of the difference in the quorum > address. > >> > From the above log, it’s not an actual difference in terms of > addresses > >> but > >> > the self.getQuorumAddress() is returning an <unresolved> one (even if > >> it’s > >> > still the same hostname related to ZooKeeper-2 instance). This > >> difference > >> > causes the allowedToCommit=false, meanwhile the ZooKeeper-2 is still > >> > reported as leader but it’s not able to commit, so prevents any > requests > >> > and the ZooKeeper ensemble gets stuck. > >> > > >> > 2023-06-19 12:32:51,996 WARN Suggested leader: 1 > >> > (org.apache.zookeeper.server.quorum.QuorumPeer) > >> > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] > >> > 2023-06-19 12:32:51,996 WARN This leader is not the designated leader, > >> it > >> > will be initialized with allowedToCommit = false > >> > (org.apache.zookeeper.server.quorum.Leader) > >> > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] > >> > > >> > The overall issue could be related to DNS problems, with DNS records > not > >> > registered yet during pod initialization (where ZooKeeper is running > on > >> > Kubernetes). But we don’t understand why it’s not able to recover > >> somehow. > >> > > >> > Instead of using the quorumListenOnAllIPs=true we also tried a > different > >> > approach by using the 0.0.0.0 address for the binding, so something > >> like: > >> > > >> > # Zookeeper nodes configuration > >> > server.1=0.0.0.0:2888:3888:participant;127.0.0.1:12181 > >> > > >> > > >> > server.2=my-cluster-zookeeper-1.my-cluster-zookeeper-nodes.default.svc:2888:3888:participant; > >> > 127.0.0.1:12181 > >> > > >> > > >> > server.3=my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc:2888:3888:participant; > >> > 127.0.0.1:12181 > >> > > >> > This way, the self.getQuorumAddress() is not suffering the same > >> problem, it > >> > doesn’t return an <unresolved> address but always an actual one. No > new > >> > leader election is needed and everything works fine. > >> > > >> > >> This is the release notes page for 3.6.4. > >> https://zookeeper.apache.org/doc/r3.6.4/releasenotes.html > >> > >> As you are running on k8s, I guess you are using a statefulset, maybe > with > >> a service with ClusterIP? > >> > >> Is the readyness probe failing? In that case the dns name should not be > >> available > >> > >> What are you using to perform the probe? Are you using the four letters > >> words API ? (ruok) > >> > >> > >> Enrico > >> > >> > >> > On 2023/06/14 06:19:24 Szalay-Bekő Máté wrote: > >> > > Interesting... > >> > > > >> > > I am not familiar with strimzi.io. > >> > > Quickly checking the release notes, I don't see anything suspicious: > >> > > https://zookeeper.apache.org/doc/r3.6.4/releasenotes.html > >> > > Also, QuorumCnxManager was not changed for 2+ years on branch 3.6. > >> > > > >> > > Are you use the same java version and zookeeper config for 3.6.3 and > >> > 3.6.4? > >> > > Can you share the zookeeper config? > >> > > > >> > > Also: zookeeper 3.6 is deprecated since december 2022. Can you > >> reproduce > >> > > the issue on newer ZooKeeper versions? > >> > > > >> > > best regards, > >> > > Máté > >> > > > >> > > On Tue, Jun 13, 2023 at 10:16 AM Luke Chen <sh...@gmail.com> wrote: > >> > > > >> > > > Hi all, > >> > > > > >> > > > We're running zookeeper under minikube using strimzi < > >> > https://strimzi.io/ > >> > > > >. > >> > > > The zookeeper works well while running with ZK v3.6.3. But when we > >> > upgraded > >> > > > to v3.6.4, we encountered hostname unresolved issue. I'm wondering > >> if > >> > this > >> > > > is a regression that some changes between v3.6.3 and v3.6.4 cause > >> this > >> > > > issue? > >> > > > > >> > > > Logs: > >> > > > ==== > >> > > > 2023-06-12 12:25:38,149 INFO binding to port /127.0.0.1:12181 > >> > > > (org.apache.zookeeper.server.NettyServerCnxnFactory) [main] > >> > > > 2023-06-12 12:25:38,194 INFO bound to port 12181 > >> > > > (org.apache.zookeeper.server.NettyServerCnxnFactory) [main] > >> > > > 2023-06-12 12:25:38,194 INFO binding to port 0.0.0.0/0.0.0.0:2181 > >> > > > (org.apache.zookeeper.server.NettyServerCnxnFactory) [main] > >> > > > 2023-06-12 12:25:38,195 INFO bound to port 2181 > >> > > > (org.apache.zookeeper.server.NettyServerCnxnFactory) [main] > >> > > > 2023-06-12 12:25:38,195 INFO Using 4000ms as the quorum cnxn > socket > >> > timeout > >> > > > (org.apache.zookeeper.server.quorum.QuorumPeer) [main] > >> > > > 2023-06-12 12:25:38,199 INFO Election port bind maximum retries is > >> > infinite > >> > > > (org.apache.zookeeper.server.quorum.QuorumCnxManager) [main] > >> > > > 2023-06-12 12:25:38,201 INFO Creating TLS-only quorum server > socket > >> > > > (org.apache.zookeeper.server.quorum.QuorumCnxManager) > >> > > > > >> > > > > >> > > >> > > >> > [ListenerHandler-my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/<unresolved>:3888] > >> > > > 2023-06-12 12:25:38,202 INFO ZooKeeper audit is disabled. > >> > > > (org.apache.zookeeper.audit.ZKAuditProvider) [main] > >> > > > 2023-06-12 12:25:38,202 ERROR Exception while listening > >> > > > (org.apache.zookeeper.server.quorum.QuorumCnxManager) > >> > > > > >> > > > > >> > > >> > > >> > [ListenerHandler-my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/<unresolved>:3888] > >> > > > java.net.SocketException: Unresolved address > >> > > > at java.base/java.net.ServerSocket.bind(ServerSocket.java:380) > >> > > > at java.base/java.net.ServerSocket.bind(ServerSocket.java:342) > >> > > > at > >> > > > > >> > > > > >> > > >> > > >> > org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.createNewServerSocket(QuorumCnxManager.java:1135) > >> > > > at > >> > > > > >> > > > > >> > > >> > > >> > org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.acceptConnections(QuorumCnxManager.java:1064) > >> > > > at > >> > > > > >> > > > > >> > > >> > > >> > org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.run(QuorumCnxManager.java:1033) > >> > > > at > >> > > > > >> > > > > >> > > >> > > >> > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) > >> > > > at > >> java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > >> > > > at > >> > > > > >> > > > > >> > > >> > > >> > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) > >> > > > at > >> > > > > >> > > > > >> > > >> > > >> > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) > >> > > > at java.base/java.lang.Thread.run(Thread.java:833) > >> > > > > >> > > > ==== > >> > > > > >> > > > Any thoughts or suggestions are welcomed. > >> > > > > >> > > > Thank you. > >> > > > Luke > >> > > > > >> > > > >> > > >> > > > > > > -- > > Paolo Patierno > > > > *Senior Principal Software Engineer @ Red Hat**Microsoft MVP on **Azure* > > > > Twitter : @ppatierno <http://twitter.com/ppatierno> > > Linkedin : paolopatierno <http://it.linkedin.com/in/paolopatierno> > > GitHub : ppatierno <https://github.com/ppatierno> > > > > > -- > Paolo Patierno > > *Senior Principal Software Engineer @ Red Hat**Microsoft MVP on **Azure* > > Twitter : @ppatierno <http://twitter.com/ppatierno> > Linkedin : paolopatierno <http://it.linkedin.com/in/paolopatierno> > GitHub : ppatierno <https://github.com/ppatierno> >