Hi all,

We've identified some issues related to the problem, and opened 2 PRs:
https://github.com/apache/zookeeper/pull/2040
https://github.com/apache/zookeeper/pull/2041

We'd like to get your feedback on these PRs.
Please take a look when available.

Thank you.
Luke


On Tue, Jun 27, 2023 at 1:34 PM Paolo Patierno <paolo.patie...@gmail.com>
wrote:

> Hi all,
> we are still facing this issue and I opened
> https://issues.apache.org/jira/browse/ZOOKEEPER-4708
> I went through git bisect to try understanding where the "problem" started
> to happen between 3.6.3 and 3.6.4.
> To avoid repeating myself you can find details on the JIRA ticket.
> Any help is really appreciated :-)
>
> Thanks,
> Paolo
>
> On Tue, 20 Jun 2023 at 08:46, Paolo Patierno <paolo.patie...@gmail.com>
> wrote:
>
> > Hi Enrico,
> > we are working on the Strimzi project (deploying Apache Kafka on
> > Kubernetes, so together with ZooKeeper).
> > It has been working fine until Apache Kafka was using ZooKeeper 3.6.3 (or
> > any other previous version).
> > With 3.6.4 we are facing the issue I described.
> >
> > Thanks,
> > Paolo
> >
> > On Mon, 19 Jun 2023 at 23:29, Enrico Olivelli <eolive...@gmail.com>
> wrote:
> >
> >> Paolo,
> >>
> >> Il Lun 19 Giu 2023, 16:43 Paolo Patierno <paolo.patie...@gmail.com> ha
> >> scritto:
> >>
> >> > Hi all,
> >> > We were able to overcome the binding issue by setting
> >> > quorumListenOnAllIPs=true but from there we are getting a new issue
> >> that is
> >> > preventing leader election completion on first start-up.
> >> >
> >> > Getting the log of the current ZooKeeper leader (ID=3) we see the
> >> > following.
> >> > (Starting with ** you can see some additional logs added to
> >> > org.apache.zookeeper.server.quorum.Leader#getDesignatedLeader in order
> >> to
> >> > get more information.)
> >> >
> >> > 2023-06-19 12:32:51,990 INFO Have quorum of supporters, sids: [[1,
> >> 3],[1,
> >> > 3]]; starting up and setting last processed zxid: 0x100000000
> >> > (org.apache.zookeeper.server.quorum.Leader)
> >> > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
> >> > 2023-06-19 12:32:51,990 INFO **
> >> >
> >> >
> >>
> newQVAcksetPair.getQuorumVerifier().getVotingMembers().get(self.getId()).addr
> >> > = my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/
> >> > 172.17.0.6:2888 (org.apache.zookeeper.server.quorum.Leader)
> >> > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
> >> > 2023-06-19 12:32:51,990 INFO ** self.getQuorumAddress() =
> >> >
> >> >
> >>
> my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/<unresolved>:2888
> >> > (org.apache.zookeeper.server.quorum.Leader)
> >> > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
> >> > 2023-06-19 12:32:51,992 INFO ** qs.addr
> >> > my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/
> >> > 172.17.0.6:2888, qs.electionAddr
> >> > my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/
> >> > 172.17.0.6:3888, qs.clientAddr/127.0.0.1:12181
> >> > (org.apache.zookeeper.server.quorum.QuorumPeer)
> >> > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
> >> > 2023-06-19 12:32:51,992 DEBUG zookeeper
> >> > (org.apache.zookeeper.common.PathTrie)
> >> > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
> >> > 2023-06-19 12:32:51,993 WARN Restarting Leader Election
> >> > (org.apache.zookeeper.server.quorum.QuorumPeer)
> >> > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
> >> >
> >> > So the leader is ZooKeeper with ID=3 and it was ACKed by the ZooKeeper
> >> node
> >> > ID=1.
> >> > As you can see we are in the Leader#startZkServer method, and because
> of
> >> > the reconfiguration enabled, the designatedLeader is processed. The
> >> problem
> >> > is that the Leader#getDesignatedLeader is not returning “self” as
> leader
> >> > but another one (ID=1), because of the difference in the quorum
> address.
> >> > From the above log, it’s not an actual difference in terms of
> addresses
> >> but
> >> > the self.getQuorumAddress() is returning an <unresolved> one (even if
> >> it’s
> >> > still the same hostname related to ZooKeeper-2 instance). This
> >> difference
> >> > causes the allowedToCommit=false, meanwhile the ZooKeeper-2 is still
> >> > reported as leader but it’s not able to commit, so prevents any
> requests
> >> > and the ZooKeeper ensemble gets stuck.
> >> >
> >> > 2023-06-19 12:32:51,996 WARN Suggested leader: 1
> >> > (org.apache.zookeeper.server.quorum.QuorumPeer)
> >> > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
> >> > 2023-06-19 12:32:51,996 WARN This leader is not the designated leader,
> >> it
> >> > will be initialized with allowedToCommit = false
> >> > (org.apache.zookeeper.server.quorum.Leader)
> >> > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
> >> >
> >> > The overall issue could be related to DNS problems, with DNS records
> not
> >> > registered yet during pod initialization (where ZooKeeper is running
> on
> >> > Kubernetes). But we don’t understand why it’s not able to recover
> >> somehow.
> >> >
> >> > Instead of using the quorumListenOnAllIPs=true we also tried a
> different
> >> > approach by using the 0.0.0.0 address for the binding, so something
> >> like:
> >> >
> >> > # Zookeeper nodes configuration
> >> > server.1=0.0.0.0:2888:3888:participant;127.0.0.1:12181
> >> >
> >> >
> >>
> server.2=my-cluster-zookeeper-1.my-cluster-zookeeper-nodes.default.svc:2888:3888:participant;
> >> > 127.0.0.1:12181
> >> >
> >> >
> >>
> server.3=my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc:2888:3888:participant;
> >> > 127.0.0.1:12181
> >> >
> >> > This way, the self.getQuorumAddress() is not suffering the same
> >> problem, it
> >> > doesn’t return an <unresolved> address but always an actual one. No
> new
> >> > leader election is needed and everything works fine.
> >> >
> >>
> >> This is the release notes page for 3.6.4.
> >> https://zookeeper.apache.org/doc/r3.6.4/releasenotes.html
> >>
> >> As you are running on k8s, I guess you are using a statefulset, maybe
> with
> >> a service with ClusterIP?
> >>
> >> Is the readyness probe failing? In that case the dns name should not be
> >> available
> >>
> >> What are you using to perform the probe? Are you using the four letters
> >> words API ? (ruok)
> >>
> >>
> >> Enrico
> >>
> >>
> >> > On 2023/06/14 06:19:24 Szalay-Bekő Máté wrote:
> >> > > Interesting...
> >> > >
> >> > > I am not familiar with strimzi.io.
> >> > > Quickly checking the release notes, I don't see anything suspicious:
> >> > > https://zookeeper.apache.org/doc/r3.6.4/releasenotes.html
> >> > > Also, QuorumCnxManager was not changed for 2+ years on branch 3.6.
> >> > >
> >> > > Are you use the same java version and zookeeper config for 3.6.3 and
> >> > 3.6.4?
> >> > > Can you share the zookeeper config?
> >> > >
> >> > > Also: zookeeper 3.6 is deprecated since december 2022. Can you
> >> reproduce
> >> > > the issue on newer ZooKeeper versions?
> >> > >
> >> > > best regards,
> >> > > Máté
> >> > >
> >> > > On Tue, Jun 13, 2023 at 10:16 AM Luke Chen <sh...@gmail.com> wrote:
> >> > >
> >> > > > Hi all,
> >> > > >
> >> > > > We're running zookeeper under minikube using strimzi <
> >> > https://strimzi.io/
> >> > > > >.
> >> > > > The zookeeper works well while running with ZK v3.6.3. But when we
> >> > upgraded
> >> > > > to v3.6.4, we encountered hostname unresolved issue. I'm wondering
> >> if
> >> > this
> >> > > > is a regression that some changes between v3.6.3 and v3.6.4 cause
> >> this
> >> > > > issue?
> >> > > >
> >> > > > Logs:
> >> > > > ====
> >> > > > 2023-06-12 12:25:38,149 INFO binding to port /127.0.0.1:12181
> >> > > > (org.apache.zookeeper.server.NettyServerCnxnFactory) [main]
> >> > > > 2023-06-12 12:25:38,194 INFO bound to port 12181
> >> > > > (org.apache.zookeeper.server.NettyServerCnxnFactory) [main]
> >> > > > 2023-06-12 12:25:38,194 INFO binding to port 0.0.0.0/0.0.0.0:2181
> >> > > > (org.apache.zookeeper.server.NettyServerCnxnFactory) [main]
> >> > > > 2023-06-12 12:25:38,195 INFO bound to port 2181
> >> > > > (org.apache.zookeeper.server.NettyServerCnxnFactory) [main]
> >> > > > 2023-06-12 12:25:38,195 INFO Using 4000ms as the quorum cnxn
> socket
> >> > timeout
> >> > > > (org.apache.zookeeper.server.quorum.QuorumPeer) [main]
> >> > > > 2023-06-12 12:25:38,199 INFO Election port bind maximum retries is
> >> > infinite
> >> > > > (org.apache.zookeeper.server.quorum.QuorumCnxManager) [main]
> >> > > > 2023-06-12 12:25:38,201 INFO Creating TLS-only quorum server
> socket
> >> > > > (org.apache.zookeeper.server.quorum.QuorumCnxManager)
> >> > > >
> >> > > >
> >> >
> >> >
> >>
> [ListenerHandler-my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/<unresolved>:3888]
> >> > > > 2023-06-12 12:25:38,202 INFO ZooKeeper audit is disabled.
> >> > > > (org.apache.zookeeper.audit.ZKAuditProvider) [main]
> >> > > > 2023-06-12 12:25:38,202 ERROR Exception while listening
> >> > > > (org.apache.zookeeper.server.quorum.QuorumCnxManager)
> >> > > >
> >> > > >
> >> >
> >> >
> >>
> [ListenerHandler-my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/<unresolved>:3888]
> >> > > > java.net.SocketException: Unresolved address
> >> > > > at java.base/java.net.ServerSocket.bind(ServerSocket.java:380)
> >> > > > at java.base/java.net.ServerSocket.bind(ServerSocket.java:342)
> >> > > > at
> >> > > >
> >> > > >
> >> >
> >> >
> >>
> org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.createNewServerSocket(QuorumCnxManager.java:1135)
> >> > > > at
> >> > > >
> >> > > >
> >> >
> >> >
> >>
> org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.acceptConnections(QuorumCnxManager.java:1064)
> >> > > > at
> >> > > >
> >> > > >
> >> >
> >> >
> >>
> org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.run(QuorumCnxManager.java:1033)
> >> > > > at
> >> > > >
> >> > > >
> >> >
> >> >
> >>
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
> >> > > > at
> >> java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> >> > > > at
> >> > > >
> >> > > >
> >> >
> >> >
> >>
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> >> > > > at
> >> > > >
> >> > > >
> >> >
> >> >
> >>
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> >> > > > at java.base/java.lang.Thread.run(Thread.java:833)
> >> > > >
> >> > > > ====
> >> > > >
> >> > > > Any thoughts or suggestions are welcomed.
> >> > > >
> >> > > > Thank you.
> >> > > > Luke
> >> > > >
> >> > >
> >> >
> >>
> >
> >
> > --
> > Paolo Patierno
> >
> > *Senior Principal Software Engineer @ Red Hat**Microsoft MVP on **Azure*
> >
> > Twitter : @ppatierno <http://twitter.com/ppatierno>
> > Linkedin : paolopatierno <http://it.linkedin.com/in/paolopatierno>
> > GitHub : ppatierno <https://github.com/ppatierno>
> >
>
>
> --
> Paolo Patierno
>
> *Senior Principal Software Engineer @ Red Hat**Microsoft MVP on **Azure*
>
> Twitter : @ppatierno <http://twitter.com/ppatierno>
> Linkedin : paolopatierno <http://it.linkedin.com/in/paolopatierno>
> GitHub : ppatierno <https://github.com/ppatierno>
>

Reply via email to