[jira] [Updated] (ZOOKEEPER-4708) ZooKeeper 3.6.4 quorum failing due to address

2023-07-28 Thread Paolo Patierno (Jira)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paolo Patierno updated ZOOKEEPER-4708:
--
Description: 
We work on the Strimzi project which is about deploying an Apache Kafka cluster 
on Kubernetes together with a ZooKeeper ensamble.

Until ZooKeeper version 3.6.3 (brought by Kafka 3.4.0), there were no issues 
when running on minikube for development purposes.

With using ZooKeeper version 3.6.4 (brought by Kafka 3.4.1), we started to have 
issues during the quorum formation and leader election.

The first one was about ZooKeeper pods not able to bind the quorum port 3888 to 
the Cluster IP but during the DNS resolution they get the loopback address 
instead.
Following a possible log at ZooKeeper startup where you can see the binding at 
127.0.0.1:3888 instead of something like 172.17.0.4:3888 (so getting a valid 
not loopback IP address).

 
{code:java}
INFO 3 is accepting connections now, my election bind port: 
my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/127.0.0.1:3888 
(org.apache.zookeeper.server.quorum.QuorumCnxManager) 
[ListenerHandler-my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/127.0.0.1:3888]{code}
This specific issue had two solutions: using quorumListenOnAllIPs=true on 
ZooKeeper configuration or binding to 0.0.0.0 address. 

Anyway it is actually not clear why it wasn't needed until 3.6.3, but needed 
for getting 3.6.4 working. What is changed from this perspective?

Said that, While binding to 0.0.0.0 seems to work fine, using the 
quorumListenOnAllIPs=true doesn't.

Assuming a ZooKeeper ensamble with 3 nodes, Getting the log of the current 
ZooKeeper leader (ID=3) we see the following.
(Starting with ** you can see some additional logs added to 
{{org.apache.zookeeper.server.quorum.Leader#getDesignatedLeader}} in order to 
get more information.)
{code:java}
2023-06-19 12:32:51,990 INFO Have quorum of supporters, sids: [[1, 3],[1, 3]]; 
starting up and setting last processed zxid: 0x1 
(org.apache.zookeeper.server.quorum.Leader) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
2023-06-19 12:32:51,990 INFO ** 
newQVAcksetPair.getQuorumVerifier().getVotingMembers().get(self.getId()).addr = 
my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/172.17.0.6:2888 
(org.apache.zookeeper.server.quorum.Leader) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
2023-06-19 12:32:51,990 INFO ** self.getQuorumAddress() = 
my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/:2888 
(org.apache.zookeeper.server.quorum.Leader) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
2023-06-19 12:32:51,992 INFO ** qs.addr 
my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/172.17.0.6:2888, 
qs.electionAddr 
my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/172.17.0.6:3888, 
qs.clientAddr/127.0.0.1:12181 (org.apache.zookeeper.server.quorum.QuorumPeer) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
2023-06-19 12:32:51,992 DEBUG zookeeper (org.apache.zookeeper.common.PathTrie) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
2023-06-19 12:32:51,993 WARN Restarting Leader Election 
(org.apache.zookeeper.server.quorum.QuorumPeer) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] {code}
So the leader is ZooKeeper with ID=3 and it was ACKed by the ZooKeeper node 
ID=1.
As you can see we are in the {{Leader#startZkServer}} method, and because of 
the reconfiguration enabled, the designatedLeader is processed. The problem is 
that the {{Leader#getDesignatedLeader}} is not returning “self” as leader but 
another one (ID=1), because of the difference in the quorum address.
>From the above log, it’s not an actual difference in terms of addresses but 
>the {{self.getQuorumAddress()}} is returning an  (even if it’s 
>still the same hostname related to ZooKeeper-2 instance). This difference 
>causes the allowedToCommit=false, meanwhile the ZooKeeper-2 is still reported 
>as leader but it’s not able to commit, so prevents any requests and the 
>ZooKeeper ensemble gets stuck.
{code:java}
2023-06-19 12:32:51,996 WARN Suggested leader: 1 
(org.apache.zookeeper.server.quorum.QuorumPeer) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
2023-06-19 12:32:51,996 WARN This leader is not the designated leader, it will 
be initialized with allowedToCommit = false 
(org.apache.zookeeper.server.quorum.Leader) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] {code}
The overall issue could be related to DNS problems, with DNS records not 
registered yet during pod initialization (where ZooKeeper is running on 
Kubernetes). But we don’t understand why it’s not able to recover somehow.

What we don't get a reason is why ZooKeeper 3.6.3 didn't need any binding 
specific configuration and was working just 

[jira] [Updated] (ZOOKEEPER-4708) ZooKeeper 3.6.4 quorum failing due to address

2023-07-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ZOOKEEPER-4708:
--
Labels: pull-request-available  (was: )

> ZooKeeper 3.6.4 quorum failing due to  address
> --
>
> Key: ZOOKEEPER-4708
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4708
> Project: ZooKeeper
>  Issue Type: Bug
>Affects Versions: 3.6.4, 3.8.1
>Reporter: Paolo Patierno
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We work on the Strimzi project which is about deploying an Apache Kafka 
> cluster on Kubernetes together with a ZooKeeper ensamble.
> Until ZooKeeper version 3.6.3 (brought by Kafka 3.4.0), there were no issues 
> when running on minikube for development purposes.
> With using ZooKeeper version 3.6.4 (brought by Kafka 3.4.1), we started to 
> have issues during the quorum formation and leader election.
> The first one was about ZooKeeper pods not able to bind the quorum port 3888 
> to the Cluster IP but during the DNS resolution they get the loopback address 
> instead.
> Following a possible log at ZooKeeper startup where you can see the binding 
> at 127.0.0.1:3888 instead of something like 172.17.0.4:3888 (so getting a 
> valid not loopback IP address).
>  
> {code:java}
> INFO 3 is accepting connections now, my election bind port: 
> my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/127.0.0.1:3888 
> (org.apache.zookeeper.server.quorum.QuorumCnxManager) 
> [ListenerHandler-my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/127.0.0.1:3888]
> This specific issue had two solutions: using quorumListenOnAllIPs=true on 
> ZooKeeper configuration or binding to 0.0.0.0 address. {code}
>  
> Anyway it is actually not clear why it wasn't needed until 3.6.3, but needed 
> for getting 3.6.4 working. What is changed from this perspective?
> Said that, While binding to 0.0.0.0 seems to work fine, using the 
> quorumListenOnAllIPs=true doesn't.
> Assuming a ZooKeeper ensamble with 3 nodes, Getting the log of the current 
> ZooKeeper leader (ID=3) we see the following.
> (Starting with ** you can see some additional logs added to 
> {{org.apache.zookeeper.server.quorum.Leader#getDesignatedLeader}} in order to 
> get more information.)
> {code:java}
> 2023-06-19 12:32:51,990 INFO Have quorum of supporters, sids: [[1, 3],[1, 
> 3]]; starting up and setting last processed zxid: 0x1 
> (org.apache.zookeeper.server.quorum.Leader) 
> [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
> 2023-06-19 12:32:51,990 INFO ** 
> newQVAcksetPair.getQuorumVerifier().getVotingMembers().get(self.getId()).addr 
> = 
> my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/172.17.0.6:2888 
> (org.apache.zookeeper.server.quorum.Leader) 
> [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
> 2023-06-19 12:32:51,990 INFO ** self.getQuorumAddress() = 
> my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/:2888
>  (org.apache.zookeeper.server.quorum.Leader) 
> [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
> 2023-06-19 12:32:51,992 INFO ** qs.addr 
> my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/172.17.0.6:2888,
>  qs.electionAddr 
> my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/172.17.0.6:3888,
>  qs.clientAddr/127.0.0.1:12181 
> (org.apache.zookeeper.server.quorum.QuorumPeer) 
> [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
> 2023-06-19 12:32:51,992 DEBUG zookeeper 
> (org.apache.zookeeper.common.PathTrie) 
> [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
> 2023-06-19 12:32:51,993 WARN Restarting Leader Election 
> (org.apache.zookeeper.server.quorum.QuorumPeer) 
> [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] {code}
> So the leader is ZooKeeper with ID=3 and it was ACKed by the ZooKeeper node 
> ID=1.
> As you can see we are in the {{Leader#startZkServer}} method, and because of 
> the reconfiguration enabled, the designatedLeader is processed. The problem 
> is that the {{Leader#getDesignatedLeader}} is not returning “self” as leader 
> but another one (ID=1), because of the difference in the quorum address.
> From the above log, it’s not an actual difference in terms of addresses but 
> the {{self.getQuorumAddress()}} is returning an  (even if it’s 
> still the same hostname related to ZooKeeper-2 instance). This difference 
> causes the allowedToCommit=false, meanwhile the ZooKeeper-2 is still reported 
> as leader but it’s not able to commit, so prevents any requests and the 
> ZooKeeper ensemble gets stuck.
> {code:java}
> 2023-06-19 12:32:51,996 WARN Suggested leader: 1 
> (org.apache.zookeeper.server.quorum.QuorumPeer) 

[jira] [Updated] (ZOOKEEPER-4708) ZooKeeper 3.6.4 quorum failing due to address

2023-06-20 Thread Paolo Patierno (Jira)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paolo Patierno updated ZOOKEEPER-4708:
--
Description: 
We work on the Strimzi project which is about deploying an Apache Kafka cluster 
on Kubernetes together with a ZooKeeper ensamble.

Until ZooKeeper version 3.6.3 (brought by Kafka 3.4.0), there were no issues 
when running on minikube for development purposes.

With using ZooKeeper version 3.6.4 (brought by Kafka 3.4.1), we started to have 
issues during the quorum formation and leader election.

The first one was about ZooKeeper pods not able to bind the quorum port 3888 to 
the Cluster IP but during the DNS resolution they get the loopback address 
instead.
Following a possible log at ZooKeeper startup where you can see the binding at 
127.0.0.1:3888 instead of something like 172.17.0.4:3888 (so getting a valid 
not loopback IP address).

 
{code:java}
INFO 3 is accepting connections now, my election bind port: 
my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/127.0.0.1:3888 
(org.apache.zookeeper.server.quorum.QuorumCnxManager) 
[ListenerHandler-my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/127.0.0.1:3888]
This specific issue had two solutions: using quorumListenOnAllIPs=true on 
ZooKeeper configuration or binding to 0.0.0.0 address. {code}
 

Anyway it is actually not clear why it wasn't needed until 3.6.3, but needed 
for getting 3.6.4 working. What is changed from this perspective?

Said that, While binding to 0.0.0.0 seems to work fine, using the 
quorumListenOnAllIPs=true doesn't.

Assuming a ZooKeeper ensamble with 3 nodes, Getting the log of the current 
ZooKeeper leader (ID=3) we see the following.
(Starting with ** you can see some additional logs added to 
{{org.apache.zookeeper.server.quorum.Leader#getDesignatedLeader}} in order to 
get more information.)
{code:java}
2023-06-19 12:32:51,990 INFO Have quorum of supporters, sids: [[1, 3],[1, 3]]; 
starting up and setting last processed zxid: 0x1 
(org.apache.zookeeper.server.quorum.Leader) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
2023-06-19 12:32:51,990 INFO ** 
newQVAcksetPair.getQuorumVerifier().getVotingMembers().get(self.getId()).addr = 
my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/172.17.0.6:2888 
(org.apache.zookeeper.server.quorum.Leader) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
2023-06-19 12:32:51,990 INFO ** self.getQuorumAddress() = 
my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/:2888 
(org.apache.zookeeper.server.quorum.Leader) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
2023-06-19 12:32:51,992 INFO ** qs.addr 
my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/172.17.0.6:2888, 
qs.electionAddr 
my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/172.17.0.6:3888, 
qs.clientAddr/127.0.0.1:12181 (org.apache.zookeeper.server.quorum.QuorumPeer) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
2023-06-19 12:32:51,992 DEBUG zookeeper (org.apache.zookeeper.common.PathTrie) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
2023-06-19 12:32:51,993 WARN Restarting Leader Election 
(org.apache.zookeeper.server.quorum.QuorumPeer) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] {code}
So the leader is ZooKeeper with ID=3 and it was ACKed by the ZooKeeper node 
ID=1.
As you can see we are in the {{Leader#startZkServer}} method, and because of 
the reconfiguration enabled, the designatedLeader is processed. The problem is 
that the {{Leader#getDesignatedLeader}} is not returning “self” as leader but 
another one (ID=1), because of the difference in the quorum address.
>From the above log, it’s not an actual difference in terms of addresses but 
>the {{self.getQuorumAddress()}} is returning an  (even if it’s 
>still the same hostname related to ZooKeeper-2 instance). This difference 
>causes the allowedToCommit=false, meanwhile the ZooKeeper-2 is still reported 
>as leader but it’s not able to commit, so prevents any requests and the 
>ZooKeeper ensemble gets stuck.
{code:java}
2023-06-19 12:32:51,996 WARN Suggested leader: 1 
(org.apache.zookeeper.server.quorum.QuorumPeer) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
2023-06-19 12:32:51,996 WARN This leader is not the designated leader, it will 
be initialized with allowedToCommit = false 
(org.apache.zookeeper.server.quorum.Leader) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] {code}
The overall issue could be related to DNS problems, with DNS records not 
registered yet during pod initialization (where ZooKeeper is running on 
Kubernetes). But we don’t understand why it’s not able to recover somehow.

What we don't get a reason is why ZooKeeper 3.6.3 didn't need any binding 
specific configuration and was working just 

[jira] [Updated] (ZOOKEEPER-4708) ZooKeeper 3.6.4 quorum failing due to address

2023-06-20 Thread Paolo Patierno (Jira)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paolo Patierno updated ZOOKEEPER-4708:
--
Description: 
We work on the Strimzi project which is about deploying an Apache Kafka cluster 
on Kubernetes together with a ZooKeeper ensamble.

Until ZooKeeper version 3.6.3 (brought by Kafka 3.4.0), there were no issues 
when running on minikube for development purposes.

With using ZooKeeper version 3.6.4 (brought by Kafka 3.4.1), we started to have 
issues during the quorum formation and leader election.

The first one was about ZooKeeper pods not able to bind the quorum port 3888 to 
the Cluster IP but during the DNS resolution they get the loopback address 
instead.
Following a possible log at ZooKeeper startup where you can see the binding at 
127.0.0.1:3888 instead of something like 172.17.0.4:3888 (so getting a valid 
not loopback IP address).

INFO 3 is accepting connections now, my election bind port: 
my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/127.0.0.1:3888 
(org.apache.zookeeper.server.quorum.QuorumCnxManager) 
[ListenerHandler-my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/127.0.0.1:3888]
This specific issue had two solutions: using quorumListenOnAllIPs=true on 
ZooKeeper configuration or binding to 0.0.0.0 address.

Anyway it is actually not clear why it wasn't needed until 3.6.3, but needed 
for getting 3.6.4 working. What is changed from this perspective?

Said that, While binding to 0.0.0.0 seems to work fine, using the 
quorumListenOnAllIPs=true doesn't.

Assuming a ZooKeeper ensamble with 3 nodes, Getting the log of the current 
ZooKeeper leader (ID=3) we see the following.
(Starting with ** you can see some additional logs added to 
{{org.apache.zookeeper.server.quorum.Leader#getDesignatedLeader}} in order to 
get more information.)
{code:java}
2023-06-19 12:32:51,990 INFO Have quorum of supporters, sids: [[1, 3],[1, 3]]; 
starting up and setting last processed zxid: 0x1 
(org.apache.zookeeper.server.quorum.Leader) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
2023-06-19 12:32:51,990 INFO ** 
newQVAcksetPair.getQuorumVerifier().getVotingMembers().get(self.getId()).addr = 
my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/172.17.0.6:2888 
(org.apache.zookeeper.server.quorum.Leader) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
2023-06-19 12:32:51,990 INFO ** self.getQuorumAddress() = 
my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/:2888 
(org.apache.zookeeper.server.quorum.Leader) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
2023-06-19 12:32:51,992 INFO ** qs.addr 
my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/172.17.0.6:2888, 
qs.electionAddr 
my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/172.17.0.6:3888, 
qs.clientAddr/127.0.0.1:12181 (org.apache.zookeeper.server.quorum.QuorumPeer) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
2023-06-19 12:32:51,992 DEBUG zookeeper (org.apache.zookeeper.common.PathTrie) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
2023-06-19 12:32:51,993 WARN Restarting Leader Election 
(org.apache.zookeeper.server.quorum.QuorumPeer) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] {code}
So the leader is ZooKeeper with ID=3 and it was ACKed by the ZooKeeper node 
ID=1.
As you can see we are in the {{Leader#startZkServer}} method, and because of 
the reconfiguration enabled, the designatedLeader is processed. The problem is 
that the {{Leader#getDesignatedLeader}} is not returning “self” as leader but 
another one (ID=1), because of the difference in the quorum address.
>From the above log, it’s not an actual difference in terms of addresses but 
>the {{self.getQuorumAddress()}} is returning an  (even if it’s 
>still the same hostname related to ZooKeeper-2 instance). This difference 
>causes the allowedToCommit=false, meanwhile the ZooKeeper-2 is still reported 
>as leader but it’s not able to commit, so prevents any requests and the 
>ZooKeeper ensemble gets stuck.

2023-06-19 12:32:51,996 WARN Suggested leader: 1 
(org.apache.zookeeper.server.quorum.QuorumPeer) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
2023-06-19 12:32:51,996 WARN This leader is not the designated leader, it will 
be initialized with allowedToCommit = false 
(org.apache.zookeeper.server.quorum.Leader) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]

The overall issue could be related to DNS problems, with DNS records not 
registered yet during pod initialization (where ZooKeeper is running on 
Kubernetes). But we don’t understand why it’s not able to recover somehow.

What we don't get a reason is why ZooKeeper 3.6.3 didn't need any binding 
specific configuration and was working just fine, while the new 3.6.4 needs 
that. 

[jira] [Updated] (ZOOKEEPER-4708) ZooKeeper 3.6.4 quorum failing due to address

2023-06-20 Thread Paolo Patierno (Jira)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paolo Patierno updated ZOOKEEPER-4708:
--
Description: 
We work on the Strimzi project which is about deploying an Apache Kafka cluster 
on Kubernetes together with a ZooKeeper ensamble.

Until ZooKeeper version 3.6.3 (brought by Kafka 3.4.0), there were no issues 
when running on minikube for development purposes.

With using ZooKeeper version 3.6.4 (brought by Kafka 3.4.1), we started to have 
issues during the quorum formation and leader election.

The first one was about ZooKeeper pods not able to bind the quorum port 3888 to 
the Cluster IP but during the DNS resolution they get the loopback address 
instead.
Following a possible log at ZooKeeper startup where you can see the binding at 
127.0.0.1:3888 instead of something like 172.17.0.4:3888 (so getting a valid 
not loopback IP address).

 
{code:java}
INFO 3 is accepting connections now, my election bind port: 
my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/127.0.0.1:3888 
(org.apache.zookeeper.server.quorum.QuorumCnxManager) 
[ListenerHandler-my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/127.0.0.1:3888]
This specific issue had two solutions: using quorumListenOnAllIPs=true on 
ZooKeeper configuration or binding to 0.0.0.0 address. {code}
 

Anyway it is actually not clear why it wasn't needed until 3.6.3, but needed 
for getting 3.6.4 working. What is changed from this perspective?

Said that, While binding to 0.0.0.0 seems to work fine, using the 
quorumListenOnAllIPs=true doesn't.

Assuming a ZooKeeper ensamble with 3 nodes, Getting the log of the current 
ZooKeeper leader (ID=3) we see the following.
(Starting with ** you can see some additional logs added to 
{{org.apache.zookeeper.server.quorum.Leader#getDesignatedLeader}} in order to 
get more information.)
{code:java}
2023-06-19 12:32:51,990 INFO Have quorum of supporters, sids: [[1, 3],[1, 3]]; 
starting up and setting last processed zxid: 0x1 
(org.apache.zookeeper.server.quorum.Leader) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
2023-06-19 12:32:51,990 INFO ** 
newQVAcksetPair.getQuorumVerifier().getVotingMembers().get(self.getId()).addr = 
my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/172.17.0.6:2888 
(org.apache.zookeeper.server.quorum.Leader) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
2023-06-19 12:32:51,990 INFO ** self.getQuorumAddress() = 
my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/:2888 
(org.apache.zookeeper.server.quorum.Leader) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
2023-06-19 12:32:51,992 INFO ** qs.addr 
my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/172.17.0.6:2888, 
qs.electionAddr 
my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/172.17.0.6:3888, 
qs.clientAddr/127.0.0.1:12181 (org.apache.zookeeper.server.quorum.QuorumPeer) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
2023-06-19 12:32:51,992 DEBUG zookeeper (org.apache.zookeeper.common.PathTrie) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
2023-06-19 12:32:51,993 WARN Restarting Leader Election 
(org.apache.zookeeper.server.quorum.QuorumPeer) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] {code}
So the leader is ZooKeeper with ID=3 and it was ACKed by the ZooKeeper node 
ID=1.
As you can see we are in the {{Leader#startZkServer}} method, and because of 
the reconfiguration enabled, the designatedLeader is processed. The problem is 
that the {{Leader#getDesignatedLeader}} is not returning “self” as leader but 
another one (ID=1), because of the difference in the quorum address.
>From the above log, it’s not an actual difference in terms of addresses but 
>the {{self.getQuorumAddress()}} is returning an  (even if it’s 
>still the same hostname related to ZooKeeper-2 instance). This difference 
>causes the allowedToCommit=false, meanwhile the ZooKeeper-2 is still reported 
>as leader but it’s not able to commit, so prevents any requests and the 
>ZooKeeper ensemble gets stuck.
{code:java}
2023-06-19 12:32:51,996 WARN Suggested leader: 1 
(org.apache.zookeeper.server.quorum.QuorumPeer) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
2023-06-19 12:32:51,996 WARN This leader is not the designated leader, it will 
be initialized with allowedToCommit = false 
(org.apache.zookeeper.server.quorum.Leader) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] {code}
The overall issue could be related to DNS problems, with DNS records not 
registered yet during pod initialization (where ZooKeeper is running on 
Kubernetes). But we don’t understand why it’s not able to recover somehow.

What we don't get a reason is why ZooKeeper 3.6.3 didn't need any binding 
specific configuration and was working just 

[jira] [Updated] (ZOOKEEPER-4708) ZooKeeper 3.6.4 quorum failing due to address

2023-06-20 Thread Paolo Patierno (Jira)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paolo Patierno updated ZOOKEEPER-4708:
--
Description: 
We work on the Strimzi project which is about deploying an Apache Kafka cluster 
on Kubernetes together with a ZooKeeper ensamble.

Until ZooKeeper version 3.6.3 (brought by Kafka 3.4.0), there were no issues 
when running on minikube for development purposes.

With using ZooKeeper version 3.6.4 (brought by Kafka 3.4.1), we started to have 
issues during the quorum formation and leader election.

The first one was about ZooKeeper pods not able to bind the quorum port 3888 to 
the Cluster IP but during the DNS resolution they get the loopback address 
instead.
Following a possible log at ZooKeeper startup where you can see the binding at 
127.0.0.1:3888 instead of something like 172.17.0.4:3888 (so getting a valid 
not loopback IP address).


INFO 3 is accepting connections now, my election bind port: 
my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/127.0.0.1:3888 
(org.apache.zookeeper.server.quorum.QuorumCnxManager) 
[ListenerHandler-my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/127.0.0.1:3888]
This specific issue had two solutions: using quorumListenOnAllIPs=true on 
ZooKeeper configuration or binding to 0.0.0.0 address.

Anyway it is actually not clear why it wasn't needed until 3.6.3, but needed 
for getting 3.6.4 working. What is changed from this perspective?

Said that, While binding to 0.0.0.0 seems to work fine, using the 
quorumListenOnAllIPs=true doesn't.

Assuming a ZooKeeper ensamble with 3 nodes, Getting the log of the current 
ZooKeeper leader (ID=3) we see the following.
(Starting with ** you can see some additional logs added to 
{{org.apache.zookeeper.server.quorum.Leader#getDesignatedLeader}} in order to 
get more information.)


2023-06-19 12:32:51,990 INFO Have quorum of supporters, sids: [[1, 3],[1, 3]]; 
starting up and setting last processed zxid: 0x1 
(org.apache.zookeeper.server.quorum.Leader) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
2023-06-19 12:32:51,990 INFO ** 
newQVAcksetPair.getQuorumVerifier().getVotingMembers().get(self.getId()).addr = 
my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/172.17.0.6:2888 
(org.apache.zookeeper.server.quorum.Leader) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
2023-06-19 12:32:51,990 INFO ** self.getQuorumAddress() = 
my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/:2888 
(org.apache.zookeeper.server.quorum.Leader) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
2023-06-19 12:32:51,992 INFO ** qs.addr 
my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/172.17.0.6:2888, 
qs.electionAddr 
my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/172.17.0.6:3888, 
qs.clientAddr/127.0.0.1:12181 (org.apache.zookeeper.server.quorum.QuorumPeer) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
2023-06-19 12:32:51,992 DEBUG zookeeper (org.apache.zookeeper.common.PathTrie) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
2023-06-19 12:32:51,993 WARN Restarting Leader Election 
(org.apache.zookeeper.server.quorum.QuorumPeer) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]


So the leader is ZooKeeper with ID=3 and it was ACKed by the ZooKeeper node 
ID=1.
As you can see we are in the {{Leader#startZkServer}} method, and because of 
the reconfiguration enabled, the designatedLeader is processed. The problem is 
that the {{Leader#getDesignatedLeader}} is not returning “self” as leader but 
another one (ID=1), because of the difference in the quorum address.
>From the above log, it’s not an actual difference in terms of addresses but 
>the {{self.getQuorumAddress()}} is returning an  (even if it’s 
>still the same hostname related to ZooKeeper-2 instance). This difference 
>causes the allowedToCommit=false, meanwhile the ZooKeeper-2 is still reported 
>as leader but it’s not able to commit, so prevents any requests and the 
>ZooKeeper ensemble gets stuck.


2023-06-19 12:32:51,996 WARN Suggested leader: 1 
(org.apache.zookeeper.server.quorum.QuorumPeer) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
2023-06-19 12:32:51,996 WARN This leader is not the designated leader, it will 
be initialized with allowedToCommit = false 
(org.apache.zookeeper.server.quorum.Leader) 
[QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]

The overall issue could be related to DNS problems, with DNS records not 
registered yet during pod initialization (where ZooKeeper is running on 
Kubernetes). But we don’t understand why it’s not able to recover somehow.

What we don't get a reason is why ZooKeeper 3.6.3 didn't need any binding 
specific configuration and was working just fine, while the new 3.6.4 needs 
that. Other than that,