Hello Vik, This issue reminds me of https://issues.apache.org/jira/browse/ZOOKEEPER-3940 Can you doublecheck if you see the same issue? I think ZOOKEEPER-3940 is docker related. Are you using a dockerized ZooKeeper?
If you have a different problem, then I recommend you to file a Jira ticket, attaching debug logs from all the 3 ZooKeeper server processes. Kind regards, Mate On Sat, Nov 7, 2020 at 9:28 PM vikramark s <[email protected]> wrote: > Hi, > > I am relatively new to zookeeper and I am struggling to resolve an issue we > are experiencing. We have recently upgraded our zookeeper version from > 3.4.x to 3.5.8. We are experiencing some issues which we think are related > to session sharing among nodes. > > I was able to recreate the issue with a sample zookeeper setup. I am not > able to set up new session after taking down the leader in a 3 node > cluster. The same flow works with 3.4.14 zookeeper but not with 3.5.8. I am > hoping maybe there is some setting I am overlooking here as I don't find > anyone complaining about this online. > > Below are the details: > > 3 node cluster. After starting all the zoo nodes: > > Zoo1 > > Zoo2 > > Zoo3 > > Zookeeper version: 3.5.8-f439ca583e70862c3068a1f2a7d4d068eec33315, built on > 05/04/2020 15:07 GMT > > Latency min/avg/max: 0/0/0 > > Received: 3 > > Sent: 2 > > Connections: 1 > > Outstanding: 0 > > Zxid: 0x0 > > Mode: follower > > Node count: 5 > > Zookeeper version: 3.5.8-f439ca583e70862c3068a1f2a7d4d068eec33315, built on > 05/04/2020 15:07 GMT > > Latency min/avg/max: 0/0/0 > > Received: 3 > > Sent: 2 > > Connections: 1 > > Outstanding: 0 > > Zxid: 0x100000000 > > Mode: leader > > Node count: 5 > > Proposal sizes last/min/max: -1/-1/-1 > > Zookeeper version: 3.5.8-f439ca583e70862c3068a1f2a7d4d068eec33315, built on > 05/04/2020 15:07 GMT > > Latency min/avg/max: 0/0/0 > > Received: 2 > > Sent: 1 > > Connections: 1 > > Outstanding: 0 > > Zxid: 0x100000000 > > Mode: follower > > Node count: 5 > > > > > > After starting one session using zkCli.sh on Zoo1 node: > > > > Zoo1 > > Zoo2 > > Zoo3 > > Zookeeper version: 3.5.8-f439ca583e70862c3068a1f2a7d4d068eec33315, built on > 05/04/2020 15:07 GMT > > Latency min/avg/max: 1/9/23 > > Received: 7 > > Sent: 6 > > Connections: 2 > > Outstanding: 0 > > Zxid: 0x100000001 > > Mode: follower > > Node count: 5 > > Zookeeper version: 3.5.8-f439ca583e70862c3068a1f2a7d4d068eec33315, built on > 05/04/2020 15:07 GMT > > Latency min/avg/max: 0/0/0 > > Received: 4 > > Sent: 3 > > Connections: 1 > > Outstanding: 0 > > Zxid: 0x100000001 > > Mode: leader > > Node count: 5 > > Proposal sizes last/min/max: 36/36/36 > > Zookeeper version: 3.5.8-f439ca583e70862c3068a1f2a7d4d068eec33315, built on > 05/04/2020 15:07 GMT > > Latency min/avg/max: 0/0/0 > > Received: 3 > > Sent: 2 > > Connections: 1 > > Outstanding: 0 > > Zxid: 0x100000001 > > Mode: follower > > Node count: 5 > > > > > > *Note: We can see that Zxid is now consistent across all nodes. * > > > > I then shut down leader node zoo2. I can see ZOO3 became the Leader. But > for some reason the ZXID is not the same between zoo1 and zoo3. > > > > Now closed the existing zkCli and started a new zkCli.sh session on the > same node (zoo1). The session was not established, the cli client just > keeps retrying and created many outstanding requests on zoo1. The only way > to resolve now is to shut down all nodes and restart them again. > (Currently, if the leader node goes down, our kafka cluster stops working. > ) > > > > Zoo1 > > Zoo2 > > Zoo3 > > Zookeeper version: 3.5.8-f439ca583e70862c3068a1f2a7d4d068eec33315, built on > 05/04/2020 15:07 GMT > > Latency min/avg/max: 0/0/2 > > Received: 50 > > Sent: 43 > > Connections: 2 > > Outstanding: 6 > > Zxid: 0x100000001 > > Mode: follower > > Node count: 5 > > down > > Zookeeper version: 3.5.8-f439ca583e70862c3068a1f2a7d4d068eec33315, built on > 05/04/2020 15:07 GMT > > Latency min/avg/max: 0/0/0 > > Received: 1 > > Sent: 0 > > Connections: 1 > > Outstanding: 0 > > Zxid: 0x200000000 > > Mode: leader > > Node count: 5 > > Proposal sizes last/min/max: -1/-1/-1 > > > > *Question: Why is the client not able to establish the session on Zoo1 ? * > > > > > > But a similar flow with zookeeper 3.4.14 works fine. Below is the detail: > > > > First initial setup: > > > > Zoo1 > > Zoo2 > > Zoo3 > > Zookeeper version: 3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf, built > on 03/06/2019 16:18 GMT > > Latency min/avg/max: 0/0/0 > > Received: 1 > > Sent: 0 > > Connections: 1 > > Outstanding: 0 > > Zxid: 0x0 > > Mode: follower > > Node count: 4 > > Zookeeper version: 3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf, built > on 03/06/2019 16:18 GMT > > Latency min/avg/max: 0/0/0 > > Received: 1 > > Sent: 0 > > Connections: 1 > > Outstanding: 0 > > Zxid: 0x100000000 > > Mode: leader > > Node count: 4 > > Proposal sizes last/min/max: -1/-1/-1 > > Zookeeper version: 3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf, built > on 03/06/2019 16:18 GMT > > Latency min/avg/max: 0/0/0 > > Received: 1 > > Sent: 0 > > Connections: 1 > > Outstanding: 0 > > Zxid: 0x100000000 > > Mode: follower > > Node count: 4 > > > > After connecting with zkCli on ZOO1. > > > > Zoo1 > > Zoo2 > > Zoo3 > > Zookeeper version: 3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf, built > on 03/06/2019 16:18 GMT > > Latency min/avg/max: 0/14/33 > > Received: 5 > > Sent: 4 > > Connections: 2 > > Outstanding: 0 > > Zxid: 0x100000001 > > Mode: follower > > Node count: 4 > > Zookeeper version: 3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf, built > on 03/06/2019 16:18 GMT > > Latency min/avg/max: 0/0/0 > > Received: 2 > > Sent: 1 > > Connections: 1 > > Outstanding: 0 > > Zxid: 0x100000001 > > Mode: leader > > Node count: 4 > > Proposal sizes last/min/max: 36/36/36 > > Zookeeper version: 3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf, built > on 03/06/2019 16:18 GMT > > Latency min/avg/max: 0/0/0 > > Received: 2 > > Sent: 1 > > Connections: 1 > > Outstanding: 0 > > Zxid: 0x100000001 > > Mode: follower > > Node count: 4 > > > > *Note: The zkid is now the same for all the nodes. * > > > > > > After shutting down leader node zoo2, I can see Zoo3 became the Leader. For > some reason the ZXID is not same between zoo1 and zoo3 initially. Zoo3 has > new zkid as a new epoch was created but zoo1 still has an old zkid. > > > > I closed the existing zxcli and started a new zkCli.sh session on the same > node (zoo1). This time session was established and the zkid was synced as > well. > > > > > > Zoo1 > > Zoo2 > > Zoo3 > > Zookeeper version: 3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf, built > on 03/06/2019 16:18 GMT > > Latency min/avg/max: 0/1/4 > > Received: 8 > > Sent: 7 > > Connections: 2 > > Outstanding: 0 > > Zxid: 0x200000001 > > Mode: follower > > Node count: 4 > > down > > > > Zookeeper version: 3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf, built > on 03/06/2019 16:18 GMT > > Latency min/avg/max: 0/0/0 > > Received: 3 > > Sent: 2 > > Connections: 1 > > Outstanding: 0 > > Zxid: 0x200000001 > > Mode: leader > > Node count: 4 > > Proposal sizes last/min/max: 36/36/36 > > > > Any help with this issue will be greatly appreciated! > > -- > Vik >
