Hi,

I am relatively new to zookeeper and I am struggling to resolve an issue we
are experiencing. We have recently upgraded our zookeeper version from
3.4.x to 3.5.8. We are experiencing some issues which we think are related
to session sharing among nodes.

I was able to recreate the issue with a sample zookeeper setup. I am not
able to set up new session after taking down the leader in a 3 node
cluster. The same flow works with 3.4.14 zookeeper but not with 3.5.8. I am
hoping maybe there is some setting I am overlooking here as I don't find
anyone complaining about this online.

Below are the details:

3 node cluster. After starting all the zoo nodes:

Zoo1

Zoo2

Zoo3

Zookeeper version: 3.5.8-f439ca583e70862c3068a1f2a7d4d068eec33315, built on
05/04/2020 15:07 GMT

Latency min/avg/max: 0/0/0

Received: 3

Sent: 2

Connections: 1

Outstanding: 0

Zxid: 0x0

Mode: follower

Node count: 5

Zookeeper version: 3.5.8-f439ca583e70862c3068a1f2a7d4d068eec33315, built on
05/04/2020 15:07 GMT

Latency min/avg/max: 0/0/0

Received: 3

Sent: 2

Connections: 1

Outstanding: 0

Zxid: 0x100000000

Mode: leader

Node count: 5

Proposal sizes last/min/max: -1/-1/-1

Zookeeper version: 3.5.8-f439ca583e70862c3068a1f2a7d4d068eec33315, built on
05/04/2020 15:07 GMT

Latency min/avg/max: 0/0/0

Received: 2

Sent: 1

Connections: 1

Outstanding: 0

Zxid: 0x100000000

Mode: follower

Node count: 5





After starting one session using zkCli.sh on Zoo1 node:



Zoo1

Zoo2

Zoo3

Zookeeper version: 3.5.8-f439ca583e70862c3068a1f2a7d4d068eec33315, built on
05/04/2020 15:07 GMT

Latency min/avg/max: 1/9/23

Received: 7

Sent: 6

Connections: 2

Outstanding: 0

Zxid: 0x100000001

Mode: follower

Node count: 5

Zookeeper version: 3.5.8-f439ca583e70862c3068a1f2a7d4d068eec33315, built on
05/04/2020 15:07 GMT

Latency min/avg/max: 0/0/0

Received: 4

Sent: 3

Connections: 1

Outstanding: 0

Zxid: 0x100000001

Mode: leader

Node count: 5

Proposal sizes last/min/max: 36/36/36

Zookeeper version: 3.5.8-f439ca583e70862c3068a1f2a7d4d068eec33315, built on
05/04/2020 15:07 GMT

Latency min/avg/max: 0/0/0

Received: 3

Sent: 2

Connections: 1

Outstanding: 0

Zxid: 0x100000001

Mode: follower

Node count: 5





*Note: We can see that Zxid is now consistent across all nodes. *



I then shut down leader node zoo2. I can see ZOO3 became the Leader. But
for some reason the ZXID is not the same between zoo1 and zoo3.



Now closed the existing zkCli and started a new zkCli.sh session on the
same node (zoo1).  The session was not established, the cli client just
keeps retrying and created many outstanding requests on zoo1.  The only way
to resolve now is to shut down all nodes and restart them again.
(Currently, if the leader node goes down, our kafka cluster stops working. )



Zoo1

Zoo2

Zoo3

Zookeeper version: 3.5.8-f439ca583e70862c3068a1f2a7d4d068eec33315, built on
05/04/2020 15:07 GMT

Latency min/avg/max: 0/0/2

Received: 50

Sent: 43

Connections: 2

Outstanding: 6

Zxid: 0x100000001

Mode: follower

Node count: 5

down

Zookeeper version: 3.5.8-f439ca583e70862c3068a1f2a7d4d068eec33315, built on
05/04/2020 15:07 GMT

Latency min/avg/max: 0/0/0

Received: 1

Sent: 0

Connections: 1

Outstanding: 0

Zxid: 0x200000000

Mode: leader

Node count: 5

Proposal sizes last/min/max: -1/-1/-1



*Question: Why is the client not able to establish the session on Zoo1 ? *





But a similar flow with zookeeper 3.4.14 works fine. Below is the detail:



First initial setup:



Zoo1

Zoo2

Zoo3

Zookeeper version: 3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf, built
on 03/06/2019 16:18 GMT

Latency min/avg/max: 0/0/0

Received: 1

Sent: 0

Connections: 1

Outstanding: 0

Zxid: 0x0

Mode: follower

Node count: 4

Zookeeper version: 3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf, built
on 03/06/2019 16:18 GMT

Latency min/avg/max: 0/0/0

Received: 1

Sent: 0

Connections: 1

Outstanding: 0

Zxid: 0x100000000

Mode: leader

Node count: 4

Proposal sizes last/min/max: -1/-1/-1

Zookeeper version: 3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf, built
on 03/06/2019 16:18 GMT

Latency min/avg/max: 0/0/0

Received: 1

Sent: 0

Connections: 1

Outstanding: 0

Zxid: 0x100000000

Mode: follower

Node count: 4



After connecting with zkCli on ZOO1.



Zoo1

Zoo2

Zoo3

Zookeeper version: 3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf, built
on 03/06/2019 16:18 GMT

Latency min/avg/max: 0/14/33

Received: 5

Sent: 4

Connections: 2

Outstanding: 0

Zxid: 0x100000001

Mode: follower

Node count: 4

Zookeeper version: 3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf, built
on 03/06/2019 16:18 GMT

Latency min/avg/max: 0/0/0

Received: 2

Sent: 1

Connections: 1

Outstanding: 0

Zxid: 0x100000001

Mode: leader

Node count: 4

Proposal sizes last/min/max: 36/36/36

Zookeeper version: 3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf, built
on 03/06/2019 16:18 GMT

Latency min/avg/max: 0/0/0

Received: 2

Sent: 1

Connections: 1

Outstanding: 0

Zxid: 0x100000001

Mode: follower

Node count: 4



*Note: The zkid is now the same for all the nodes. *





After shutting down leader node zoo2, I can see Zoo3 became the Leader. For
some reason the ZXID is not same between zoo1 and zoo3 initially. Zoo3 has
new zkid as a new epoch was created but zoo1 still has an old zkid.



I closed the existing zxcli and started a new zkCli.sh session on the same
node (zoo1).  This time session was established and the zkid was synced as
well.





Zoo1

Zoo2

Zoo3

Zookeeper version: 3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf, built
on 03/06/2019 16:18 GMT

Latency min/avg/max: 0/1/4

Received: 8

Sent: 7

Connections: 2

Outstanding: 0

Zxid: 0x200000001

Mode: follower

Node count: 4

down



Zookeeper version: 3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf, built
on 03/06/2019 16:18 GMT

Latency min/avg/max: 0/0/0

Received: 3

Sent: 2

Connections: 1

Outstanding: 0

Zxid: 0x200000001

Mode: leader

Node count: 4

Proposal sizes last/min/max: 36/36/36



 Any help with this issue will be greatly appreciated!

-- 
Vik

Reply via email to