I have a 3 node ZK cluster (A, B, C). On one of the the nodes (node A), I
have a ZK client running that connects to the local server and creates an
ephemeral znode to indicate clients on other nodes that it is online.
I have test script that reboots the zookeeper server as well as client on A.
The test does a getstat on the ephemeral znode created by the client on A. I
am seeing that the view of znodes on A is different from the other 2 nodes.
I can tell this from the session ID that the client gets after reconnecting
to the local ZK server.
So the test is simple:
- kill zookeeper server and client process
- wait for a few seconds
- do zkCli.sh stat ... > test.out
What I am seeing is that the ephemeral znode with old zxid, time, and
session ID is reappearing on node A. I have attached the output of 3
consecutive getstat requests of the test (see client_getstat.out). Notice
that the third output is the same as the first one. That is, the old
ephemeral znode reappeared at A. However, both B and C are showing the
latest znode with correct time, zxid and session ID (output not attached).
After this point, all following getstat requests on A are showing the old
znode. Whereas, B and C show the correct znode every time the client on A
comes online. This is something very perplexing. Earlier I thought this was
a bug in my client implementation. But the test shows that the ZK server on
A after reboot is out of sync with rest of the servers.
The stat command to each server shows that the servers are in sync as far as
zxid's are concerned (see stat.out). So there is something wrong with A's
local database that is causing this problem.
Has anyone seen this before? I will be doing more debugging in the next few
days. Comments/suggestions for further debugging are welcomed.