zhihai xu created YARN-3242:
Summary: Old ZK client session watcher event messed up new ZK
client session due to ZooKeeper asynchronously closing client session.
Project: Hadoop YARN
Issue Type: Bug
Affects Versions: 2.6.0
Reporter: zhihai xu
Assignee: zhihai xu
Old ZK client session watcher event messed up new ZK client session due to
ZooKeeper asynchronously closing client session.
The watcher event from old ZK client session can still be sent to
ZKRMStateStore when the old ZK client session is closed.
This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper
We only have one ZKRMStateStore but we can have multiple ZK client sessions.
Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher
event is from current session. So the watcher event from old ZK client session
which just is closed will still be processed.
For example, If a Disconnected event received from old session after new
session is connected, the zkClient will be set to null
LOG.info("ZKRMStateStore Session disconnected");
oldZkClient = zkClient;
zkClient = null;
Then ZKRMStateStore won't receive SyncConnected event from new session because
new session is already in SyncConnected state and it won't send SyncConnected
event until it is disconnected and connected again.
Then we will see all the ZKRMStateStore operations fail with IOException "Wait
for ZKClient creation timed out" until RM shutdown.
This message was sent by Atlassian JIRA