[ 
https://issues.apache.org/jira/browse/YARN-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14332037#comment-14332037
 ] 

zhihai xu commented on YARN-3242:
---------------------------------

I uploaded a draft patch which will only process watcher event from current 
ZooKeeper Client session.

> Old ZK client session watcher event messed up new ZK client session due to 
> ZooKeeper asynchronously closing client session.
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-3242
>                 URL: https://issues.apache.org/jira/browse/YARN-3242
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: zhihai xu
>            Assignee: zhihai xu
>            Priority: Critical
>         Attachments: YARN-3242.000.patch
>
>
> Old ZK client session watcher event messed up new ZK client session due to 
> ZooKeeper asynchronously closing client session.
> The watcher event from old ZK client session can still be sent to 
> ZKRMStateStore when the old  ZK client session is closed.
> This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper 
> session.
> We only have one ZKRMStateStore but we can have multiple ZK client sessions.
> Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher 
> event is from current session. So the watcher event from old ZK client 
> session which just is closed will still be processed.
> For example, If a Disconnected event received from old session after new 
> session is connected, the zkClient will be set to null
> {code}
>         case Disconnected:
>           LOG.info("ZKRMStateStore Session disconnected");
>           oldZkClient = zkClient;
>           zkClient = null;
>           break;
> {code}
> Then ZKRMStateStore won't receive SyncConnected event from new session 
> because new session is already in SyncConnected state and it won't send 
> SyncConnected event until it is disconnected and connected again.
> Then we will see all the ZKRMStateStore operations fail with IOException 
> "Wait for ZKClient creation timed out" until  RM shutdown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to