[ 
https://issues.apache.org/jira/browse/YARN-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3242:
----------------------------
    Description: 
Old ZK client session watcher event messed up new ZK client session due to 
ZooKeeper asynchronously closing client session.
The watcher event from old ZK client session can still be sent to 
ZKRMStateStore after the old  ZK client session is closed.
This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper 
session.
We only have one ZKRMStateStore but we can have multiple ZK client sessions.
Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher 
event is from current session. So the watcher event from old ZK client session 
which just is closed will still be processed.
For example, If a Disconnected event received from old session after new 
session is connected, the zkClient will be set to null
{code}
        case Disconnected:
          LOG.info("ZKRMStateStore Session disconnected");
          oldZkClient = zkClient;
          zkClient = null;
          break;
{code}
Then ZKRMStateStore won't receive SyncConnected event from new session because 
new session is already in SyncConnected state and it won't send SyncConnected 
event until it is disconnected and connected again.
Then we will see all the ZKRMStateStore operations fail with IOException "Wait 
for ZKClient creation timed out" until  RM shutdown.

  was:
Old ZK client session watcher event messed up new ZK client session due to 
ZooKeeper asynchronously closing client session.
The watcher event from old ZK client session can still be sent to 
ZKRMStateStore when the old  ZK client session is closed.
This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper 
session.
We only have one ZKRMStateStore but we can have multiple ZK client sessions.
Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher 
event is from current session. So the watcher event from old ZK client session 
which just is closed will still be processed.
For example, If a Disconnected event received from old session after new 
session is connected, the zkClient will be set to null
{code}
        case Disconnected:
          LOG.info("ZKRMStateStore Session disconnected");
          oldZkClient = zkClient;
          zkClient = null;
          break;
{code}
Then ZKRMStateStore won't receive SyncConnected event from new session because 
new session is already in SyncConnected state and it won't send SyncConnected 
event until it is disconnected and connected again.
Then we will see all the ZKRMStateStore operations fail with IOException "Wait 
for ZKClient creation timed out" until  RM shutdown.


> Old ZK client session watcher event messed up new ZK client session due to 
> ZooKeeper asynchronously closing client session.
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-3242
>                 URL: https://issues.apache.org/jira/browse/YARN-3242
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: zhihai xu
>            Assignee: zhihai xu
>            Priority: Critical
>         Attachments: YARN-3242.000.patch, YARN-3242.001.patch
>
>
> Old ZK client session watcher event messed up new ZK client session due to 
> ZooKeeper asynchronously closing client session.
> The watcher event from old ZK client session can still be sent to 
> ZKRMStateStore after the old  ZK client session is closed.
> This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper 
> session.
> We only have one ZKRMStateStore but we can have multiple ZK client sessions.
> Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher 
> event is from current session. So the watcher event from old ZK client 
> session which just is closed will still be processed.
> For example, If a Disconnected event received from old session after new 
> session is connected, the zkClient will be set to null
> {code}
>         case Disconnected:
>           LOG.info("ZKRMStateStore Session disconnected");
>           oldZkClient = zkClient;
>           zkClient = null;
>           break;
> {code}
> Then ZKRMStateStore won't receive SyncConnected event from new session 
> because new session is already in SyncConnected state and it won't send 
> SyncConnected event until it is disconnected and connected again.
> Then we will see all the ZKRMStateStore operations fail with IOException 
> "Wait for ZKClient creation timed out" until  RM shutdown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to