[ 
https://issues.apache.org/jira/browse/YARN-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14346469#comment-14346469
 ] 

Rohith commented on YARN-3242:
------------------------------

Thanks for detailed explanation. I was able to reproduce this issue frequently 
on low end machine. I deployed the patch in cluster and verified. It is working 
fine & RM is able to continue without shutdown. From the below log, I see that 
event Disconnected is from old session.
{noformat}
2015-03-04 11:42:06,445 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher 
event type: None with state:SyncConnected for path:null for Service 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
2015-03-04 11:42:06,445 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
ZKRMStateStore Session connected
2015-03-04 11:42:06,445 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Ignore 
watcher event type: None with state:Disconnected for path:null from old session
2015-03-04 11:42:06,460 INFO org.apache.zookeeper.ClientCnxn: EventThread shut 
down
2015-03-04 11:42:06,460 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
Checking for any old active which needs to be fenced...
2015-03-04 11:42:06,987 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old 
node exists: 0a0c7961726e2d636c75737465721203726d31
{noformat}


> Old ZK client session watcher event causes ZKRMStateStore out of sync with 
> current ZK client session due to ZooKeeper asynchronously closing client 
> session.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-3242
>                 URL: https://issues.apache.org/jira/browse/YARN-3242
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: zhihai xu
>            Assignee: zhihai xu
>            Priority: Critical
>         Attachments: YARN-3242.000.patch, YARN-3242.001.patch, 
> YARN-3242.002.patch, YARN-3242.003.patch, YARN-3242.004.patch
>
>
> Old ZK client session watcher event messed up new ZK client session due to 
> ZooKeeper asynchronously closing client session.
> The watcher event from old ZK client session can still be sent to 
> ZKRMStateStore after the old  ZK client session is closed.
> This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper 
> session.
> We only have one ZKRMStateStore but we can have multiple ZK client sessions.
> Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher 
> event is from current session. So the watcher event from old ZK client 
> session which just is closed will still be processed.
> For example, If a Disconnected event received from old session after new 
> session is connected, the zkClient will be set to null
> {code}
>         case Disconnected:
>           LOG.info("ZKRMStateStore Session disconnected");
>           oldZkClient = zkClient;
>           zkClient = null;
>           break;
> {code}
> Then ZKRMStateStore won't receive SyncConnected event from new session 
> because new session is already in SyncConnected state and it won't send 
> SyncConnected event until it is disconnected and connected again.
> Then we will see all the ZKRMStateStore operations fail with IOException 
> "Wait for ZKClient creation timed out" until  RM shutdown.
> The following code from zookeeper(ClientCnxn#EventThread) show even after 
> receive eventOfDeath, EventThread will still process all the events until  
> waitingEvents queue is empty.
> {code}
>               while (true) {
>                  Object event = waitingEvents.take();
>                  if (event == eventOfDeath) {
>                     wasKilled = true;
>                  } else {
>                     processEvent(event);
>                  }
>                  if (wasKilled)
>                     synchronized (waitingEvents) {
>                        if (waitingEvents.isEmpty()) {
>                           isRunning = false;
>                           break;
>                        }
>                     }
>               }
>       private void processEvent(Object event) {
>           try {
>               if (event instanceof WatcherSetEventPair) {
>                   // each watcher will process the event
>                   WatcherSetEventPair pair = (WatcherSetEventPair) event;
>                   for (Watcher watcher : pair.watchers) {
>                       try {
>                           watcher.process(pair.event);
>                       } catch (Throwable t) {
>                           LOG.error("Error while calling watcher ", t);
>                       }
>                   }
>               } else {
>     public void disconnect() {
>         if (LOG.isDebugEnabled()) {
>             LOG.debug("Disconnecting client for session: 0x"
>                       + Long.toHexString(getSessionId()));
>         }
>         sendThread.close();
>         eventThread.queueEventOfDeath();
>     }
>     public void close() throws IOException {
>         if (LOG.isDebugEnabled()) {
>             LOG.debug("Closing client for session: 0x"
>                       + Long.toHexString(getSessionId()));
>         }
>         try {
>             RequestHeader h = new RequestHeader();
>             h.setType(ZooDefs.OpCode.closeSession);
>             submitRequest(h, null, null, null);
>         } catch (InterruptedException e) {
>             // ignore, close the send/event threads
>         } finally {
>             disconnect();
>         }
>     }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to