[ 
https://issues.apache.org/jira/browse/YARN-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375098#comment-14375098
 ] 

zhihai xu commented on YARN-3385:
---------------------------------

The sequence for the Race condition is the following:
1, RM try to remove application application_1426560404988_0132 state from 
ZKRMStateStore.
{code}
2015-03-17 19:18:48,075 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Max number of 
completed apps kept in state store met: maxCompletedAppsInStateStore = 10000, 
removing app application_1426560404988_0132 from state store.
2015-03-17 19:18:48,075 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing 
info for app: application_1426560404988_0132
{code}

2. Unluckily ConnectionLoss for the ZK session happened at the same time as RM 
remove application state from ZK.
The ZooKeeper server deleted the node successfully, But due to ConnectionLoss, 
RM didn't know the operation succeeded.
{code}
2015-03-17 19:18:51,836 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss
{code}

3.RM did retry to remove application state to ZK
{code}
2015-03-17 19:18:51,837 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Retrying 
operation on ZK. Retry no. 1
{code}

4. during the retry, the ZK session is reconnected.
{code}
2015-03-17 19:18:58,924 INFO org.apache.zookeeper.ClientCnxn: Session 
establishment complete on server, sessionid = 0x24be28f536e2006, negotiated 
timeout = 10000
{code}

5. Because the node was already deleted successfully at ZooKeeper in the 
previous operation, it will fail with NoNode KeeperException for the retry
{code}
2015-03-17 19:18:58,956 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
2015-03-17 19:18:58,956 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed 
out ZK retries. Giving up!
{code}

6.This NoNode KeeperException will cause removing app failure in RMStateStore
{code}
2015-03-17 19:18:58,956 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
removing app: application_1426560404988_0132
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
{code}

7.RMStateStore will send RMFatalEventType.STATE_STORE_OP_FAILED event to 
ResourceManager
{code}
  protected void notifyStoreOperationFailed(Exception failureCause) {
    RMFatalEventType type;
    if (failureCause instanceof StoreFencedException) {
      type = RMFatalEventType.STATE_STORE_FENCED;
    } else {
      type = RMFatalEventType.STATE_STORE_OP_FAILED;
    }
    rmDispatcher.getEventHandler().handle(new RMFatalEvent(type, failureCause));
  }
{code}

8.ResoureManager will kill itself after received STATE_STORE_OP_FAILED 
RMFatalEvent.
{code}
2015-03-17 19:18:58,958 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
STATE_STORE_OP_FAILED. Cause:
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
2015-03-17 19:18:58,959 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
status 1
{code}


> Race condition: KeeperException$NoNodeException will cause RM shutdown during 
> ZK node deletion(Op.delete).
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-3385
>                 URL: https://issues.apache.org/jira/browse/YARN-3385
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>            Reporter: zhihai xu
>            Assignee: zhihai xu
>            Priority: Critical
>
> Race condition: KeeperException$NoNodeException will cause RM shutdown during 
> ZK node deletion(Op.delete).
> The race condition is similar as YARN-2721 and YARN-3023.
> When the race condition exists for ZK node creation, it should also exist for 
>  ZK node deletion.
> We see this issue with the following stack trace:
> {code}
> 2015-03-17 19:18:58,958 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> STATE_STORE_OP_FAILED. Cause:
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
>       at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>       at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945)
>       at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:857)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:854)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:854)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:647)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:691)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
>       at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
>       at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
>       at java.lang.Thread.run(Thread.java:745)
> 2015-03-17 19:18:58,959 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to