[ 
https://issues.apache.org/jira/browse/YARN-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375185#comment-14375185
 ] 

zhihai xu commented on YARN-3385:
---------------------------------

I uploaded a patch YARN-3385.000.patch for review. The patch fixed both 
Op.delete and zkClient.delete for NoNodeException and optimized the code at 
removeRMDelegationTokenState to skip ZK delete operation if the node doesn't 
exist.

Without the patch, the test will fail with the following message
{code}
-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running 
org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore
Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 7.853 sec <<< 
FAILURE! - in 
org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore
testRMAppDeleteNoNodeException(org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore)
  Time elapsed: 1.253 sec  <<< FAILURE!
java.lang.AssertionError: NoNodeException should not happen.
        at org.junit.Assert.fail(Assert.java:88)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore.testRMAppDeleteNoNodeException(TestZKRMStateStore.java:405)
Results :
Failed tests: 
  TestZKRMStateStore.testRMAppDeleteNoNodeException:405 NoNodeException should 
not happen.
Tests run: 5, Failures: 1, Errors: 0, Skipped: 0

org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
        at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:949)
        at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:920)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:916)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1080)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1101)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:916)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:928)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:697)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore.testRMAppDelete(TestZKRMStateStore.java:401)
{code}

> Race condition: KeeperException$NoNodeException will cause RM shutdown during 
> ZK node deletion.
> -----------------------------------------------------------------------------------------------
>
>                 Key: YARN-3385
>                 URL: https://issues.apache.org/jira/browse/YARN-3385
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>            Reporter: zhihai xu
>            Assignee: zhihai xu
>            Priority: Critical
>         Attachments: YARN-3385.000.patch
>
>
> Race condition: KeeperException$NoNodeException will cause RM shutdown during 
> ZK node deletion(Op.delete).
> The race condition is similar as YARN-2721 and YARN-3023.
> since the race condition exists for ZK node creation, it should also exist 
> for  ZK node deletion.
> We see this issue with the following stack trace:
> {code}
> 2015-03-17 19:18:58,958 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> STATE_STORE_OP_FAILED. Cause:
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
>       at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>       at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945)
>       at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:857)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:854)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:854)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:647)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:691)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
>       at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
>       at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
>       at java.lang.Thread.run(Thread.java:745)
> 2015-03-17 19:18:58,959 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to