[ https://issues.apache.org/jira/browse/YARN-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375098#comment-14375098 ]
zhihai xu commented on YARN-3385: --------------------------------- The sequence for the Race condition is the following: 1, RM try to remove application application_1426560404988_0132 state from ZKRMStateStore. {code} 2015-03-17 19:18:48,075 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Max number of completed apps kept in state store met: maxCompletedAppsInStateStore = 10000, removing app application_1426560404988_0132 from state store. 2015-03-17 19:18:48,075 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing info for app: application_1426560404988_0132 {code} 2. Unluckily ConnectionLoss for the ZK session happened at the same time as RM remove application state from ZK. The ZooKeeper server deleted the node successfully, But due to ConnectionLoss, RM didn't know the operation succeeded. {code} 2015-03-17 19:18:51,836 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss {code} 3.RM did retry to remove application state to ZK {code} 2015-03-17 19:18:51,837 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Retrying operation on ZK. Retry no. 1 {code} 4. during the retry, the ZK session is reconnected. {code} 2015-03-17 19:18:58,924 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server, sessionid = 0x24be28f536e2006, negotiated timeout = 10000 {code} 5. Because the node was already deleted successfully at ZooKeeper in the previous operation, it will fail with NoNode KeeperException for the retry {code} 2015-03-17 19:18:58,956 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode 2015-03-17 19:18:58,956 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed out ZK retries. Giving up! {code} 6.This NoNode KeeperException will cause removing app failure in RMStateStore {code} 2015-03-17 19:18:58,956 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error removing app: application_1426560404988_0132 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode {code} 7.RMStateStore will send RMFatalEventType.STATE_STORE_OP_FAILED event to ResourceManager {code} protected void notifyStoreOperationFailed(Exception failureCause) { RMFatalEventType type; if (failureCause instanceof StoreFencedException) { type = RMFatalEventType.STATE_STORE_FENCED; } else { type = RMFatalEventType.STATE_STORE_OP_FAILED; } rmDispatcher.getEventHandler().handle(new RMFatalEvent(type, failureCause)); } {code} 8.ResoureManager will kill itself after received STATE_STORE_OP_FAILED RMFatalEvent. {code} 2015-03-17 19:18:58,958 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode 2015-03-17 19:18:58,959 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1 {code} > Race condition: KeeperException$NoNodeException will cause RM shutdown during > ZK node deletion(Op.delete). > ---------------------------------------------------------------------------------------------------------- > > Key: YARN-3385 > URL: https://issues.apache.org/jira/browse/YARN-3385 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Reporter: zhihai xu > Assignee: zhihai xu > Priority: Critical > > Race condition: KeeperException$NoNodeException will cause RM shutdown during > ZK node deletion(Op.delete). > The race condition is similar as YARN-2721 and YARN-3023. > When the race condition exists for ZK node creation, it should also exist for > ZK node deletion. > We see this issue with the following stack trace: > {code} > 2015-03-17 19:18:58,958 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a > org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode > at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945) > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:857) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:854) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:854) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:647) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:691) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:745) > 2015-03-17 19:18:58,959 INFO org.apache.hadoop.util.ExitUtil: Exiting with > status 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)