[ 
https://issues.apache.org/jira/browse/YARN-3023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14270526#comment-14270526
 ] 

Rohith commented on YARN-3023:
------------------------------

Which version of Hadoop are you using? In trunk this is handled, If node 
already exists then ZKRMStateStore wont throw NodeExists
{code}
catch (KeeperException ke) {
          if (ke.code() == Code.NODEEXISTS) {
            LOG.info("znode already exists!");
            return null;
          }
// other code
}
{code}

> Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM 
> crash 
> ---------------------------------------------------------------------------------
>
>                 Key: YARN-3023
>                 URL: https://issues.apache.org/jira/browse/YARN-3023
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: zhihai xu
>            Assignee: zhihai xu
>
> Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM 
> crash.
> The sequence for the Race condition is the following:
> 1, RM Store attempt state to ZK by calling createWithRetries
> {code}
> 2015-01-06 12:37:35,343 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> Storing attempt: AppId: application_1418914202950_42363 AttemptId: 
> appattempt_1418914202950_42363_000001 MasterContainer: Container: 
> [ContainerId: container_1418914202950_42363_01_000001,
> {code}
> 2. unluckily ConnectionLoss for the ZK session happened at the same time as 
> RM Stored attempt state to ZK.
> The ZooKeeper server created the node and store the data successfully, But 
> due to ConnectionLoss, RM didn't know the operation (createWithRetries) is 
> succeeded.
> {code}
> 2015-01-06 12:37:36,102 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Exception while executing a ZK operation.
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss
> {code}
> 3.RM did retry to store attempt state to ZK after one second
> {code}
> 2015-01-06 12:37:36,104 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Retrying operation on ZK. Retry no. 1
> {code}
> 4. during the one second interval, the ZK session is reconnected.
> {code}
> 2015-01-06 12:37:36,210 INFO org.apache.zookeeper.ClientCnxn: Socket 
> connection established initiating session
> 2015-01-06 12:37:36,213 INFO org.apache.zookeeper.ClientCnxn: Session 
> establishment complete on server, sessionid = 0x44a9166eb2d12cb, negotiated 
> timeout = 10000
> {code}
> 5. Because the node was created successfully at ZooKeeper in the first 
> try(runWithCheck),
> For the second try, it will fail with NodeExists KeeperException
> {code}
> 2015-01-06 12:37:37,116 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Exception while executing a ZK operation.
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists
> 2015-01-06 12:37:37,118 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed 
> out ZK retries. Giving up!
> {code}
> 6.This NodeExists KeeperException will cause Storing AppAttempt failure in 
> RMStateStore
> {code}
> 2015-01-06 12:37:37,118 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
> storing appAttempt: appattempt_1418914202950_42363_000001
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists
> {code}
> 7.RMStateStore will send RMFatalEventType.STATE_STORE_OP_FAILED event to 
> ResourceManager
> {code}
>   protected void notifyStoreOperationFailed(Exception failureCause) {
>     RMFatalEventType type;
>     if (failureCause instanceof StoreFencedException) {
>       type = RMFatalEventType.STATE_STORE_FENCED;
>     } else {
>       type = RMFatalEventType.STATE_STORE_OP_FAILED;
>     }
>     rmDispatcher.getEventHandler().handle(new RMFatalEvent(type, 
> failureCause));
>   }
> {code}
> 8.ResoureManager will kill itself after received STATE_STORE_OP_FAILED 
> RMFatalEvent.
> {code}
> 2015-01-06 12:37:37,128 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> STATE_STORE_OP_FAILED. Cause:
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists
> 2015-01-06 12:37:37,138 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to