[jira] [Commented] (YARN-3023) Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM crash

2015-01-08 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270526#comment-14270526
 ] 

Rohith commented on YARN-3023:
--

Which version of Hadoop are you using? In trunk this is handled, If node 
already exists then ZKRMStateStore wont throw NodeExists
{code}
catch (KeeperException ke) {
  if (ke.code() == Code.NODEEXISTS) {
LOG.info(znode already exists!);
return null;
  }
// other code
}
{code}

 Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM 
 crash 
 -

 Key: YARN-3023
 URL: https://issues.apache.org/jira/browse/YARN-3023
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: zhihai xu
Assignee: zhihai xu

 Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM 
 crash.
 The sequence for the Race condition is the following:
 1, RM Store attempt state to ZK by calling createWithRetries
 {code}
 2015-01-06 12:37:35,343 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 Storing attempt: AppId: application_1418914202950_42363 AttemptId: 
 appattempt_1418914202950_42363_01 MasterContainer: Container: 
 [ContainerId: container_1418914202950_42363_01_01,
 {code}
 2. unluckily ConnectionLoss for the ZK session happened at the same time as 
 RM Stored attempt state to ZK.
 The ZooKeeper server created the node and store the data successfully, But 
 due to ConnectionLoss, RM didn't know the operation (createWithRetries) is 
 succeeded.
 {code}
 2015-01-06 12:37:36,102 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Exception while executing a ZK operation.
 org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
 = ConnectionLoss
 {code}
 3.RM did retry to store attempt state to ZK after one second
 {code}
 2015-01-06 12:37:36,104 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Retrying operation on ZK. Retry no. 1
 {code}
 4. during the one second interval, the ZK session is reconnected.
 {code}
 2015-01-06 12:37:36,210 INFO org.apache.zookeeper.ClientCnxn: Socket 
 connection established initiating session
 2015-01-06 12:37:36,213 INFO org.apache.zookeeper.ClientCnxn: Session 
 establishment complete on server, sessionid = 0x44a9166eb2d12cb, negotiated 
 timeout = 1
 {code}
 5. Because the node was created successfully at ZooKeeper in the first 
 try(runWithCheck),
 For the second try, it will fail with NodeExists KeeperException
 {code}
 2015-01-06 12:37:37,116 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Exception while executing a ZK operation.
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists
 2015-01-06 12:37:37,118 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed 
 out ZK retries. Giving up!
 {code}
 6.This NodeExists KeeperException will cause Storing AppAttempt failure in 
 RMStateStore
 {code}
 2015-01-06 12:37:37,118 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 storing appAttempt: appattempt_1418914202950_42363_01
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists
 {code}
 7.RMStateStore will send RMFatalEventType.STATE_STORE_OP_FAILED event to 
 ResourceManager
 {code}
   protected void notifyStoreOperationFailed(Exception failureCause) {
 RMFatalEventType type;
 if (failureCause instanceof StoreFencedException) {
   type = RMFatalEventType.STATE_STORE_FENCED;
 } else {
   type = RMFatalEventType.STATE_STORE_OP_FAILED;
 }
 rmDispatcher.getEventHandler().handle(new RMFatalEvent(type, 
 failureCause));
   }
 {code}
 8.ResoureManager will kill itself after received STATE_STORE_OP_FAILED 
 RMFatalEvent.
 {code}
 2015-01-06 12:37:37,128 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists
 2015-01-06 12:37:37,138 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
 status 1
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3023) Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM crash

2015-01-08 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270581#comment-14270581
 ] 

zhihai xu commented on YARN-3023:
-

Yes, you are right. The issue is the same as YARN-2721.

 Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM 
 crash 
 -

 Key: YARN-3023
 URL: https://issues.apache.org/jira/browse/YARN-3023
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: zhihai xu
Assignee: zhihai xu

 Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM 
 crash.
 The sequence for the Race condition is the following:
 1, RM Store attempt state to ZK by calling createWithRetries
 {code}
 2015-01-06 12:37:35,343 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 Storing attempt: AppId: application_1418914202950_42363 AttemptId: 
 appattempt_1418914202950_42363_01 MasterContainer: Container: 
 [ContainerId: container_1418914202950_42363_01_01,
 {code}
 2. unluckily ConnectionLoss for the ZK session happened at the same time as 
 RM Stored attempt state to ZK.
 The ZooKeeper server created the node and store the data successfully, But 
 due to ConnectionLoss, RM didn't know the operation (createWithRetries) is 
 succeeded.
 {code}
 2015-01-06 12:37:36,102 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Exception while executing a ZK operation.
 org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
 = ConnectionLoss
 {code}
 3.RM did retry to store attempt state to ZK after one second
 {code}
 2015-01-06 12:37:36,104 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Retrying operation on ZK. Retry no. 1
 {code}
 4. during the one second interval, the ZK session is reconnected.
 {code}
 2015-01-06 12:37:36,210 INFO org.apache.zookeeper.ClientCnxn: Socket 
 connection established initiating session
 2015-01-06 12:37:36,213 INFO org.apache.zookeeper.ClientCnxn: Session 
 establishment complete on server, sessionid = 0x44a9166eb2d12cb, negotiated 
 timeout = 1
 {code}
 5. Because the node was created successfully at ZooKeeper in the first 
 try(runWithCheck),
 For the second try, it will fail with NodeExists KeeperException
 {code}
 2015-01-06 12:37:37,116 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Exception while executing a ZK operation.
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists
 2015-01-06 12:37:37,118 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed 
 out ZK retries. Giving up!
 {code}
 6.This NodeExists KeeperException will cause Storing AppAttempt failure in 
 RMStateStore
 {code}
 2015-01-06 12:37:37,118 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 storing appAttempt: appattempt_1418914202950_42363_01
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists
 {code}
 7.RMStateStore will send RMFatalEventType.STATE_STORE_OP_FAILED event to 
 ResourceManager
 {code}
   protected void notifyStoreOperationFailed(Exception failureCause) {
 RMFatalEventType type;
 if (failureCause instanceof StoreFencedException) {
   type = RMFatalEventType.STATE_STORE_FENCED;
 } else {
   type = RMFatalEventType.STATE_STORE_OP_FAILED;
 }
 rmDispatcher.getEventHandler().handle(new RMFatalEvent(type, 
 failureCause));
   }
 {code}
 8.ResoureManager will kill itself after received STATE_STORE_OP_FAILED 
 RMFatalEvent.
 {code}
 2015-01-06 12:37:37,128 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists
 2015-01-06 12:37:37,138 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
 status 1
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)