[jira] [Commented] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash

2015-05-12 Thread lachisis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541287#comment-14541287
 ] 

lachisis commented on YARN-3614:


Yes it is. But need to configure 
yarn.resourcemanager.state-store.max-completed-applications  to limit 
applications number in rmstore. 
Before modify the configure, it will cost ten minutes to switch to active when 
four thousand apps in rmstore. that situation is not comfortable.

 FileSystemRMStateStore throw exception when failed to remove application, 
 that cause resourcemanager to crash
 -

 Key: YARN-3614
 URL: https://issues.apache.org/jira/browse/YARN-3614
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical

 FileSystemRMStateStore is only a accessorial plug-in of rmstore. 
 When it failed to remove application, I think warning is enough, but now 
 resourcemanager crashed.
 Recently, I configure 
 yarn.resourcemanager.state-store.max-completed-applications  to limit 
 applications number in rmstore. when applications number exceed the limit, 
 some old applications will be removed. If failed to remove, resourcemanager 
 will crash.
 The following is log: 
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing 
 info for app: application_1430994493305_0053
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
  Removing info for app: application_1430994493305_0053 at: 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 2015-05-11 06:58:43,816 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 removing app: application_1430994493305_0053
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:745)
 2015-05-11 06:58:43,819 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 

[jira] [Commented] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash

2015-05-12 Thread nijel (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541306#comment-14541306
 ] 

nijel commented on YARN-3614:
-

One possible cause is discussed in YARN-868
Can you try the solution given in this issue.

 FileSystemRMStateStore throw exception when failed to remove application, 
 that cause resourcemanager to crash
 -

 Key: YARN-3614
 URL: https://issues.apache.org/jira/browse/YARN-3614
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical

 FileSystemRMStateStore is only a accessorial plug-in of rmstore. 
 When it failed to remove application, I think warning is enough, but now 
 resourcemanager crashed.
 Recently, I configure 
 yarn.resourcemanager.state-store.max-completed-applications  to limit 
 applications number in rmstore. when applications number exceed the limit, 
 some old applications will be removed. If failed to remove, resourcemanager 
 will crash.
 The following is log: 
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing 
 info for app: application_1430994493305_0053
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
  Removing info for app: application_1430994493305_0053 at: 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 2015-05-11 06:58:43,816 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 removing app: application_1430994493305_0053
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:745)
 2015-05-11 06:58:43,819 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 

[jira] [Commented] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash

2015-05-11 Thread lachisis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537625#comment-14537625
 ] 

lachisis commented on YARN-3614:


Yes, it is ok to check the existence of the directory first.

 FileSystemRMStateStore throw exception when failed to remove application, 
 that cause resourcemanager to crash
 -

 Key: YARN-3614
 URL: https://issues.apache.org/jira/browse/YARN-3614
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical

 FileSystemRMStateStore is only a accessorial plug-in of rmstore. 
 When it failed to remove application, I think warning is enough, but now 
 resourcemanager crashed.
 Recently, I configure 
 yarn.resourcemanager.state-store.max-completed-applications  to limit 
 applications number in rmstore. when applications number exceed the limit, 
 some old applications will be removed. If failed to remove, resourcemanager 
 will crash.
 The following is log: 
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing 
 info for app: application_1430994493305_0053
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
  Removing info for app: application_1430994493305_0053 at: 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 2015-05-11 06:58:43,816 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 removing app: application_1430994493305_0053
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:745)
 2015-05-11 06:58:43,819 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 

[jira] [Commented] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash

2015-05-11 Thread lachisis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537628#comment-14537628
 ] 

lachisis commented on YARN-3614:


Yes, it is ok to check the existence of the directory first.

 FileSystemRMStateStore throw exception when failed to remove application, 
 that cause resourcemanager to crash
 -

 Key: YARN-3614
 URL: https://issues.apache.org/jira/browse/YARN-3614
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical

 FileSystemRMStateStore is only a accessorial plug-in of rmstore. 
 When it failed to remove application, I think warning is enough, but now 
 resourcemanager crashed.
 Recently, I configure 
 yarn.resourcemanager.state-store.max-completed-applications  to limit 
 applications number in rmstore. when applications number exceed the limit, 
 some old applications will be removed. If failed to remove, resourcemanager 
 will crash.
 The following is log: 
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing 
 info for app: application_1430994493305_0053
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
  Removing info for app: application_1430994493305_0053 at: 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 2015-05-11 06:58:43,816 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 removing app: application_1430994493305_0053
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:745)
 2015-05-11 06:58:43,819 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 

[jira] [Commented] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash

2015-05-11 Thread lachisis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537624#comment-14537624
 ] 

lachisis commented on YARN-3614:


Yes, it is ok to check the existence of the directory first.

 FileSystemRMStateStore throw exception when failed to remove application, 
 that cause resourcemanager to crash
 -

 Key: YARN-3614
 URL: https://issues.apache.org/jira/browse/YARN-3614
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical

 FileSystemRMStateStore is only a accessorial plug-in of rmstore. 
 When it failed to remove application, I think warning is enough, but now 
 resourcemanager crashed.
 Recently, I configure 
 yarn.resourcemanager.state-store.max-completed-applications  to limit 
 applications number in rmstore. when applications number exceed the limit, 
 some old applications will be removed. If failed to remove, resourcemanager 
 will crash.
 The following is log: 
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing 
 info for app: application_1430994493305_0053
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
  Removing info for app: application_1430994493305_0053 at: 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 2015-05-11 06:58:43,816 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 removing app: application_1430994493305_0053
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:745)
 2015-05-11 06:58:43,819 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 

[jira] [Commented] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash

2015-05-11 Thread lachisis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537626#comment-14537626
 ] 

lachisis commented on YARN-3614:


Yes, it is ok to check the existence of the directory first.

 FileSystemRMStateStore throw exception when failed to remove application, 
 that cause resourcemanager to crash
 -

 Key: YARN-3614
 URL: https://issues.apache.org/jira/browse/YARN-3614
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical

 FileSystemRMStateStore is only a accessorial plug-in of rmstore. 
 When it failed to remove application, I think warning is enough, but now 
 resourcemanager crashed.
 Recently, I configure 
 yarn.resourcemanager.state-store.max-completed-applications  to limit 
 applications number in rmstore. when applications number exceed the limit, 
 some old applications will be removed. If failed to remove, resourcemanager 
 will crash.
 The following is log: 
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing 
 info for app: application_1430994493305_0053
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
  Removing info for app: application_1430994493305_0053 at: 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 2015-05-11 06:58:43,816 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 removing app: application_1430994493305_0053
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:745)
 2015-05-11 06:58:43,819 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 

[jira] [Commented] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash

2015-05-11 Thread Brahma Reddy Battula (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537662#comment-14537662
 ] 

Brahma Reddy Battula commented on YARN-3614:


{quote} when standby resourcemanager try to transitiontoActive, it will cost 
more than ten minutes to load applications{quote}
did you dig into this one, like why it's took 10mins..? Thanks

 FileSystemRMStateStore throw exception when failed to remove application, 
 that cause resourcemanager to crash
 -

 Key: YARN-3614
 URL: https://issues.apache.org/jira/browse/YARN-3614
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical

 FileSystemRMStateStore is only a accessorial plug-in of rmstore. 
 When it failed to remove application, I think warning is enough, but now 
 resourcemanager crashed.
 Recently, I configure 
 yarn.resourcemanager.state-store.max-completed-applications  to limit 
 applications number in rmstore. when applications number exceed the limit, 
 some old applications will be removed. If failed to remove, resourcemanager 
 will crash.
 The following is log: 
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing 
 info for app: application_1430994493305_0053
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
  Removing info for app: application_1430994493305_0053 at: 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 2015-05-11 06:58:43,816 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 removing app: application_1430994493305_0053
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:745)
 2015-05-11 06:58:43,819 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 

[jira] [Commented] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash

2015-05-11 Thread lachisis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537640#comment-14537640
 ] 

lachisis commented on YARN-3614:


I used HA of yarn for stable service. 
Months later, I find when standby resourcemanager try to transitiontoActiver, 
it will cost more than ten minutes to load applications. So I backup the 
rmstore in hdfs and change the configure 
yarn.resourcemanager.state-store.max-completed-applications to limit 
applications number in rmstroe. And find it work well when transition.
Later my partner restore backuped rmstore, and submitted a new application, 
then find resoucemanager cashed.

I know restoring backuped rmstore when resourcemanager running is not suitable. 
But this also means the processing logic of FileSystemRMStateStore is weak a 
liitle. So I suggest a little change here.
 



 FileSystemRMStateStore throw exception when failed to remove application, 
 that cause resourcemanager to crash
 -

 Key: YARN-3614
 URL: https://issues.apache.org/jira/browse/YARN-3614
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical

 FileSystemRMStateStore is only a accessorial plug-in of rmstore. 
 When it failed to remove application, I think warning is enough, but now 
 resourcemanager crashed.
 Recently, I configure 
 yarn.resourcemanager.state-store.max-completed-applications  to limit 
 applications number in rmstore. when applications number exceed the limit, 
 some old applications will be removed. If failed to remove, resourcemanager 
 will crash.
 The following is log: 
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing 
 info for app: application_1430994493305_0053
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
  Removing info for app: application_1430994493305_0053 at: 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 2015-05-11 06:58:43,816 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 removing app: application_1430994493305_0053
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:745)
 2015-05-11 06:58:43,819 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 

[jira] [Commented] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash

2015-05-11 Thread lachisis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537632#comment-14537632
 ] 

lachisis commented on YARN-3614:


Sorry, terrible network.  How can i delete the repeated replys.

 FileSystemRMStateStore throw exception when failed to remove application, 
 that cause resourcemanager to crash
 -

 Key: YARN-3614
 URL: https://issues.apache.org/jira/browse/YARN-3614
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical

 FileSystemRMStateStore is only a accessorial plug-in of rmstore. 
 When it failed to remove application, I think warning is enough, but now 
 resourcemanager crashed.
 Recently, I configure 
 yarn.resourcemanager.state-store.max-completed-applications  to limit 
 applications number in rmstore. when applications number exceed the limit, 
 some old applications will be removed. If failed to remove, resourcemanager 
 will crash.
 The following is log: 
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing 
 info for app: application_1430994493305_0053
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
  Removing info for app: application_1430994493305_0053 at: 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 2015-05-11 06:58:43,816 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 removing app: application_1430994493305_0053
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:745)
 2015-05-11 06:58:43,819 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 

[jira] [Commented] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash

2015-05-11 Thread lachisis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537645#comment-14537645
 ] 

lachisis commented on YARN-3614:


Thanks for the chance to provide the patch.
I will submit the patch later.

 FileSystemRMStateStore throw exception when failed to remove application, 
 that cause resourcemanager to crash
 -

 Key: YARN-3614
 URL: https://issues.apache.org/jira/browse/YARN-3614
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical

 FileSystemRMStateStore is only a accessorial plug-in of rmstore. 
 When it failed to remove application, I think warning is enough, but now 
 resourcemanager crashed.
 Recently, I configure 
 yarn.resourcemanager.state-store.max-completed-applications  to limit 
 applications number in rmstore. when applications number exceed the limit, 
 some old applications will be removed. If failed to remove, resourcemanager 
 will crash.
 The following is log: 
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing 
 info for app: application_1430994493305_0053
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
  Removing info for app: application_1430994493305_0053 at: 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 2015-05-11 06:58:43,816 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 removing app: application_1430994493305_0053
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:745)
 2015-05-11 06:58:43,819 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 

[jira] [Commented] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash

2015-05-11 Thread nijel (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537837#comment-14537837
 ] 

nijel commented on YARN-3614:
-

hi @lachisis
bq.when standby resourcemanager try to transitiontoActive, it will cost more 
than ten minutes to load applications
Is this a secure cluster ? 

 FileSystemRMStateStore throw exception when failed to remove application, 
 that cause resourcemanager to crash
 -

 Key: YARN-3614
 URL: https://issues.apache.org/jira/browse/YARN-3614
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical

 FileSystemRMStateStore is only a accessorial plug-in of rmstore. 
 When it failed to remove application, I think warning is enough, but now 
 resourcemanager crashed.
 Recently, I configure 
 yarn.resourcemanager.state-store.max-completed-applications  to limit 
 applications number in rmstore. when applications number exceed the limit, 
 some old applications will be removed. If failed to remove, resourcemanager 
 will crash.
 The following is log: 
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing 
 info for app: application_1430994493305_0053
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
  Removing info for app: application_1430994493305_0053 at: 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 2015-05-11 06:58:43,816 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 removing app: application_1430994493305_0053
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:745)
 2015-05-11 06:58:43,819 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 

[jira] [Commented] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash

2015-05-10 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537558#comment-14537558
 ] 

Rohith commented on YARN-3614:
--

YARN-3410 try to remove the application from RMStateStore which is used as RM 
start up arguments i.e {{./yarn resourcemanager 
-remove-application-from-state-store appId}}. 

I am wondering about the use case that why someone move this application folder 
manually?? OTOH, it is better either check for path existence of handle the 
exception and log WARN message instead of throwing exception which crashes the 
RM

 FileSystemRMStateStore throw exception when failed to remove application, 
 that cause resourcemanager to crash
 -

 Key: YARN-3614
 URL: https://issues.apache.org/jira/browse/YARN-3614
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical

 FileSystemRMStateStore is only a accessorial plug-in of rmstore. 
 When it failed to remove application, I think warning is enough, but now 
 resourcemanager crashed.
 Recently, I configure 
 yarn.resourcemanager.state-store.max-completed-applications  to limit 
 applications number in rmstore. when applications number exceed the limit, 
 some old applications will be removed. If failed to remove, resourcemanager 
 will crash.
 The following is log: 
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing 
 info for app: application_1430994493305_0053
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
  Removing info for app: application_1430994493305_0053 at: 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 2015-05-11 06:58:43,816 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 removing app: application_1430994493305_0053
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:745)
 2015-05-11 06:58:43,819 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 

[jira] [Commented] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash

2015-05-10 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537561#comment-14537561
 ] 

Rohith commented on YARN-3614:
--

[~lachisis] Would you be interest in providing patch? feel free to take up!!. 

 FileSystemRMStateStore throw exception when failed to remove application, 
 that cause resourcemanager to crash
 -

 Key: YARN-3614
 URL: https://issues.apache.org/jira/browse/YARN-3614
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical

 FileSystemRMStateStore is only a accessorial plug-in of rmstore. 
 When it failed to remove application, I think warning is enough, but now 
 resourcemanager crashed.
 Recently, I configure 
 yarn.resourcemanager.state-store.max-completed-applications  to limit 
 applications number in rmstore. when applications number exceed the limit, 
 some old applications will be removed. If failed to remove, resourcemanager 
 will crash.
 The following is log: 
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing 
 info for app: application_1430994493305_0053
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
  Removing info for app: application_1430994493305_0053 at: 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 2015-05-11 06:58:43,816 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 removing app: application_1430994493305_0053
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:745)
 2015-05-11 06:58:43,819 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 

[jira] [Commented] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash

2015-05-10 Thread Tsuyoshi Ozawa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537590#comment-14537590
 ] 

Tsuyoshi Ozawa commented on YARN-3614:
--

[~rohithsharma] thank you for clarification, I got the point. You're right.

[~lachisis] do you have a chance to create a patch dealing with following 
things?

* Creating a helper method like checkAndRemovePathWithRetries(), which calls 
existsWithRetries and deleteFileWithRetries internally.
* Updating call checkAndRemovePathWithRetries() in the files.



 FileSystemRMStateStore throw exception when failed to remove application, 
 that cause resourcemanager to crash
 -

 Key: YARN-3614
 URL: https://issues.apache.org/jira/browse/YARN-3614
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical

 FileSystemRMStateStore is only a accessorial plug-in of rmstore. 
 When it failed to remove application, I think warning is enough, but now 
 resourcemanager crashed.
 Recently, I configure 
 yarn.resourcemanager.state-store.max-completed-applications  to limit 
 applications number in rmstore. when applications number exceed the limit, 
 some old applications will be removed. If failed to remove, resourcemanager 
 will crash.
 The following is log: 
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing 
 info for app: application_1430994493305_0053
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
  Removing info for app: application_1430994493305_0053 at: 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 2015-05-11 06:58:43,816 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 removing app: application_1430994493305_0053
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:745)
 2015-05-11 06:58:43,819 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 

[jira] [Commented] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash

2015-05-10 Thread Tsuyoshi Ozawa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537592#comment-14537592
 ] 

Tsuyoshi Ozawa commented on YARN-3614:
--

{quote}
checkAndRemovePathWithRetries
{quote}

checkAndDeleteFileWithRetries would be more consistent, personally.

 FileSystemRMStateStore throw exception when failed to remove application, 
 that cause resourcemanager to crash
 -

 Key: YARN-3614
 URL: https://issues.apache.org/jira/browse/YARN-3614
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical

 FileSystemRMStateStore is only a accessorial plug-in of rmstore. 
 When it failed to remove application, I think warning is enough, but now 
 resourcemanager crashed.
 Recently, I configure 
 yarn.resourcemanager.state-store.max-completed-applications  to limit 
 applications number in rmstore. when applications number exceed the limit, 
 some old applications will be removed. If failed to remove, resourcemanager 
 will crash.
 The following is log: 
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing 
 info for app: application_1430994493305_0053
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
  Removing info for app: application_1430994493305_0053 at: 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 2015-05-11 06:58:43,816 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 removing app: application_1430994493305_0053
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:745)
 2015-05-11 06:58:43,819 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 

[jira] [Commented] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash

2015-05-10 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537580#comment-14537580
 ] 

Rohith commented on YARN-3614:
--

Some methods does not check for existence of path like 
{{removeRMDTMasterKeyState}} {{removeApplicationStateInternal}} 
{{removeRMDelegationTokenState}} and {{removeRMDTMasterKeyState}} .. Am I right?

 FileSystemRMStateStore throw exception when failed to remove application, 
 that cause resourcemanager to crash
 -

 Key: YARN-3614
 URL: https://issues.apache.org/jira/browse/YARN-3614
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical

 FileSystemRMStateStore is only a accessorial plug-in of rmstore. 
 When it failed to remove application, I think warning is enough, but now 
 resourcemanager crashed.
 Recently, I configure 
 yarn.resourcemanager.state-store.max-completed-applications  to limit 
 applications number in rmstore. when applications number exceed the limit, 
 some old applications will be removed. If failed to remove, resourcemanager 
 will crash.
 The following is log: 
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing 
 info for app: application_1430994493305_0053
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
  Removing info for app: application_1430994493305_0053 at: 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 2015-05-11 06:58:43,816 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 removing app: application_1430994493305_0053
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:745)
 2015-05-11 06:58:43,819 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 

[jira] [Commented] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash

2015-05-10 Thread Tsuyoshi Ozawa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537573#comment-14537573
 ] 

Tsuyoshi Ozawa commented on YARN-3614:
--

@Rohith FSRMStateStore has checked path existence before removing the path. Do 
I missing something?

@lachisis I appreciate if you can provide a patch :-)

 FileSystemRMStateStore throw exception when failed to remove application, 
 that cause resourcemanager to crash
 -

 Key: YARN-3614
 URL: https://issues.apache.org/jira/browse/YARN-3614
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical

 FileSystemRMStateStore is only a accessorial plug-in of rmstore. 
 When it failed to remove application, I think warning is enough, but now 
 resourcemanager crashed.
 Recently, I configure 
 yarn.resourcemanager.state-store.max-completed-applications  to limit 
 applications number in rmstore. when applications number exceed the limit, 
 some old applications will be removed. If failed to remove, resourcemanager 
 will crash.
 The following is log: 
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing 
 info for app: application_1430994493305_0053
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
  Removing info for app: application_1430994493305_0053 at: 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 2015-05-11 06:58:43,816 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 removing app: application_1430994493305_0053
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:745)
 2015-05-11 06:58:43,819 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 

[jira] [Commented] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash

2015-05-10 Thread Tsuyoshi Ozawa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537526#comment-14537526
 ] 

Tsuyoshi Ozawa commented on YARN-3614:
--

Thank you for clarification. On YARN-3410, whose target is 2.8.0, the problem 
looks to be addressed since removeApplication check the existence of the 
directory. Please correct me if I'm wrong.

{code}
  @Override
  public synchronized void removeApplication(ApplicationId removeAppId)
  throws Exception {
Path nodeRemovePath = getAppDir(rmAppRoot, removeAppId);
if (existsWithRetries(nodeRemovePath)) {
  deleteFileWithRetries(nodeRemovePath);
}
  }
{code}

 FileSystemRMStateStore throw exception when failed to remove application, 
 that cause resourcemanager to crash
 -

 Key: YARN-3614
 URL: https://issues.apache.org/jira/browse/YARN-3614
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical

 FileSystemRMStateStore is only a accessorial plug-in of rmstore. 
 When it failed to remove application, I think warning is enough, but now 
 resourcemanager crashed.
 Recently, I configure 
 yarn.resourcemanager.state-store.max-completed-applications  to limit 
 applications number in rmstore. when applications number exceed the limit, 
 some old applications will be removed. If failed to remove, resourcemanager 
 will crash.
 The following is log: 
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing 
 info for app: application_1430994493305_0053
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
  Removing info for app: application_1430994493305_0053 at: 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 2015-05-11 06:58:43,816 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 removing app: application_1430994493305_0053
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:745)
 2015-05-11 06:58:43,819 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 

[jira] [Commented] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash

2015-05-10 Thread Tsuyoshi Ozawa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537485#comment-14537485
 ] 

Tsuyoshi Ozawa commented on YARN-3614:
--

[~lachisis] thank you for reporting this issue. I think this issue is resolved 
by operation-level retry of FSRMStateStore implemented on YARN-2820. The 
feature is merged on 2.7.0. I think 2.7.1 is coming soon, so could you use it 
for your development?

 FileSystemRMStateStore throw exception when failed to remove application, 
 that cause resourcemanager to crash
 -

 Key: YARN-3614
 URL: https://issues.apache.org/jira/browse/YARN-3614
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical

 FileSystemRMStateStore is only a accessorial plug-in of rmstore. 
 When it failed to remove application, I think warning is enough, but now 
 resourcemanager crashed.
 Recently, I configure 
 yarn.resourcemanager.state-store.max-completed-applications  to limit 
 applications number in rmstore. when applications number exceed the limit, 
 some old applications will be removed. If failed to remove, resourcemanager 
 will crash.
 The following is log: 
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing 
 info for app: application_1430994493305_0053
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
  Removing info for app: application_1430994493305_0053 at: 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 2015-05-11 06:58:43,816 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 removing app: application_1430994493305_0053
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:745)
 2015-05-11 06:58:43,819 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 

[jira] [Commented] (YARN-3614) FileSystemRMStateStore throw exception when failed to remove application, that cause resourcemanager to crash

2015-05-10 Thread lachisis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537499#comment-14537499
 ] 

lachisis commented on YARN-3614:


Thanks for your attention. 
I have downladed the 2.7.0, and review the FileSystemRMStateStore.java 
implementation. 
But I think it dosen't fix the issue which I submitted.

The followinf is the code of 2.7.0. If fs.delete return false, it still thows 
Exception.  I think a warning is enough here. otherwise, if someone move this 
application folder manually,  Exception will throw through function 
deleteFile, deleteFileWithRetries, removeApplicationStateInternal.

@Override
  public synchronized void removeApplicationStateInternal(
  ApplicationStateData appState)
  throws Exception {
ApplicationId appId =
appState.getApplicationSubmissionContext().getApplicationId();
Path nodeRemovePath = getAppDir(rmAppRoot, appId);
LOG.info(Removing info for app:  + appId +  at:  + nodeRemovePath);
deleteFileWithRetries(nodeRemovePath);
  }

private void deleteFileWithRetries(final Path deletePath) throws Exception {
new FSActionVoid() {
  @Override
  public Void run() throws Exception {
deleteFile(deletePath);
return null;
  }
}.runWithRetries();
  }

private void deleteFile(Path deletePath) throws Exception {
if(!fs.delete(deletePath, true)) {
  throw new Exception(Failed to delete  + deletePath);
}
  }





 FileSystemRMStateStore throw exception when failed to remove application, 
 that cause resourcemanager to crash
 -

 Key: YARN-3614
 URL: https://issues.apache.org/jira/browse/YARN-3614
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: lachisis
Priority: Critical

 FileSystemRMStateStore is only a accessorial plug-in of rmstore. 
 When it failed to remove application, I think warning is enough, but now 
 resourcemanager crashed.
 Recently, I configure 
 yarn.resourcemanager.state-store.max-completed-applications  to limit 
 applications number in rmstore. when applications number exceed the limit, 
 some old applications will be removed. If failed to remove, resourcemanager 
 will crash.
 The following is log: 
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing 
 info for app: application_1430994493305_0053
 2015-05-11 06:58:43,815 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore:
  Removing info for app: application_1430994493305_0053 at: 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 2015-05-11 06:58:43,816 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 removing app: application_1430994493305_0053
 java.lang.Exception: Failed to delete 
 /hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1430994493305_0053
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.deleteFile(FileSystemRMStateStore.java:572)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.removeApplicationStateInternal(FileSystemRMStateStore.java:471)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:185)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:171)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:745)
 2015-05-11 06:58:43,819 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of