[jira] [Commented] (YARN-3536) ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover
[ https://issues.apache.org/jira/browse/YARN-3536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509802#comment-14509802 ] zhihai xu commented on YARN-3536: - Is This issue similar as YARN-2834? ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover -- Key: YARN-3536 URL: https://issues.apache.org/jira/browse/YARN-3536 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, resourcemanager Affects Versions: 2.4.1 Reporter: gu-chi Here is a scenario that Application status is FAILED/FINISHED but AppAttempt status is null, this cause NPE when doing recover with yarn.resourcemanager.work-preserving-recovery.enabled set to true, RM should handle recovery gracefully -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3536) ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover
[ https://issues.apache.org/jira/browse/YARN-3536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510320#comment-14510320 ] gu-chi commented on YARN-3536: -- Thx, as the exception trace stack is almost, I once looked into this ticket. This patch is already merged into the current environment I use. Not same cause. ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover -- Key: YARN-3536 URL: https://issues.apache.org/jira/browse/YARN-3536 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, resourcemanager Affects Versions: 2.4.1 Reporter: gu-chi Here is a scenario that Application status is FAILED/FINISHED but AppAttempt status is null, this cause NPE when doing recover with yarn.resourcemanager.work-preserving-recovery.enabled set to true, RM should handle recovery gracefully -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3536) ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover
[ https://issues.apache.org/jira/browse/YARN-3536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508852#comment-14508852 ] gu-chi commented on YARN-3536: -- 2015-04-21 03:52:31,395 | INFO | AsyncDispatcher event handler | appattempt_1429597538411_0001_02 State change from RUNNING to FINAL_SAVING | org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:704) 2015-04-21 03:52:31,397 | INFO | AsyncDispatcher event handler | Updating application application_1429597538411_0001 with final state: FINISHING | org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.rememberTargetTransitionsAndStoreState(RMAppImpl.java:988) 2015-04-21 03:52:31,397 | WARN | main-SendThread(VM1228:24002) | Session 0xd4cdaa0557f0005 for server VM1228/9.91.12.28:24002, unexpected error, closing socket connection and attempting reconnect | org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1126) java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:368) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1105) 2015-04-21 03:52:31,499 | INFO | AsyncDispatcher event handler | Exception while executing a ZK operation. | org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1098) org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /rmstore/ZKRMStateRoot/RMAppRoot/application_1429597538411_0001/appattempt_1429597538411_0001_02 at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1073) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:996) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:993) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1066) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1085) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:993) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:683) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:236) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:219) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:792) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:866) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:861) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover -- Key: YARN-3536 URL: https://issues.apache.org/jira/browse/YARN-3536 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, resourcemanager Affects Versions: 2.4.1 Reporter: gu-chi Here is a scenario that
[jira] [Commented] (YARN-3536) ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover
[ https://issues.apache.org/jira/browse/YARN-3536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508855#comment-14508855 ] gu-chi commented on YARN-3536: -- 2015-04-21 04:22:33,923 | INFO | main-EventThread | Recovering app: application_1429597538411_0001 with 2 attempts and final state = FINISHED | org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:700) 2015-04-21 04:22:33,923 | INFO | main-EventThread | Recovering attempt: appattempt_1429597538411_0001_01 with final state: FAILED | org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:734) 2015-04-21 04:22:33,924 | INFO | main-EventThread | Recovering attempt: appattempt_1429597538411_0001_02 with final state: null | org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:734) 2015-04-21 04:22:33,924 | INFO | main-EventThread | Create AMRMToken for ApplicationAttempt: appattempt_1429597538411_0001_02 | org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager.createAndGetAMRMToken(AMRMTokenSecretManager.java:195) 2015-04-21 04:22:33,924 | INFO | main-EventThread | Creating password for appattempt_1429597538411_0001_02 | org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager.createPassword(AMRMTokenSecretManager.java:307) 2015-04-21 04:22:33,924 | INFO | main-EventThread | appattempt_1429597538411_0001_01 State change from NEW to FAILED | org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:704) 2015-04-21 04:22:33,925 | INFO | main-EventThread | Registering app attempt : appattempt_1429597538411_0001_02 | org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.registerAppAttempt(ApplicationMasterService.java:656) 2015-04-21 04:22:33,925 | ERROR | main-EventThread | Failed to load/recover state | org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:533) java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:607) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:941) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:97) ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover -- Key: YARN-3536 URL: https://issues.apache.org/jira/browse/YARN-3536 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, resourcemanager Affects Versions: 2.4.1 Reporter: gu-chi Here is a scenario that Application status is FAILED/FINISHED but AppAttempt status is null, this cause NPE when doing recover with yarn.resourcemanager.work-preserving-recovery.enabled set to true -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3536) ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover
[ https://issues.apache.org/jira/browse/YARN-3536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508900#comment-14508900 ] gu-chi commented on YARN-3536: -- Please assign this to me for fixing ZK exception occur when updating AppAttempt status, then NPE thrown when RM do recover -- Key: YARN-3536 URL: https://issues.apache.org/jira/browse/YARN-3536 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, resourcemanager Affects Versions: 2.4.1 Reporter: gu-chi Here is a scenario that Application status is FAILED/FINISHED but AppAttempt status is null, this cause NPE when doing recover with yarn.resourcemanager.work-preserving-recovery.enabled set to true, RM should handle recovery gracefully -- This message was sent by Atlassian JIRA (v6.3.4#6332)