[ https://issues.apache.org/jira/browse/YARN-6107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16113838#comment-16113838 ]
stefanlee commented on YARN-6107: --------------------------------- is there flink type application running in your cluster? this bug is fixed in hadoop-2.6, you can reference [https://issues.apache.org/jira/browse/YARN-2823] > ResourceManager recovered with NPE Exception due to zk store failed > ------------------------------------------------------------------- > > Key: YARN-6107 > URL: https://issues.apache.org/jira/browse/YARN-6107 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn > Affects Versions: 2.5.1 > Reporter: liuxiangwei > > Firstly, RM is stopped by the exception below: > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode > = Session expired for /nmg01-khan-yarn-on-normandy-rmstore/ZKRM > StateRoot/RMAppRoot/application_1484014091623_3711/appattempt_1484014091623_3711_000001 > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:127) > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045) > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1073) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:960) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:957) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1007) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1026) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:957) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:65 > 4) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:236) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:219) > at > org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:774) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:845) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:840) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:662) > Secondly, Restart the RM but never success due to exception below: > 2017-01-18 15:07:48,130 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type APP_ATTEMPT_ADDED t > o the scheduler > java.lang.NullPointerException > The stack trace points to the code blow: > SchedulerApplication<FiCaSchedulerApp> application = > applications.get(appAttemptId.getApplicationId()); > It seems application does not exist. > And we found log like this > 2017-01-18 15:11:21,204 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Recovering > app: application_1484014091623_3711 wi > th 1 attempts and final state = FINISHED > 2017-01-18 15:11:21,204 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > Recovering attempt: appattempt_148 > 4014091623_3711_000001 with final state: null > 2017-01-18 15:11:21,204 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > appattempt_1484014091623_3711_0000 > 01 State change from NEW to LAUNCHED > 2017-01-18 15:11:21,204 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: > application_1484014091623_3711 State change from > NEW to FINISHED > the final states do not make equal. > We have to check the application whether is null to avoid this problem and > make this failover success. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org