[ https://issues.apache.org/jira/browse/YARN-10046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17000068#comment-17000068 ]
Wilfred Spiegelenburg commented on YARN-10046: ---------------------------------------------- If this is really teh version that you say it is the error happens at this point: {code} 519 SchedulerApplication<FSAppAttempt> application = applications.get( 520 applicationAttemptId.getApplicationId()); 521 String user = application.getUser(); 522 FSLeafQueue queue = (FSLeafQueue) application.getQueue(); {code} Which would mean that the application is null which makes it the same issue as YARN-7913. Please check that one, I have started working on a fix for it. Normally this failure means that you have changed the scheduler configuration so much that we cannot handle it on recovery. > RM failed to transition to Active because of App recovery throwing > java.lang.NullPointerException > ------------------------------------------------------------------------------------------------- > > Key: YARN-10046 > URL: https://issues.apache.org/jira/browse/YARN-10046 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 3.0.0 > Reporter: Yong Xing > Priority: Critical > > > CDH Distribution: Hadoop 3.0.0-cdh6.0.1 > The exception stack is as follows. > 2019-12-12 17:09:41,422 ERROR > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to > load/recover state > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:521) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1221) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:130) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1265) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1206) > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:907) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:116) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recoverAppAttempts(RMAppImpl.java:1046) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.access$2000(RMAppImpl.java:118) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:1110) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:1051) > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:875) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:357) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:544) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1393) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:758) > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1146) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1186) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1182) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1726) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1182) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:320) > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144) > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:894) > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:473) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:592) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:491) > > During the recovery of Application attempts, the status of one app attempt is > NULL. The following LOG describes: > 2019-12-12 17:09:41,381 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Recovering > app: application_1576136386231_0742 with 1 attempts and final state = > NONE2019-12-12 17:09:41,381 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Recovering > app: application_1576136386231_0742 with 1 attempts and final state = NONE > > The corresponding code in rmapp/RMAppImpl.java is > if (recoveredFinalState == null) { > LOG.info(String.format(RECOVERY_MESSAGE, getApplicationId(), > appState.getAttemptCount(), "NONE")); > } else if (LOG.isDebugEnabled()) { > LOG.debug(String.format(RECOVERY_MESSAGE, getApplicationId(), > appState.getAttemptCount(), recoveredFinalState)); > } > > In rmapp/attempt/RMAppAttemptImpl.java, there is a piece of code using the > RMAppAttemptState, which is NULL. > > private static class BaseFinalTransition extends BaseTransition { > private final RMAppAttemptState finalAttemptState; > public BaseFinalTransition(RMAppAttemptState finalAttemptState) { > this.finalAttemptState = finalAttemptState; > } > @Override > public void transition(RMAppAttemptImpl appAttempt, > RMAppAttemptEvent event) { > ApplicationAttemptId appAttemptId = appAttempt.getAppAttemptId(); > // Tell the AMS. Unregister from the ApplicationMasterService > appAttempt.masterService.unregisterAttempt(appAttemptId); > // Tell the application and the scheduler > ApplicationId applicationId = appAttemptId.getApplicationId(); > RMAppEvent appEvent = null; > boolean keepContainersAcrossAppAttempts = false; > switch (finalAttemptState) { > case FINISHED: > { > In the switch clause, java.lang.NullPointerException is thrown because > finalAttemptState is NULL. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org