[jira] [Updated] (YARN-10051) Throwing NoSuchElementException when even dispatcher handling NODE_UPDATE
[ https://issues.apache.org/jira/browse/YARN-10051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yong Xing updated YARN-10051: - Description: Restarting a NM, I found the active RM crash. The Exception stack is as follows. {code:java} 2019-12-16 18:12:20,286 FATAL org.apache.hadoop.yarn.event.EventDispatcher: Error in handling event type NODE_UPDATE to the Event Dispatcher java.util.NoSuchElementException at java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:2036) at java.util.concurrent.ConcurrentSkipListSet.first(ConcurrentSkipListSet.java:396) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.getNextPendingAsk(AppSchedulingInfo.java:373) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.isOverAMShareLimit(FSAppAttempt.java:941) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:1374) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:345) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:204) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1091) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:958) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1180) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:130) at org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) at java.lang.Thread.run(Thread.java:748) {code} This issue looks a bit same as [YARN-9552|https://issues.apache.org/jira/browse/YARN-9552],[YARN-7382|https://issues.apache.org/jira/browse/YARN-7382]. But the root cause is different. was: Restarting a NM, I found the active RM crash. The Exception stack is as follows. {code:java} 2019-12-16 18:12:20,286 FATAL org.apache.hadoop.yarn.event.EventDispatcher: Error in handling event type NODE_UPDATE to the Event Dispatcher java.util.NoSuchElementException at java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:2036) at java.util.concurrent.ConcurrentSkipListSet.first(ConcurrentSkipListSet.java:396) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.getNextPendingAsk(AppSchedulingInfo.java:373) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.isOverAMShareLimit(FSAppAttempt.java:941) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:1374) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:345) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:204) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1091) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:958) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1180) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:130) at org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) at java.lang.Thread.run(Thread.java:748) {code} > Throwing NoSuchElementException when even dispatcher handling NODE_UPDATE > - > > Key: YARN-10051 > URL: https://issues.apache.org/jira/browse/YARN-10051 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0 >Reporter: Yong Xing >Priority: Critical > Fix For: 3.0.0 > > > Restarting a NM, I found the active RM crash. The Exception stack is as > follows. > {code:java} > 2019-12-16 18:12:20,286 FATAL org.apache.hadoop.yarn.event.EventDispatcher: > Error in handling event type NODE_UPDATE to the Event Dispatcher > java.util.NoSuchElementException > at > java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:2036) > at > java.util.concurrent.ConcurrentSkipListSet.first(ConcurrentSkipListSet.java:396) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.getNextPendingAsk(AppSchedulingInfo.java:373) >
[jira] [Created] (YARN-10051) Throwing NoSuchElementException when even dispatcher handling NODE_UPDATE
Yong Xing created YARN-10051: Summary: Throwing NoSuchElementException when even dispatcher handling NODE_UPDATE Key: YARN-10051 URL: https://issues.apache.org/jira/browse/YARN-10051 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.0.0 Reporter: Yong Xing Fix For: 3.0.0 Restarting a NM, I found the active RM crash. The Exception stack is as follows. {code:java} 2019-12-16 18:12:20,286 FATAL org.apache.hadoop.yarn.event.EventDispatcher: Error in handling event type NODE_UPDATE to the Event Dispatcher java.util.NoSuchElementException at java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:2036) at java.util.concurrent.ConcurrentSkipListSet.first(ConcurrentSkipListSet.java:396) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.getNextPendingAsk(AppSchedulingInfo.java:373) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.isOverAMShareLimit(FSAppAttempt.java:941) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:1374) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:345) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:204) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1091) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:958) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1180) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:130) at org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) at java.lang.Thread.run(Thread.java:748) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10046) RM failed to transition to Active because of App recovery throwing java.lang.NullPointerException
Yong Xing created YARN-10046: Summary: RM failed to transition to Active because of App recovery throwing java.lang.NullPointerException Key: YARN-10046 URL: https://issues.apache.org/jira/browse/YARN-10046 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.0.0 Reporter: Yong Xing CDH Distribution: Hadoop 3.0.0-cdh6.0.1 The exception stack is as follows. 2019-12-12 17:09:41,422 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to load/recover state java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:521) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1221) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:130) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1265) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1206) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:907) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:116) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recoverAppAttempts(RMAppImpl.java:1046) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.access$2000(RMAppImpl.java:118) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:1110) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:1051) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:875) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:357) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:544) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1393) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:758) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1146) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1186) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1182) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1726) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1182) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:320) at org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:894) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:473) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:592) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:491) During the recovery of Application attempts, the status of one app attempt is NULL. The following LOG describes: 2019-12-12 17:09:41,381 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Recovering app: application_1576136386231_0742 with 1 attempts and final state =