[ https://issues.apache.org/jira/browse/YARN-7382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16216146#comment-16216146 ]
Arun Suresh commented on YARN-7382: ----------------------------------- Thanks for raising this [~rkanter] and for the patch. looks straight forward. +1 pending jenkins > NoSuchElementException in FairScheduler after failover causes RM crash > ---------------------------------------------------------------------- > > Key: YARN-7382 > URL: https://issues.apache.org/jira/browse/YARN-7382 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler > Affects Versions: 2.9.0, 3.0.0 > Reporter: Robert Kanter > Assignee: Robert Kanter > Priority: Blocker > Attachments: YARN-7382.001.patch > > > While running an MR job (e.g. sleep) and an RM failover occurs, once the maps > gets to 100%, the now active RM will crash due to: > {noformat} > 2017-10-18 15:02:05,347 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_1508361403235_0001_01_000002 Container Transitioned from RUNNING to > COMPLETED > 2017-10-18 15:02:05,347 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest > OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS > APPID=application_1508361403235_0001 > CONTAINERID=container_1508361403235_0001_01_000002 > RESOURCE=<memory:1024, vCores:1> > 2017-10-18 15:02:05,349 FATAL org.apache.hadoop.yarn.event.EventDispatcher: > Error in handling event type NODE_UPDATE to the Event Dispatcher > java.util.NoSuchElementException > at > java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:2036) > at > java.util.concurrent.ConcurrentSkipListSet.first(ConcurrentSkipListSet.java:396) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.getNextPendingAsk(AppSchedulingInfo.java:371) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.isOverAMShareLimit(FSAppAttempt.java:901) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:1326) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:371) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:221) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:221) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1019) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:887) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1104) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:128) > at > org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) > at java.lang.Thread.run(Thread.java:748) > 2017-10-18 15:02:05,360 INFO org.apache.hadoop.yarn.event.EventDispatcher: > Exiting, bbye.. > {noformat} > This leaves the cluster with no RMs! -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org