[ 
https://issues.apache.org/jira/browse/YARN-7382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter updated YARN-7382:
--------------------------------
    Attachment: YARN-7382.001.patch

The 001 patch fixes the problem by marking that the first container is the AM 
earlier, during recovery, than it had been doing before.  That prevents it from 
getting into the state where it tries to get the first key while 
{{schedulerKeys}} is empty.  I verified this manually in a cluster and also 
updated a unit test, where the resources used by the AM weren't being reported 
correctly before.

> NoSuchElementException in FairScheduler after failover causes RM crash
> ----------------------------------------------------------------------
>
>                 Key: YARN-7382
>                 URL: https://issues.apache.org/jira/browse/YARN-7382
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>    Affects Versions: 2.9.0, 3.0.0
>            Reporter: Robert Kanter
>            Assignee: Robert Kanter
>            Priority: Blocker
>         Attachments: YARN-7382.001.patch
>
>
> While running an MR job (e.g. sleep) and an RM failover occurs, once the maps 
> gets to 100%, the now active RM will crash due to:
> {noformat}
> 2017-10-18 15:02:05,347 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_1508361403235_0001_01_000002 Container Transitioned from RUNNING to 
> COMPLETED
> 2017-10-18 15:02:05,347 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest  
> OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS  
> APPID=application_1508361403235_0001    
> CONTAINERID=container_1508361403235_0001_01_000002      
> RESOURCE=<memory:1024, vCores:1>
> 2017-10-18 15:02:05,349 FATAL org.apache.hadoop.yarn.event.EventDispatcher: 
> Error in handling event type NODE_UPDATE to the Event Dispatcher
> java.util.NoSuchElementException
>         at 
> java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:2036)
>         at 
> java.util.concurrent.ConcurrentSkipListSet.first(ConcurrentSkipListSet.java:396)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.getNextPendingAsk(AppSchedulingInfo.java:371)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.isOverAMShareLimit(FSAppAttempt.java:901)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:1326)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:371)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:221)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:221)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1019)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:887)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1104)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:128)
>         at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
>         at java.lang.Thread.run(Thread.java:748)
> 2017-10-18 15:02:05,360 INFO org.apache.hadoop.yarn.event.EventDispatcher: 
> Exiting, bbye..
> {noformat}
> This leaves the cluster with no RMs!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to