[ 
https://issues.apache.org/jira/browse/YARN-4546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15083927#comment-15083927
 ] 

Jason Lowe commented on YARN-4546:
----------------------------------

When the overflow occurs the RM crashes with a stacktrace like this:
{noformat}
2015-12-26 20:18:39,731 [ResourceManager Event Processor] FATAL 
resourcemanager.ResourceManager: Error in handling event type NODE_UPDATE to 
the scheduler
java.lang.IllegalArgumentException: count cannot be negative: -2147483648
        at 
com.google.common.base.Preconditions.checkArgument(Preconditions.java:115)
        at 
com.google.common.collect.Multisets.checkNonnegative(Multisets.java:943)
        at 
com.google.common.collect.AbstractMapBasedMultiset.setCount(AbstractMapBasedMultiset.java:277)
        at com.google.common.collect.HashMultiset.setCount(HashMultiset.java:34)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.addSchedulingOpportunity(SchedulerApplicationAttempt.java:485)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:872)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:586)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:447)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1019)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1061)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:115)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:682)
        at java.lang.Thread.run(Thread.java:745)
2015-12-26 20:18:39,732 [ResourceManager Event Processor] INFO 
resourcemanager.ResourceManager: Exiting, bbye..
{noformat}

In this particular case the resource request went unsatisfied for a long time 
due to the use of node labels and the application having blacklisted every node 
with that label.  At that point no node in the cluster could satisfy the 
request because it either didn't have the label or it was blacklisted.  So the 
resource request accumulated scheduling opportunities until the count 
eventually overflowed.

> ResourceManager crash due to scheduling opportunity overflow
> ------------------------------------------------------------
>
>                 Key: YARN-4546
>                 URL: https://issues.apache.org/jira/browse/YARN-4546
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.1
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>
> If a resource request lingers long enough unsatisfied then the scheduling 
> opportunities count for the request can overflow and cause an RM crash.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to