Nathan Roberts created YARN-5540:
------------------------------------

             Summary: Capacity Scheduler spends too much time looking at empty 
priorities
                 Key: YARN-5540
                 URL: https://issues.apache.org/jira/browse/YARN-5540
             Project: Hadoop YARN
          Issue Type: Improvement
          Components: capacity scheduler, resourcemanager
    Affects Versions: 2.7.2
            Reporter: Nathan Roberts
            Assignee: Jason Lowe


We're starting to see the capacity scheduler run out of scheduling horsepower 
when running 500-1000 applications on clusters with 4K nodes or so.

This seems to be amplified by TEZ applications. TEZ applications have many more 
priorities (sometimes in the hundreds) than typical MR applications and 
therefore the loop in the scheduler which examines every priority within every 
running application, starts to be a hotspot. The priorities appear to stay 
around forever, even when there is no remaining resource request at that 
priority causing us to spend a lot of time looking at nothing.

jstack snippet:
{noformat}
"ResourceManager Event Processor" #28 prio=5 os_prio=0 tid=0x00007fc2d453e800 
nid=0x22f3 runnable [0x00007fc2a8be2000]
   java.lang.Thread.State: RUNNABLE
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.getResourceRequest(SchedulerApplicationAttempt.java:210)
        - eliminated <0x00000005e73e5dc0> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:852)
        - locked <0x00000005e73e5dc0> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp)
        - locked <0x00000003006fcf60> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:527)
        - locked <0x00000003001b22f8> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:415)
        - locked <0x00000003001b22f8> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1224)
        - locked <0x0000000300041e40> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to