[jira] [Updated] (YARN-5540) scheduler spends too much time looking at empty priorities

2016-09-19 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-5540:
-
Attachment: YARN-5540-branch-2.8.004.patch

Thanks for the review, Arun!  Posting the branch-2.8 patch again to trigger the 
Jenkins run.

> scheduler spends too much time looking at empty priorities
> --
>
> Key: YARN-5540
> URL: https://issues.apache.org/jira/browse/YARN-5540
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, fairscheduler, resourcemanager
>Affects Versions: 2.7.2
>Reporter: Nathan Roberts
>Assignee: Jason Lowe
> Attachments: YARN-5540-branch-2.7.004.patch, 
> YARN-5540-branch-2.8.004.patch, YARN-5540-branch-2.8.004.patch, 
> YARN-5540.001.patch, YARN-5540.002.patch, YARN-5540.003.patch, 
> YARN-5540.004.patch
>
>
> We're starting to see the capacity scheduler run out of scheduling horsepower 
> when running 500-1000 applications on clusters with 4K nodes or so.
> This seems to be amplified by TEZ applications. TEZ applications have many 
> more priorities (sometimes in the hundreds) than typical MR applications and 
> therefore the loop in the scheduler which examines every priority within 
> every running application, starts to be a hotspot. The priorities appear to 
> stay around forever, even when there is no remaining resource request at that 
> priority causing us to spend a lot of time looking at nothing.
> jstack snippet:
> {noformat}
> "ResourceManager Event Processor" #28 prio=5 os_prio=0 tid=0x7fc2d453e800 
> nid=0x22f3 runnable [0x7fc2a8be2000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.getResourceRequest(SchedulerApplicationAttempt.java:210)
> - eliminated <0x0005e73e5dc0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:852)
> - locked <0x0005e73e5dc0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp)
> - locked <0x0003006fcf60> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:527)
> - locked <0x0003001b22f8> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:415)
> - locked <0x0003001b22f8> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1224)
> - locked <0x000300041e40> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5540) scheduler spends too much time looking at empty priorities

2016-09-16 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-5540:
-
Attachment: YARN-5540-branch-2.7.004.patch
YARN-5540-branch-2.8.004.patch

Thanks for the reviews!

Attaching the patches for 2.8 and 2.7.  2.8 was quite a bit different since 
that branch doesn't have the priority -> scheduler key change.  2.7 was even 
simpler than the 2.8 patch since it doesn't have the container 
increase/decrease functionality so we don't need to do refcounting and can get 
away with a ConcurrentSkipListSet.

If someone can verify the patches for branch-2.8 and branch-2.7 look good as 
well then I'd be happy to commit this.


> scheduler spends too much time looking at empty priorities
> --
>
> Key: YARN-5540
> URL: https://issues.apache.org/jira/browse/YARN-5540
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, fairscheduler, resourcemanager
>Affects Versions: 2.7.2
>Reporter: Nathan Roberts
>Assignee: Jason Lowe
> Attachments: YARN-5540-branch-2.7.004.patch, 
> YARN-5540-branch-2.8.004.patch, YARN-5540.001.patch, YARN-5540.002.patch, 
> YARN-5540.003.patch, YARN-5540.004.patch
>
>
> We're starting to see the capacity scheduler run out of scheduling horsepower 
> when running 500-1000 applications on clusters with 4K nodes or so.
> This seems to be amplified by TEZ applications. TEZ applications have many 
> more priorities (sometimes in the hundreds) than typical MR applications and 
> therefore the loop in the scheduler which examines every priority within 
> every running application, starts to be a hotspot. The priorities appear to 
> stay around forever, even when there is no remaining resource request at that 
> priority causing us to spend a lot of time looking at nothing.
> jstack snippet:
> {noformat}
> "ResourceManager Event Processor" #28 prio=5 os_prio=0 tid=0x7fc2d453e800 
> nid=0x22f3 runnable [0x7fc2a8be2000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.getResourceRequest(SchedulerApplicationAttempt.java:210)
> - eliminated <0x0005e73e5dc0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:852)
> - locked <0x0005e73e5dc0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp)
> - locked <0x0003006fcf60> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:527)
> - locked <0x0003001b22f8> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:415)
> - locked <0x0003001b22f8> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1224)
> - locked <0x000300041e40> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5540) scheduler spends too much time looking at empty priorities

2016-09-13 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-5540:
-
Attachment: YARN-5540.004.patch

Oops, just realized patch 003 is missing the TODO comment removal.  Fixed in 
004.

> scheduler spends too much time looking at empty priorities
> --
>
> Key: YARN-5540
> URL: https://issues.apache.org/jira/browse/YARN-5540
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, fairscheduler, resourcemanager
>Affects Versions: 2.7.2
>Reporter: Nathan Roberts
>Assignee: Jason Lowe
> Attachments: YARN-5540.001.patch, YARN-5540.002.patch, 
> YARN-5540.003.patch, YARN-5540.004.patch
>
>
> We're starting to see the capacity scheduler run out of scheduling horsepower 
> when running 500-1000 applications on clusters with 4K nodes or so.
> This seems to be amplified by TEZ applications. TEZ applications have many 
> more priorities (sometimes in the hundreds) than typical MR applications and 
> therefore the loop in the scheduler which examines every priority within 
> every running application, starts to be a hotspot. The priorities appear to 
> stay around forever, even when there is no remaining resource request at that 
> priority causing us to spend a lot of time looking at nothing.
> jstack snippet:
> {noformat}
> "ResourceManager Event Processor" #28 prio=5 os_prio=0 tid=0x7fc2d453e800 
> nid=0x22f3 runnable [0x7fc2a8be2000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.getResourceRequest(SchedulerApplicationAttempt.java:210)
> - eliminated <0x0005e73e5dc0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:852)
> - locked <0x0005e73e5dc0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp)
> - locked <0x0003006fcf60> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:527)
> - locked <0x0003001b22f8> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:415)
> - locked <0x0003001b22f8> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1224)
> - locked <0x000300041e40> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5540) scheduler spends too much time looking at empty priorities

2016-09-13 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-5540:
-
Attachment: YARN-5540.003.patch

Thanks for the review, Wangda!

Updated the method names per the suggestion.  I removed the TODO comment, since 
Arun also asked about it above.  This code didn't change the behavior 
surrounding the question raised by the TODO.  However I think it's safe to 
assume at this point that we are not going to consider activating apps when the 
total container ask is zero.


> scheduler spends too much time looking at empty priorities
> --
>
> Key: YARN-5540
> URL: https://issues.apache.org/jira/browse/YARN-5540
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, fairscheduler, resourcemanager
>Affects Versions: 2.7.2
>Reporter: Nathan Roberts
>Assignee: Jason Lowe
> Attachments: YARN-5540.001.patch, YARN-5540.002.patch, 
> YARN-5540.003.patch
>
>
> We're starting to see the capacity scheduler run out of scheduling horsepower 
> when running 500-1000 applications on clusters with 4K nodes or so.
> This seems to be amplified by TEZ applications. TEZ applications have many 
> more priorities (sometimes in the hundreds) than typical MR applications and 
> therefore the loop in the scheduler which examines every priority within 
> every running application, starts to be a hotspot. The priorities appear to 
> stay around forever, even when there is no remaining resource request at that 
> priority causing us to spend a lot of time looking at nothing.
> jstack snippet:
> {noformat}
> "ResourceManager Event Processor" #28 prio=5 os_prio=0 tid=0x7fc2d453e800 
> nid=0x22f3 runnable [0x7fc2a8be2000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.getResourceRequest(SchedulerApplicationAttempt.java:210)
> - eliminated <0x0005e73e5dc0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:852)
> - locked <0x0005e73e5dc0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp)
> - locked <0x0003006fcf60> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:527)
> - locked <0x0003001b22f8> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:415)
> - locked <0x0003001b22f8> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1224)
> - locked <0x000300041e40> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5540) scheduler spends too much time looking at empty priorities

2016-08-26 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-5540:
-
Attachment: YARN-5540.002.patch

Updating the patch to use a ConcurrentSkipListMap instead of a TreeMap.  This 
is not going to be as cheap as having the iterator do the removal, but it's far 
less code change and more robust if we allow other threads to update the 
requests as the scheduler is examining them.

I also updated the TODO comment and added a check for the CME possibility which 
catches the error from the previous patch.



> scheduler spends too much time looking at empty priorities
> --
>
> Key: YARN-5540
> URL: https://issues.apache.org/jira/browse/YARN-5540
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, fairscheduler, resourcemanager
>Affects Versions: 2.7.2
>Reporter: Nathan Roberts
>Assignee: Jason Lowe
> Attachments: YARN-5540.001.patch, YARN-5540.002.patch
>
>
> We're starting to see the capacity scheduler run out of scheduling horsepower 
> when running 500-1000 applications on clusters with 4K nodes or so.
> This seems to be amplified by TEZ applications. TEZ applications have many 
> more priorities (sometimes in the hundreds) than typical MR applications and 
> therefore the loop in the scheduler which examines every priority within 
> every running application, starts to be a hotspot. The priorities appear to 
> stay around forever, even when there is no remaining resource request at that 
> priority causing us to spend a lot of time looking at nothing.
> jstack snippet:
> {noformat}
> "ResourceManager Event Processor" #28 prio=5 os_prio=0 tid=0x7fc2d453e800 
> nid=0x22f3 runnable [0x7fc2a8be2000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.getResourceRequest(SchedulerApplicationAttempt.java:210)
> - eliminated <0x0005e73e5dc0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:852)
> - locked <0x0005e73e5dc0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp)
> - locked <0x0003006fcf60> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:527)
> - locked <0x0003001b22f8> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:415)
> - locked <0x0003001b22f8> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1224)
> - locked <0x000300041e40> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5540) scheduler spends too much time looking at empty priorities

2016-08-23 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-5540:
-
Attachment: YARN-5540.001.patch

The main problem is that a scheduler key is never being removed from the 
collection of scheduler keys even when there are no further asks for that key.  
There's also separate issue where we can fail to cleanup the underlying hash 
map keys underneath a particular scheduler key, but I believe that's more of a 
memory issue than a performance issue.  The performance issue occurs because 
the inner loop for schedulers is to iterate the scheduler keys, so it's 
important to remove keys we know are no longer necessary.

When I first started this patch I tried to clean up everything with the 
bookkeeping including all the keys from the underlying requests hashmap.  This 
made for a much larger patch and adds new, interesting NPE possibilities since 
requests could disappear in cases that are impossible today.  For example the 
current code goes out of its way to avoid removing the ANY request for a 
scheduler key.  As such I decided to focus just on the scheduler key set size 
problem which is a more focused patch that should still fix the main problem 
behind this JIRA.

Attaching a patch for trunk for review.  The main idea is to reference count 
the various scheduler keys and remove them once their refcount goes to zero.  
We increment the refcount for a key when the corresponding ANY request goes 
from zero to non-zero or if there's a container increment request against that 
scheduler key when there wasn't one before.  Similarly we decrement the 
refcount for a key when the corresponding ANY request goes from non-zero to 
zero or if there are no container increment requests when there were some 
before.  When a scheduler key refcount goes from 0 to 1 it is inserted in the 
collection of scheduler keys, and when it goes from 1 to 0 it is removed from 
the collection.  This also has the nice property that deactivation checks 
simply become an isEmpty check on the collection of scheduler keys rather than 
a loop over that collection.

Once we're agreed on a version for trunk I'll put up the separate patches for 
branch-2.8 and branch-2.7 due to changes from YARN-5392 and YARN-1651, 
respectively.


> scheduler spends too much time looking at empty priorities
> --
>
> Key: YARN-5540
> URL: https://issues.apache.org/jira/browse/YARN-5540
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, fairscheduler, resourcemanager
>Affects Versions: 2.7.2
>Reporter: Nathan Roberts
>Assignee: Jason Lowe
> Attachments: YARN-5540.001.patch
>
>
> We're starting to see the capacity scheduler run out of scheduling horsepower 
> when running 500-1000 applications on clusters with 4K nodes or so.
> This seems to be amplified by TEZ applications. TEZ applications have many 
> more priorities (sometimes in the hundreds) than typical MR applications and 
> therefore the loop in the scheduler which examines every priority within 
> every running application, starts to be a hotspot. The priorities appear to 
> stay around forever, even when there is no remaining resource request at that 
> priority causing us to spend a lot of time looking at nothing.
> jstack snippet:
> {noformat}
> "ResourceManager Event Processor" #28 prio=5 os_prio=0 tid=0x7fc2d453e800 
> nid=0x22f3 runnable [0x7fc2a8be2000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.getResourceRequest(SchedulerApplicationAttempt.java:210)
> - eliminated <0x0005e73e5dc0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:852)
> - locked <0x0005e73e5dc0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp)
> - locked <0x0003006fcf60> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:527)
> - locked <0x0003001b22f8> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:415)
> - locked <0x0003001b22f8> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersT

[jira] [Updated] (YARN-5540) scheduler spends too much time looking at empty priorities

2016-08-22 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-5540:
-
Summary: scheduler spends too much time looking at empty priorities  (was: 
Capacity Scheduler spends too much time looking at empty priorities)

I agree this applies to the FairScheduler as well, so updating the summary 
accordingly.

> scheduler spends too much time looking at empty priorities
> --
>
> Key: YARN-5540
> URL: https://issues.apache.org/jira/browse/YARN-5540
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, fairscheduler, resourcemanager
>Affects Versions: 2.7.2
>Reporter: Nathan Roberts
>Assignee: Jason Lowe
>
> We're starting to see the capacity scheduler run out of scheduling horsepower 
> when running 500-1000 applications on clusters with 4K nodes or so.
> This seems to be amplified by TEZ applications. TEZ applications have many 
> more priorities (sometimes in the hundreds) than typical MR applications and 
> therefore the loop in the scheduler which examines every priority within 
> every running application, starts to be a hotspot. The priorities appear to 
> stay around forever, even when there is no remaining resource request at that 
> priority causing us to spend a lot of time looking at nothing.
> jstack snippet:
> {noformat}
> "ResourceManager Event Processor" #28 prio=5 os_prio=0 tid=0x7fc2d453e800 
> nid=0x22f3 runnable [0x7fc2a8be2000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.getResourceRequest(SchedulerApplicationAttempt.java:210)
> - eliminated <0x0005e73e5dc0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:852)
> - locked <0x0005e73e5dc0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp)
> - locked <0x0003006fcf60> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:527)
> - locked <0x0003001b22f8> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:415)
> - locked <0x0003001b22f8> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1224)
> - locked <0x000300041e40> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org