[ 
https://issues.apache.org/jira/browse/YARN-9770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16918853#comment-16918853
 ] 

Jonathan Hung commented on YARN-9770:
-------------------------------------

Hi [~eepayne], at a high level what we observed is, an app with > 10k container 
requests gets submitted to an underutilized queue A. Queue A takes up 
allocations for 5-10 seconds. When A's utilization reaches utilization of other 
queues (e.g. queue B), queue B starts getting allocations too - queue B will 
allocate to apps in fifo order, and if the apps at the head of the fifo queue 
in B are at least medium-sized, these apps will consume all of the allocations 
in queue B.

While underutilized queues are receiving allocations, highly utilized queues 
are not, but are still receiving app submissions, increasing activeUsers in 
these highly utilized queues.

Another thing we observed is that if underutilized queues have high container 
churn, its utilization will remain low, and continue to consume a majority of 
scheduler's overall container allocations.

Attached a screenshot (activeUsers_overlay) which shows activeUsers for an 
impacted queue (blue is post-YARN-9770, red is pre-YARN-9770)

> Create a queue ordering policy which picks child queues with equal probability
> ------------------------------------------------------------------------------
>
>                 Key: YARN-9770
>                 URL: https://issues.apache.org/jira/browse/YARN-9770
>             Project: Hadoop YARN
>          Issue Type: Improvement
>            Reporter: Jonathan Hung
>            Assignee: Jonathan Hung
>            Priority: Major
>              Labels: release-blocker
>         Attachments: YARN-9770.001.patch, YARN-9770.002.patch, 
> YARN-9770.003.patch, activeUsers_overlay.png
>
>
> Ran some simulations with the default queue_utilization_ordering_policy:
> An underutilized queue which receives an application with many (thousands) 
> resource requests will hog scheduler allocations for a long time (on the 
> order of a minute). In the meantime apps are getting submitted to all other 
> queues, which increases activeUsers in these queues, which drops user limit 
> in these queues to small values if minimum-user-limit-percent is configured 
> to small values (e.g. 10%).
> To avoid this issue, we assign to queues with equal probability, to avoid 
> scenarios where queues don't get allocations for a long time.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to