[jira] [Comment Edited] (YARN-8509) Total pending resource calculation in preemption should use user-limit factor instead of minimum-user-limit-percent

Zian Chen (JIRA) Fri, 10 Aug 2018 14:22:06 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16573997#comment-16573997
 ]


Zian Chen edited comment on YARN-8509 at 8/10/18 9:21 PM:
----------------------------------------------------------

Hi Eric, thanks for the comments. Discussed with Wangda, the patch uploaded 
before is not correct due to misunderstand of the original problem.

I have changed the Jira title. The intention of this Jira is to fix calculation 
of pending resource consider user-limit in preemption scenario. Currently, 
pending resource calculation in preemption uses the calculation algorithm in 
scheduling which is this one,
{code:java}
user_limit = min(max(current_capacity)/ #active_users, current_capacity * 
user_limit_percent), queue_capacity * user_limit_factor)
{code}
this is good for scheduling cause we want to make sure users can get at least 
"minimum-user-limit-percent" of resource to use, which is more like a lower 
bound of user-limit. However we should not capture total pending resource a 
leaf queue can get by minimum-user-limit-percent, instead, we want to use 
user-limit-factor which is the upper bound to capture pending resource in 
preemption. Cause if we use minimum-user-limit-percent to capture pending 
resource, resource under-utilization will happen in preemption scenario. Thus, 
we suggest the pending resource calculation for preemption should use this 
formula.

 
{code:java}
total_pending(partition,queue) = min {Q_max(partition) - Q_used(partition), Σ 
(min {
User.ulf(partition) - User.used(partition), User.pending(partition})}
{code}
Let me give an example,

 
{code:java}
               Root
            /  |  \  \
           a   b   c  d
          30  30  30  10


 1) Only one node (n1) in the cluster, it has 100G.

 2) app1 submit to queue-a, asks for 10G used, 6G pending.

 3) app2 submit to queue-b, asks for 40G used, 30G pending.

 4) app3 submit to queue-c, asks for 50G used, 30G pending.
{code}
Here we only have one user, and user-limit-factor for queues are

 

 
||Queue name|| minimum-user-limit-percent ||user-limit-factor||
|         a|                      50|        1.0 f|
|         b|                      50|        3.0 f|
|         c|                      50|        3.0 f|
|         d|                      50|        2.0 f|

With old calculation, user-limit for queue-a is 30G, which can let app1 has 6G 
pending, but user-limit for queue-b becomes 40G, which makes headroom become 
zero after subtract 40G used, the 30G pending resource been asked can not be 
accepted, same thing with queue-c too.

However if we see this test case in preemption point of view, we should allow 
queue-b and queue-c take more pending resources. Because even though queue-a 
has 30G guaranteed configured, it's under utilization. And by pending resource 
captured by the old algorithm, queue-b and queue-c can not take available 
resource through preemption which make the cluster resource not used 
effectively. 

To summarize, since user-limit-factor maintains the hard-limit of how much 
resource can be used by a user, we should calculate pending resource consider 
user-limit-factor instead of minimum-user-limit-percent. 

Could you share your opinion on this, [~eepayne]?

 

 


was (Author: zian chen):
Hi Eric, thanks for the comments. Discussed with Wangda, the patch uploaded 
before is not correct due to misunderstand of the original problem.

I have changed the Jira title. The intention of this Jira is to fix calculation 
of pending resource consider user-limit in preemption scenario. Currently, 
pending resource calculation in preemption uses the calculation algorithm in 
scheduling which is this one,
{code:java}
user_limit = min(max(current_capacity)/ #active_users, current_capacity * 
user_limit_percent), queue_capacity * user_limit_factor)
{code}
this is good for scheduling cause we want to make sure users can get at least 
"minimum-user-limit-percent" of resource to use, which is more like a lower 
bound of user-limit. However we should not capture total pending resource a 
leaf queue can get by minimum-user-limit-percent, instead, we want to use 
user-limit-factor which is the upper bound to capture pending resource in 
preemption. Cause if we use minimum-user-limit-percent to capture pending 
resource, resource under-utilization will happen in preemption scenario. Thus, 
we suggest the pending resource calculation for preemption should use this 
formula.

 
{code:java}

total_pending(partition,queue) = min {Q_max(partition) - Q_used(partition), Σ 
(min {
User.ulf(partition) - User.used(partition), User.pending(partition})}
{code}
Let me give an example,

 
{code:java}
               Root
            /  |  \  \
           a   b   c  d
          30  30  30  10


 1) Only one node (n1) in the cluster, it has 100G.

 2) app1 submit to queue-a, asks for 10G used, 6G pending.

 3) app2 submit to queue-b, asks for 40G used, 30G pending.

 4) app3 submit to queue-c, asks for 50G used, 30G pending.
{code}
Here we only have one user, and user-limit-factor for queues are

 

 
||Queue name|| minimum-user-limit-percent ||user-limit-factor||
|         a|                      1|        1.0 f|
|         b|                      1|        2.0 f|
|         c|                      1|        2.0 f|
|         d|                      1|        2.0 f|

With old calculation, user-limit for queue-a is 30G, which can let app1 has 6G 
pending, but user-limit for queue-b becomes 40G, which makes headroom become 
zero after subtract 40G used, the 30G pending resource been asked can not be 
accepted, same thing with queue-c too.

However if we see this test case in preemption point of view, we should allow 
queue-b and queue-c take more pending resources. Because even though queue-a 
has 30G guaranteed configured, it's under utilization. And by pending resource 
captured by the old algorithm, queue-b and queue-c can not take available 
resource through preemption which make the cluster resource not used 
effectively. 

To summarize, since user-limit-factor maintains the hard-limit of how much 
resource can be used by a user, we should calculate pending resource consider 
user-limit-factor instead of minimum-user-limit-percent. 

Could you share your opinion on this, [~eepayne]?

 

 

> Total pending resource calculation in preemption should use user-limit factor 
> instead of minimum-user-limit-percent
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-8509
>                 URL: https://issues.apache.org/jira/browse/YARN-8509
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: yarn
>            Reporter: Zian Chen
>            Assignee: Zian Chen
>            Priority: Major
>         Attachments: YARN-8509.001.patch, YARN-8509.002.patch, 
> YARN-8509.003.patch
>
>
> In LeafQueue#getTotalPendingResourcesConsideringUserLimit, we calculate total 
> pending resource based on user-limit percent and user-limit factor which will 
> cap pending resource for each user to the minimum of user-limit pending and 
> actual pending. This will prevent queue from taking more pending resource to 
> achieve queue balance after all queue satisfied with its ideal allocation.
>   
>  We need to change the logic to let queue pending can go beyond userlimit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (YARN-8509) Total pending resource calculation in preemption should use user-limit factor instead of minimum-user-limit-percent

Reply via email to