[ 
https://issues.apache.org/jira/browse/YARN-11073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka resolved YARN-11073.
----------------------------------
    Fix Version/s: 3.4.0
       Resolution: Fixed

Merged the PR into trunk. Let's add test cases in a separate JIRA.

> CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues
> ------------------------------------------------------------------------------
>
>                 Key: YARN-11073
>                 URL: https://issues.apache.org/jira/browse/YARN-11073
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler, scheduler preemption
>    Affects Versions: 2.10.1
>            Reporter: Jian Chen
>            Assignee: Jian Chen
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.4.0
>
>         Attachments: YARN-11073.tmp-1.patch
>
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> When running a Hive job in a low-capacity queue on an idle cluster, 
> preemption kicked in to preempt job containers even though there's no other 
> job running and competing for resources. 
> Let's take this scenario as an example:
>  * cluster resource : <Memory:168GB, VCores:48>
>  ** {_}*queue_low*{_}: min_capacity 1%
>  ** queue_mid: min_capacity 19%
>  ** queue_high: min_capacity 80%
>  * CapacityScheduler with DRF
> During the fifo preemption candidates selection process, the 
> _preemptableAmountCalculator_ needs to first "{_}computeIdealAllocation{_}" 
> which depends on each queue's guaranteed/min capacity. A queue's guaranteed 
> capacity is currently calculated as 
> "Resources.multiply(totalPartitionResource, absCapacity)", so the guaranteed 
> capacity of queue_low is:
>  * {_}*queue_low*{_}: <Memory: (168*0.01)GB, VCores:(48*0.01)> = 
> <Memory:1.68GB, VCores:0.48>, but since the Resource object takes only Long 
> values, these Doubles values get casted into Long, and then the final result 
> becomes *<Memory:1GB, VCores:0>*
> Because the guaranteed capacity of queue_low is 0, its normalized guaranteed 
> capacity based on active queues is also 0 based on the current algorithm in 
> "{_}resetCapacity{_}". This eventually leads to the continuous preemption of 
> job containers running in {_}*queue_low*{_}. 
> In order to work around this corner case, I made a small patch (for my own 
> use case) around "{_}resetCapacity{_}" to consider a couple new scenarios: 
>  * if the sum of absoluteCapacity/minCapacity of all active queues is zero, 
> we should normalize their guaranteed capacity evenly
> {code:java}
> 1.0f / num_of_queues{code}
>  * if the sum of pre-normalized guaranteed capacity values ({_}MB or 
> VCores{_}) of all active queues is zero, meaning we might have several queues 
> like queue_low whose capacity value got casted into 0, we should normalize 
> evenly as well like the first scenario (if they are all tiny, it really makes 
> no big difference, for example, 1% vs 1.2%).
>  * if one of the active queues has a zero pre-normalized guaranteed capacity 
> value but its absoluteCapacity/minCapacity is *not* zero, then we should 
> normalize based on the weight of their configured queue 
> absoluteCapacity/minCapacity. This is to make sure _*queue_low*_ gets a small 
> but fair normalized value when _*queue_mid*_ is also active. 
> {code:java}
> minCapacity / (sum_of_min_capacity_of_active_queues)
> {code}
>  
> This is how I currently work around this issue, it might need someone who's 
> more familiar in this component to do a systematic review of the entire 
> preemption process to fix it properly. Maybe we can always apply the 
> weight-based approach using absoluteCapacity, or rewrite the code of Resource 
> to remove the casting, or always roundUp when calculating a queue's 
> guaranteed capacity, etc.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to