[
https://issues.apache.org/jira/browse/YARN-11073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Akira Ajisaka resolved YARN-11073.
----------------------------------
Fix Version/s: 3.4.0
Resolution: Fixed
Merged the PR into trunk. Let's add test cases in a separate JIRA.
> CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues
> ------------------------------------------------------------------------------
>
> Key: YARN-11073
> URL: https://issues.apache.org/jira/browse/YARN-11073
> Project: Hadoop YARN
> Issue Type: Bug
> Components: capacity scheduler, scheduler preemption
> Affects Versions: 2.10.1
> Reporter: Jian Chen
> Assignee: Jian Chen
> Priority: Major
> Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: YARN-11073.tmp-1.patch
>
> Time Spent: 50m
> Remaining Estimate: 0h
>
> When running a Hive job in a low-capacity queue on an idle cluster,
> preemption kicked in to preempt job containers even though there's no other
> job running and competing for resources.
> Let's take this scenario as an example:
> * cluster resource : <Memory:168GB, VCores:48>
> ** {_}*queue_low*{_}: min_capacity 1%
> ** queue_mid: min_capacity 19%
> ** queue_high: min_capacity 80%
> * CapacityScheduler with DRF
> During the fifo preemption candidates selection process, the
> _preemptableAmountCalculator_ needs to first "{_}computeIdealAllocation{_}"
> which depends on each queue's guaranteed/min capacity. A queue's guaranteed
> capacity is currently calculated as
> "Resources.multiply(totalPartitionResource, absCapacity)", so the guaranteed
> capacity of queue_low is:
> * {_}*queue_low*{_}: <Memory: (168*0.01)GB, VCores:(48*0.01)> =
> <Memory:1.68GB, VCores:0.48>, but since the Resource object takes only Long
> values, these Doubles values get casted into Long, and then the final result
> becomes *<Memory:1GB, VCores:0>*
> Because the guaranteed capacity of queue_low is 0, its normalized guaranteed
> capacity based on active queues is also 0 based on the current algorithm in
> "{_}resetCapacity{_}". This eventually leads to the continuous preemption of
> job containers running in {_}*queue_low*{_}.
> In order to work around this corner case, I made a small patch (for my own
> use case) around "{_}resetCapacity{_}" to consider a couple new scenarios:
> * if the sum of absoluteCapacity/minCapacity of all active queues is zero,
> we should normalize their guaranteed capacity evenly
> {code:java}
> 1.0f / num_of_queues{code}
> * if the sum of pre-normalized guaranteed capacity values ({_}MB or
> VCores{_}) of all active queues is zero, meaning we might have several queues
> like queue_low whose capacity value got casted into 0, we should normalize
> evenly as well like the first scenario (if they are all tiny, it really makes
> no big difference, for example, 1% vs 1.2%).
> * if one of the active queues has a zero pre-normalized guaranteed capacity
> value but its absoluteCapacity/minCapacity is *not* zero, then we should
> normalize based on the weight of their configured queue
> absoluteCapacity/minCapacity. This is to make sure _*queue_low*_ gets a small
> but fair normalized value when _*queue_mid*_ is also active.
> {code:java}
> minCapacity / (sum_of_min_capacity_of_active_queues)
> {code}
>
> This is how I currently work around this issue, it might need someone who's
> more familiar in this component to do a systematic review of the entire
> preemption process to fix it properly. Maybe we can always apply the
> weight-based approach using absoluteCapacity, or rewrite the code of Resource
> to remove the casting, or always roundUp when calculating a queue's
> guaranteed capacity, etc.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]