[
https://issues.apache.org/jira/browse/YARN-8020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407717#comment-16407717
]
kyungwan nam commented on YARN-8020:
------------------------------------
I'm thinking the reason why it happens is as follows.
{code:java}
// assign all cluster resources until no more demand, or no resources are
// left
while (!orderedByNeed.isEmpty() && Resources.greaterThan(rc, totGuarant,
unassigned, Resources.none())) {
Resource wQassigned = Resource.newInstance(0, 0);
// we compute normalizedGuarantees capacity based on currently active
// queues
resetCapacity(unassigned, orderedByNeed, ignoreGuarantee);
// For each underserved queue (or set of queues if multiple are equally
// underserved), offer its share of the unassigned resources based on its
// normalized guarantee. After the offer, if the queue is not satisfied,
// place it back in the ordered list of queues, recalculating its place
// in the order of most under-guaranteed to most over-guaranteed. In this
// way, the most underserved queue(s) are always given resources first.
Collection<TempQueuePerPartition> underserved = getMostUnderservedQueues(
orderedByNeed, tqComparator);
for (Iterator<TempQueuePerPartition> i = underserved.iterator(); i
.hasNext();) {
TempQueuePerPartition sub = i.next();
Resource wQavail = Resources.multiplyAndNormalizeUp(rc, unassigned,
sub.normalizedGuarantee, Resource.newInstance(1, 1));
Resource wQidle = sub.offer(wQavail, rc, totGuarant,
isReservedPreemptionCandidatesSelector);
Resource wQdone = Resources.subtract(wQavail, wQidle);
if (Resources.greaterThan(rc, totGuarant, wQdone, Resources.none())) {
// The queue is still asking for more. Put it back in the priority
// queue, recalculating its order based on need.
orderedByNeed.add(sub);
}
Resources.addTo(wQassigned, wQdone);
}
Resources.subtractFrom(unassigned, wQassigned);
}
{code}
{quote}default, 27648, 209, 3072, 1, 207360, 120, 30720, 210, 0, 0, 0, 0,
label1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, pri, 4096, 25, 11264, 88, 207360,
120, 15360, 113, 0, 0, 0, 0
{quote}
'unassigned' would be assigned in the most underserved order. therefore, most
vcores of 'unassigned' have been allocated to pri queue.
therefore, when offer() is called for default queue, 'unassinged' would be a
large memory and a few vcores.
let’s assume, 'avail' <200000, 7>
normally, in this case, min(avail, (current + pending - assigned) ) should be
‘avail’. because, available vcores are not enough.
but, it was (current + pending - assigned) due to memory.
min ( <200000, 7>, ( <27648, 209> + <3072, 1> - <207360, 120> ) )
min ( <200000, 7>, <-176640, 90> ) = <-176640, 90>
as a result, idealAssigned for default queue is <-176640, 90> + <207360, 120> =
<30720, 210>
> when DRF is used, preemption does not trigger due to incorrect idealAssigned
> ----------------------------------------------------------------------------
>
> Key: YARN-8020
> URL: https://issues.apache.org/jira/browse/YARN-8020
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: kyungwan nam
> Priority: Major
>
> I’ve met that Inter Queue Preemption does not work.
> It happens when DRF is used and submitting application with a large number of
> vcores.
> IMHO, idealAssigned can be set incorrectly by following code.
> {code}
> // This function "accepts" all the resources it can (pending) and return
> // the unused ones
> Resource offer(Resource avail, ResourceCalculator rc,
> Resource clusterResource, boolean considersReservedResource) {
> Resource absMaxCapIdealAssignedDelta = Resources.componentwiseMax(
> Resources.subtract(getMax(), idealAssigned),
> Resource.newInstance(0, 0));
> // accepted = min{avail,
> // max - assigned,
> // current + pending - assigned,
> // # Make sure a queue will not get more than max of its
> // # used/guaranteed, this is to make sure preemption won't
> // # happen if all active queues are beyond their guaranteed
> // # This is for leaf queue only.
> // max(guaranteed, used) - assigned}
> // remain = avail - accepted
> Resource accepted = Resources.min(rc, clusterResource,
> absMaxCapIdealAssignedDelta,
> Resources.min(rc, clusterResource, avail, Resources
> /*
> * When we're using FifoPreemptionSelector (considerReservedResource
> * = false).
> *
> * We should deduct reserved resource from pending to avoid
> excessive
> * preemption:
> *
> * For example, if an under-utilized queue has used = reserved = 20.
> * Preemption policy will try to preempt 20 containers (which is not
> * satisfied) from different hosts.
> *
> * In FifoPreemptionSelector, there's no guarantee that preempted
> * resource can be used by pending request, so policy will preempt
> * resources repeatly.
> */
> .subtract(Resources.add(getUsed(),
> (considersReservedResource ? pending : pendingDeductReserved)),
> idealAssigned)));
> {code}
> let’s say,
> * cluster resource : <Memory:200GB, VCores:20>
> * idealAssigned(assigned): <Memory:100GB, VCores:10>
> * avail: <Memory:181GB, Vcores:1>
> * current: <Memory:19GB, Vcores:19>
> * pending: <Memory:0, Vcores:0>
> current + pending - assigned: <Memory:-181GB, Vcores:9>
> min ( avail, (current + pending - assigned) ) : <Memory:-181GB, Vcores:9>
> accepted: <Memory:-181GB, Vcores:9>
> as a result, idealAssigned will be <Memory:-81GB, VCores:19>, which does not
> trigger preemption.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]