[ 
https://issues.apache.org/jira/browse/YARN-8020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407717#comment-16407717
 ] 

kyungwan nam commented on YARN-8020:
------------------------------------

I'm thinking the reason why it happens is as follows.
{code:java}
// assign all cluster resources until no more demand, or no resources are
// left
while (!orderedByNeed.isEmpty() && Resources.greaterThan(rc, totGuarant,
    unassigned, Resources.none())) {
  Resource wQassigned = Resource.newInstance(0, 0);
  // we compute normalizedGuarantees capacity based on currently active
  // queues
  resetCapacity(unassigned, orderedByNeed, ignoreGuarantee);

  // For each underserved queue (or set of queues if multiple are equally
  // underserved), offer its share of the unassigned resources based on its
  // normalized guarantee. After the offer, if the queue is not satisfied,
  // place it back in the ordered list of queues, recalculating its place
  // in the order of most under-guaranteed to most over-guaranteed. In this
  // way, the most underserved queue(s) are always given resources first.
  Collection<TempQueuePerPartition> underserved = getMostUnderservedQueues(
      orderedByNeed, tqComparator);
  for (Iterator<TempQueuePerPartition> i = underserved.iterator(); i
      .hasNext();) {
    TempQueuePerPartition sub = i.next();
    Resource wQavail = Resources.multiplyAndNormalizeUp(rc, unassigned,
        sub.normalizedGuarantee, Resource.newInstance(1, 1));
    Resource wQidle = sub.offer(wQavail, rc, totGuarant,
        isReservedPreemptionCandidatesSelector);
    Resource wQdone = Resources.subtract(wQavail, wQidle);

    if (Resources.greaterThan(rc, totGuarant, wQdone, Resources.none())) {
      // The queue is still asking for more. Put it back in the priority
      // queue, recalculating its order based on need.
      orderedByNeed.add(sub);
    }
    Resources.addTo(wQassigned, wQdone);
  }
  Resources.subtractFrom(unassigned, wQassigned);
}
{code}
{quote}default, 27648, 209, 3072, 1, 207360, 120, 30720, 210, 0, 0, 0, 0, 
label1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, pri, 4096, 25, 11264, 88, 207360, 
120, 15360, 113, 0, 0, 0, 0
{quote}
'unassigned' would be assigned in the most underserved order. therefore, most 
vcores of 'unassigned' have been allocated to pri queue.
 therefore, when offer() is called for default queue, 'unassinged' would be a 
large memory and a few vcores.
 let’s assume, 'avail' <200000, 7> 
 normally, in this case, min(avail, (current + pending - assigned) ) should be 
‘avail’. because, available vcores are not enough.
 but, it was (current + pending - assigned) due to memory.

min ( <200000, 7>, ( <27648, 209> + <3072, 1> - <207360, 120> ) )
 min ( <200000, 7>, <-176640, 90> ) = <-176640, 90>

as a result, idealAssigned for default queue is <-176640, 90> + <207360, 120> = 
<30720, 210>

> when DRF is used, preemption does not trigger due to incorrect idealAssigned
> ----------------------------------------------------------------------------
>
>                 Key: YARN-8020
>                 URL: https://issues.apache.org/jira/browse/YARN-8020
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: kyungwan nam
>            Priority: Major
>
> I’ve met that Inter Queue Preemption does not work.
> It happens when DRF is used and submitting application with a large number of 
> vcores.
> IMHO, idealAssigned can be set incorrectly by following code.
> {code}
> // This function "accepts" all the resources it can (pending) and return
> // the unused ones
> Resource offer(Resource avail, ResourceCalculator rc,
>     Resource clusterResource, boolean considersReservedResource) {
>   Resource absMaxCapIdealAssignedDelta = Resources.componentwiseMax(
>       Resources.subtract(getMax(), idealAssigned),
>       Resource.newInstance(0, 0));
>   // accepted = min{avail,
>   //               max - assigned,
>   //               current + pending - assigned,
>   //               # Make sure a queue will not get more than max of its
>   //               # used/guaranteed, this is to make sure preemption won't
>   //               # happen if all active queues are beyond their guaranteed
>   //               # This is for leaf queue only.
>   //               max(guaranteed, used) - assigned}
>   // remain = avail - accepted
>   Resource accepted = Resources.min(rc, clusterResource,
>       absMaxCapIdealAssignedDelta,
>       Resources.min(rc, clusterResource, avail, Resources
>           /*
>            * When we're using FifoPreemptionSelector (considerReservedResource
>            * = false).
>            *
>            * We should deduct reserved resource from pending to avoid 
> excessive
>            * preemption:
>            *
>            * For example, if an under-utilized queue has used = reserved = 20.
>            * Preemption policy will try to preempt 20 containers (which is not
>            * satisfied) from different hosts.
>            *
>            * In FifoPreemptionSelector, there's no guarantee that preempted
>            * resource can be used by pending request, so policy will preempt
>            * resources repeatly.
>            */
>           .subtract(Resources.add(getUsed(),
>               (considersReservedResource ? pending : pendingDeductReserved)),
>               idealAssigned)));
> {code}
> let’s say,
> * cluster resource : <Memory:200GB, VCores:20>
> * idealAssigned(assigned): <Memory:100GB, VCores:10>
> * avail: <Memory:181GB, Vcores:1>
> * current: <Memory:19GB, Vcores:19>
> * pending: <Memory:0, Vcores:0>
> current + pending - assigned: <Memory:-181GB, Vcores:9>
> min ( avail, (current + pending - assigned) ) : <Memory:-181GB, Vcores:9>
> accepted: <Memory:-181GB, Vcores:9>
> as a result, idealAssigned will be <Memory:-81GB, VCores:19>, which does not 
> trigger preemption.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to