[ 
https://issues.apache.org/jira/browse/YARN-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351537#comment-15351537
 ] 

Wangda Tan commented on YARN-4280:
----------------------------------

If I understand correctly, current logic is:
For Application, if it fails because of headroom, QUEUE_SKIPPED will be returned
For ParentQueue, when any of child queue returns QUEUE_SKIPPED, it deducts 
queue limit and returns QUEUE_SKIPPED when any of its children returns 
QUEUE_SKIPPED:

There’re two potential issues I can see:
1) Headroom of child could be negative since we have continuous reservation 
logic, so we need to deduct parentLimit by max(child.headroom, none()).

2) It doesn’t work properly when we have nested queue hierarchy like:
{code}
    root
    /   \
   a    b
 /  \
a1  a2
{code}

Assume all queue’s max capacity is 100, if we have capacities:
{code}
  a.configured = 50
  a.used = 48
     a1.configured = 25
     a1.used = 24
     a2.configured = 25
     a2.used = 24
  b.configured = 50
  b.used = 50

  Total available resource of cluster = 2.
{code}

- Let’s say a node with 2 available resource heartbeat, it goes to root->a->a1, 
the resource  of pending request of a1 is 10, so it cannot allocate and returns 
QUEUE_SKIPPED to queue-a.
- queue-a deducts its limit by 1, and set limit of a2 to 25.
- Assume resource of pending request in a2 is 1, so a2 allocates 1 resource.
- Back to queue-a, it gets a >0 allocation, it enters:
{code}
      if (Resources.greaterThan(
              resourceCalculator, cluster,
              assignment.getResource(), Resources.none())) {
         ...
      }
{code}
So it will not enter:
{code}
      if (assignment.getSkippedType()
            == CSAssignment.SkippedType.QUEUE_LIMIT_SKIPPED) {
          skippedType = CSAssignment.SkippedType.QUEUE_LIMIT_SKIPPED;
{code}
Because of this, skippedType is still NO_SKIPPED in queue-a, so it returns 
NO_SKIPPED to root.
- Root then goes to b, assume resource of pending request in b is 1, so b gets 
1 resource.

When this happens, the large container request in queue-a can still be starved.

Probably even if child queue allocate something, we still need to deduct limits 
of parentQueue if skipType of child-queue is QUEUE_SKIPPED.

> CapacityScheduler reservations may not prevent indefinite postponement on a 
> busy cluster
> ----------------------------------------------------------------------------------------
>
>                 Key: YARN-4280
>                 URL: https://issues.apache.org/jira/browse/YARN-4280
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler
>    Affects Versions: 2.6.1, 2.8.0, 2.7.1
>            Reporter: Kuhu Shukla
>            Assignee: Kuhu Shukla
>         Attachments: YARN-4280.001.patch, YARN-4280.002.patch, 
> YARN-4280.003.patch, YARN-4280.004.patch, YARN-4280.005.patch, 
> YARN-4280.006.patch
>
>
> Consider the following scenario:
> There are 2 queues A(25% of the total capacity) and B(75%), both can run at 
> total cluster capacity. There are 2 applications, appX that runs on Queue A, 
> always asking for 1G containers(non-AM) and appY runs on Queue B asking for 2 
> GB containers.
> The user limit is high enough for the application to reach 100% of the 
> cluster resource. 
> appX is running at total cluster capacity, full with 1G containers releasing 
> only one container at a time. appY comes in with a request of 2GB container 
> but only 1 GB is free. Ideally, since appY is in the underserved queue, it 
> has higher priority and should reserve for its 2 GB request. Since this 
> request puts the alloc+reserve above total capacity of the cluster, 
> reservation is not made. appX comes in with a 1GB request and since 1GB is 
> still available, the request is allocated. 
> This can continue indefinitely causing priority inversion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to