[ 
https://issues.apache.org/jira/browse/YARN-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15342666#comment-15342666
 ] 

Jason Lowe commented on YARN-4280:
----------------------------------

I haven't had a chance to look at the patch yet, but I'm not thrilled with the 
thought of storing this state somewhere beyond the allocate call or using the 
existing reserved container logic for this.  I think it's going to add 
special-case logic to the handling of reserved containers in all sorts of 
places.  In addition if we cannot really do the reservation if the parent queue 
max resource will be violated then that doesn't solve the original problem.  We 
need the parent queue to completely stop assigning to other leaf queues if the 
only reason we cannot reserve is because we would exceed the max capacity of 
our parent when we did the reservation.

My thinking of the algorithm implementation would have no extra state being 
stored with a queue.  As soon as we do that we will have all sorts of cleanup 
cases that have to be handled, like the app finished case Wangda mentioned 
above, etc.  I see the same issue if we go with actual reserved containers, 
since we need to make sure those get cleaned up in similar situations otherwise 
we will "leak" reservations.  For example, another app in the leaf queue could 
become higher priority or a higher-priority app that wasn't asking starts 
asking again.   If we made the "semi-reserved" container from a previous call, 
we would need to release that container to allow the new, higher-priority app 
to get its resource.  In short, I think it will be messy to get it to work 
correctly in all cases.

Instead I saw the algorithm as simply having the leaf queue returning the 
amount of space it is trying to consume when this situation occurs, capped by 
its max capacity.  The parent queue would deduct that amount from its local 
variable tracking the parent queue's current capability limits that it passes 
to the child queues.  If after deducting the limit is non-zero then the parent 
can continue calling other child queues with the reduced parent limit (to keep 
other children from eating into the higher-priority queue that is trying to 
fulfil a reservation).  If after deducting the limit is zero then the parent 
can return early to its parent with a similar deduction, etc.  The amount we're 
holding back trying to fulfil the reservation is only ever stored in local 
variables, so it never persists beyond the current scheduler allocate call.  
Therefore there's nothing to clean up -- it automatically cleans up on the next 
scheduler allocate call if we are either able to finally allocate/reserve a 
real container or some higher-priority app starts asking or the app terminates, 
etc.


> CapacityScheduler reservations may not prevent indefinite postponement on a 
> busy cluster
> ----------------------------------------------------------------------------------------
>
>                 Key: YARN-4280
>                 URL: https://issues.apache.org/jira/browse/YARN-4280
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler
>    Affects Versions: 2.6.1, 2.8.0, 2.7.1
>            Reporter: Kuhu Shukla
>            Assignee: Kuhu Shukla
>         Attachments: YARN-4280.001.patch, YARN-4280.002.patch
>
>
> Consider the following scenario:
> There are 2 queues A(25% of the total capacity) and B(75%), both can run at 
> total cluster capacity. There are 2 applications, appX that runs on Queue A, 
> always asking for 1G containers(non-AM) and appY runs on Queue B asking for 2 
> GB containers.
> The user limit is high enough for the application to reach 100% of the 
> cluster resource. 
> appX is running at total cluster capacity, full with 1G containers releasing 
> only one container at a time. appY comes in with a request of 2GB container 
> but only 1 GB is free. Ideally, since appY is in the underserved queue, it 
> has higher priority and should reserve for its 2 GB request. Since this 
> request puts the alloc+reserve above total capacity of the cluster, 
> reservation is not made. appX comes in with a 1GB request and since 1GB is 
> still available, the request is allocated. 
> This can continue indefinitely causing priority inversion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to