[ 
https://issues.apache.org/jira/browse/YARN-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146757#comment-14146757
 ] 

Jason Lowe commented on YARN-2592:
----------------------------------

IMHO users shouldn't be complaining if they are getting their guarantees (i.e.: 
the capacity of the queue).  Anything over capacity is "bonus" and they 
shouldn't rely on the scheduler going out of its way to give it more.  If they 
can't get their stuff done within their configured capacity then they need more 
capacity.

bq. I think promoting proper handling of preemption on the app side (i.e., 
checkpoint your state, or redistributed your computation) is overall a 
healthier direction. 

I agree with the theory.  If preempting is "cheap" then we should leverage it 
more often to solve resource contention.  The problem in practice is that it's 
often outside the hands of ops and even the users.  YARN is becoming more and 
more general, including app frameworks that aren't part of the core Hadoop 
stack, and I think it will be commonplace for quite some time that at least 
some apps won't have checkpoint/migration support.  That makes preemption 
not-so-cheap, which means we don't want to use it unless really necessary.  
Killing containers to give another queue more "bonus" resources is unnecessary 
and therefore preferable to avoid when preemption isn't cheap.  If those 
resources really are necessary then the queue should have more guaranteed 
capacity rather than expect the scheduler to kill other containers when it's 
beyond capacity.

> Preemption can kill containers to fulfil need of already over-capacity queue.
> -----------------------------------------------------------------------------
>
>                 Key: YARN-2592
>                 URL: https://issues.apache.org/jira/browse/YARN-2592
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 3.0.0, 2.5.1
>            Reporter: Eric Payne
>
> There are scenarios in which one over-capacity queue can cause preemption of 
> another over-capacity queue. However, since killing containers may lose work, 
> it doesn't make sense to me to kill containers to feed an already 
> over-capacity queue.
> Consider the following:
> {code}
> root has A,B,C, total capacity = 90
> A.guaranteed = 30, A.pending = 5, A.current = 40
> B.guaranteed = 30, B.pending = 0, B.current = 50
> C.guaranteed = 30, C.pending = 0, C.current = 0
> {code}
> In this case, the queue preemption monitor will kill 5 resources from queue B 
> so that queue A can pick them up, even though queue A is already over its 
> capacity. This could lose any work that those containers in B had already 
> done.
> Is there a use case for this behavior? It seems to me that if a queue is 
> already over its capacity, it shouldn't destroy the work of other queues. If 
> the over-capacity queue needs more resources, that seems to be a problem that 
> should be solved by increasing its guarantee.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to