[jira] [Commented] (YARN-2592) Preemption can kill containers to fulfil need of already over-capacity queue.

2015-02-27 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14340241#comment-14340241
 ] 

Eric Payne commented on YARN-2592:
--

Closing this, since it is expected that as long as there are available 
resources, queue usage should grow evenly based on percentage of absolute 
capacity, even when preemption can happen to fill this growth as long as the 
absolute max capacity is not reached and the queues are growing evenly.

 Preemption can kill containers to fulfil need of already over-capacity queue.
 -

 Key: YARN-2592
 URL: https://issues.apache.org/jira/browse/YARN-2592
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.5.1
Reporter: Eric Payne

 There are scenarios in which one over-capacity queue can cause preemption of 
 another over-capacity queue. However, since killing containers may lose work, 
 it doesn't make sense to me to kill containers to feed an already 
 over-capacity queue.
 Consider the following:
 {code}
 root has A,B,C, total capacity = 90
 A.guaranteed = 30, A.pending = 5, A.current = 40
 B.guaranteed = 30, B.pending = 0, B.current = 50
 C.guaranteed = 30, C.pending = 0, C.current = 0
 {code}
 In this case, the queue preemption monitor will kill 5 resources from queue B 
 so that queue A can pick them up, even though queue A is already over its 
 capacity. This could lose any work that those containers in B had already 
 done.
 Is there a use case for this behavior? It seems to me that if a queue is 
 already over its capacity, it shouldn't destroy the work of other queues. If 
 the over-capacity queue needs more resources, that seems to be a problem that 
 should be solved by increasing its guarantee.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2592) Preemption can kill containers to fulfil need of already over-capacity queue.

2014-09-24 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146391#comment-14146391
 ] 

Jason Lowe commented on YARN-2592:
--

+1 for at least allowing users to configure no preemption to satisfy over 
capacity queues.

 Preemption can kill containers to fulfil need of already over-capacity queue.
 -

 Key: YARN-2592
 URL: https://issues.apache.org/jira/browse/YARN-2592
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.5.1
Reporter: Eric Payne

 There are scenarios in which one over-capacity queue can cause preemption of 
 another over-capacity queue. However, since killing containers may lose work, 
 it doesn't make sense to me to kill containers to feed an already 
 over-capacity queue.
 Consider the following:
 {code}
 root has A,B,C, total capacity = 90
 A.guaranteed = 30, A.pending = 5, A.current = 40
 B.guaranteed = 30, B.pending = 0, B.current = 50
 C.guaranteed = 30, C.pending = 0, C.current = 0
 {code}
 In this case, the queue preemption monitor will kill 5 resources from queue B 
 so that queue A can pick them up, even though queue A is already over its 
 capacity. This could lose any work that those containers in B had already 
 done.
 Is there a use case for this behavior? It seems to me that if a queue is 
 already over its capacity, it shouldn't destroy the work of other queues. If 
 the over-capacity queue needs more resources, that seems to be a problem that 
 should be solved by increasing its guarantee.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2592) Preemption can kill containers to fulfil need of already over-capacity queue.

2014-09-24 Thread Carlo Curino (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146728#comment-14146728
 ] 

Carlo Curino commented on YARN-2592:


Preemption is trying to enforce the scheduler invariants. One of which is how 
over-capacity is distributed among queues (weighted fairly on rightful 
capacity).  

I understand the desire to protect individual containers, and there will be 
many specific examples we can come up with in which killing a container is a 
pity as it loses some work (unless it handles the preemption message correctly 
and checkpoint its state), but long term I think enforcing the invariants is 
more important (fair and predictable for users). The opposite argument one can 
make is why is queue B allowed to retain more over capacity than A? if this 
happens systematically or for long period of time is unnerving for users as 
much as some lost work. 

Also note that preemption already has few built-in mechanisms (deadzones, and 
grace-periods) designed to limit the impact on running tasks, are we sure that 
proper tuning of capacity/max-capcity/dead-zones/grace-periods is not enough to 
remove 99% of the problem? This would be only an issue for long-running tasks 
(exceeding 2x the grace periods), when run above the capacity + dead-zone of a 
queue but within max-capacity. And only trigger, for a queue that is more over 
capacity than any other peer queue, when the peer queue also has over-capacity 
needs exceeding free space, AND no under-capacity queue is demanding the same 
resources we should make sure this is significant enough of a scenario in 
practice to justify complexity of new configurables.

I am definitely opposed to make this the default behavior, but I agree with 
Jason that we could add config parameters that allow to prevent preemption for 
over-capacity balancing. I feel though this is a slippery slope, which I think 
might lead to many loopholes (protecting AM being another one), that eventually 
will make configuring preemption and understanding what is happening for the 
users very hard. 

I think promoting proper handling of preemption on the app side (i.e., 
checkpoint your state, or redistributed your computation) is overall a 
healthier direction. 

My 2 cents..


 Preemption can kill containers to fulfil need of already over-capacity queue.
 -

 Key: YARN-2592
 URL: https://issues.apache.org/jira/browse/YARN-2592
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.5.1
Reporter: Eric Payne

 There are scenarios in which one over-capacity queue can cause preemption of 
 another over-capacity queue. However, since killing containers may lose work, 
 it doesn't make sense to me to kill containers to feed an already 
 over-capacity queue.
 Consider the following:
 {code}
 root has A,B,C, total capacity = 90
 A.guaranteed = 30, A.pending = 5, A.current = 40
 B.guaranteed = 30, B.pending = 0, B.current = 50
 C.guaranteed = 30, C.pending = 0, C.current = 0
 {code}
 In this case, the queue preemption monitor will kill 5 resources from queue B 
 so that queue A can pick them up, even though queue A is already over its 
 capacity. This could lose any work that those containers in B had already 
 done.
 Is there a use case for this behavior? It seems to me that if a queue is 
 already over its capacity, it shouldn't destroy the work of other queues. If 
 the over-capacity queue needs more resources, that seems to be a problem that 
 should be solved by increasing its guarantee.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2592) Preemption can kill containers to fulfil need of already over-capacity queue.

2014-09-24 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146757#comment-14146757
 ] 

Jason Lowe commented on YARN-2592:
--

IMHO users shouldn't be complaining if they are getting their guarantees (i.e.: 
the capacity of the queue).  Anything over capacity is bonus and they 
shouldn't rely on the scheduler going out of its way to give it more.  If they 
can't get their stuff done within their configured capacity then they need more 
capacity.

bq. I think promoting proper handling of preemption on the app side (i.e., 
checkpoint your state, or redistributed your computation) is overall a 
healthier direction. 

I agree with the theory.  If preempting is cheap then we should leverage it 
more often to solve resource contention.  The problem in practice is that it's 
often outside the hands of ops and even the users.  YARN is becoming more and 
more general, including app frameworks that aren't part of the core Hadoop 
stack, and I think it will be commonplace for quite some time that at least 
some apps won't have checkpoint/migration support.  That makes preemption 
not-so-cheap, which means we don't want to use it unless really necessary.  
Killing containers to give another queue more bonus resources is unnecessary 
and therefore preferable to avoid when preemption isn't cheap.  If those 
resources really are necessary then the queue should have more guaranteed 
capacity rather than expect the scheduler to kill other containers when it's 
beyond capacity.

 Preemption can kill containers to fulfil need of already over-capacity queue.
 -

 Key: YARN-2592
 URL: https://issues.apache.org/jira/browse/YARN-2592
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.5.1
Reporter: Eric Payne

 There are scenarios in which one over-capacity queue can cause preemption of 
 another over-capacity queue. However, since killing containers may lose work, 
 it doesn't make sense to me to kill containers to feed an already 
 over-capacity queue.
 Consider the following:
 {code}
 root has A,B,C, total capacity = 90
 A.guaranteed = 30, A.pending = 5, A.current = 40
 B.guaranteed = 30, B.pending = 0, B.current = 50
 C.guaranteed = 30, C.pending = 0, C.current = 0
 {code}
 In this case, the queue preemption monitor will kill 5 resources from queue B 
 so that queue A can pick them up, even though queue A is already over its 
 capacity. This could lose any work that those containers in B had already 
 done.
 Is there a use case for this behavior? It seems to me that if a queue is 
 already over its capacity, it shouldn't destroy the work of other queues. If 
 the over-capacity queue needs more resources, that seems to be a problem that 
 should be solved by increasing its guarantee.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2592) Preemption can kill containers to fulfil need of already over-capacity queue.

2014-09-24 Thread Carlo Curino (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146790#comment-14146790
 ] 

Carlo Curino commented on YARN-2592:


I hear you, and I agree we will need to cope with non-cheap preemption for a 
while, and even long term not everyone will be nicely preemptable (our work on 
YARN-1051 is for example designed to allow people to get very guaranteed and 
protected resources when needed). 

However, the compromise you propose means that the over-capacity zone is 
weirdly policed... on one side we expect the giving of containers to respect 
a notion of fairness (proportional to your rightful capacity), which is in 
turns not enforce by preemption. I find this inconsistent.

Moreover, as I was saying, I think this will only spare containers in a rather 
narrow band (when imbalance happened among over capacity queues, and no 
under-capacity queues are requesting resources yet, and we are above the 
dead-zone, and tasks run longer than 2x the grace period). Is this a large 
enough use case to require special-casing?
If this is important in practice and an adoption show-stopper I am fine with 
compromises, but we should make sure this is the case. 

A way to do this is to enable preemption but run it in observe-only mode, 
where the policy logs what he would like to do without actually doing it... We 
can see whether on a real cluster we are often/ever in the scenario you are 
trying to address.




 Preemption can kill containers to fulfil need of already over-capacity queue.
 -

 Key: YARN-2592
 URL: https://issues.apache.org/jira/browse/YARN-2592
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.5.1
Reporter: Eric Payne

 There are scenarios in which one over-capacity queue can cause preemption of 
 another over-capacity queue. However, since killing containers may lose work, 
 it doesn't make sense to me to kill containers to feed an already 
 over-capacity queue.
 Consider the following:
 {code}
 root has A,B,C, total capacity = 90
 A.guaranteed = 30, A.pending = 5, A.current = 40
 B.guaranteed = 30, B.pending = 0, B.current = 50
 C.guaranteed = 30, C.pending = 0, C.current = 0
 {code}
 In this case, the queue preemption monitor will kill 5 resources from queue B 
 so that queue A can pick them up, even though queue A is already over its 
 capacity. This could lose any work that those containers in B had already 
 done.
 Is there a use case for this behavior? It seems to me that if a queue is 
 already over its capacity, it shouldn't destroy the work of other queues. If 
 the over-capacity queue needs more resources, that seems to be a problem that 
 should be solved by increasing its guarantee.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)