[ 
https://issues.apache.org/jira/browse/YARN-624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13736970#comment-13736970
 ] 

Carlo Curino commented on YARN-624:
-----------------------------------

Robert, you are right, and provide a compelling example of an application that 
has dynamic needs for resources. 

There are ways around this, where you dynamically negotiate an 
increase/decrease of dedicated resources, and keep 
the AM as it is. Philosophically this keeps all interaction AM-RM as 
best-effort partial-ok, while is the client-RM 
protocol that talks about binding negotiation for resources. This would work 
and match well the current preemption 
mechanics, but I am not sure it is the best design (I haven't thought hard 
about it yet).

If we go with the design where the AM makes gang-like requests, we should make 
the preemption policy aware of
this, and act accordingly. In a sense, this boils down to a "granularity" 
problem, not too different from the current
size of containers to preempt vs needed capacity. But it stretches the 
precision issue by potentially a huge factor, making 
the tradeoff between under and over preempting a more subtle line to walk. 

Two ways around this:
* we might want introduce non-strictly FIFO preemptions in a queue, i.e., skip 
a large gang and preempt containers from the 
next app if the gang is way bigger than my preemption needs. This risks to 
break reservations, and has possibly funny and 
gameable semantics. Also it seems hard to gain experience on how to parametrize 
such heuristics.

* an alternative workaround is to ensure that no gang requests are satisfied 
with over-capacity containers, this keeps the
gangs out of the preemption radar. A simple way to enforce this is to set 
max-capacity the same as guaranteed capacity for 
the queues that will serve gang requests. (This might combine nicely with the 
dynamic negotiation business as well).

Another sub-problem of gang-scheduling is to track which containers belong to 
which gang (and/or which requests they serve). 
This also requires the AM to be consistent in how it uses containers it 
receives and possibly a more explicit protocol to 
say "this container I am giving you is part of that gang request", otherwise a 
single preemption might break multiple topologies. 
In general this containers-to-requests tracking seems a bit too opaque at the 
moment (I have heard independent complaints from 
ApplicationMaster developers on this before).
                
> Support gang scheduling in the AM RM protocol
> ---------------------------------------------
>
>                 Key: YARN-624
>                 URL: https://issues.apache.org/jira/browse/YARN-624
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: api, scheduler
>    Affects Versions: 2.0.4-alpha
>            Reporter: Sandy Ryza
>            Assignee: Sandy Ryza
>
> Per discussion on YARN-392 and elsewhere, gang scheduling, in which a 
> scheduler runs a set of tasks when they can all be run at the same time, 
> would be a useful feature for YARN schedulers to support.
> Currently, AMs can approximate this by holding on to containers until they 
> get all the ones they need.  However, this lends itself to deadlocks when 
> different AMs are waiting on the same containers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to