[
https://issues.apache.org/jira/browse/YARN-624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13736970#comment-13736970
]
Carlo Curino commented on YARN-624:
-----------------------------------
Robert, you are right, and provide a compelling example of an application that
has dynamic needs for resources.
There are ways around this, where you dynamically negotiate an
increase/decrease of dedicated resources, and keep
the AM as it is. Philosophically this keeps all interaction AM-RM as
best-effort partial-ok, while is the client-RM
protocol that talks about binding negotiation for resources. This would work
and match well the current preemption
mechanics, but I am not sure it is the best design (I haven't thought hard
about it yet).
If we go with the design where the AM makes gang-like requests, we should make
the preemption policy aware of
this, and act accordingly. In a sense, this boils down to a "granularity"
problem, not too different from the current
size of containers to preempt vs needed capacity. But it stretches the
precision issue by potentially a huge factor, making
the tradeoff between under and over preempting a more subtle line to walk.
Two ways around this:
* we might want introduce non-strictly FIFO preemptions in a queue, i.e., skip
a large gang and preempt containers from the
next app if the gang is way bigger than my preemption needs. This risks to
break reservations, and has possibly funny and
gameable semantics. Also it seems hard to gain experience on how to parametrize
such heuristics.
* an alternative workaround is to ensure that no gang requests are satisfied
with over-capacity containers, this keeps the
gangs out of the preemption radar. A simple way to enforce this is to set
max-capacity the same as guaranteed capacity for
the queues that will serve gang requests. (This might combine nicely with the
dynamic negotiation business as well).
Another sub-problem of gang-scheduling is to track which containers belong to
which gang (and/or which requests they serve).
This also requires the AM to be consistent in how it uses containers it
receives and possibly a more explicit protocol to
say "this container I am giving you is part of that gang request", otherwise a
single preemption might break multiple topologies.
In general this containers-to-requests tracking seems a bit too opaque at the
moment (I have heard independent complaints from
ApplicationMaster developers on this before).
> Support gang scheduling in the AM RM protocol
> ---------------------------------------------
>
> Key: YARN-624
> URL: https://issues.apache.org/jira/browse/YARN-624
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: api, scheduler
> Affects Versions: 2.0.4-alpha
> Reporter: Sandy Ryza
> Assignee: Sandy Ryza
>
> Per discussion on YARN-392 and elsewhere, gang scheduling, in which a
> scheduler runs a set of tasks when they can all be run at the same time,
> would be a useful feature for YARN schedulers to support.
> Currently, AMs can approximate this by holding on to containers until they
> get all the ones they need. However, this lends itself to deadlocks when
> different AMs are waiting on the same containers.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira