john lilley commented on YARN-624:

I would like to +1 this feature, and illustrate our use cases.  Currently there 
are two:
-- Finding strongly-connected subgraphs.  This is a central step in 
data-quality/matching applications, because after record-matching is performed 
in a distributed fashion, the match pairs (edges) must be turned into match 
groups (subgraphs).  It is very inefficient to process this using a traditional 
independent-task YARN model.
-- Machine-learning model training.  There are many models that lend themselves 
to distributed processing, and even those that don't can benefit from parallel 
genetic algorithm that competes multiple models and topologies in parallel.

In both these cases we are considering a custom AM that runs like:
-- Asks for M containers
-- Accepts as few as N containers, but only after not getting M for some period 
of time (heuristics TBD).
-- Possibly, after getting non-zero but < N containers for some time, release 
them all, sleep a while, and try again (deadlock avoidance).

This algorithm would be much better run by the RM, because it can:
-- Immediately fail the AM if N containers are impossible.
-- Avoid idle incomplete sets of containers while waiting for a sufficient gang.
-- Avoid deadlock.

> Support gang scheduling in the AM RM protocol
> ---------------------------------------------
>                 Key: YARN-624
>                 URL: https://issues.apache.org/jira/browse/YARN-624
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: api, scheduler
>    Affects Versions: 2.0.4-alpha
>            Reporter: Sandy Ryza
>            Assignee: Sandy Ryza
> Per discussion on YARN-392 and elsewhere, gang scheduling, in which a 
> scheduler runs a set of tasks when they can all be run at the same time, 
> would be a useful feature for YARN schedulers to support.
> Currently, AMs can approximate this by holding on to containers until they 
> get all the ones they need.  However, this lends itself to deadlocks when 
> different AMs are waiting on the same containers.

This message was sent by Atlassian JIRA

Reply via email to