[
https://issues.apache.org/jira/browse/YARN-624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733177#comment-13733177
]
Bikas Saha commented on YARN-624:
---------------------------------
RM currently expects something to start in a container within a timeout after
allocation. Either that needs to change or that will set a maximum timeout for
which the AM can hold onto containers while waiting for a gang of them to be
allocated. The NM could provide an API to launch but not start a process. So
all resource copying etc could be completed and the process may be launched in
a suspended state, ready to go. This may help in telling the RM that the
container actually is being used. Then NM could then un-suspend and start the
process after being told by the AM to do so.
> Support gang scheduling in the AM RM protocol
> ---------------------------------------------
>
> Key: YARN-624
> URL: https://issues.apache.org/jira/browse/YARN-624
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: api, scheduler
> Affects Versions: 2.0.4-alpha
> Reporter: Sandy Ryza
> Assignee: Sandy Ryza
>
> Per discussion on YARN-392 and elsewhere, gang scheduling, in which a
> scheduler runs a set of tasks when they can all be run at the same time,
> would be a useful feature for YARN schedulers to support.
> Currently, AMs can approximate this by holding on to containers until they
> get all the ones they need. However, this lends itself to deadlocks when
> different AMs are waiting on the same containers.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira