[
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990252#comment-13990252
]
Jian He commented on YARN-2001:
-------------------------------
In a simple case that an application is granted 50% of the cluster resource.
The cluster has 2 nodes. the application used up all its resource quota and
launched all containers on node1. RM fails over and node2 first re-syncs back
with RM. Since node2 has no containers running for this application, AM asks
for more containers and RM will think this AM hasn’t used any resources and
will grant it more resources on node1. Then node1 comes back to RM, RM recovers
all containers on node1. The application end up with more than 50% resource
limit.
Another example would be RM needs to generate new container Id for the new
containers requested from AM. If RM accepts new requests from AM before nodes
sync back, the new container Id may overlap with the Ids of the recovered
containers.
> Persist NMs info for RM restart
> -------------------------------
>
> Key: YARN-2001
> URL: https://issues.apache.org/jira/browse/YARN-2001
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Reporter: Jian He
> Assignee: Jian He
>
> RM should not accept allocate requests from AMs until all the NMs have
> registered with RM. For that, RM needs to remember the previous NMs and wait
> for all the NMs to register.
> This is also useful for remembering decommissioned nodes across restarts.
--
This message was sent by Atlassian JIRA
(v6.2#6252)