[ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990252#comment-13990252
 ] 

Jian He commented on YARN-2001:
-------------------------------

In a simple case that an application is granted 50% of the cluster resource. 
The cluster has 2 nodes. the application used up all its resource quota and 
launched all containers on node1. RM fails over and node2 first re-syncs back 
with RM. Since node2 has no containers running for this application, AM asks 
for more containers and RM will think this AM hasn’t used any resources and 
will grant it more resources on node1. Then node1 comes back to RM, RM recovers 
all containers on node1. The application end up with more than 50% resource 
limit.

Another example would be RM needs to generate new container Id for the new 
containers requested from AM. If RM accepts new requests from AM before nodes 
sync back, the new container Id may overlap with the Ids of the recovered 
containers. 

> Persist NMs info for RM restart
> -------------------------------
>
>                 Key: YARN-2001
>                 URL: https://issues.apache.org/jira/browse/YARN-2001
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Jian He
>            Assignee: Jian He
>
> RM should not accept allocate requests from AMs until all the NMs have 
> registered with RM. For that, RM needs to remember the previous NMs and wait 
> for all the NMs to register.
> This is also useful for remembering decommissioned nodes across restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to