[ 
https://issues.apache.org/jira/browse/YARN-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14643099#comment-14643099
 ] 

MENG DING commented on YARN-1644:
---------------------------------

bq.  NM re-registration can still happen between the time the increase action 
is accepted, and the time it's added into increasedContainers. Even 
startContainer has the same problem, newly started container may fall into this 
tiny window that RM won't recover this container.
Yes, you are right that startContainer would have the same problem. 
So to make it clear, RM restart/NM re-registration can happen in the following 
scenarios:
* 1. Container resource increase is already completed. In this case, NM 
re-registration can send the correct (increased) container size (through 
containerStatus object) for RM recovery.
* 2. Container to be increased has been added into increasedContainers, but the 
resource is not yet updated. In this case, NM re-registration can send the 
correct container size through both containerStatus and increasedContainers 
objects for RM recovery.
* 3. The increase action is accepted, but the container to be increased has not 
been added into increasedContainers. In this case, the resource view between NM 
and RM becomes different. The same issue applies to startContainers.

I don't have a solution for c yet, but I think the chance for scenario 3 to 
happen is very small, especially with the {{blockNewContainerRequests}} and 
matching RM identifier logic right now. Maybe we can log a separate JIRA for 
scenario 3, and fix that for both container increase and container launch?

> RM-NM protocol changes and NodeStatusUpdater implementation to support 
> container resizing
> -----------------------------------------------------------------------------------------
>
>                 Key: YARN-1644
>                 URL: https://issues.apache.org/jira/browse/YARN-1644
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>            Reporter: Wangda Tan
>            Assignee: MENG DING
>         Attachments: YARN-1644-YARN-1197.4.patch, 
> YARN-1644-YARN-1197.5.patch, YARN-1644.1.patch, YARN-1644.2.patch, 
> YARN-1644.3.patch, yarn-1644.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to