[ 
https://issues.apache.org/jira/browse/YARN-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14641017#comment-14641017
 ] 

MENG DING commented on YARN-1644:
---------------------------------

Thanks [~jianhe] for the review comments, and thanks for bringing up the corner 
case that I have missed.

I think the main problem (if I understand it correctly) right now is that if NM 
fails to send the {{increasedContainers}} to RM, and gets a *Resync* 
instruction (due to RM restart), the {{increasedContainers}} is still cleared 
from {{NMContext}}. If at this moment, the container status has not yet been 
updated in NM (its a very short time window, but still possible), the 
{{registerNodeManager}} will send old container resource info to RM for RM 
container recovery.

If this is the case, then reusing {{ContainerStatus}} object for 
{{increasedContainers}} still cannot resolve the problem. What we need, I 
believe, is to make sure that we only remove the container from 
{{NMContext.increasedContainers}} *after* the container status has been updated 
in NM. We *also* need to add {{increasedContainers}} to the 
{{RegisterNodeManagerRequestProto}}, such that during node manager 
registration, RM needs to check 
{{RegisterNodeManagerRequest.increasedContainers}} as well to set the correct 
container size for container recovery.

Thoughts?

> RM-NM protocol changes and NodeStatusUpdater implementation to support 
> container resizing
> -----------------------------------------------------------------------------------------
>
>                 Key: YARN-1644
>                 URL: https://issues.apache.org/jira/browse/YARN-1644
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>            Reporter: Wangda Tan
>            Assignee: MENG DING
>         Attachments: YARN-1644-YARN-1197.4.patch, 
> YARN-1644-YARN-1197.5.patch, YARN-1644.1.patch, YARN-1644.2.patch, 
> YARN-1644.3.patch, yarn-1644.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to