MENG DING commented on YARN-1644:

Thanks [~jianhe] for the review comments, and thanks for bringing up the corner 
case that I have missed.

I think the main problem (if I understand it correctly) right now is that if NM 
fails to send the {{increasedContainers}} to RM, and gets a *Resync* 
instruction (due to RM restart), the {{increasedContainers}} is still cleared 
from {{NMContext}}. If at this moment, the container status has not yet been 
updated in NM (its a very short time window, but still possible), the 
{{registerNodeManager}} will send old container resource info to RM for RM 
container recovery.

If this is the case, then reusing {{ContainerStatus}} object for 
{{increasedContainers}} still cannot resolve the problem. What we need, I 
believe, is to make sure that we only remove the container from 
{{NMContext.increasedContainers}} *after* the container status has been updated 
in NM. We *also* need to add {{increasedContainers}} to the 
{{RegisterNodeManagerRequestProto}}, such that during node manager 
registration, RM needs to check 
{{RegisterNodeManagerRequest.increasedContainers}} as well to set the correct 
container size for container recovery.


> RM-NM protocol changes and NodeStatusUpdater implementation to support 
> container resizing
> -----------------------------------------------------------------------------------------
>                 Key: YARN-1644
>                 URL: https://issues.apache.org/jira/browse/YARN-1644
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>            Reporter: Wangda Tan
>            Assignee: MENG DING
>         Attachments: YARN-1644-YARN-1197.4.patch, 
> YARN-1644-YARN-1197.5.patch, YARN-1644.1.patch, YARN-1644.2.patch, 
> YARN-1644.3.patch, yarn-1644.1.patch

This message was sent by Atlassian JIRA

Reply via email to