[
https://issues.apache.org/jira/browse/YARN-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14641017#comment-14641017
]
Advertising
MENG DING commented on YARN-1644:
---------------------------------
Thanks [~jianhe] for the review comments, and thanks for bringing up the corner
case that I have missed.
I think the main problem (if I understand it correctly) right now is that if NM
fails to send the {{increasedContainers}} to RM, and gets a *Resync*
instruction (due to RM restart), the {{increasedContainers}} is still cleared
from {{NMContext}}. If at this moment, the container status has not yet been
updated in NM (its a very short time window, but still possible), the
{{registerNodeManager}} will send old container resource info to RM for RM
container recovery.
If this is the case, then reusing {{ContainerStatus}} object for
{{increasedContainers}} still cannot resolve the problem. What we need, I
believe, is to make sure that we only remove the container from
{{NMContext.increasedContainers}} *after* the container status has been updated
in NM. We *also* need to add {{increasedContainers}} to the
{{RegisterNodeManagerRequestProto}}, such that during node manager
registration, RM needs to check
{{RegisterNodeManagerRequest.increasedContainers}} as well to set the correct
container size for container recovery.
Thoughts?
> RM-NM protocol changes and NodeStatusUpdater implementation to support
> container resizing
> -----------------------------------------------------------------------------------------
>
> Key: YARN-1644
> URL: https://issues.apache.org/jira/browse/YARN-1644
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: nodemanager
> Reporter: Wangda Tan
> Assignee: MENG DING
> Attachments: YARN-1644-YARN-1197.4.patch,
> YARN-1644-YARN-1197.5.patch, YARN-1644.1.patch, YARN-1644.2.patch,
> YARN-1644.3.patch, yarn-1644.1.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)