[
https://issues.apache.org/jira/browse/YARN-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14643099#comment-14643099
]
MENG DING commented on YARN-1644:
---------------------------------
bq. NM re-registration can still happen between the time the increase action
is accepted, and the time it's added into increasedContainers. Even
startContainer has the same problem, newly started container may fall into this
tiny window that RM won't recover this container.
Yes, you are right that startContainer would have the same problem.
So to make it clear, RM restart/NM re-registration can happen in the following
scenarios:
* 1. Container resource increase is already completed. In this case, NM
re-registration can send the correct (increased) container size (through
containerStatus object) for RM recovery.
* 2. Container to be increased has been added into increasedContainers, but the
resource is not yet updated. In this case, NM re-registration can send the
correct container size through both containerStatus and increasedContainers
objects for RM recovery.
* 3. The increase action is accepted, but the container to be increased has not
been added into increasedContainers. In this case, the resource view between NM
and RM becomes different. The same issue applies to startContainers.
I don't have a solution for c yet, but I think the chance for scenario 3 to
happen is very small, especially with the {{blockNewContainerRequests}} and
matching RM identifier logic right now. Maybe we can log a separate JIRA for
scenario 3, and fix that for both container increase and container launch?
> RM-NM protocol changes and NodeStatusUpdater implementation to support
> container resizing
> -----------------------------------------------------------------------------------------
>
> Key: YARN-1644
> URL: https://issues.apache.org/jira/browse/YARN-1644
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: nodemanager
> Reporter: Wangda Tan
> Assignee: MENG DING
> Attachments: YARN-1644-YARN-1197.4.patch,
> YARN-1644-YARN-1197.5.patch, YARN-1644.1.patch, YARN-1644.2.patch,
> YARN-1644.3.patch, yarn-1644.1.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)