[
https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14322143#comment-14322143
]
Rohith commented on YARN-3194:
------------------------------
Thanks [~jianhe] for pointing me out container recovery flow!! Issue priority
can decided later,not a problem.
I had deeper look about NM registration flow. There are 2 scenario's can occur
# Node added event : Again here 2 scenario's can occur
## New node is registering with different ip:port -- NOT A PROBLEM
## Old node is re-registering because of RESYNC command from RM when RM restart
-- NOT A PROBLEM
# Node reconnected event :
## Existing node is re-registering i.e RM treat it as reconnecting node when RM
is not restarted
### NM RESTART NOT Enabled -- NOT A PROBLEM
### NM RESTART is Enabled -- {color:red}Problem is here{color}
When Node is reconnected and applications are running in that node,
NMContainerStatus are ignored. I think RMNodeReconnectEvent should consider
NMContainerStatus and process it.
> After NM restart,completed containers are not released which are sent during
> NM registration
> --------------------------------------------------------------------------------------------
>
> Key: YARN-3194
> URL: https://issues.apache.org/jira/browse/YARN-3194
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 2.6.0
> Environment: NM restart is enabled
> Reporter: Rohith
> Assignee: Rohith
>
> On NM restart ,NM sends all the outstanding NMContainerStatus to RM. But RM
> process only ContainerState.RUNNING. If container is completed when NM was
> down then those containers resources wont be release which result in
> applications to hang.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)