[ 
https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14322143#comment-14322143
 ] 

Rohith commented on YARN-3194:
------------------------------

Thanks [~jianhe] for pointing me out container recovery flow!! Issue priority 
can decided later,not a problem.

I had deeper look about NM registration flow. There are 2 scenario's can occur
# Node added event : Again here 2 scenario's can occur
## New node is registering with different ip:port -- NOT A PROBLEM
## Old node is re-registering because of RESYNC command from RM when RM restart 
-- NOT A PROBLEM
# Node reconnected event : 
## Existing node is re-registering i.e RM treat it as reconnecting node when RM 
is not restarted
### NM RESTART NOT Enabled -- NOT A PROBLEM
### NM RESTART is Enabled -- {color:red}Problem is here{color}
When Node is reconnected and applications are running in that node, 
NMContainerStatus are ignored. I think RMNodeReconnectEvent should consider 
NMContainerStatus and process it.

> After NM restart,completed containers are not released which are sent during 
> NM registration
> --------------------------------------------------------------------------------------------
>
>                 Key: YARN-3194
>                 URL: https://issues.apache.org/jira/browse/YARN-3194
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>         Environment: NM restart is enabled
>            Reporter: Rohith
>            Assignee: Rohith
>
> On NM restart ,NM sends all the outstanding NMContainerStatus to RM. But RM 
> process only ContainerState.RUNNING. If container is completed when NM was 
> down then those containers resources wont be release which result in 
> applications to hang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to