[
https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Lowe updated YARN-3194:
-----------------------------
Fix Version/s: 2.6.2
I committed this to branch-2.6 as well.
> RM should handle NMContainerStatuses sent by NM while registering if NM is
> Reconnected node
> -------------------------------------------------------------------------------------------
>
> Key: YARN-3194
> URL: https://issues.apache.org/jira/browse/YARN-3194
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 2.7.0
> Environment: NM restart is enabled
> Reporter: Rohith Sharma K S
> Assignee: Rohith Sharma K S
> Priority: Blocker
> Fix For: 2.7.0, 2.6.2
>
> Attachments: 0001-YARN-3194.patch, 0001-yarn-3194-v1.patch
>
>
> On NM restart ,NM sends all the outstanding NMContainerStatus to RM during
> registration. The registration can be treated by RM as New node or
> Reconnecting node. RM triggers corresponding event on the basis of node added
> or node reconnected state.
> # Node added event : Again here 2 scenario's can occur
> ## New node is registering with different ip:port – NOT A PROBLEM
> ## Old node is re-registering because of RESYNC command from RM when RM
> restart – NOT A PROBLEM
> # Node reconnected event :
> ## Existing node is re-registering i.e RM treat it as reconnecting node when
> RM is not restarted
> ### NM RESTART NOT Enabled – NOT A PROBLEM
> ### NM RESTART is Enabled
> #### Some applications are running on this node – *Problem is here*
> #### Zero applications are running on this node – NOT A PROBLEM
> Since NMContainerStatus are not handled, RM never get to know about
> completedContainer and never release resource held be containers. RM will not
> allocate new containers for pending resource request as long as the
> completedContainer event is triggered. This results in applications to wait
> indefinitly because of pending containers are not served by RM.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)