[
https://issues.apache.org/jira/browse/YARN-4344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001800#comment-15001800
]
zhihai xu commented on YARN-4344:
---------------------------------
Thanks for reporting this issue [~vvasudev]! Thanks for the review [~Jason
Lowe]!
[~rohithsharma] tried to clean up the code at YARN-3286. Based on the following
comment from [~jianhe] at YARN-3286,
{code}
I think this has changed the behavior that without any RM/NM restart features
enabled, earlier restarting a node will trigger RM to kill all the containers
on this node, but now it won't ?
{code}
The patch may cause compatibility issue. Maybe we can merge the case
{{rmNode.getHttpPort() == newNode.getHttpPort()}} with {{rmNode.getHttpPort()
!= newNode.getHttpPort()}} for noRunningApps.
Thoughts?
> NMs reconnecting with changed capabilities can lead to wrong cluster resource
> calculations
> ------------------------------------------------------------------------------------------
>
> Key: YARN-4344
> URL: https://issues.apache.org/jira/browse/YARN-4344
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 2.7.1, 2.6.2
> Reporter: Varun Vasudev
> Assignee: Varun Vasudev
> Priority: Critical
> Attachments: YARN-4344.001.patch
>
>
> After YARN-3802, if an NM re-connects to the RM with changed capabilities,
> there can arise situations where the overall cluster resource calculation for
> the cluster will be incorrect leading to inconsistencies in scheduling.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)