[
https://issues.apache.org/jira/browse/YARN-4344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15000521#comment-15000521
]
Jason Lowe commented on YARN-4344:
----------------------------------
Thanks for the patch, Varun! I think the change will fix the reported issue,
but I'm a bit skeptical of the vastly different handling of the event based on
whether apps are running or not. For example, if the http port is changing
when the node re-registers, why are we treating it as a node removal then
addition if there aren't any apps running but not if there are apps running?
Seems like that should be consistent.
Comments on the patch itself:
The comment about sending the node removal event at the start of the main block
in the transition is no longer very accurate.
Please don't put large sleeps (on the order of seconds) in tests. These extra
sleep seconds quickly add up to a significant amount of time over many tests.
If we need to sleep for polling reasons the sleep should be much shorter, like
on the order of 10ms. Better than sleep-polling is flushing the event
dispatcher and then checking since we can avoid polling entirely.
Nit: isCapabilityChanged init can be simplified to the following, similar to
the noRunningApps boolean init above it:
{code}
boolean isCapabilityChanged =
!rmNode.getTotalCapability().equals(newNode.getTotalCapability());
{code}
Nit: is this conditional check even necessary? We can just update the total
capability with no semantic effect if it hasn't changed. Since this is just
updating a reference with another precomputed one, it's not like we're avoiding
some expensive code. ;-)
{code}
if (isCapabilityChanged) {
rmNode.totalCapability = newNode.getTotalCapability();
}
{code}
> NMs reconnecting with changed capabilities can lead to wrong cluster resource
> calculations
> ------------------------------------------------------------------------------------------
>
> Key: YARN-4344
> URL: https://issues.apache.org/jira/browse/YARN-4344
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 2.7.1, 2.6.2
> Reporter: Varun Vasudev
> Assignee: Varun Vasudev
> Priority: Critical
> Attachments: YARN-4344.001.patch
>
>
> After YARN-3802, if an NM re-connects to the RM with changed capabilities,
> there can arise situations where the overall cluster resource calculation for
> the cluster will be incorrect leading to inconsistencies in scheduling.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)