[ 
https://issues.apache.org/jira/browse/YARN-4344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15000521#comment-15000521
 ] 

Jason Lowe commented on YARN-4344:
----------------------------------

Thanks for the patch, Varun!  I think the change will fix the reported issue, 
but I'm a bit skeptical of the vastly different handling of the event based on 
whether apps are running or not.  For example, if the http port is changing 
when the node re-registers, why are we treating it as a node removal then 
addition if there aren't any apps running but not if there are apps running?  
Seems like that should be consistent.

Comments on the patch itself:

The comment about sending the node removal event at the start of the main block 
in the transition is no longer very accurate.
 
Please don't put large sleeps (on the order of seconds) in tests.  These extra 
sleep seconds quickly add up to a significant amount of time over many tests.  
If we need to sleep for polling reasons the sleep should be much shorter, like 
on the order of 10ms.  Better than sleep-polling is flushing the event 
dispatcher and then checking since we can avoid polling entirely.

Nit: isCapabilityChanged init can be simplified to the following, similar to 
the noRunningApps boolean init above it:
{code}
      boolean isCapabilityChanged =
          !rmNode.getTotalCapability().equals(newNode.getTotalCapability());
 {code}

Nit: is this conditional check even necessary?  We can just update the total 
capability with no semantic effect if it hasn't changed.  Since this is just 
updating a reference with another precomputed one, it's not like we're avoiding 
some expensive code. ;-)
{code}
        if (isCapabilityChanged) {
          rmNode.totalCapability = newNode.getTotalCapability();
        }
{code}

> NMs reconnecting with changed capabilities can lead to wrong cluster resource 
> calculations
> ------------------------------------------------------------------------------------------
>
>                 Key: YARN-4344
>                 URL: https://issues.apache.org/jira/browse/YARN-4344
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.7.1, 2.6.2
>            Reporter: Varun Vasudev
>            Assignee: Varun Vasudev
>            Priority: Critical
>         Attachments: YARN-4344.001.patch
>
>
> After YARN-3802, if an NM re-connects to the RM with changed capabilities, 
> there can arise situations where the overall cluster resource calculation for 
> the cluster will be incorrect leading to inconsistencies in scheduling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to