Jason Lowe commented on YARN-4311:

Thanks for the patch, Kuhu!  Test failures are related, please look into them.

In general the patch seems like a reasonable approach.  There needs to be some 
way for admins to remove nodes that are no longer relevant to the cluster, and 
AFAIK there's no supported way to do this short of restarting the 
resourcemanager.  As nodes churn in and out of the cluster, they will simply 
accumulate in the decommissioned or lost nodes buckets until the next 
resourcemanager restart.

My main concern is the behavior when someone botches the include list (e.g.: 
accidentally truncates the includes list file and refreshes).  At that point 
all of the cluster nodes will disappear from the resourcemanager with no 
indication of what happened (except potentially the shutdown metric will 
increment by the number of nodes lost).  Today they will all go into the 
decommissioned bucket, but with this patch they'll simply disappear.  This 
either needs to be "as designed" behavior, or we'd have to implement a separate 
mechanism outside of the include/exclude lists to direct the RM to "forget" a 
node.  I believe HDFS recently was changed to behave this was as well wrt. the 
include/exclude lists and forgetting nodes (see HDFS-8950), so I'm inclined to 
be consistent with that and say it's "as designed."

Some comments on the patch itself:

isInvalidAndAbsent doesn't have the same handling of IP's as isValidNode does.

It might also be clearer if isInvalidAndAbsent were just named isUntracked or 
isUntrackedNode indicating those are nodes we aren't tracking in any way.

isInvalidAndAbsent doesn't lock hostsReader like isValidNode does.

What about refreshNodesGracefully?  That also refreshes the host 
include/exclude lists and arguably needs similar logic.  We need to discuss 
what it means to gracefully refresh the list when the node completely 
disappears from both the include and exclude list.  Should it still gracefully 
decommission, and how do we make sure that node is properly tracked?  If 
graceful, does it automatically disappear when the decommission completes since 
it's not in either list?

Nit: While looping over the nodes, if the node is valid then there's no reason 
to check if its not valid and absent.  So it could be simplified to the 
    for (NodeId nodeId: rmContext.getRMNodes().keySet()) {
      if (!isValidNode(nodeId.getHost())) {
        RMNodeEventType nodeEventType = isInvalidAndAbsent(nodeId.getHost()) ? 
            new RMNodeEvent(nodeId, nodeEventType));

> Removing nodes from include and exclude lists will not remove them from 
> decommissioned nodes list
> -------------------------------------------------------------------------------------------------
>                 Key: YARN-4311
>                 URL: https://issues.apache.org/jira/browse/YARN-4311
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.6.1
>            Reporter: Kuhu Shukla
>            Assignee: Kuhu Shukla
>         Attachments: YARN-4311-v1.patch
> In order to fully forget about a node, removing the node from include and 
> exclude list is not sufficient. The RM lists it under Decomm-ed nodes. The 
> tricky part that [~jlowe] pointed out was the case when include lists are not 
> used, in that case we don't want the nodes to fall off if they are not active.

This message was sent by Atlassian JIRA

Reply via email to