[
https://issues.apache.org/jira/browse/YARN-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990565#comment-14990565
]
Jason Lowe commented on YARN-4311:
----------------------------------
Thanks for the patch, Kuhu! Test failures are related, please look into them.
In general the patch seems like a reasonable approach. There needs to be some
way for admins to remove nodes that are no longer relevant to the cluster, and
AFAIK there's no supported way to do this short of restarting the
resourcemanager. As nodes churn in and out of the cluster, they will simply
accumulate in the decommissioned or lost nodes buckets until the next
resourcemanager restart.
My main concern is the behavior when someone botches the include list (e.g.:
accidentally truncates the includes list file and refreshes). At that point
all of the cluster nodes will disappear from the resourcemanager with no
indication of what happened (except potentially the shutdown metric will
increment by the number of nodes lost). Today they will all go into the
decommissioned bucket, but with this patch they'll simply disappear. This
either needs to be "as designed" behavior, or we'd have to implement a separate
mechanism outside of the include/exclude lists to direct the RM to "forget" a
node. I believe HDFS recently was changed to behave this was as well wrt. the
include/exclude lists and forgetting nodes (see HDFS-8950), so I'm inclined to
be consistent with that and say it's "as designed."
Some comments on the patch itself:
isInvalidAndAbsent doesn't have the same handling of IP's as isValidNode does.
It might also be clearer if isInvalidAndAbsent were just named isUntracked or
isUntrackedNode indicating those are nodes we aren't tracking in any way.
isInvalidAndAbsent doesn't lock hostsReader like isValidNode does.
What about refreshNodesGracefully? That also refreshes the host
include/exclude lists and arguably needs similar logic. We need to discuss
what it means to gracefully refresh the list when the node completely
disappears from both the include and exclude list. Should it still gracefully
decommission, and how do we make sure that node is properly tracked? If
graceful, does it automatically disappear when the decommission completes since
it's not in either list?
Nit: While looping over the nodes, if the node is valid then there's no reason
to check if its not valid and absent. So it could be simplified to the
following:
{code}
for (NodeId nodeId: rmContext.getRMNodes().keySet()) {
if (!isValidNode(nodeId.getHost())) {
RMNodeEventType nodeEventType = isInvalidAndAbsent(nodeId.getHost()) ?
RMNodeEventType.SHUTDOWN : RMNodeEventType.DECOMMISSION;
this.rmContext.getDispatcher().getEventHandler().handle(
new RMNodeEvent(nodeId, nodeEventType));
}
}
{code}
> Removing nodes from include and exclude lists will not remove them from
> decommissioned nodes list
> -------------------------------------------------------------------------------------------------
>
> Key: YARN-4311
> URL: https://issues.apache.org/jira/browse/YARN-4311
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 2.6.1
> Reporter: Kuhu Shukla
> Assignee: Kuhu Shukla
> Attachments: YARN-4311-v1.patch
>
>
> In order to fully forget about a node, removing the node from include and
> exclude list is not sufficient. The RM lists it under Decomm-ed nodes. The
> tricky part that [~jlowe] pointed out was the case when include lists are not
> used, in that case we don't want the nodes to fall off if they are not active.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)