[
https://issues.apache.org/jira/browse/YARN-10890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tarun Parimi reassigned YARN-10890:
-----------------------------------
Assignee: Tarun Parimi
> Node Attributes in Distributed mapping misses update to scheduler when node
> gets decommissioned/recommissioned
> --------------------------------------------------------------------------------------------------------------
>
> Key: YARN-10890
> URL: https://issues.apache.org/jira/browse/YARN-10890
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 3.3.0, 3.2.1
> Reporter: Tarun Parimi
> Assignee: Tarun Parimi
> Priority: Major
>
> The NodeAttributesManagerImpl maintains the node to attribute mapping. But it
> doesnt remove the mapping when a node goes down. This makes sense for
> centralized mapping, since the attribute mapping is centralized to RM, so a
> node going down doesn't affect the mapping.
> In distributed mapping, the node attribute mapping is updated via NM
> heartbeat to RM and so these node attributes are only valid as long as the
> node is heartbeating . But when a node is decommissioned or lost, the node
> attribute entry still remains in NodeAttributesManagerImpl.
> After the performance improvement change done in YARN-8925, we only update
> distributed node attributes when necessary. However when a previously
> decommissioned node is recommissioned again, NodeAttributesManagerImpl still
> has the old mapping entry belonging to the old SchedulerNode instance which
> was decommisioned.
> This results in ResourceTrackerService#updateNodeAttributesIfNecessary
> skipping the update, since it is comparing with the attributes belonging to
> the old decommisioned node instance.
> {code:java}
> if (!NodeLabelUtil
> .isNodeAttributesEquals(nodeAttributes, currentNodeAttributes))
> {
> this.rmContext.getNodeAttributesManager()
> .replaceNodeAttributes(NodeAttribute.PREFIX_DISTRIBUTED,
> ImmutableMap.of(nodeId.getHost(), nodeAttributes));
> } else if (LOG.isDebugEnabled()) {
> LOG.debug("Skip updating node attributes since there is no change
> for "
> + nodeId + " : " + nodeAttributes);
> }
> {code}
> We should remove the distributed node attributes whenever a node gets
> deactivated to avoid this issue. So these attributes will get added properly
> in scheduler whenever the node becomes active again and registers/heartbeats.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]