Tarun Parimi created YARN-10890:
-----------------------------------

             Summary: Node Attributes in Distributed mapping misses update to 
scheduler when node gets decommissioned/recommissioned
                 Key: YARN-10890
                 URL: https://issues.apache.org/jira/browse/YARN-10890
             Project: Hadoop YARN
          Issue Type: Bug
    Affects Versions: 3.2.1, 3.3.0
            Reporter: Tarun Parimi


The NodeAttributesManagerImpl maintains the node to attribute mapping. But it 
doesnt remove the mapping when a node goes down. This makes sense for 
centralized mapping, since the attribute mapping is centralized to RM, so a 
node going down doesn't affect the mapping.

In distributed mapping, the node attribute mapping is updated via NM heartbeat 
to RM and so these node attributes are only valid as long as the node is 
heartbeating . But when a node is decommissioned or lost, the node attribute 
entry still remains in  NodeAttributesManagerImpl.

After the performance improvement change done in YARN-8925, we only update 
distributed node attributes when necessary. However when a previously 
decommissioned node is recommissioned again, NodeAttributesManagerImpl still 
has the old mapping entry belonging to the old SchedulerNode instance which was 
decommisioned.

This results in ResourceTrackerService#updateNodeAttributesIfNecessary skipping 
the update, since it is comparing with the attributes belonging to the old 
decommisioned node instance.
{code:java}
            if (!NodeLabelUtil
                .isNodeAttributesEquals(nodeAttributes, currentNodeAttributes)) 
{
              this.rmContext.getNodeAttributesManager()
                  .replaceNodeAttributes(NodeAttribute.PREFIX_DISTRIBUTED,
                      ImmutableMap.of(nodeId.getHost(), nodeAttributes));
            } else if (LOG.isDebugEnabled()) {
              LOG.debug("Skip updating node attributes since there is no change 
for "
                  + nodeId + " : " + nodeAttributes);
            }
{code}

We should remove the distributed node attributes whenever a node gets 
deactivated to avoid this issue. So these attributes will get added properly in 
scheduler whenever the node becomes active again and registers/heartbeats.






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to