Chengbing Liu created YARN-3266:
-----------------------------------

             Summary: RMContext inactiveNodes should have NodeId as map key
                 Key: YARN-3266
                 URL: https://issues.apache.org/jira/browse/YARN-3266
             Project: Hadoop YARN
          Issue Type: Bug
          Components: resourcemanager
    Affects Versions: 2.6.0
            Reporter: Chengbing Liu


Under the default NM port configuration, which is 0, we have observed in the 
current version, "lost nodes" count is greater than the length of the lost node 
list. This will happen when we consecutively restart the same NM twice:
* NM started at port 10001
* NM restarted at port 10002
* NM restarted at port 10003
* NM:10001 timeout, {{ClusterMetrics#incrNumLostNMs()}}, # lost node=1; 
{{rmNode.context.getInactiveRMNodes().put(rmNode.nodeId.getHost(), rmNode)}}, 
{{inactiveNodes}} has 1 element
* NM:10002 timeout, {{ClusterMetrics#incrNumLostNMs()}}, # lost node=2; 
{{rmNode.context.getInactiveRMNodes().put(rmNode.nodeId.getHost(), rmNode)}}, 
{{inactiveNodes}} still has 1 element

Since we allow multiple NodeManagers on one host (as discussed in YARN-1888), 
{{inactiveNodes}} should be of type {{ConcurrentMap<NodeId, RMNode>}}. If this 
will break the current API, then the key string should include the NM's port as 
well.

Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to