Hong Zhiguo created YARN-2299:
---------------------------------

             Summary: inconsistency at identifying node
                 Key: YARN-2299
                 URL: https://issues.apache.org/jira/browse/YARN-2299
             Project: Hadoop YARN
          Issue Type: Bug
          Components: resourcemanager
            Reporter: Hong Zhiguo
            Assignee: Hong Zhiguo
            Priority: Critical


If port of "yarn.nodemanager.address" is not specified at NM, NM will choose 
random port. If the NM is ungracefully dead(OOM kill, kill -9, or OS restart) 
and then restarted within "yarn.nm.liveness-monitor.expiry-interval-ms", 
"host:port1" and "host:port2" will both be present in "Active Nodes" on WebUI 
for a while, and after host:port1 expiration, we get host:port1 in "Lost Nodes" 
and host:port2 in "Active Nodes". If the NM is ungracefully dead again, we get 
only host:port1 in "Lost Nodes". "host:port2" is neither in "Active Nodes" nor 
in  "Lost Nodes".

Another case, two NM is running on same host(miniYarnCluster or other test 
purpose), if both of them are lost, we get only one "Lost Nodes" in WebUI.

In both case, sum of "Active Nodes" and "Lost Nodes" is not the number of nodes 
we expected.

The root cause is due to inconsistency at how we think two Nodes are identical.
When we manager active nodes(RMContextImpl.nodes), we use NodeId which contains 
port. Two nodes with same host but different port are thought to be different 
node.
But when we manager inactive nodes(RMContextImpl.inactiveNodes), we use only 
use host. Two nodes with same host but different port are thought to identical.

We should differentiate 2 cases: 
 - intentionally multiple NMs per host
 - NM instances one after another on same host

Two possible solutions:
1) Introduce a boolean config like "one-node-per-host"(default as "true"), and 
use host to differentiate nodes on RM if it's true.

2) Make it mandatory to have valid port in "yarn.nodemanager.address" config.  
In this sutiation, NM instances one after another on same host will have same 
NodeId, while intentionally multiple NMs per host will have different NodeId.

Personally I prefer option 1 because it's easier for users.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to