[
https://issues.apache.org/jira/browse/YARN-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14225312#comment-14225312
]
Bruno Alexandre Rosa commented on YARN-2299:
--------------------------------------------
I tried to reproduce the first case on version 2.5.2 and the bug it is still
present. However, instead of host:port1 showing on Lost Nodes, I got
host:port2. In the same fashion, I lost track of host:port1. The sum of Lost
Nodes remains inconsistent.
> inconsistency at identifying node
> ---------------------------------
>
> Key: YARN-2299
> URL: https://issues.apache.org/jira/browse/YARN-2299
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Reporter: Hong Zhiguo
> Assignee: Hong Zhiguo
> Priority: Critical
>
> If port of "yarn.nodemanager.address" is not specified at NM, NM will choose
> random port. If the NM is ungracefully dead(OOM kill, kill -9, or OS restart)
> and then restarted within "yarn.nm.liveness-monitor.expiry-interval-ms",
> "host:port1" and "host:port2" will both be present in "Active Nodes" on WebUI
> for a while, and after host:port1 expiration, we get host:port1 in "Lost
> Nodes" and host:port2 in "Active Nodes". If the NM is ungracefully dead
> again, we get only host:port1 in "Lost Nodes". "host:port2" is neither in
> "Active Nodes" nor in "Lost Nodes".
> Another case, two NM is running on same host(miniYarnCluster or other test
> purpose), if both of them are lost, we get only one "Lost Nodes" in WebUI.
> In both case, sum of "Active Nodes" and "Lost Nodes" is not the number of
> nodes we expected.
> The root cause is due to inconsistency at how we think two Nodes are
> identical.
> When we manager active nodes(RMContextImpl.nodes), we use NodeId which
> contains port. Two nodes with same host but different port are thought to be
> different node.
> But when we manager inactive nodes(RMContextImpl.inactiveNodes), we use only
> use host. Two nodes with same host but different port are thought to
> identical.
> To fix the inconsistency, we should differentiate below 2 cases and be
> consistent for both of them:
> - intentionally multiple NMs per host
> - NM instances one after another on same host
> Two possible solutions:
> 1) Introduce a boolean config like "one-node-per-host"(default as "true"),
> and use host to differentiate nodes on RM if it's true.
> 2) Make it mandatory to have valid port in "yarn.nodemanager.address" config.
> In this sutiation, NM instances one after another on same host will have
> same NodeId, while intentionally multiple NMs per host will have different
> NodeId.
> Personally I prefer option 1 because it's easier for users.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)