[
https://issues.apache.org/jira/browse/YARN-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
zhaoyunjiong reopened YARN-1888:
--------------------------------
The problem here is our cluster use port 0, but when restart NodeManager, the
"Lost Nodes" became inaccurate:
Host A have a NodeManager with ID: $HOSTA:$PORTA,
after restart, the NodeManager now with ID: $HOSTA:$PORTB,
since the ID changed, so ResourceManager didn't think it is a reconnected
NodeManager.
Then few minutes later, NodeManager $HOSTA:$PORTA expired, and marked as LOST.
This make people confused, at first I don't think it is a bug too, but after
few peoples asked me why there are so many nodes LOST, then I come up with this
simple patch: if there is already another NodeManager in the same node (in real
production cluster, I don't think people will start more than one NodeManager
on one machine), then don't mark expired NodeManager as LOST.
> Not add NodeManager to inactiveRMNodes when reboot NodeManager which have
> different port
> ----------------------------------------------------------------------------------------
>
> Key: YARN-1888
> URL: https://issues.apache.org/jira/browse/YARN-1888
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 2.3.0
> Reporter: zhaoyunjiong
> Priority: Minor
> Attachments: YARN-1888.patch
>
>
> When NodeManager's port set to 0, reboot NodeManager will cause "Losts Nodes"
> inaccurate.
--
This message was sent by Atlassian JIRA
(v6.2#6252)