Jun Gong created YARN-5063:
------------------------------
Summary: Fail to launch AM continuously on a lost NM
Key: YARN-5063
URL: https://issues.apache.org/jira/browse/YARN-5063
Project: Hadoop YARN
Issue Type: Bug
Components: resourcemanager
Reporter: Jun Gong
Assignee: Jun Gong
If a NM node shuts down, RM will not mark it as LOST until liveness monitor
finds it timeout. However before that, RM might continuously allocate AM on
that NM.
We found this case in our cluster: RM continuously allocated a same AM on a
lost NM before RM found it lost, and AMLauncher always failed because it could
not connect to the lost NM. To solve the problem, we could add the NM to AM
blacklist if RM failed to launch it.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]