Jun Gong created YARN-5063:
------------------------------

             Summary: Fail to launch AM continuously on a lost NM
                 Key: YARN-5063
                 URL: https://issues.apache.org/jira/browse/YARN-5063
             Project: Hadoop YARN
          Issue Type: Bug
          Components: resourcemanager
            Reporter: Jun Gong
            Assignee: Jun Gong


If a NM node shuts down, RM will not mark it as LOST until liveness monitor 
finds it timeout. However before that, RM might continuously allocate AM on 
that NM.

We found this case in our cluster: RM continuously allocated a same AM on a 
lost NM before RM found it lost, and AMLauncher always failed because it could 
not connect to the lost NM. To solve the problem, we could add the NM to AM 
blacklist if RM failed to launch it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to