Jun Gong created YARN-3809:

             Summary: Failed to launch new attempts because 
ApplicationMasterLauncher's threads all hang
                 Key: YARN-3809
                 URL: https://issues.apache.org/jira/browse/YARN-3809
             Project: Hadoop YARN
          Issue Type: Improvement
          Components: resourcemanager
            Reporter: Jun Gong
            Assignee: Jun Gong

ApplicationMasterLauncher create a thread pool whose size is 10 to deal with 
AMLauncherEventType(LAUNCH and CLEANUP).

In our cluster, there was many NM with 10+ AM running on it, and one shut down 
for some reason. After RM found the NM LOST, it cleaned up AMs running on it. 
Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. 
ApplicationMasterLauncher's thread pool would be filled up, and they all hang 
in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, 
the default RPC time out is 15 mins. It means that in 15 mins 
ApplicationMasterLauncher could not handle new event such as LAUNCH, then new 
attempts will fails to launch because of time out.

This message was sent by Atlassian JIRA

Reply via email to