[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang

Jason Lowe (JIRA) Fri, 19 Jun 2015 08:08:11 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14593490#comment-14593490
 ]


Jason Lowe commented on YARN-3809:
----------------------------------

Thanks for updating the patch, Jun.

ContainerManagementProtocolPBClientImpl is not the appropriate place to make 
this change.  That is used by every client of the ContainerManagementProtocol, 
which includes AMs, etc.  AMLauncher.getContainerMgrProxy is a more appropriate 
place, although there we still don't want to create a new config every time we 
create an NM client proxy.  Creating configs is expensive.  An even better 
place to put this change is in the AMLauncher constructor, where it can create 
a copy of the conf then set the IPC config in it, which in turn will be passed 
to the YarnRPC code to create the proxy.

The new properties should have entries in yarn-default.xml for documentation 
purposes.

Nit: "container-management" is probably not going to be clear to most users 
what it means.  I think it would be clearer to use "nodemanager" since that's 
used in many other places, so I suggest a property name like 
"yarn.resourcemanager.nodemanager-connect-retries".

Since we're changing the code that sets up the AM launcher thread pool, it'd be 
really nice to give each of the threads in that pool a descriptive name.  It's 
annoying to see a bunch of threads like "pool-1-thread-2" in the jstack all 
waiting for work but no clue in the name or stack as to what service they 
belong to.

> Failed to launch new attempts because ApplicationMasterLauncher's threads all 
> hang
> ----------------------------------------------------------------------------------
>
>                 Key: YARN-3809
>                 URL: https://issues.apache.org/jira/browse/YARN-3809
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-3809.01.patch, YARN-3809.02.patch
>
>
> ApplicationMasterLauncher create a thread pool whose size is 10 to deal with 
> AMLauncherEventType(LAUNCH and CLEANUP).
> In our cluster, there was many NM with 10+ AM running on it, and one shut 
> down for some reason. After RM found the NM LOST, it cleaned up AMs running 
> on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. 
> ApplicationMasterLauncher's thread pool would be filled up, and they all hang 
> in the code containerMgrProxy.stopContainers(stopRequest) because NM was 
> down, the default RPC time out is 15 mins. It means that in 15 mins 
> ApplicationMasterLauncher could not handle new event such as LAUNCH, then new 
> attempts will fails to launch because of time out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang

Reply via email to