[ 
https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-2314:
-----------------------------

    Attachment: nmproxycachefix.prototype.patch

I was thinking along similar lines, but I am worried about the corner case 
where all RPCs are in use.  I think we need to handle this case even if it's 
rare.  An AM running on a node where it can see the RM but has a network cut to 
the rest of the cluster could go really bad really quick otherwise.  If we 
don't handle the corner case then we'll continue to grow the proxy cache beyond 
its boundaries as we do today, and that AM will explode with thousands of 
threads for what may be a temporary network outage.

While debugging this I wrote up a quick prototype patch to try to fix the cache 
so that it keeps the cache under the configured limit.  Attaching the patch for 
reference.  However as I mentioned above, simply keeping the NM proxy cache 
under its configured limit means nothing if we don't address the problems with 
connections remaining open in the IPC Client layer.

> ContainerManagementProtocolProxy can create thousands of threads for a large 
> cluster
> ------------------------------------------------------------------------------------
>
>                 Key: YARN-2314
>                 URL: https://issues.apache.org/jira/browse/YARN-2314
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 2.1.0-beta
>            Reporter: Jason Lowe
>            Priority: Critical
>         Attachments: nmproxycachefix.prototype.patch
>
>
> ContainerManagementProtocolProxy has a cache of NM proxies, and the size of 
> this cache is configurable.  However the cache can grow far beyond the 
> configured size when running on a large cluster and blow AM address/container 
> limits.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to