[ 
https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14065316#comment-14065316
 ] 

Jason Lowe commented on YARN-2314:
----------------------------------

The problem is that the cache doesn't try very hard to remove proxies when the 
cache is at or beyond the maximum configured size.  When adding a new proxy to 
the cache and it should remove an entry, it simply grabs the 
least-recently-used proxy and tries to close it.  If the entry is currently in 
use then an entry isn't immediately removed and that means we're running with a 
cache larger than configured.

This can get far worse on a big cluster.  For example, if the 
least-recently-used proxy is currently performing a call that is stuck on 
socket connection retries, the LRU entry could take quite a while before it 
closes.  During that time each new proxy created will make the same attempt to 
close that proxy and fail to do so.  That means that the cache size is now N-1 
larger than it should be when it finally does close where N is the number of 
proxies created while the LRU entry was busy.

On a large cluster with thousands of nodes a proxy hanging on one node could 
allow the cache to have thousands of more proxies in it than configured.  Since 
each proxy is a thread, that's thousands of threads, and all those thread 
stacks can blow container limits on the AM (or address limits if it's a 32-bit 
AM).

> ContainerManagementProtocolProxy can create thousands of threads for a large 
> cluster
> ------------------------------------------------------------------------------------
>
>                 Key: YARN-2314
>                 URL: https://issues.apache.org/jira/browse/YARN-2314
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 2.1.0-beta
>            Reporter: Jason Lowe
>            Priority: Critical
>
> ContainerManagementProtocolProxy has a cache of NM proxies, and the size of 
> this cache is configurable.  However the cache can grow far beyond the 
> configured size when running on a large cluster and blow AM address/container 
> limits.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to