[
https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14065316#comment-14065316
]
Jason Lowe commented on YARN-2314:
----------------------------------
The problem is that the cache doesn't try very hard to remove proxies when the
cache is at or beyond the maximum configured size. When adding a new proxy to
the cache and it should remove an entry, it simply grabs the
least-recently-used proxy and tries to close it. If the entry is currently in
use then an entry isn't immediately removed and that means we're running with a
cache larger than configured.
This can get far worse on a big cluster. For example, if the
least-recently-used proxy is currently performing a call that is stuck on
socket connection retries, the LRU entry could take quite a while before it
closes. During that time each new proxy created will make the same attempt to
close that proxy and fail to do so. That means that the cache size is now N-1
larger than it should be when it finally does close where N is the number of
proxies created while the LRU entry was busy.
On a large cluster with thousands of nodes a proxy hanging on one node could
allow the cache to have thousands of more proxies in it than configured. Since
each proxy is a thread, that's thousands of threads, and all those thread
stacks can blow container limits on the AM (or address limits if it's a 32-bit
AM).
> ContainerManagementProtocolProxy can create thousands of threads for a large
> cluster
> ------------------------------------------------------------------------------------
>
> Key: YARN-2314
> URL: https://issues.apache.org/jira/browse/YARN-2314
> Project: Hadoop YARN
> Issue Type: Bug
> Components: client
> Affects Versions: 2.1.0-beta
> Reporter: Jason Lowe
> Priority: Critical
>
> ContainerManagementProtocolProxy has a cache of NM proxies, and the size of
> this cache is configurable. However the cache can grow far beyond the
> configured size when running on a large cluster and blow AM address/container
> limits. More details in the first comment.
--
This message was sent by Atlassian JIRA
(v6.2#6252)