[jira] [Updated] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster
[ https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated YARN-2314: --- Attachment: tez-yarn-2314.xlsx Attaching the results of getProxy() call for tez with 20 nodes with this patch for different cache sizes and for different data sizes (tested a job @200GB and 10 TB scale). Overall, there is slight degradation in performance (in milliseconds) by setting cache size to 0, but not significant to make an impact in overall job runtime in tez. ContainerManagementProtocolProxy can create thousands of threads for a large cluster Key: YARN-2314 URL: https://issues.apache.org/jira/browse/YARN-2314 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.1.0-beta Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical Attachments: YARN-2314.patch, YARN-2314v2.patch, disable-cm-proxy-cache.patch, nmproxycachefix.prototype.patch, tez-yarn-2314.xlsx ContainerManagementProtocolProxy has a cache of NM proxies, and the size of this cache is configurable. However the cache can grow far beyond the configured size when running on a large cluster and blow AM address/container limits. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster
[ https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-2314: - Attachment: YARN-2314v2.patch Updated the patch to deprecate yarn.client.max-nodemanagers-proxies in favor of yarn.client.max-cached-nodemanagers-proxies. ContainerManagementProtocolProxy can create thousands of threads for a large cluster Key: YARN-2314 URL: https://issues.apache.org/jira/browse/YARN-2314 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.1.0-beta Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical Attachments: YARN-2314.patch, YARN-2314v2.patch, disable-cm-proxy-cache.patch, nmproxycachefix.prototype.patch ContainerManagementProtocolProxy has a cache of NM proxies, and the size of this cache is configurable. However the cache can grow far beyond the configured size when running on a large cluster and blow AM address/container limits. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster
[ https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-2314: - Attachment: YARN-2314.patch Attaching a patch that allows the existing yarn.client.max-nodemanagers-proxies to be zero to indicate the proxy cache is disabled. Also per Wangda's comment the default is 0 (i.e.: cache is disabled). If disabled it sets the idle timeout to zero, otherwise it leaves it untouched and caches the proxy objects. The comment for the property was updated to also mention the issue with lingering connection threads and the potential for the cache to cause problems on large clusters. This patch also includes my earlier prototype fix to keep the cache from accidentally increasing in size if connections are busy. bq. I'm a little doubt about if there is any other potential bug if we completely remove it. I'm on the other side of that fence, since we ran for a long time on Hadoop 0.23 without this cache and did not see issues. We've already found two issues with the cache (grows above the specified size and accumulates lingering connection threads), and I have yet to see evidence it is needed. If anything there's some evidence to the contrary from us and Sangjin. But in case someone running on a smaller cluster really is depending upon this cache for some use case, the patch tries to let large clusters work yet small cluster users can turn on this cache. ContainerManagementProtocolProxy can create thousands of threads for a large cluster Key: YARN-2314 URL: https://issues.apache.org/jira/browse/YARN-2314 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.1.0-beta Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical Attachments: YARN-2314.patch, disable-cm-proxy-cache.patch, nmproxycachefix.prototype.patch ContainerManagementProtocolProxy has a cache of NM proxies, and the size of this cache is configurable. However the cache can grow far beyond the configured size when running on a large cluster and blow AM address/container limits. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster
[ https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-2314: - Attachment: disable-cm-proxy-cache.patch Yeah, I don't think there's a good way to fix this short of running a bigger container than necessary or patching the code. Attaching a patch we've been running with recently that disables the CM proxy cache completely and reinstates the fix from MAPREDUCE-. It's not an ideal fix but it effectively restores the behavior to what Hadoop 0.23 did which worked OK for us. ContainerManagementProtocolProxy can create thousands of threads for a large cluster Key: YARN-2314 URL: https://issues.apache.org/jira/browse/YARN-2314 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.1.0-beta Reporter: Jason Lowe Priority: Critical Attachments: disable-cm-proxy-cache.patch, nmproxycachefix.prototype.patch ContainerManagementProtocolProxy has a cache of NM proxies, and the size of this cache is configurable. However the cache can grow far beyond the configured size when running on a large cluster and blow AM address/container limits. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster
[ https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-2314: - Attachment: nmproxycachefix.prototype.patch I was thinking along similar lines, but I am worried about the corner case where all RPCs are in use. I think we need to handle this case even if it's rare. An AM running on a node where it can see the RM but has a network cut to the rest of the cluster could go really bad really quick otherwise. If we don't handle the corner case then we'll continue to grow the proxy cache beyond its boundaries as we do today, and that AM will explode with thousands of threads for what may be a temporary network outage. While debugging this I wrote up a quick prototype patch to try to fix the cache so that it keeps the cache under the configured limit. Attaching the patch for reference. However as I mentioned above, simply keeping the NM proxy cache under its configured limit means nothing if we don't address the problems with connections remaining open in the IPC Client layer. ContainerManagementProtocolProxy can create thousands of threads for a large cluster Key: YARN-2314 URL: https://issues.apache.org/jira/browse/YARN-2314 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.1.0-beta Reporter: Jason Lowe Priority: Critical Attachments: nmproxycachefix.prototype.patch ContainerManagementProtocolProxy has a cache of NM proxies, and the size of this cache is configurable. However the cache can grow far beyond the configured size when running on a large cluster and blow AM address/container limits. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.2#6252)