subject:"\[jira\] \[Updated\] \(YARN\-2314\) ContainerManagementProtocolProxy can create thousands of threads for a large cluster"

[jira] [Updated] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster

2014-10-16 Thread Rajesh Balamohan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated YARN-2314:
---
Attachment: tez-yarn-2314.xlsx

Attaching the results of getProxy() call for tez with 20 nodes with this patch 
for different cache sizes and for different data sizes (tested a job @200GB and 
10 TB scale).  Overall, there is slight degradation in performance (in 
milliseconds) by setting cache size to 0, but not significant to make an impact 
in overall job runtime in tez.

 ContainerManagementProtocolProxy can create thousands of threads for a large 
 cluster
 

 Key: YARN-2314
 URL: https://issues.apache.org/jira/browse/YARN-2314
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
 Attachments: YARN-2314.patch, YARN-2314v2.patch, 
 disable-cm-proxy-cache.patch, nmproxycachefix.prototype.patch, 
 tez-yarn-2314.xlsx


 ContainerManagementProtocolProxy has a cache of NM proxies, and the size of 
 this cache is configurable.  However the cache can grow far beyond the 
 configured size when running on a large cluster and blow AM address/container 
 limits.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster

2014-10-15 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-2314:
-
Attachment: YARN-2314v2.patch

Updated the patch to deprecate yarn.client.max-nodemanagers-proxies in favor of 
yarn.client.max-cached-nodemanagers-proxies.

 ContainerManagementProtocolProxy can create thousands of threads for a large 
 cluster
 

 Key: YARN-2314
 URL: https://issues.apache.org/jira/browse/YARN-2314
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
 Attachments: YARN-2314.patch, YARN-2314v2.patch, 
 disable-cm-proxy-cache.patch, nmproxycachefix.prototype.patch


 ContainerManagementProtocolProxy has a cache of NM proxies, and the size of 
 this cache is configurable.  However the cache can grow far beyond the 
 configured size when running on a large cluster and blow AM address/container 
 limits.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster

2014-10-14 Thread Jason Lowe (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jason Lowe updated YARN-2314:
-
Attachment: YARN-2314.patch

Attaching a patch that allows the existing yarn.client.max-nodemanagers-proxies
to be zero to indicate the proxy cache is disabled. Also per Wangda's comment
the default is 0 (i.e.: cache is disabled). If disabled it sets the idle
timeout to zero, otherwise it leaves it untouched and caches the proxy objects.
The comment for the property was updated to also mention the issue with
lingering connection threads and the potential for the cache to cause problems
on large clusters. This patch also includes my earlier prototype fix to keep
the cache from accidentally increasing in size if connections are busy.

bq. I'm a little doubt about if there is any other potential bug if we
completely remove it.

I'm on the other side of that fence, since we ran for a long time on Hadoop
0.23 without this cache and did not see issues. We've already found two issues
with the cache (grows above the specified size and accumulates lingering
connection threads), and I have yet to see evidence it is needed. If anything
there's some evidence to the contrary from us and Sangjin.

But in case someone running on a smaller cluster really is depending upon this
cache for some use case, the patch tries to let large clusters work yet small
cluster users can turn on this cache.

ContainerManagementProtocolProxy can create thousands of threads for a large
cluster

Key: YARN-2314
URL: https://issues.apache.org/jira/browse/YARN-2314
Project: Hadoop YARN
Issue Type: Bug
Components: client
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
Attachments: YARN-2314.patch, disable-cm-proxy-cache.patch,
nmproxycachefix.prototype.patch

ContainerManagementProtocolProxy has a cache of NM proxies, and the size of
this cache is configurable. However the cache can grow far beyond the
configured size when running on a large cluster and blow AM address/container
limits. More details in the first comment.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster

2014-09-12 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-2314:
-
Attachment: disable-cm-proxy-cache.patch

Yeah, I don't think there's a good way to fix this short of running a bigger 
container than necessary or patching the code.

Attaching a patch we've been running with recently that disables the CM proxy 
cache completely and reinstates the fix from MAPREDUCE-.  It's not an ideal 
fix but it effectively restores the behavior to what Hadoop 0.23 did which 
worked OK for us.

 ContainerManagementProtocolProxy can create thousands of threads for a large 
 cluster
 

 Key: YARN-2314
 URL: https://issues.apache.org/jira/browse/YARN-2314
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe
Priority: Critical
 Attachments: disable-cm-proxy-cache.patch, 
 nmproxycachefix.prototype.patch


 ContainerManagementProtocolProxy has a cache of NM proxies, and the size of 
 this cache is configurable.  However the cache can grow far beyond the 
 configured size when running on a large cluster and blow AM address/container 
 limits.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster

2014-07-22 Thread Jason Lowe (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jason Lowe updated YARN-2314:
-

Attachment: nmproxycachefix.prototype.patch

I was thinking along similar lines, but I am worried about the corner case
where all RPCs are in use. I think we need to handle this case even if it's
rare. An AM running on a node where it can see the RM but has a network cut to
the rest of the cluster could go really bad really quick otherwise. If we
don't handle the corner case then we'll continue to grow the proxy cache beyond
its boundaries as we do today, and that AM will explode with thousands of
threads for what may be a temporary network outage.

While debugging this I wrote up a quick prototype patch to try to fix the cache
so that it keeps the cache under the configured limit. Attaching the patch for
reference. However as I mentioned above, simply keeping the NM proxy cache
under its configured limit means nothing if we don't address the problems with
connections remaining open in the IPC Client layer.

ContainerManagementProtocolProxy can create thousands of threads for a large
cluster

Key: YARN-2314
URL: https://issues.apache.org/jira/browse/YARN-2314
Project: Hadoop YARN
Issue Type: Bug
Components: client
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe
Priority: Critical
Attachments: nmproxycachefix.prototype.patch

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster

[jira] [Updated] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster

[jira] [Updated] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster

[jira] [Updated] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster

[jira] [Updated] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster

5 matches

Site Navigation

Mail list logo

Footer information