[jira] [Updated] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster

2014-10-16 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated YARN-2314:
---
Attachment: tez-yarn-2314.xlsx

Attaching the results of getProxy() call for tez with 20 nodes with this patch 
for different cache sizes and for different data sizes (tested a job @200GB and 
10 TB scale).  Overall, there is slight degradation in performance (in 
milliseconds) by setting cache size to 0, but not significant to make an impact 
in overall job runtime in tez.

 ContainerManagementProtocolProxy can create thousands of threads for a large 
 cluster
 

 Key: YARN-2314
 URL: https://issues.apache.org/jira/browse/YARN-2314
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
 Attachments: YARN-2314.patch, YARN-2314v2.patch, 
 disable-cm-proxy-cache.patch, nmproxycachefix.prototype.patch, 
 tez-yarn-2314.xlsx


 ContainerManagementProtocolProxy has a cache of NM proxies, and the size of 
 this cache is configurable.  However the cache can grow far beyond the 
 configured size when running on a large cluster and blow AM address/container 
 limits.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster

2014-10-15 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-2314:
-
Attachment: YARN-2314v2.patch

Updated the patch to deprecate yarn.client.max-nodemanagers-proxies in favor of 
yarn.client.max-cached-nodemanagers-proxies.

 ContainerManagementProtocolProxy can create thousands of threads for a large 
 cluster
 

 Key: YARN-2314
 URL: https://issues.apache.org/jira/browse/YARN-2314
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
 Attachments: YARN-2314.patch, YARN-2314v2.patch, 
 disable-cm-proxy-cache.patch, nmproxycachefix.prototype.patch


 ContainerManagementProtocolProxy has a cache of NM proxies, and the size of 
 this cache is configurable.  However the cache can grow far beyond the 
 configured size when running on a large cluster and blow AM address/container 
 limits.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster

2014-10-14 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-2314:
-
Attachment: YARN-2314.patch

Attaching a patch that allows the existing yarn.client.max-nodemanagers-proxies 
to be zero to indicate the proxy cache is disabled.  Also per Wangda's comment 
the default is 0 (i.e.: cache is disabled).  If disabled it sets the idle 
timeout to zero, otherwise it leaves it untouched and caches the proxy objects. 
 The comment for the property was updated to also mention the issue with 
lingering connection threads and the potential for the cache to cause problems 
on large clusters.  This patch also includes my earlier prototype fix to keep 
the cache from accidentally increasing in size if connections are busy.

bq. I'm a little doubt about if there is any other potential bug if we 
completely remove it.

I'm on the other side of that fence, since we ran for a long time on Hadoop 
0.23 without this cache and did not see issues.  We've already found two issues 
with the cache (grows above the specified size and accumulates lingering 
connection threads), and I have yet to see evidence it is needed.  If anything 
there's some evidence to the contrary from us and Sangjin.

But in case someone running on a smaller cluster really is depending upon this 
cache for some use case, the patch tries to let large clusters work yet small 
cluster users can turn on this cache.

 ContainerManagementProtocolProxy can create thousands of threads for a large 
 cluster
 

 Key: YARN-2314
 URL: https://issues.apache.org/jira/browse/YARN-2314
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
 Attachments: YARN-2314.patch, disable-cm-proxy-cache.patch, 
 nmproxycachefix.prototype.patch


 ContainerManagementProtocolProxy has a cache of NM proxies, and the size of 
 this cache is configurable.  However the cache can grow far beyond the 
 configured size when running on a large cluster and blow AM address/container 
 limits.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster

2014-09-12 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-2314:
-
Attachment: disable-cm-proxy-cache.patch

Yeah, I don't think there's a good way to fix this short of running a bigger 
container than necessary or patching the code.

Attaching a patch we've been running with recently that disables the CM proxy 
cache completely and reinstates the fix from MAPREDUCE-.  It's not an ideal 
fix but it effectively restores the behavior to what Hadoop 0.23 did which 
worked OK for us.

 ContainerManagementProtocolProxy can create thousands of threads for a large 
 cluster
 

 Key: YARN-2314
 URL: https://issues.apache.org/jira/browse/YARN-2314
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe
Priority: Critical
 Attachments: disable-cm-proxy-cache.patch, 
 nmproxycachefix.prototype.patch


 ContainerManagementProtocolProxy has a cache of NM proxies, and the size of 
 this cache is configurable.  However the cache can grow far beyond the 
 configured size when running on a large cluster and blow AM address/container 
 limits.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster

2014-07-22 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-2314:
-

Attachment: nmproxycachefix.prototype.patch

I was thinking along similar lines, but I am worried about the corner case 
where all RPCs are in use.  I think we need to handle this case even if it's 
rare.  An AM running on a node where it can see the RM but has a network cut to 
the rest of the cluster could go really bad really quick otherwise.  If we 
don't handle the corner case then we'll continue to grow the proxy cache beyond 
its boundaries as we do today, and that AM will explode with thousands of 
threads for what may be a temporary network outage.

While debugging this I wrote up a quick prototype patch to try to fix the cache 
so that it keeps the cache under the configured limit.  Attaching the patch for 
reference.  However as I mentioned above, simply keeping the NM proxy cache 
under its configured limit means nothing if we don't address the problems with 
connections remaining open in the IPC Client layer.

 ContainerManagementProtocolProxy can create thousands of threads for a large 
 cluster
 

 Key: YARN-2314
 URL: https://issues.apache.org/jira/browse/YARN-2314
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe
Priority: Critical
 Attachments: nmproxycachefix.prototype.patch


 ContainerManagementProtocolProxy has a cache of NM proxies, and the size of 
 this cache is configurable.  However the cache can grow far beyond the 
 configured size when running on a large cluster and blow AM address/container 
 limits.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.2#6252)