[ 
https://issues.apache.org/jira/browse/YARN-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15038107#comment-15038107
 ] 

Jason Lowe commented on YARN-4414:
----------------------------------

I noticed that HA proxies for the namenode and resourcemanager explicitly 
disable the connection retries in the RPC layer by default since it knows the 
HA proxy will do the retries.  I think the same should apply for nodemanager 
proxies, since we're seeing even connection timeouts retried too often in the 
RPC layer given a container allocation is worthless after 10 minutes by 
default.  By disabling retries in the RPC layer, we can add 
ConnectTimeoutException back to the list of exceptions retried at the NM proxy 
layer and simply retry all appropriate exceptions at the NM proxy layer.

> Nodemanager connection errors are retried at multiple levels
> ------------------------------------------------------------
>
>                 Key: YARN-4414
>                 URL: https://issues.apache.org/jira/browse/YARN-4414
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.7.1, 2.6.2
>            Reporter: Jason Lowe
>
> This is related to YARN-3238.  Ran into more scenarios where connection 
> errors are being retried at multiple levels, like NoRouteToHostException.  
> The fix for YARN-3238 was too specific, and I think we need a more general 
> solution to catch a wider array of connection errors that can occur to avoid 
> retrying them both at the RPC layer and at the NM proxy layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to