[ https://issues.apache.org/jira/browse/YARN-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15038107#comment-15038107 ]
Jason Lowe commented on YARN-4414: ---------------------------------- I noticed that HA proxies for the namenode and resourcemanager explicitly disable the connection retries in the RPC layer by default since it knows the HA proxy will do the retries. I think the same should apply for nodemanager proxies, since we're seeing even connection timeouts retried too often in the RPC layer given a container allocation is worthless after 10 minutes by default. By disabling retries in the RPC layer, we can add ConnectTimeoutException back to the list of exceptions retried at the NM proxy layer and simply retry all appropriate exceptions at the NM proxy layer. > Nodemanager connection errors are retried at multiple levels > ------------------------------------------------------------ > > Key: YARN-4414 > URL: https://issues.apache.org/jira/browse/YARN-4414 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Affects Versions: 2.7.1, 2.6.2 > Reporter: Jason Lowe > > This is related to YARN-3238. Ran into more scenarios where connection > errors are being retried at multiple levels, like NoRouteToHostException. > The fix for YARN-3238 was too specific, and I think we need a more general > solution to catch a wider array of connection errors that can occur to avoid > retrying them both at the RPC layer and at the NM proxy layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)