[ https://issues.apache.org/jira/browse/YARN-3238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14331991#comment-14331991 ]
Jian He commented on YARN-3238: ------------------------------- I think this is related to the RetryPolicy library we use from common module. the implementation of {{RetryPolicies.retryUpToMaximumTimeWithFixedSleep}} doesn't match the semantics. It should retry based on the overall time taken instead of the number of retries. HADOOP-11398 is trying to fix this. > Connection timeouts to nodemanagers are retried at multiple levels > ------------------------------------------------------------------ > > Key: YARN-3238 > URL: https://issues.apache.org/jira/browse/YARN-3238 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 2.6.0 > Reporter: Jason Lowe > Assignee: Jason Lowe > Priority: Blocker > Fix For: 2.7.0 > > Attachments: YARN-3238.001.patch > > > The IPC layer will retry connection timeouts automatically (see Client.java), > but we are also retrying them with YARN's RetryPolicy put in place when the > NM proxy is created. This causes a two-level retry mechanism where the IPC > layer has already retried quite a few times (45 by default) for each YARN > RetryPolicy error that is retried. The end result is that NM clients can > wait a very, very long time for the connection to finally fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)