[ 
https://issues.apache.org/jira/browse/YARN-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367167#comment-14367167
 ] 

Andrew Johnson commented on YARN-3364:
--------------------------------------

No, I did not have YARN-3238 applied.  Thanks for that!

Given that and HADOOP-11398 I think this can can be closed.

> Clarify Naming of yarn.client.nodemanager-connect.max-wait-ms and 
> yarn.resourcemanager.connect.max-wait.ms 
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-3364
>                 URL: https://issues.apache.org/jira/browse/YARN-3364
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: yarn
>            Reporter: Andrew Johnson
>
> I encountered an issue recently where the ApplicationMaster for MapReduce 
> jobs would spend hours attempting to connect to a node in my cluster that had 
> died due to a hardware fault.  After debugging this, I found that the 
> yarn.client.nodemanager-connect.max-wait-ms property did not behave as I had 
> expected.  Based on the name I had thought this would set a maximum time 
> limit for attempting to connect to a NodeManager.  The code in 
> org.apache.hadoop.yarn.client.NMProxy corroborated this thought - it used a 
> RetryUpToMaximumTimeWithFixedSleep policy when a  ConnectTimeoutException was 
> thrown, as it was in my case with a dead node.
> However, the RetryUpToMaximumTimeWithFixedSleep policy doesn't actually set a 
> time limit, but instead divides the maximum time by the sleep period to set a 
> total number of retries, regardless of how long those retries take.  As such 
> I was seeing the ApplicationMaster spend much longer attempting to make a 
> connection than I had anticipated.
> The yarn.resourcemanager.connect.max-wait.ms would have the same behavior.  
> These properties would be better named like 
> yarn.client.nodemanager-connect.max.retries and 
> yarn.resourcemanager.connect.max.retries to better align with the actual 
> behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to