[jira] [Updated] (YARN-542) Change the default global AM max-attempts value to be not one
[ https://issues.apache.org/jira/browse/YARN-542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-542: - Description: Today, the global AM max-attempts is set to 1 which is a bad choice. AM max-attempts accounts for both AM level failures as well as container crashes due to localization issue, lost nodes etc. To account for AM crashes due to problems that are not caused by user code, mainly lost nodes, we want to give AMs some retires. I propose we change it to atleast two. Can change it to 4 to match other retry-configs. was: Today, the AM max-retries is set to 1 which is a bad choice. AM max-retries accounts for both AM level failures as well as container crashes due to localization issue, lost nodes etc. To account for AM crashes due to problems that are not caused by user code, mainly lost nodes, we want to give AMs some retires. I propose we change it to atleast two. Can change it to 4 to match other retry-configs. Change the default global AM max-attempts value to be not one - Key: YARN-542 URL: https://issues.apache.org/jira/browse/YARN-542 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Zhijie Shen Today, the global AM max-attempts is set to 1 which is a bad choice. AM max-attempts accounts for both AM level failures as well as container crashes due to localization issue, lost nodes etc. To account for AM crashes due to problems that are not caused by user code, mainly lost nodes, we want to give AMs some retires. I propose we change it to atleast two. Can change it to 4 to match other retry-configs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-542) Change the default global AM max-attempts value to be not one
[ https://issues.apache.org/jira/browse/YARN-542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-542: - Summary: Change the default global AM max-attempts value to be not one (was: Change the default AM retry value to be not one) Change the default global AM max-attempts value to be not one - Key: YARN-542 URL: https://issues.apache.org/jira/browse/YARN-542 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Zhijie Shen Today, the AM max-retries is set to 1 which is a bad choice. AM max-retries accounts for both AM level failures as well as container crashes due to localization issue, lost nodes etc. To account for AM crashes due to problems that are not caused by user code, mainly lost nodes, we want to give AMs some retires. I propose we change it to atleast two. Can change it to 4 to match other retry-configs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-542) Change the default global AM max-attempts value to be not one
[ https://issues.apache.org/jira/browse/YARN-542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-542: - Attachment: YARN-542.1.patch I've drafted a patch, which includes the following modifications: 1. Change the default value of yarn.resourcemanager.am.max-attempts from 1 to 2. 2. In the test cases, where more than one attempt is set, YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS is used instead of the hard-coding values. 3. Assert the set maxAttempts 1 where one and more than one will make difference. Change the default global AM max-attempts value to be not one - Key: YARN-542 URL: https://issues.apache.org/jira/browse/YARN-542 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Zhijie Shen Attachments: YARN-542.1.patch Today, the global AM max-attempts is set to 1 which is a bad choice. AM max-attempts accounts for both AM level failures as well as container crashes due to localization issue, lost nodes etc. To account for AM crashes due to problems that are not caused by user code, mainly lost nodes, we want to give AMs some retires. I propose we change it to atleast two. Can change it to 4 to match other retry-configs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira