[jira] [Commented] (YARN-3480) Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable
[ https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533389#comment-14533389 ] Jian He commented on YARN-3480: --- [~hex108], thanks for your explanations. I can see this will be a problem for long running apps, as the number of attempts will just keep increasing. To be consistent with the attemptFailureValidityWindow for long running apps, instead of introducing a global hard limit, How about removing the attempt records that's beyond the validity window ? Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable Key: YARN-3480 URL: https://issues.apache.org/jira/browse/YARN-3480 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3480.01.patch, YARN-3480.02.patch, YARN-3480.03.patch When RM HA is enabled and running containers are kept across attempts, apps are more likely to finish successfully with more retries(attempts), so it will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make RM recover process much slower. It might be better to set max attempts to be stored in RMStateStore. BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to a small value, retried attempts might be very large. So we need to delete some attempts stored in RMStateStore and RMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3480) Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable
[ https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533731#comment-14533731 ] Jun Gong commented on YARN-3480: [~jianhe] just catch your option. Do you mean that the configure value for max attempts stored in RMAppImpl and RMStateStore is not needed at all, we could just remove the attempts records that's beyond the validity window? If so, I will update the patch. Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable Key: YARN-3480 URL: https://issues.apache.org/jira/browse/YARN-3480 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3480.01.patch, YARN-3480.02.patch, YARN-3480.03.patch When RM HA is enabled and running containers are kept across attempts, apps are more likely to finish successfully with more retries(attempts), so it will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make RM recover process much slower. It might be better to set max attempts to be stored in RMStateStore. BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to a small value, retried attempts might be very large. So we need to delete some attempts stored in RMStateStore and RMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3480) Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable
[ https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533705#comment-14533705 ] Jun Gong commented on YARN-3480: [~jianhe], thanks for your comments and suggestion. The latest patch(YARN-3480.03.patch) have been working as expected: just remove the attempt records that's beyond the validity window. For the special case that the validity window is -1, it might be better to remove attempts until the number of attempts is less than hard limit, because yarn.resourcemanager.am.max-attempts might be very large(like our scenario) . What's you option? Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable Key: YARN-3480 URL: https://issues.apache.org/jira/browse/YARN-3480 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3480.01.patch, YARN-3480.02.patch, YARN-3480.03.patch When RM HA is enabled and running containers are kept across attempts, apps are more likely to finish successfully with more retries(attempts), so it will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make RM recover process much slower. It might be better to set max attempts to be stored in RMStateStore. BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to a small value, retried attempts might be very large. So we need to delete some attempts stored in RMStateStore and RMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3480) Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable
[ https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532755#comment-14532755 ] sandflee commented on YARN-3480: one benefit in [~hex108]'s work is we wouldn't worry about AM' failure. In our production env, we don't expect apps to be killed because of AM's failure. Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable Key: YARN-3480 URL: https://issues.apache.org/jira/browse/YARN-3480 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3480.01.patch, YARN-3480.02.patch, YARN-3480.03.patch When RM HA is enabled and running containers are kept across attempts, apps are more likely to finish successfully with more retries(attempts), so it will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make RM recover process much slower. It might be better to set max attempts to be stored in RMStateStore. BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to a small value, retried attempts might be very large. So we need to delete some attempts stored in RMStateStore and RMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3480) Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable
[ https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527069#comment-14527069 ] Jian He commented on YARN-3480: --- [~hex108], generally, it's better to avoid a global config for an outlier app. 1. How often do you see an app failed with a large number of attempts? If it's limited to a few apps. I wouldn't worry so much. bq. make RM recover process much slower. 2. How slower it is in reality in your case? we've done some benchmark, recovering 10k apps(with 1 attempt) on ZK is pretty fast, within 20 seconds or so. 3. Limiting the attempts to be recorded means we are losing history. it's a trade off. My main point is that if you can provide some real numbers showing how slow the recovery process in real scenario, we can figure out where the bottleneck is and how to improve it. Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable Key: YARN-3480 URL: https://issues.apache.org/jira/browse/YARN-3480 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3480.01.patch, YARN-3480.02.patch, YARN-3480.03.patch When RM HA is enabled and running containers are kept across attempts, apps are more likely to finish successfully with more retries(attempts), so it will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make RM recover process much slower. It might be better to set max attempts to be stored in RMStateStore. BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to a small value, retried attempts might be very large. So we need to delete some attempts stored in RMStateStore and RMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3480) Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable
[ https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527847#comment-14527847 ] Jun Gong commented on YARN-3480: [~jianhe], sorry for not specifying our scenario: RM HA is enabled, use ZK to store apps' info, most apps running in the cluster are long running(service) apps, yarn.resourcemanager.am.max-attempts is set to 1 because we have not patched YARN-611 and we want apps to retry more times. There are 10K apps with 1~1 attempts stored in ZK. It will take about 6 mins to recover those apps when RM HA. {quote} 1. How often do you see an app failed with a large number of attempts? If it's limited to a few apps. I wouldn't worry so much. 2. How slower it is in reality in your case? we've done some benchmark, recovering 10k apps(with 1 attempt) on ZK is pretty fast, within 20 seconds or so. {quote} Please see above. I think it will be OK for map-reduce jobs. But it might not be OK for service apps which have been running several months. {quote} 3. Limiting the attempts to be recorded means we are losing history. it's a trade off. {quote} Yes, I agree. Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable Key: YARN-3480 URL: https://issues.apache.org/jira/browse/YARN-3480 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3480.01.patch, YARN-3480.02.patch, YARN-3480.03.patch When RM HA is enabled and running containers are kept across attempts, apps are more likely to finish successfully with more retries(attempts), so it will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make RM recover process much slower. It might be better to set max attempts to be stored in RMStateStore. BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to a small value, retried attempts might be very large. So we need to delete some attempts stored in RMStateStore and RMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3480) Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable
[ https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526195#comment-14526195 ] Jun Gong commented on YARN-3480: [~vinodkv] Thank you for the comments. {quote} No, as you noted later, the right solution is for apps to set the attempt-failures validity-interval. {quote} Yes, I agree with it. {quote} We already have a yarn.resourcemanager.am.max-attempts that acts as a global limit. Is that not sufficient? A more practical problem is the number of apps itself. And we do have an upper limit of 10K by default for this. Is that not enough? Are you seeing issues in a real-life scenario? {quote} yarn.resourcemanager.am.max-attempts just limits the max attempts in the time window which is configured through 'attemptFailuresValidityInterval'. Suppose the following scenario: app's am.max-attempts is set to 2, and its attemptFailuresValidityInterval is set to 30, if app failed at 00:00, 00:31, 00: 62..., it will continue to retry and run because its number of failed attempts at the time window(attemptFailuresValidityInterval) is always 1. Then attempts' number will increase continously. {quote} I think we need to have a lower limit on the failure-validaty interval to avoid situations like this. If others agree too, will file a ticket. {quote} Please see the above scenario. Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable Key: YARN-3480 URL: https://issues.apache.org/jira/browse/YARN-3480 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3480.01.patch, YARN-3480.02.patch, YARN-3480.03.patch When RM HA is enabled and running containers are kept across attempts, apps are more likely to finish successfully with more retries(attempts), so it will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make RM recover process much slower. It might be better to set max attempts to be stored in RMStateStore. BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to a small value, retried attempts might be very large. So we need to delete some attempts stored in RMStateStore and RMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3480) Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable
[ https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14525491#comment-14525491 ] Hadoop QA commented on YARN-3480: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 37s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 7 new or modified test files. | | {color:green}+1{color} | javac | 7m 33s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 37s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 1m 34s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 13s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 36s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 2m 37s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | yarn tests | 0m 25s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 52m 17s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 91m 30s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12729945/YARN-3480.03.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 6ae2a0d | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/7659/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7659/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7659/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7659/console | This message was automatically generated. Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable Key: YARN-3480 URL: https://issues.apache.org/jira/browse/YARN-3480 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3480.01.patch, YARN-3480.02.patch, YARN-3480.03.patch When RM HA is enabled and running containers are kept across attempts, apps are more likely to finish successfully with more retries(attempts), so it will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make RM recover process much slower. It might be better to set max attempts to be stored in RMStateStore. BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to a small value, retried attempts might be very large. So we need to delete some attempts stored in RMStateStore and RMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3480) Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable
[ https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509263#comment-14509263 ] Hadoop QA commented on YARN-3480: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | patch | 0m 0s | The patch command could not apply the patch during dryrun. | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12727636/YARN-3480.01.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 189a63a | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7473/console | This message was automatically generated. Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable Key: YARN-3480 URL: https://issues.apache.org/jira/browse/YARN-3480 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3480.01.patch When RM HA is enabled and running containers are kept across attempts, apps are more likely to finish successfully with more retries(attempts), so it will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make RM recover process much slower. It might be better to set max attempts to be stored in RMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)