[
https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14133169#comment-14133169
]
Hudson commented on YARN-611:
-----------------------------
SUCCESS: Integrated in Hadoop-Yarn-trunk #680 (See
[https://builds.apache.org/job/Hadoop-Yarn-trunk/680/])
YARN-611. Added an API to let apps specify an interval beyond which AM failures
should be ignored towards counting max-attempts. Contributed by Xuan Gong.
(vinodkv: rev 14e2639fd0d53f7e0b58f2f4744af44983d4e867)
*
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/test/java/org/apache/hadoop/mapreduce/v2/hs/TestHistoryFileManager.java
*
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java
*
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/records/impl/pb/ApplicationAttemptStateDataPBImpl.java
*
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ApplicationSubmissionContextPBImpl.java
*
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/TestAMRestart.java
*
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/v2/TestSpeculativeExecutionWithMRApp.java
*
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/records/ApplicationAttemptStateData.java
*
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttempt.java
*
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java
*
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStoreTestBase.java
*
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/rm/TestRMContainerAllocator.java
*
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java
*
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskAttempt.java
*
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_protos.proto
*
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/pom.xml
*
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/FileSystemRMStateStore.java
*
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStore.java
* hadoop-yarn-project/CHANGES.txt
*
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/MemoryRMStateStore.java
*
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/proto/yarn_server_resourcemanager_recovery.proto
*
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/ControlledClock.java
*
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/ControlledClock.java
*
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java
*
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ApplicationSubmissionContext.java
> Add an AM retry count reset window to YARN RM
> ---------------------------------------------
>
> Key: YARN-611
> URL: https://issues.apache.org/jira/browse/YARN-611
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Affects Versions: 2.0.3-alpha
> Reporter: Chris Riccomini
> Assignee: Xuan Gong
> Fix For: 2.6.0
>
> Attachments: YARN-611.1.patch, YARN-611.10.patch, YARN-611.11.patch,
> YARN-611.2.patch, YARN-611.3.patch, YARN-611.4.patch,
> YARN-611.4.rebase.patch, YARN-611.5.patch, YARN-611.6.patch,
> YARN-611.7.patch, YARN-611.8.patch, YARN-611.9.patch, YARN-611.9.rebase.patch
>
>
> YARN currently has the following config:
> yarn.resourcemanager.am.max-retries
> This config defaults to 2, and defines how many times to retry a "failed" AM
> before failing the whole YARN job. YARN counts an AM as failed if the node
> that it was running on dies (the NM will timeout, which counts as a failure
> for the AM), or if the AM dies.
> This configuration is insufficient for long running (or infinitely running)
> YARN jobs, since the machine (or NM) that the AM is running on will
> eventually need to be restarted (or the machine/NM will fail). In such an
> event, the AM has not done anything wrong, but this is counted as a "failure"
> by the RM. Since the retry count for the AM is never reset, eventually, at
> some point, the number of machine/NM failures will result in the AM failure
> count going above the configured value for
> yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the
> job as failed, and shut it down. This behavior is not ideal.
> I propose that we add a second configuration:
> yarn.resourcemanager.am.retry-count-window-ms
> This configuration would define a window of time that would define when an AM
> is "well behaved", and it's safe to reset its failure count back to zero.
> Every time an AM fails the RmAppImpl would check the last time that the AM
> failed. If the last failure was less than retry-count-window-ms ago, and the
> new failure count is > max-retries, then the job should fail. If the AM has
> never failed, the retry count is < max-retries, or if the last failure was
> OUTSIDE the retry-count-window-ms, then the job should be restarted.
> Additionally, if the last failure was outside the retry-count-window-ms, then
> the failure count should be set back to 0.
> This would give developers a way to have well-behaved AMs run forever, while
> still failing mis-behaving AMs after a short period of time.
> I think the work to be done here is to change the RmAppImpl to actually look
> at app.attempts, and see if there have been more than max-retries failures in
> the last retry-count-window-ms milliseconds. If there have, then the job
> should fail, if not, then the job should go forward. Additionally, we might
> also need to add an endTime in either RMAppAttemptImpl or
> RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the
> failure.
> Thoughts?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)