[
https://issues.apache.org/jira/browse/YARN-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13740242#comment-13740242
]
Bikas Saha commented on YARN-1055:
----------------------------------
First of all, whatever needs to be set must be set in the AppSubmissionContext
API for that job. Only that is job specific and this config cannot be global
across all jobs.
By MAPREDUCE-4824 on job submission, we set a property in job conf (that is job
specific) saying not to retry the job.
In YARN, on job submission, in the AppSubmissionContext API (that is job
specific), we say that max-am-retries = 1.
For a job that cannot be restarted, (either due to AM crash or node crash or RM
restart AND all these are indistinguishable wrt to the job) the per job
max-am-retries needs to be set to 1. Its probably 2 weeks worth of work to
remove RM restart from the above list. Even after that, such a job needs to set
max-am-retries = 1 so that RM does not restart the job when the node crashes or
AM crashes. Why does an rm restart related special API need to be added now?
> Handle app recovery differently for AM failures and RM restart
> --------------------------------------------------------------
>
> Key: YARN-1055
> URL: https://issues.apache.org/jira/browse/YARN-1055
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Affects Versions: 2.1.0-beta
> Reporter: Karthik Kambatla
>
> Ideally, we would like to tolerate container, AM, RM failures. App recovery
> for AM and RM currently relies on the max-attempts config; tolerating AM
> failures requires it to be > 1 and tolerating RM failure/restart requires it
> to be = 1.
> We should handle these two differently, with two separate configs.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira