[jira] [Commented] (YARN-1055) Handle app recovery differently for AM failures and RM restart

Bikas Saha (JIRA) Wed, 14 Aug 2013 14:50:51 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13740242#comment-13740242
 ]


Bikas Saha commented on YARN-1055:
----------------------------------

First of all, whatever needs to be set must be set in the AppSubmissionContext 
API for that job. Only that is job specific and this config cannot be global 
across all jobs.

By MAPREDUCE-4824 on job submission, we set a property in job conf (that is job 
specific) saying not to retry the job.
In YARN, on job submission, in the AppSubmissionContext API (that is job 
specific), we say that max-am-retries = 1.

For a job that cannot be restarted, (either due to AM crash or node crash or RM 
restart AND all these are indistinguishable wrt to the job) the per job 
max-am-retries needs to be set to 1. Its probably 2 weeks worth of work to 
remove RM restart from the above list. Even after that, such a job needs to set 
max-am-retries = 1 so that RM does not restart the job when the node crashes or 
AM crashes. Why does an rm restart related special API need to be added now?

                
> Handle app recovery differently for AM failures and RM restart
> --------------------------------------------------------------
>
>                 Key: YARN-1055
>                 URL: https://issues.apache.org/jira/browse/YARN-1055
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.1.0-beta
>            Reporter: Karthik Kambatla
>
> Ideally, we would like to tolerate container, AM, RM failures. App recovery 
> for AM and RM currently relies on the max-attempts config; tolerating AM 
> failures requires it to be > 1 and tolerating RM failure/restart requires it 
> to be = 1.
> We should handle these two differently, with two separate configs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-1055) Handle app recovery differently for AM failures and RM restart

Reply via email to