[ https://issues.apache.org/jira/browse/YARN-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13737487#comment-13737487 ]
Karthik Kambatla commented on YARN-1055: ---------------------------------------- Let me explain what I am getting at with the help of a concrete example. # User is trying to run a Oozie workflow that has a 10 actions - the 10th one is an MR job with 100 map tasks. # The launcher job starts (AM-l) and subsequently starts the MR job - (AM-mr-1); the max-app-attempts for launcher (AM-l) is set to > 1, say 3. # After completion of 95 tasks, AM-mr-1 goes down (node or other failure). Ideally, I would not want to restart the entire oozie workflow for a single AM (may be node) failure. To address this, I would want to set max-app-attempts for MR-AM to be > 1, say 3. # Assuming max-app-attempts = 3, the MR job runs a few more tasks. # When the MR job still has 1 task to go, the RM goes down. # Post RM-restart, the launcher (AM-l) and MR job (AM-mr-2) are restarted. The launcher re-runs the MR job - (AM-mr-3). It is possible that AM-mr-2 and AM-mr-3 run at the same time leading to any number of issues - performance, correctness etc. To avoid this, I would want to set max-app-attempts = 1 for the MR action. # Points 3 (tolerating AM failure) and 6 (tolerating RM failure) require us to set max-app-attempts to > 1 and =1 respectively at the same time. Now, consider a separate config for recovering apps on RM restart exists. I could use this config to address point 6 (the RM failure) and the current max-app-attempts for point 3 (the AM failure). Am I overlooking/missing something here. Thoughts? > App recovery should be configurable per application > --------------------------------------------------- > > Key: YARN-1055 > URL: https://issues.apache.org/jira/browse/YARN-1055 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager > Affects Versions: 2.1.0-beta > Reporter: Karthik Kambatla > > In Hadoop-1, the job recovery on JT restart is configurable per-job. For > parity and its usefulness, we should have the same behavior in YARN as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira