[ 
https://issues.apache.org/jira/browse/YARN-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13737487#comment-13737487
 ] 

Karthik Kambatla commented on YARN-1055:
----------------------------------------

Let me explain what I am getting at with the help of a concrete example.

# User is trying to run a Oozie workflow that has a 10 actions - the 10th one 
is an MR job with 100 map tasks.
# The launcher job starts (AM-l) and subsequently starts the MR job - 
(AM-mr-1); the max-app-attempts for launcher (AM-l) is set to > 1, say 3.
# After completion of 95 tasks, AM-mr-1 goes down (node or other failure). 
Ideally, I would not want to restart the entire oozie workflow for a single AM 
(may be node) failure. To address this, I would want to set max-app-attempts 
for MR-AM to be > 1, say 3.
# Assuming max-app-attempts = 3, the MR job runs a few more tasks.
# When the MR job still has 1 task to go, the RM goes down.
# Post RM-restart, the launcher (AM-l) and MR job (AM-mr-2) are restarted. The 
launcher re-runs the MR job - (AM-mr-3). It is possible that AM-mr-2 and 
AM-mr-3 run at the same time leading to any number of issues - performance, 
correctness etc. To avoid this, I would want to set max-app-attempts = 1 for 
the MR action. 
# Points 3 (tolerating AM failure) and 6 (tolerating RM failure) require us to 
set max-app-attempts to > 1 and =1 respectively at the same time.

Now, consider a separate config for recovering apps on RM restart exists. I 
could use this config to address point 6 (the RM failure) and the current 
max-app-attempts for point 3 (the AM failure).

Am I overlooking/missing something here. Thoughts?
                
> App recovery should be configurable per application
> ---------------------------------------------------
>
>                 Key: YARN-1055
>                 URL: https://issues.apache.org/jira/browse/YARN-1055
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.1.0-beta
>            Reporter: Karthik Kambatla
>
> In Hadoop-1, the job recovery on JT restart is configurable per-job. For 
> parity and its usefulness, we should have the same behavior in YARN as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to