[
https://issues.apache.org/jira/browse/YARN-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13737487#comment-13737487
]
Karthik Kambatla commented on YARN-1055:
----------------------------------------
Let me explain what I am getting at with the help of a concrete example.
# User is trying to run a Oozie workflow that has a 10 actions - the 10th one
is an MR job with 100 map tasks.
# The launcher job starts (AM-l) and subsequently starts the MR job -
(AM-mr-1); the max-app-attempts for launcher (AM-l) is set to > 1, say 3.
# After completion of 95 tasks, AM-mr-1 goes down (node or other failure).
Ideally, I would not want to restart the entire oozie workflow for a single AM
(may be node) failure. To address this, I would want to set max-app-attempts
for MR-AM to be > 1, say 3.
# Assuming max-app-attempts = 3, the MR job runs a few more tasks.
# When the MR job still has 1 task to go, the RM goes down.
# Post RM-restart, the launcher (AM-l) and MR job (AM-mr-2) are restarted. The
launcher re-runs the MR job - (AM-mr-3). It is possible that AM-mr-2 and
AM-mr-3 run at the same time leading to any number of issues - performance,
correctness etc. To avoid this, I would want to set max-app-attempts = 1 for
the MR action.
# Points 3 (tolerating AM failure) and 6 (tolerating RM failure) require us to
set max-app-attempts to > 1 and =1 respectively at the same time.
Now, consider a separate config for recovering apps on RM restart exists. I
could use this config to address point 6 (the RM failure) and the current
max-app-attempts for point 3 (the AM failure).
Am I overlooking/missing something here. Thoughts?
> App recovery should be configurable per application
> ---------------------------------------------------
>
> Key: YARN-1055
> URL: https://issues.apache.org/jira/browse/YARN-1055
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Affects Versions: 2.1.0-beta
> Reporter: Karthik Kambatla
>
> In Hadoop-1, the job recovery on JT restart is configurable per-job. For
> parity and its usefulness, we should have the same behavior in YARN as well.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira