[
https://issues.apache.org/jira/browse/YARN-614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648654#comment-13648654
]
Chris Riccomini commented on YARN-614:
--------------------------------------
The proposed change, right now, is to keep a counter in RMAppImpl
(ignoredFailures) that keeps track of AM failures that we wish to ignore. This
counter is incremented whenever the RMAppImpl gets a AM fails with a container
status that is either ABORTED or DISKS_FAILED. When the RMAppImpl decides
whether to re-try an app with a new attempt, instead of checking if attempts <
maxSize, it checks attempts - ignoredFailures < maxSize.
This patch does not address recovery. If attempts > maxSize on recovery, the
job will not be restarted, even if some of those attempts resulted in ABORTED
or DISKS_FAILED failures.
> Retry attempts automatically for hardware failures or YARN issues and set
> default app retries to 1
> --------------------------------------------------------------------------------------------------
>
> Key: YARN-614
> URL: https://issues.apache.org/jira/browse/YARN-614
> Project: Hadoop YARN
> Issue Type: Improvement
> Reporter: Bikas Saha
> Assignee: Chris Riccomini
> Attachments: YARN-614-0.patch, YARN-614-1.patch, YARN-614-2.patch
>
>
> Attempts can fail due to a large number of user errors and they should not be
> retried unnecessarily. The only reason YARN should retry an attempt is when
> the hardware fails or YARN has an error. NM failing, lost NM and NM disk
> errors are the hardware errors that come to mind.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira