[ 
https://issues.apache.org/jira/browse/YARN-614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648654#comment-13648654
 ] 

Chris Riccomini commented on YARN-614:
--------------------------------------

The proposed change, right now, is to keep a counter in RMAppImpl 
(ignoredFailures) that keeps track of AM failures that we wish to ignore. This 
counter is incremented whenever the RMAppImpl gets a AM fails with a container 
status that is either ABORTED or DISKS_FAILED. When the RMAppImpl decides 
whether to re-try an app with a new attempt, instead of checking if attempts < 
maxSize, it checks attempts - ignoredFailures < maxSize.

This patch does not address recovery. If attempts > maxSize on recovery, the 
job will not be restarted, even if some of those attempts resulted in ABORTED 
or DISKS_FAILED failures.
                
> Retry attempts automatically for hardware failures or YARN issues and set 
> default app retries to 1
> --------------------------------------------------------------------------------------------------
>
>                 Key: YARN-614
>                 URL: https://issues.apache.org/jira/browse/YARN-614
>             Project: Hadoop YARN
>          Issue Type: Improvement
>            Reporter: Bikas Saha
>            Assignee: Chris Riccomini
>         Attachments: YARN-614-0.patch, YARN-614-1.patch, YARN-614-2.patch
>
>
> Attempts can fail due to a large number of user errors and they should not be 
> retried unnecessarily. The only reason YARN should retry an attempt is when 
> the hardware fails or YARN has an error. NM failing, lost NM and NM disk 
> errors are the hardware errors that come to mind.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to