[ 
https://issues.apache.org/jira/browse/YARN-614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13646993#comment-13646993
 ] 

Chris Riccomini commented on YARN-614:
--------------------------------------

bq. One solution could be to move the check from finishAttempt() to 
createAttempt(). finishAttempt() always enqueues a new attempt. the new attempt 
creation checks if one can still be created based on failed count etc.

This wouldn't fix the problem with RMAppManager.recover(), would it? Whether we 
enqueue attempts in finishAttempt or createAttempt, if the attempt account ever 
goes above maxAppAttempts, it seems like RMAppManager would not recover the 
app, right?

Are you proposing we always call appImpl.recover() in RMAppManager, always 
retry in RMAppImpl.AttemptFailedTransition, and call 
RMAppImpl.countFailureToAttemptLimit() inside RMAppImpl.createNewAttempt?
                
> Retry attempts automatically for hardware failures or YARN issues and set 
> default app retries to 1
> --------------------------------------------------------------------------------------------------
>
>                 Key: YARN-614
>                 URL: https://issues.apache.org/jira/browse/YARN-614
>             Project: Hadoop YARN
>          Issue Type: Improvement
>            Reporter: Bikas Saha
>         Attachments: YARN-614-0.patch
>
>
> Attempts can fail due to a large number of user errors and they should not be 
> retried unnecessarily. The only reason YARN should retry an attempt is when 
> the hardware fails or YARN has an error. NM failing, lost NM and NM disk 
> errors are the hardware errors that come to mind.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to