[
https://issues.apache.org/jira/browse/YARN-614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13647750#comment-13647750
]
Chris Riccomini commented on YARN-614:
--------------------------------------
Added a new patch. Resolves 1 (switch justFinishedContainers to a map for O(1)
container status look) and 3 (added a shouldIgnoreFailures method) in my list
above.
Bikas, I think we should leave recovery for another ticket.
Do you want me to update RMAppManager.recover() to have the same "if
(app.attempts.size() - app.ignoredFailures >= app.maxAppAttempts)" logic as
RMAppImpl.AttemptFailedTransition?
> Retry attempts automatically for hardware failures or YARN issues and set
> default app retries to 1
> --------------------------------------------------------------------------------------------------
>
> Key: YARN-614
> URL: https://issues.apache.org/jira/browse/YARN-614
> Project: Hadoop YARN
> Issue Type: Improvement
> Reporter: Bikas Saha
> Attachments: YARN-614-0.patch, YARN-614-1.patch
>
>
> Attempts can fail due to a large number of user errors and they should not be
> retried unnecessarily. The only reason YARN should retry an attempt is when
> the hardware fails or YARN has an error. NM failing, lost NM and NM disk
> errors are the hardware errors that come to mind.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira