[
https://issues.apache.org/jira/browse/YARN-614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13647770#comment-13647770
]
Chris Riccomini commented on YARN-614:
--------------------------------------
Hey Bikas,
Looking into the recovery stuff a bit more. As far as I can tell (still
wrapping my head around this stuff), the RMApp's RECOVER transition moves from
NEW to SUBMITTED right now. This transition is triggered by the
RMAppManager.recover -> RMAppManager.submitApplication, which sends the RECOVER
event. The submitApplication call happens directly before appImpl.recover() in
RMAppManager:
bq. if(shouldRecover) {
LOG.info("Recovering application " + appState.getAppId());
submitApplication(appState.getApplicationSubmissionContext(),
appState.getSubmitTime(), true);
// re-populate attempt information in application
RMAppImpl appImpl = (RMAppImpl) rmContext.getRMApps().get(
appState.getAppId());
appImpl.recover(state);
}
This means that the RECOVER transition (StartAppAttemptTransition) happens
before we have any state in the RMAppImpl. As a result, we can't add any logic
to StartAppAttemptTransition to determine whether we should transition to
FAILED at this point (since the attempts variable will be empty at this point).
I think this means that we can't do your second suggestion ("Another solution
could be to make the RMApp go from NEW to FAILED in the recover transition
based on failed counts etc.").
Am I understanding this correctly?
> Retry attempts automatically for hardware failures or YARN issues and set
> default app retries to 1
> --------------------------------------------------------------------------------------------------
>
> Key: YARN-614
> URL: https://issues.apache.org/jira/browse/YARN-614
> Project: Hadoop YARN
> Issue Type: Improvement
> Reporter: Bikas Saha
> Attachments: YARN-614-0.patch, YARN-614-1.patch
>
>
> Attempts can fail due to a large number of user errors and they should not be
> retried unnecessarily. The only reason YARN should retry an attempt is when
> the hardware fails or YARN has an error. NM failing, lost NM and NM disk
> errors are the hardware errors that come to mind.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira