[ 
https://issues.apache.org/jira/browse/YARN-614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13647770#comment-13647770
 ] 

Chris Riccomini commented on YARN-614:
--------------------------------------

Hey Bikas,

Looking into the recovery stuff a bit more. As far as I can tell (still 
wrapping my head around this stuff), the RMApp's RECOVER transition moves from 
NEW to SUBMITTED right now. This transition is triggered by the 
RMAppManager.recover -> RMAppManager.submitApplication, which sends the RECOVER 
event. The submitApplication call happens directly before appImpl.recover() in 
RMAppManager:

bq.      if(shouldRecover) {
        LOG.info("Recovering application " + appState.getAppId());
        submitApplication(appState.getApplicationSubmissionContext(), 
                        appState.getSubmitTime(), true);
        // re-populate attempt information in application
        RMAppImpl appImpl = (RMAppImpl) rmContext.getRMApps().get(
                                                        appState.getAppId());
        appImpl.recover(state);
      }

This means that the RECOVER transition (StartAppAttemptTransition) happens 
before we have any state in the RMAppImpl. As a result, we can't add any logic 
to StartAppAttemptTransition to determine whether we should transition to 
FAILED at this point (since the attempts variable will be empty at this point). 
I think this means that we can't do your second suggestion ("Another solution 
could be to make the RMApp go from NEW to FAILED in the recover transition 
based on failed counts etc.").

Am I understanding this correctly?
                
> Retry attempts automatically for hardware failures or YARN issues and set 
> default app retries to 1
> --------------------------------------------------------------------------------------------------
>
>                 Key: YARN-614
>                 URL: https://issues.apache.org/jira/browse/YARN-614
>             Project: Hadoop YARN
>          Issue Type: Improvement
>            Reporter: Bikas Saha
>         Attachments: YARN-614-0.patch, YARN-614-1.patch
>
>
> Attempts can fail due to a large number of user errors and they should not be 
> retried unnecessarily. The only reason YARN should retry an attempt is when 
> the hardware fails or YARN has an error. NM failing, lost NM and NM disk 
> errors are the hardware errors that come to mind.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to