[ https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13749516#comment-13749516 ]
Bikas Saha commented on YARN-540: --------------------------------- What will happen if the RM failed after deleting the app from the store but before the app pulled that information from the RM? I think the RM is not going to recognize that app after restarting and return exception to the finishApplicationMaster response from the running app. The app will fail. And then it will not be re-started by the RM. Comments about the patch itself. Why are we sleeping before checking the value? {code} + do { + response = scheduler.finishApplicationMaster(request); + Thread.sleep(rmPollInterval); + } while (response.getIsRemovedFromRMStateSore()); {code} The state transitions are asynchronous. We cannot expect to always find the app in the FINISHING state. {code} + if (RMAppState.FINISHING.equals(rmContext.getRMApps() + .get(applicationAttemptId.getApplicationId()).getState())) { + return FinishApplicationMasterResponse.newInstance(true); + } {code} Can the application finish on the RM (in between 2 finishApp() requests) such that it never gets a true response? RMAppEventType.ATTEMPT_FINISHING should be renamed to ATTEMPT_UNREGISTERED in a different jira. store.removeApplication() should be in the RMAppImpl transitions (AppRemoving and FinalTransition) instead of ApplicationMasterService and RMAppManager. Can we pick a name that does not expose class names and impl details? eg. isUnregistered() {code} getIsRemovedFromRMStateSore() {code} Is this possible to avoid 2 round trips to store? {code} + if(!fs.exists(deletePath)) + return; if(!fs.delete(deletePath, true)) { throw new Exception("Failed to delete " + deletePath); {code} There is no need for multiple code paths/transitions. It should always go from RUNNING->APP_REMOVING. Please look at NEW->NEW_SAVING. When recovery is not enabled we use the NullRMStateStore to ensure that the main code path in the RM remains the same. {code} + private static final class RMAppFinishingOrRemovingTransition implements + MultipleArcTransition<RMAppImpl, RMAppEvent, RMAppState> { + @Override + public RMAppState transition(RMAppImpl app, RMAppEvent event) { + boolean isRecoveryEnabled = + app.conf.getBoolean(YarnConfiguration.RECOVERY_ENABLED, + YarnConfiguration.DEFAULT_RM_RECOVERY_ENABLED); + if (isRecoveryEnabled) { + LOG.info("Removing application with id " + app.applicationId); + app.rmContext.getStateStore().removeApplication(app); + return RMAppState.APP_REMOVING; + } else + new RMAppFinishingTransition().transition(app, event); + return RMAppState.FINISHING; {code} Why add the APP_? {code} RUNNING, + APP_REMOVING, {code} > Race condition causing RM to potentially relaunch already unregistered AMs on > RM restart > ---------------------------------------------------------------------------------------- > > Key: YARN-540 > URL: https://issues.apache.org/jira/browse/YARN-540 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager > Reporter: Jian He > Assignee: Jian He > Priority: Blocker > Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.patch, > YARN-540.patch > > > When job succeeds and successfully call finishApplicationMaster, RM shutdown > and restart-dispatcher is stopped before it can process REMOVE_APP event. The > next time RM comes back, it will reload the existing state files even though > the job is succeeded -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira