[
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13749516#comment-13749516
]
Bikas Saha commented on YARN-540:
---------------------------------
What will happen if the RM failed after deleting the app from the store but
before the app pulled that information from the RM? I think the RM is not going
to recognize that app after restarting and return exception to the
finishApplicationMaster response from the running app. The app will fail. And
then it will not be re-started by the RM.
Comments about the patch itself.
Why are we sleeping before checking the value?
{code}
+ do {
+ response = scheduler.finishApplicationMaster(request);
+ Thread.sleep(rmPollInterval);
+ } while (response.getIsRemovedFromRMStateSore());
{code}
The state transitions are asynchronous. We cannot expect to always find the app
in the FINISHING state.
{code}
+ if (RMAppState.FINISHING.equals(rmContext.getRMApps()
+ .get(applicationAttemptId.getApplicationId()).getState())) {
+ return FinishApplicationMasterResponse.newInstance(true);
+ }
{code}
Can the application finish on the RM (in between 2 finishApp() requests) such
that it never gets a true response?
RMAppEventType.ATTEMPT_FINISHING should be renamed to ATTEMPT_UNREGISTERED in a
different jira.
store.removeApplication() should be in the RMAppImpl transitions (AppRemoving
and FinalTransition) instead of ApplicationMasterService and RMAppManager.
Can we pick a name that does not expose class names and impl details? eg.
isUnregistered()
{code}
getIsRemovedFromRMStateSore()
{code}
Is this possible to avoid 2 round trips to store?
{code}
+ if(!fs.exists(deletePath))
+ return;
if(!fs.delete(deletePath, true)) {
throw new Exception("Failed to delete " + deletePath);
{code}
There is no need for multiple code paths/transitions. It should always go from
RUNNING->APP_REMOVING. Please look at NEW->NEW_SAVING. When recovery is not
enabled we use the NullRMStateStore to ensure that the main code path in the RM
remains the same.
{code}
+ private static final class RMAppFinishingOrRemovingTransition implements
+ MultipleArcTransition<RMAppImpl, RMAppEvent, RMAppState> {
+ @Override
+ public RMAppState transition(RMAppImpl app, RMAppEvent event) {
+ boolean isRecoveryEnabled =
+ app.conf.getBoolean(YarnConfiguration.RECOVERY_ENABLED,
+ YarnConfiguration.DEFAULT_RM_RECOVERY_ENABLED);
+ if (isRecoveryEnabled) {
+ LOG.info("Removing application with id " + app.applicationId);
+ app.rmContext.getStateStore().removeApplication(app);
+ return RMAppState.APP_REMOVING;
+ } else
+ new RMAppFinishingTransition().transition(app, event);
+ return RMAppState.FINISHING;
{code}
Why add the APP_?
{code}
RUNNING,
+ APP_REMOVING,
{code}
> Race condition causing RM to potentially relaunch already unregistered AMs on
> RM restart
> ----------------------------------------------------------------------------------------
>
> Key: YARN-540
> URL: https://issues.apache.org/jira/browse/YARN-540
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Reporter: Jian He
> Assignee: Jian He
> Priority: Blocker
> Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.patch,
> YARN-540.patch
>
>
> When job succeeds and successfully call finishApplicationMaster, RM shutdown
> and restart-dispatcher is stopped before it can process REMOVE_APP event. The
> next time RM comes back, it will reload the existing state files even though
> the job is succeeded
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira