[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

Bikas Saha (JIRA) Sat, 24 Aug 2013 19:34:32 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13749516#comment-13749516
 ]


Bikas Saha commented on YARN-540:
---------------------------------

What will happen if the RM failed after deleting the app from the store but 
before the app pulled that information from the RM? I think the RM is not going 
to  recognize that app after restarting and return exception to the 
finishApplicationMaster response from the running app. The app will fail. And 
then it will not be re-started by the RM.

Comments about the patch itself.

Why are we sleeping before checking the value?
{code}
+      do {
+        response = scheduler.finishApplicationMaster(request);
+        Thread.sleep(rmPollInterval);
+      } while (response.getIsRemovedFromRMStateSore());
{code}

The state transitions are asynchronous. We cannot expect to always find the app 
in the FINISHING state.
{code}
+      if (RMAppState.FINISHING.equals(rmContext.getRMApps()
+        .get(applicationAttemptId.getApplicationId()).getState())) {
+        return FinishApplicationMasterResponse.newInstance(true);
+      }
{code}
Can the application finish on the RM (in between 2 finishApp() requests) such 
that it never gets a true response?

RMAppEventType.ATTEMPT_FINISHING should be renamed to ATTEMPT_UNREGISTERED in a 
different jira.

store.removeApplication() should be in the RMAppImpl transitions (AppRemoving 
and FinalTransition) instead of ApplicationMasterService and RMAppManager.

Can we pick a name that does not expose class names and impl details? eg. 
isUnregistered()
{code}
getIsRemovedFromRMStateSore()
{code}

Is this possible to avoid 2 round trips to store?
{code}
+    if(!fs.exists(deletePath))
+      return;
     if(!fs.delete(deletePath, true)) {
       throw new Exception("Failed to delete " + deletePath);
{code}

There is no need for multiple code paths/transitions. It should always go from 
RUNNING->APP_REMOVING. Please look at NEW->NEW_SAVING. When recovery is not 
enabled we use the NullRMStateStore to ensure that the main code path in the RM 
remains the same.
{code}
+  private static final class RMAppFinishingOrRemovingTransition  implements
+      MultipleArcTransition<RMAppImpl, RMAppEvent, RMAppState> {
+    @Override
+    public RMAppState transition(RMAppImpl app, RMAppEvent event) {
+      boolean isRecoveryEnabled =
+          app.conf.getBoolean(YarnConfiguration.RECOVERY_ENABLED,
+            YarnConfiguration.DEFAULT_RM_RECOVERY_ENABLED);
+      if (isRecoveryEnabled) {
+        LOG.info("Removing application with id " + app.applicationId);
+        app.rmContext.getStateStore().removeApplication(app);
+        return RMAppState.APP_REMOVING;
+      } else
+        new RMAppFinishingTransition().transition(app, event);
+        return RMAppState.FINISHING;
{code}

Why add the APP_?
{code}
   RUNNING,
+  APP_REMOVING,
{code}

                
> Race condition causing RM to potentially relaunch already unregistered AMs on 
> RM restart
> ----------------------------------------------------------------------------------------
>
>                 Key: YARN-540
>                 URL: https://issues.apache.org/jira/browse/YARN-540
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Jian He
>            Assignee: Jian He
>            Priority: Blocker
>         Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.patch, 
> YARN-540.patch
>
>
> When job succeeds and successfully call finishApplicationMaster, RM shutdown 
> and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
> next time RM comes back, it will reload the existing state files even though 
> the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

Reply via email to