[ 
https://issues.apache.org/jira/browse/YARN-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13803984#comment-13803984
 ] 

Bikas Saha commented on YARN-891:
---------------------------------

Is APP_SAVED event being overloaded for both saving initial and final state? 
What happens when the app is killed while waiting for initial state to be 
saved? A final state save operation will be sent to the store (resulting in 2 
pending store operations). When the initial store operation returns APP_SAVED 
event then the state machine will transition to the final state. Then the 
APP_SAVED from the final store operation will return and crash the state 
machine?
{code}
+     // Transitions from FINAL_SAVING state
+    .addTransition(RMAppState.FINAL_SAVING,
+      EnumSet.of(RMAppState.FINISHING, RMAppState.FAILED,
+        RMAppState.KILLED, RMAppState.FINISHED), RMAppEventType.APP_SAVED,
+        new RMAppFinalStateSavedTransition())
{code}

The comment is quite correct. Simply returning accepted state may not be enough 
since the scheduler has to first actually accept the existing app attempt 
before it can be in accepted state.
{code}
+      // For YARN-1210, simply return ACCECPTED state waiting for AM to
+      // reregister or fail. and remove the following code.
+      return new AttemptFailedTransition(RMAppState.SUBMITTED).transition(app,
+        event);
{code}

Unless I am missing something, it looks like the following may change the final 
state from what was originally intended to be the final state/what is stored as 
the final state when the attempt first moved to final_saving state.
{code}
+        // pass in the earlier AMUnregistered Event also, as this is needed for
+        // AMFinishedFinalStateSavedTransition later on
+        appAttempt.rememberTargetTransitions(event, new 
AMFinishedAfterFinalSavingTransition(
+            appAttempt.eventCausingFinalSaving), RMAppAttemptState.FINISHED);
{code}

There is no code to remove the stored apps at some point. So apps will keep 
accumulating in the store and lead to long recovery times. Will it be done in a 
separate jira? If yes, what is the jira number? It would be close to a blocker 
for the next release.

> Store completed application information in RM state store
> ---------------------------------------------------------
>
>                 Key: YARN-891
>                 URL: https://issues.apache.org/jira/browse/YARN-891
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Bikas Saha
>            Assignee: Jian He
>         Attachments: YARN-891.1.patch, YARN-891.2.patch, YARN-891.3.patch, 
> YARN-891.4.patch, YARN-891.5.patch, YARN-891.6.patch, YARN-891.patch, 
> YARN-891.patch, YARN-891.patch, YARN-891.patch, YARN-891.patch, YARN-891.patch
>
>
> Store completed application/attempt info in RMStateStore when 
> application/attempt completes. This solves some problems like finished 
> application get lost after RM restart and some other races like YARN-1195



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to