[
https://issues.apache.org/jira/browse/YARN-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13803984#comment-13803984
]
Bikas Saha commented on YARN-891:
---------------------------------
Is APP_SAVED event being overloaded for both saving initial and final state?
What happens when the app is killed while waiting for initial state to be
saved? A final state save operation will be sent to the store (resulting in 2
pending store operations). When the initial store operation returns APP_SAVED
event then the state machine will transition to the final state. Then the
APP_SAVED from the final store operation will return and crash the state
machine?
{code}
+ // Transitions from FINAL_SAVING state
+ .addTransition(RMAppState.FINAL_SAVING,
+ EnumSet.of(RMAppState.FINISHING, RMAppState.FAILED,
+ RMAppState.KILLED, RMAppState.FINISHED), RMAppEventType.APP_SAVED,
+ new RMAppFinalStateSavedTransition())
{code}
The comment is quite correct. Simply returning accepted state may not be enough
since the scheduler has to first actually accept the existing app attempt
before it can be in accepted state.
{code}
+ // For YARN-1210, simply return ACCECPTED state waiting for AM to
+ // reregister or fail. and remove the following code.
+ return new AttemptFailedTransition(RMAppState.SUBMITTED).transition(app,
+ event);
{code}
Unless I am missing something, it looks like the following may change the final
state from what was originally intended to be the final state/what is stored as
the final state when the attempt first moved to final_saving state.
{code}
+ // pass in the earlier AMUnregistered Event also, as this is needed for
+ // AMFinishedFinalStateSavedTransition later on
+ appAttempt.rememberTargetTransitions(event, new
AMFinishedAfterFinalSavingTransition(
+ appAttempt.eventCausingFinalSaving), RMAppAttemptState.FINISHED);
{code}
There is no code to remove the stored apps at some point. So apps will keep
accumulating in the store and lead to long recovery times. Will it be done in a
separate jira? If yes, what is the jira number? It would be close to a blocker
for the next release.
> Store completed application information in RM state store
> ---------------------------------------------------------
>
> Key: YARN-891
> URL: https://issues.apache.org/jira/browse/YARN-891
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Reporter: Bikas Saha
> Assignee: Jian He
> Attachments: YARN-891.1.patch, YARN-891.2.patch, YARN-891.3.patch,
> YARN-891.4.patch, YARN-891.5.patch, YARN-891.6.patch, YARN-891.patch,
> YARN-891.patch, YARN-891.patch, YARN-891.patch, YARN-891.patch, YARN-891.patch
>
>
> Store completed application/attempt info in RMStateStore when
> application/attempt completes. This solves some problems like finished
> application get lost after RM restart and some other races like YARN-1195
--
This message was sent by Atlassian JIRA
(v6.1#6144)