[
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13749734#comment-13749734
]
Jian He commented on YARN-540:
------------------------------
bq. What will happen if the RM failed after deleting the app from the store but
before the app pulled that information from the RM?
App will not fail because RM unregister is ignoring any exceptions coming from
finishApp(). JobClient can also get the final status of the App regardless
wether finishApp() fails or not.
bq. The state transitions are asynchronous. We cannot expect to always find the
app in the FINISHING state.
FINISHING state is the only state after unregister call happens that we can
reliably say app is removed from state store depending on currently
implemented state transitions. Tell me if I missed something.
bq. Can the application finish on the RM (in between 2 finishApp() requests)
such that it never gets a true response?
Application will not go to FINISHED state unless AM process exists or AM
expires. So I think it can reliably get the true response as long as RM is
available.
bq. Is this possible to avoid 2 round trips to store?
Are you saying is the following code possible to handle duplicative APP_REMOVE
events?
bq. There is no need for multiple code paths/transitions.
I in fact noticed this while writing the patch, the intention was to avoid the
unnecessary overhead trip to RMStateStore. thoughts?
Agree with other comments, will post a new patch soon.
> Race condition causing RM to potentially relaunch already unregistered AMs on
> RM restart
> ----------------------------------------------------------------------------------------
>
> Key: YARN-540
> URL: https://issues.apache.org/jira/browse/YARN-540
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Reporter: Jian He
> Assignee: Jian He
> Priority: Blocker
> Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.patch,
> YARN-540.patch
>
>
> When job succeeds and successfully call finishApplicationMaster, RM shutdown
> and restart-dispatcher is stopped before it can process REMOVE_APP event. The
> next time RM comes back, it will reload the existing state files even though
> the job is succeeded
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira