[
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13622681#comment-13622681
]
Bikas Saha commented on YARN-540:
---------------------------------
This is a known issue. The problem here is that the rm state store is
essentially a write ahead log. But in the application unregister/finish case,
the application has already finished before the rm stores that fact in its
state. So the RM by itself cannot avoid this problem. Since its a race
condition we may choose not not fix it unless we see this happen often in
practice.
The solutions that come to mind are
1) finishApplicationMaster() blocks until the finish is stored in the store.
This has issues of getting blocked on a slow/unavailable store. Also, the RM
does a bunch of other things before and application finishes. The RM may not be
able to remove the application from the store until all those steps are
complete.
2) finishApplicationMaster() becomes a 2-step process in which, in the second
step the app waits for the RM to change the app's state to "FINISHED" before
exiting.
> RM state store not cleaned if job succeeds but RM shutdown and
> restart-dispatcher stopped before it can process REMOVE_APP event
> --------------------------------------------------------------------------------------------------------------------------------
>
> Key: YARN-540
> URL: https://issues.apache.org/jira/browse/YARN-540
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Reporter: Jian He
> Assignee: Jian He
>
> When job succeeds and successfully call finishApplicationMaster, RM shutdown
> and restart-dispatcher is stopped before it can process REMOVE_APP event. The
> next time RM comes back, it will reload the existing state files even though
> the job is succeeded
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira