[
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13759395#comment-13759395
]
Jason Lowe commented on YARN-540:
---------------------------------
Ah, after the RM restarts, the NM can notify the RM that the AM container
exited then that would pretty much fix it. We'd only have an issue if the NM
went down at the same time the RM did. I'm still a bit unclear on the
specifics for how the RM recovers the container states in work-preserving
restart, but assuming the NMs report not only active containers but also those
that have exited since the last successful heartbeat upon RM
recovery/re-registration then we should be OK.
> Race condition causing RM to potentially relaunch already unregistered AMs on
> RM restart
> ----------------------------------------------------------------------------------------
>
> Key: YARN-540
> URL: https://issues.apache.org/jira/browse/YARN-540
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Reporter: Jian He
> Assignee: Jian He
> Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch,
> YARN-540.patch, YARN-540.patch
>
>
> When job succeeds and successfully call finishApplicationMaster, RM shutdown
> and restart-dispatcher is stopped before it can process REMOVE_APP event. The
> next time RM comes back, it will reload the existing state files even though
> the job is succeeded
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira