[ https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13759412#comment-13759412 ]
Jason Lowe commented on YARN-540: --------------------------------- bq. Then RM will also need to somehow remember that unregister came in but the state-store app removal isn't done. Which is not possible without more state-store writes? Argh, right I forgot. It will simply see the container exit but not understand the context of that exit and misinterpret it as a crash and recover scenario. Darn, I thought we had it. :-) I think the existing unregister call should be blocking from the AMs perspective, as that's the simplest and most-compatible way to fix it. We could always add an asynchronous form of that API later. If most AMs are expected to communicate through a wrapper layer where we can hide this behavior then that's probably fine too -- RM and low-level API could be async but most AMs still see it as a blocking call. Part of the issue of making it async is at some point we need to have some flow control. If apps are churning faster than we can persist them then there's going to be issues (backup of store dispatcher queue, etc.). At some point we have to block something. > Race condition causing RM to potentially relaunch already unregistered AMs on > RM restart > ---------------------------------------------------------------------------------------- > > Key: YARN-540 > URL: https://issues.apache.org/jira/browse/YARN-540 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager > Reporter: Jian He > Assignee: Jian He > Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, > YARN-540.patch, YARN-540.patch > > > When job succeeds and successfully call finishApplicationMaster, RM shutdown > and restart-dispatcher is stopped before it can process REMOVE_APP event. The > next time RM comes back, it will reload the existing state files even though > the job is succeeded -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira