[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

Jason Lowe (JIRA) Thu, 05 Sep 2013 13:47:35 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13759412#comment-13759412
 ]


Jason Lowe commented on YARN-540:
---------------------------------

bq.  Then RM will also need to somehow remember that unregister came in but the 
state-store app removal isn't done. Which is not possible without more 
state-store writes?

Argh, right I forgot.  It will simply see the container exit but not understand 
the context of that exit and misinterpret it as a crash and recover scenario.  
Darn, I thought we had it.  :-)

I think the existing unregister call should be blocking from the AMs 
perspective, as that's the simplest and most-compatible way to fix it.  We 
could always add an asynchronous form of that API later.  If most AMs are 
expected to communicate through a wrapper layer where we can hide this behavior 
then that's probably fine too -- RM and low-level API could be async but most 
AMs still see it as a blocking call.

Part of the issue of making it async is at some point we need to have some flow 
control.  If apps are churning faster than we can persist them then there's 
going to be issues (backup of store dispatcher queue, etc.).  At some point we 
have to block something.
                
> Race condition causing RM to potentially relaunch already unregistered AMs on 
> RM restart
> ----------------------------------------------------------------------------------------
>
>                 Key: YARN-540
>                 URL: https://issues.apache.org/jira/browse/YARN-540
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Jian He
>            Assignee: Jian He
>         Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, 
> YARN-540.patch, YARN-540.patch
>
>
> When job succeeds and successfully call finishApplicationMaster, RM shutdown 
> and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
> next time RM comes back, it will reload the existing state files even though 
> the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-540) Race condition causing RM to potentially relaunch already unregistered AMs on RM restart

Reply via email to