[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13758037#comment-13758037
 ] 

Bikas Saha commented on YARN-540:
---------------------------------

1) or 2) are basically the same thing. 1) will block the unregister call until 
it succeeds. 2) requires the AM to keep looping on unregister until it 
succeeds. 2) just enables the RM to make the store operation asynchronously and 
prevent RPC threads from getting blocked.
The core issue is that the RM can crash before removing the app from the store. 
Thus when it restarts it thinks that the app is still running and tries to 
re-launch it. This is the core issue in this jira and should be a rare event.
The MR app master sleeps for 5s before unregistering with the RM and reports 
success meanwhile to the client. This exacerbates the above rare issue and 
makes it possible to repro it more often.
                
> Race condition causing RM to potentially relaunch already unregistered AMs on 
> RM restart
> ----------------------------------------------------------------------------------------
>
>                 Key: YARN-540
>                 URL: https://issues.apache.org/jira/browse/YARN-540
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Jian He
>            Assignee: Jian He
>         Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.patch, 
> YARN-540.patch
>
>
> When job succeeds and successfully call finishApplicationMaster, RM shutdown 
> and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
> next time RM comes back, it will reload the existing state files even though 
> the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to