[jira] [Commented] (YARN-540) RM state store not cleaned if job succeeds but RM shutdown and restart-dispatcher stopped before it can process REMOVE_APP event

Bikas Saha (JIRA) Thu, 04 Apr 2013 12:20:16 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13622681#comment-13622681
 ]


Bikas Saha commented on YARN-540:
---------------------------------

This is a known issue. The problem here is that the rm state store is 
essentially a write ahead log. But in the application unregister/finish case, 
the application has already finished before the rm stores that fact in its 
state. So the RM by itself cannot avoid this problem. Since its a race 
condition we may choose not not fix it unless we see this happen often in 
practice.
The solutions that come to mind are
1) finishApplicationMaster() blocks until the finish is stored in the store. 
This has issues of getting blocked on a slow/unavailable store. Also, the RM 
does a bunch of other things before and application finishes. The RM may not be 
able to remove the application from the store until all those steps are 
complete.
2) finishApplicationMaster() becomes a 2-step process in which, in the second 
step the app waits for the RM to change the app's state to "FINISHED" before 
exiting.
                
> RM state store not cleaned if job succeeds but RM shutdown and 
> restart-dispatcher stopped before it can process REMOVE_APP event
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-540
>                 URL: https://issues.apache.org/jira/browse/YARN-540
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Jian He
>            Assignee: Jian He
>
> When job succeeds and successfully call finishApplicationMaster, RM shutdown 
> and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
> next time RM comes back, it will reload the existing state files even though 
> the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-540) RM state store not cleaned if job succeeds but RM shutdown and restart-dispatcher stopped before it can process REMOVE_APP event

Reply via email to