[jira] [Commented] (YARN-917) Job can fail when RM restarts after staging dir is cleaned but before MR successfully unregister with RM

Jason Lowe (JIRA) Tue, 30 Jul 2013 07:18:29 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13723891#comment-13723891
 ]


Jason Lowe commented on YARN-917:
---------------------------------

Yes, that's exactly what I was proposing with my first comment.

Originally the staging directory cleanup was after the unregister, but there 
was a problem.  Before the FINISHING state was added, the RM would kill the AM 
container as soon as it unregistered.  This meant that sometimes the AM would 
be killed by the NM (if the NM happened to heartbeat soon enough) before the AM 
had a chance to cleanup the staging directory, and over time staging 
directories would start piling up and filling user quotas.  The initial fix was 
to move the staging directory cleanup from after unregistering to just before.

Now that there's a FINISHING state that allows the AM some time to cleanup 
before it is killed, it should have plenty of time to remove the staging 
directory before the RM tries to kill it after unregistering.
                
> Job can fail when RM restarts after staging dir is cleaned but before MR 
> successfully unregister with RM
> --------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-917
>                 URL: https://issues.apache.org/jira/browse/YARN-917
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Jian He
>            Assignee: Jian He
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-917) Job can fail when RM restarts after staging dir is cleaned but before MR successfully unregister with RM

Reply via email to