[ 
https://issues.apache.org/jira/browse/YARN-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13605953#comment-13605953
 ] 

Jason Lowe commented on YARN-472:
---------------------------------

Another cause for the AM to receive a reboot command from the RM is a 
split-brain situation where the RM has expired the AM (e.g.: due to network 
cut) but the AM has not killed itself (e.g.: thrashing in garbage collect or 
something).  If this were the case and the AM isn't the last attempt, it needs 
to get out of the way and not do any damage (e.g.: not try to commit, create 
history, etc.) because the RM could have already started the other attempt.

Attempting to unregister is likely a fruitless effort since the RM has 
basically said via the reboot directive it has no idea what this AM is trying 
to do.  If it does succeed in unregistering then that would prevent further app 
attempts from launching as Bikas noted, and that's not desirable.

I agree that adding a new state seems unnecessary.  I've always interpreted the 
reboot directive to indicate the AM is in a bad state and needs to get out, 
fast.  As such, I'd rather keep this simple.  If the attempt isn't the last, 
have the AM log the reception of the reboot and crash without doing any 
filesystem damage.  If it is the last attempt then we can do something like we 
do today, e.g.: cleanup staging and generate history with an error status.

                
> MR app master deletes staging dir when sent a reboot command from the RM
> ------------------------------------------------------------------------
>
>                 Key: YARN-472
>                 URL: https://issues.apache.org/jira/browse/YARN-472
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: jian he
>            Assignee: jian he
>         Attachments: YARN-472.1.patch
>
>
> If the RM is restarted when the MR job is running, then it sends a reboot 
> command to the job. The job ends up deleting the staging dir and that causes 
> the next attempt to fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to