[
https://issues.apache.org/jira/browse/YARN-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13605953#comment-13605953
]
Jason Lowe commented on YARN-472:
---------------------------------
Another cause for the AM to receive a reboot command from the RM is a
split-brain situation where the RM has expired the AM (e.g.: due to network
cut) but the AM has not killed itself (e.g.: thrashing in garbage collect or
something). If this were the case and the AM isn't the last attempt, it needs
to get out of the way and not do any damage (e.g.: not try to commit, create
history, etc.) because the RM could have already started the other attempt.
Attempting to unregister is likely a fruitless effort since the RM has
basically said via the reboot directive it has no idea what this AM is trying
to do. If it does succeed in unregistering then that would prevent further app
attempts from launching as Bikas noted, and that's not desirable.
I agree that adding a new state seems unnecessary. I've always interpreted the
reboot directive to indicate the AM is in a bad state and needs to get out,
fast. As such, I'd rather keep this simple. If the attempt isn't the last,
have the AM log the reception of the reboot and crash without doing any
filesystem damage. If it is the last attempt then we can do something like we
do today, e.g.: cleanup staging and generate history with an error status.
> MR app master deletes staging dir when sent a reboot command from the RM
> ------------------------------------------------------------------------
>
> Key: YARN-472
> URL: https://issues.apache.org/jira/browse/YARN-472
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Reporter: jian he
> Assignee: jian he
> Attachments: YARN-472.1.patch
>
>
> If the RM is restarted when the MR job is running, then it sends a reboot
> command to the job. The job ends up deleting the staging dir and that causes
> the next attempt to fail.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira