[
https://issues.apache.org/jira/browse/YARN-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13462141#comment-13462141
]
Robert Joseph Evans commented on YARN-128:
------------------------------------------
bq. AMs should not finish themselves while the RM is down or recovering. They
should just spin.
+1 for that. If we let the MR AM finish, and then the RM comes up and tries to
restart it will get confused because it will not find the job history log where
it expects to see it which will cause it to restart, and it is likely to find
the output directory already populated with data, which could cause the job to
fail. What is worse it may not fail, because I think the output committer will
ignore those errors. The first AM could inform oozie that the job finished
through a callback, and a second job may be launched and is reading the data at
the time that the restarted first job is trying to write that data, which could
cause inconsistent results or cause the second job to fail somewhat randomly.
bq. An upper bound (time) on recovery?
This is a bit difficult to determine because the RM is responsible for renewing
tokens. Right now it will renew them when they only have about 10% of their
time left before they expire. So it depends on how long the shortest token you
have in flight is valid for before it needs to be renewed. In general all of
the tokens I have seen are for 24 hours, so you would have about 2.4 hours to
bring the RM back up and read in/start renewing all of the tokens or risk
tokens expiring.
> Resurrect RM Restart
> ---------------------
>
> Key: YARN-128
> URL: https://issues.apache.org/jira/browse/YARN-128
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 2.0.0-alpha
> Reporter: Arun C Murthy
> Assignee: Bikas Saha
> Attachments: MR-4343.1.patch, RM-recovery-initial-thoughts.txt
>
>
> We should resurrect 'RM Restart' which we disabled sometime during the RM
> refactor.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira