[ 
https://issues.apache.org/jira/browse/YARN-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13462141#comment-13462141
 ] 

Robert Joseph Evans commented on YARN-128:
------------------------------------------

bq. AMs should not finish themselves while the RM is down or recovering. They 
should just spin.

+1 for that.  If we let the MR AM finish, and then the RM comes up and tries to 
restart it will get confused because it will not find the job history log where 
it expects to see it which will cause it to restart, and it is likely to find 
the output directory already populated with data, which could cause the job to 
fail.  What is worse it may not fail, because I think the output committer will 
ignore those errors. The first AM could inform oozie that the job finished 
through a callback, and a second job may be launched and is reading the data at 
the time that the restarted first job is trying to write that data, which could 
cause inconsistent results or cause the second job to fail somewhat randomly. 

bq. An upper bound (time) on recovery?

This is a bit difficult to determine because the RM is responsible for renewing 
tokens.  Right now it will renew them when they only have about 10% of their 
time left before they expire.  So it depends on how long the shortest token you 
have in flight is valid for before it needs to be renewed.  In general all of 
the tokens I have seen are for 24 hours, so you would have about 2.4 hours to 
bring the RM back up and read in/start renewing all of the tokens or risk 
tokens expiring.  
                
> Resurrect RM Restart 
> ---------------------
>
>                 Key: YARN-128
>                 URL: https://issues.apache.org/jira/browse/YARN-128
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.0.0-alpha
>            Reporter: Arun C Murthy
>            Assignee: Bikas Saha
>         Attachments: MR-4343.1.patch, RM-recovery-initial-thoughts.txt
>
>
> We should resurrect 'RM Restart' which we disabled sometime during the RM 
> refactor.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to