[jira] [Commented] (YARN-128) Resurrect RM Restart

Vinod Kumar Vavilapalli (JIRA) Mon, 24 Sep 2012 13:18:11 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13462077#comment-13462077
 ]


Vinod Kumar Vavilapalli commented on YARN-128:
----------------------------------------------

+1 for most of your points. Some specific comments:

bq. What about AM's that completed during restart. Re-running them should be a 
no-op.
AMs should not finish themselves while the RM is down or recovering. They 
should just spin.

bq. How to handle container releases messages that were lost when RM was down? 
Will AM's get delivery failure and continue to resend indefinitely?
You mean release requests from AM? Like above, if AMs just spin, we don't have 
an issue.

bq. Need new AM-RM API to resend asks from AM to RM.
See AMResponse.getRebott(). That can be used to inform AMs to resend all 
details.

bq. What information about keys and tokens to persist across restart so that 
existing secure containers continue to run with new RM and new containers.
We already noted this as java comments in code. Need to put in proper 
documentation.

bq. ZK nodes themelves should be secure.
Good point. Worst case that ZK doesn't support security, we can rely on a RM 
specific ZK instance and firewall rules.

More requirements:
 - An upper bound (time) on recovery?
 - Writing to ZK shouldn't add more than x% (< 1-2%) to app latency?

More state to save:
 - New app submissions should be persisted/accepted but not acted upon during 
recovery.

Miscellaneous points:
 - I think we should add a new ServiceState call Recovering and use the same in 
RM.
 - Overall, clients, AMs and NMs should spin while the RM is down or doing 
recovery. Also we need to handle fail-over of RM, should do as part of a 
separate ticket.
 - When is recovery officially finished? When all running AMs sync up? I 
suppose so, that would be an upper bound equaling AM-expiry interval.
 - Need to think of how the RM-NM shared secret roll-over is affected, if RM is 
down for a significant amount of item

                
> Resurrect RM Restart 
> ---------------------
>
>                 Key: YARN-128
>                 URL: https://issues.apache.org/jira/browse/YARN-128
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.0.0-alpha
>            Reporter: Arun C Murthy
>            Assignee: Bikas Saha
>         Attachments: MR-4343.1.patch, RM-recovery-initial-thoughts.txt
>
>
> We should resurrect 'RM Restart' which we disabled sometime during the RM 
> refactor.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-128) Resurrect RM Restart

Reply via email to