[ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995213#comment-13995213
 ] 

Tsuyoshi OZAWA commented on YARN-2001:
--------------------------------------

[~leftnoteasy] , my idea is creating ClusterId-space under the 
epoch(cluster-timestamp) like {{Map<Epoch, List<ClusterID>>}}.

* Epoch (saved in ZKRMStateStore and RM's memory), just a integer value.
* ClusterID (saved in RM's memory), same as current code.

A rough sketch is as follows:

* When a new active RM starts up, Epoch in RMStateStore is incremented and RM 
sets the Epoch. ClusterID is reset to zero. 
* Heartbeats between NM and RM include Epoch: RM can distinguish old 
cluster-timestamps from the new one when NM is registered. If the Epoch is 
older than RM expects, RM can kill the containers via NM.

Please correct me if I'm wrong.

> Threshold for RM to accept requests from AM after failover
> ----------------------------------------------------------
>
>                 Key: YARN-2001
>                 URL: https://issues.apache.org/jira/browse/YARN-2001
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Jian He
>            Assignee: Jian He
>
> After failover, RM may require a certain threshold to determine whether it’s 
> safe to make scheduling decisions and start accepting new container requests 
> from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
> until a certain amount of nodes joining before accepting new container 
> requests.  Or it could simply be a timeout, only after the timeout RM accepts 
> new requests. 
> NMs joined after the threshold can be treated as new NMs and instructed to 
> kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to