[
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995213#comment-13995213
]
Tsuyoshi OZAWA commented on YARN-2001:
--------------------------------------
[~leftnoteasy] , my idea is creating ClusterId-space under the
epoch(cluster-timestamp) like {{Map<Epoch, List<ClusterID>>}}.
* Epoch (saved in ZKRMStateStore and RM's memory), just a integer value.
* ClusterID (saved in RM's memory), same as current code.
A rough sketch is as follows:
* When a new active RM starts up, Epoch in RMStateStore is incremented and RM
sets the Epoch. ClusterID is reset to zero.
* Heartbeats between NM and RM include Epoch: RM can distinguish old
cluster-timestamps from the new one when NM is registered. If the Epoch is
older than RM expects, RM can kill the containers via NM.
Please correct me if I'm wrong.
> Threshold for RM to accept requests from AM after failover
> ----------------------------------------------------------
>
> Key: YARN-2001
> URL: https://issues.apache.org/jira/browse/YARN-2001
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Reporter: Jian He
> Assignee: Jian He
>
> After failover, RM may require a certain threshold to determine whether it’s
> safe to make scheduling decisions and start accepting new container requests
> from AMs. The threshold could be a certain amount of nodes. i.e. RM waits
> until a certain amount of nodes joining before accepting new container
> requests. Or it could simply be a timeout, only after the timeout RM accepts
> new requests.
> NMs joined after the threshold can be treated as new NMs and instructed to
> kill all its containers.
--
This message was sent by Atlassian JIRA
(v6.2#6252)