[
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13991138#comment-13991138
]
Karthik Kambatla commented on YARN-2001:
----------------------------------------
bq. In a simple case that an application is granted 50% of the cluster
resource. The cluster has 2 nodes. the application used up all its resource
quota and launched all containers on node1. RM fails over and node2 first
re-syncs back with RM.
Only the RM went down and not the AM. The AM continues to know that it is
running all its containers on Node 1, and places request only for additional
resources. No?
bq. Another example would be RM needs to generate new container Id for the new
containers requested from AM. If RM accepts new requests from AM before nodes
sync back, the new container Id may overlap with the Ids of the recovered
containers.
I am not sure if [~adhoot]'s prototype includes this, but I think we should use
the RM's "cluster" timestamp in the name of container-ids as well so we
precisely know which RM authorized creating a particular container.
> Threshold for RM to accept requests from AM after failover
> ----------------------------------------------------------
>
> Key: YARN-2001
> URL: https://issues.apache.org/jira/browse/YARN-2001
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Reporter: Jian He
> Assignee: Jian He
>
> After failover, RM may require a certain threshold to determine whether it’s
> safe to make scheduling decisions and start accepting new container requests
> from AMs. The threshold could be a certain amount of nodes. i.e. RM waits
> until a certain amount of nodes joining before accepting new container
> requests. Or it could simply be a timeout, only after the timeout RM accepts
> new requests.
> NMs joined after the threshold can be treated as new NMs and instructed to
> kill all its containers.
--
This message was sent by Atlassian JIRA
(v6.2#6252)