[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover

Karthik Kambatla (JIRA) Tue, 06 May 2014 14:37:51 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13991138#comment-13991138
 ]


Karthik Kambatla commented on YARN-2001:
----------------------------------------

bq. In a simple case that an application is granted 50% of the cluster 
resource. The cluster has 2 nodes. the application used up all its resource 
quota and launched all containers on node1. RM fails over and node2 first 
re-syncs back with RM. 
Only the RM went down and not the AM. The AM continues to know that it is 
running all its containers on Node 1, and places request only for additional 
resources. No? 

bq. Another example would be RM needs to generate new container Id for the new 
containers requested from AM. If RM accepts new requests from AM before nodes 
sync back, the new container Id may overlap with the Ids of the recovered 
containers.
I am not sure if [~adhoot]'s prototype includes this, but I think we should use 
the RM's "cluster" timestamp in the name of container-ids as well so we 
precisely know which RM authorized creating a particular container.

> Threshold for RM to accept requests from AM after failover
> ----------------------------------------------------------
>
>                 Key: YARN-2001
>                 URL: https://issues.apache.org/jira/browse/YARN-2001
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Jian He
>            Assignee: Jian He
>
> After failover, RM may require a certain threshold to determine whether it’s 
> safe to make scheduling decisions and start accepting new container requests 
> from AMs. The threshold could be a certain amount of nodes. i.e. RM waits 
> until a certain amount of nodes joining before accepting new container 
> requests.  Or it could simply be a timeout, only after the timeout RM accepts 
> new requests. 
> NMs joined after the threshold can be treated as new NMs and instructed to 
> kill all its containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover

Reply via email to