[ 
https://issues.apache.org/jira/browse/YARN-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15223541#comment-15223541
 ] 

Karthik Kambatla commented on YARN-4679:
----------------------------------------

{quote}
Unlike the RM, the NM is unaware of any resource allocations assigned to it 
until the AM gets around to launching the container. If the NM decides to lower 
its resources, it could easily receive container launch requests afterwards 
that would violate its new total allocation.
{quote}
Good point. But, with distributed scheduling (YARN-2877), the same would apply 
to the RM resizing the NM without knowing what other work has been scheduled on 
the RM. Clearly, we should handle the NM resize (especially shrink) very 
carefully. The design on YARN-291 doesn't talk about this, may be the details 
are in the code. /cc [~djp]

> When work-preserving restart is enabled, the scheduler should wait for the 
> earlier of recovery completion and configured wait time
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-4679
>                 URL: https://issues.apache.org/jira/browse/YARN-4679
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>            Reporter: Karthik Kambatla
>
> When work-preserving restart is enabled, it appears the restart (or failover) 
> is unconditionally blocked for the configured delay even if the recovery 
> itself finishes sooner than this. This should be updated to wait for the 
> earlier of the two conditions. Also, it would be nice to allow setting the 
> config to -1 to indicate wait as long as need for the recovery to be 
> completed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to