[
https://issues.apache.org/jira/browse/YARN-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15223541#comment-15223541
]
Karthik Kambatla commented on YARN-4679:
----------------------------------------
{quote}
Unlike the RM, the NM is unaware of any resource allocations assigned to it
until the AM gets around to launching the container. If the NM decides to lower
its resources, it could easily receive container launch requests afterwards
that would violate its new total allocation.
{quote}
Good point. But, with distributed scheduling (YARN-2877), the same would apply
to the RM resizing the NM without knowing what other work has been scheduled on
the RM. Clearly, we should handle the NM resize (especially shrink) very
carefully. The design on YARN-291 doesn't talk about this, may be the details
are in the code. /cc [~djp]
> When work-preserving restart is enabled, the scheduler should wait for the
> earlier of recovery completion and configured wait time
> ----------------------------------------------------------------------------------------------------------------------------------
>
> Key: YARN-4679
> URL: https://issues.apache.org/jira/browse/YARN-4679
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: resourcemanager
> Reporter: Karthik Kambatla
>
> When work-preserving restart is enabled, it appears the restart (or failover)
> is unconditionally blocked for the configured delay even if the recovery
> itself finishes sooner than this. This should be updated to wait for the
> earlier of the two conditions. Also, it would be nice to allow setting the
> config to -1 to indicate wait as long as need for the recovery to be
> completed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)