[ 
https://issues.apache.org/jira/browse/YARN-4597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15101149#comment-15101149
 ] 

Chris Douglas commented on YARN-4597:
-------------------------------------

The {{ContainerLaunchContext}} (CLC) specifies the prerequisites for starting a 
container on a node. These include setting up user/application directories and 
downloading dependencies to the NM cache (localization). The NM assumes that an 
authenticated {{startContainer}} request has not overbooked resources on the 
node, so resources are only reserved/enforced during the container launch and 
execution.

This JIRA proposes to add a phase between localization and container launch to 
manage a collection of runnable containers. Similar to the localizer stage, a 
container will launch only after all the resources from its CLC are assigned by 
a _local scheduler_. The local scheduler will select containers to run based on 
priority, declared requirements, and by monitoring utilization on the node 
(YARN-1011).

A few future and in-progress features motiviate this change.

*Preemption* Instead of sending a kill when the RM selects a victim container, 
it could instead convert it from a {{GUARANTEED}} to an {{OPTIMISTIC}} 
container (YARN-4335). This has two benefits. First, the downgraded container 
can continue to run until a guaranteed container arrives _and_ finishes 
localizing its dependencies, so the downgraded container has an opportunity to 
complete or checkpoint. When the guaranteed container moves from {{LOCALIZED}} 
to {{SCHEDULING}}, the local scheduler may select the victim (formerly 
guaranteed) container to be killed. \[1\] Second, the NM may elect to kill the 
victim container to run _different_ optimistic containers, particularly 
short-running tasks.

*Optimistic scheduling and overprovisioning* To support distributed scheduling 
(YARN-2877) and resource-aware scheduling (YARN-1011), the NM needs a component 
to select containers that are ready to run. The local scheduler can not only 
select tasks to run based on monitoring, it can also make offers to running 
containers using durations attached to leases \[2\]. Based on recent 
observations, it may start containers that oversubscribe the node, or delay 
starting containers if a lease is close to expiring (i.e., the container is 
likely to complete).

*Long-running services*. Note that by separating the local scheduler, both that 
module _and_ the localizer could be opened up as services provided by the NM. 
The localizer could also be extended to prioritize downloads among 
{{OPTIMISTIC}} containers (possibly preemptable by {{GUARANTEED}}, and to group 
containers based on their dependencies (e.g., avoid downloading a large dep for 
fewer than N optimistic containers). By exposing these services, the NM can 
assist with the following:

# Resource spikes. If a service container needs to spike temporarily, it may 
not need guaranteed resources (YARN-1197). Containers requiring low-latency 
elasticity could request optimistic resources instead of peak provisioning, 
resizing, or using workarounds like [Llama|http://cloudera.github.io/llama/]. 
If the local scheduler is addressable by local containers, then the lease could 
be logical (i.e., not start a process). Resources assigned to a {{RUNNING}} 
container could be published rather than triggering a launch. One could also 
imagine service workers marking some resources as unused, while retaining the 
authority to spike into them ("subleasing" them to opportunistic containers) by 
reclaiming them through the local scheduler.
# Upgrades. If the container needs to pull new dependencies, it could use the 
NM Localizer rather of coordinating the download itself.
# Maintenance tasks. Services often need to clean up, compact, scrub, and 
checkpoint local data. Right now, each service needs to independnetly monitor 
resource utilization to back off saturated resources (particularly disks). 
Coordination between services is difficult. In contrast, one could schedule 
tasks like block scrubbing as optimistic tasks in the NM to avoid interrupting 
services that are spiking. This is similar in spirit to distributed scheduling 
insofar as it does not involve the RM and targets a single host (i.e., the host 
the container is running on).

\[1\] Though it was selected as a victim by the RM, the local scheduler may 
decide to kill a different {{OPTIMISTIC}} container when the guaranteed 
container requests resources. For example, if a container completes on the node 
after the RM selected the victim, then the NM may elect to kill a smaller 
optimistic process if it is sufficient to satisfy the guarantee.
\[2\] Discussion on duration in YARN-1039 was part of a broader conversation on 
support for long-running services (YARN-896).


> Add SCHEDULE to NM container lifecycle
> --------------------------------------
>
>                 Key: YARN-4597
>                 URL: https://issues.apache.org/jira/browse/YARN-4597
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Chris Douglas
>
> Currently, the NM immediately launches containers after resource 
> localization. Several features could be more cleanly implemented if the NM 
> included a separate stage for reserving resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to