[
https://issues.apache.org/jira/browse/YARN-4597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15101149#comment-15101149
]
Chris Douglas commented on YARN-4597:
-------------------------------------
The {{ContainerLaunchContext}} (CLC) specifies the prerequisites for starting a
container on a node. These include setting up user/application directories and
downloading dependencies to the NM cache (localization). The NM assumes that an
authenticated {{startContainer}} request has not overbooked resources on the
node, so resources are only reserved/enforced during the container launch and
execution.
This JIRA proposes to add a phase between localization and container launch to
manage a collection of runnable containers. Similar to the localizer stage, a
container will launch only after all the resources from its CLC are assigned by
a _local scheduler_. The local scheduler will select containers to run based on
priority, declared requirements, and by monitoring utilization on the node
(YARN-1011).
A few future and in-progress features motiviate this change.
*Preemption* Instead of sending a kill when the RM selects a victim container,
it could instead convert it from a {{GUARANTEED}} to an {{OPTIMISTIC}}
container (YARN-4335). This has two benefits. First, the downgraded container
can continue to run until a guaranteed container arrives _and_ finishes
localizing its dependencies, so the downgraded container has an opportunity to
complete or checkpoint. When the guaranteed container moves from {{LOCALIZED}}
to {{SCHEDULING}}, the local scheduler may select the victim (formerly
guaranteed) container to be killed. \[1\] Second, the NM may elect to kill the
victim container to run _different_ optimistic containers, particularly
short-running tasks.
*Optimistic scheduling and overprovisioning* To support distributed scheduling
(YARN-2877) and resource-aware scheduling (YARN-1011), the NM needs a component
to select containers that are ready to run. The local scheduler can not only
select tasks to run based on monitoring, it can also make offers to running
containers using durations attached to leases \[2\]. Based on recent
observations, it may start containers that oversubscribe the node, or delay
starting containers if a lease is close to expiring (i.e., the container is
likely to complete).
*Long-running services*. Note that by separating the local scheduler, both that
module _and_ the localizer could be opened up as services provided by the NM.
The localizer could also be extended to prioritize downloads among
{{OPTIMISTIC}} containers (possibly preemptable by {{GUARANTEED}}, and to group
containers based on their dependencies (e.g., avoid downloading a large dep for
fewer than N optimistic containers). By exposing these services, the NM can
assist with the following:
# Resource spikes. If a service container needs to spike temporarily, it may
not need guaranteed resources (YARN-1197). Containers requiring low-latency
elasticity could request optimistic resources instead of peak provisioning,
resizing, or using workarounds like [Llama|http://cloudera.github.io/llama/].
If the local scheduler is addressable by local containers, then the lease could
be logical (i.e., not start a process). Resources assigned to a {{RUNNING}}
container could be published rather than triggering a launch. One could also
imagine service workers marking some resources as unused, while retaining the
authority to spike into them ("subleasing" them to opportunistic containers) by
reclaiming them through the local scheduler.
# Upgrades. If the container needs to pull new dependencies, it could use the
NM Localizer rather of coordinating the download itself.
# Maintenance tasks. Services often need to clean up, compact, scrub, and
checkpoint local data. Right now, each service needs to independnetly monitor
resource utilization to back off saturated resources (particularly disks).
Coordination between services is difficult. In contrast, one could schedule
tasks like block scrubbing as optimistic tasks in the NM to avoid interrupting
services that are spiking. This is similar in spirit to distributed scheduling
insofar as it does not involve the RM and targets a single host (i.e., the host
the container is running on).
\[1\] Though it was selected as a victim by the RM, the local scheduler may
decide to kill a different {{OPTIMISTIC}} container when the guaranteed
container requests resources. For example, if a container completes on the node
after the RM selected the victim, then the NM may elect to kill a smaller
optimistic process if it is sufficient to satisfy the guarantee.
\[2\] Discussion on duration in YARN-1039 was part of a broader conversation on
support for long-running services (YARN-896).
> Add SCHEDULE to NM container lifecycle
> --------------------------------------
>
> Key: YARN-4597
> URL: https://issues.apache.org/jira/browse/YARN-4597
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Reporter: Chris Douglas
>
> Currently, the NM immediately launches containers after resource
> localization. Several features could be more cleanly implemented if the NM
> included a separate stage for reserving resources.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)