[ https://issues.apache.org/jira/browse/YARN-4597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15101149#comment-15101149 ]
Chris Douglas commented on YARN-4597: ------------------------------------- The {{ContainerLaunchContext}} (CLC) specifies the prerequisites for starting a container on a node. These include setting up user/application directories and downloading dependencies to the NM cache (localization). The NM assumes that an authenticated {{startContainer}} request has not overbooked resources on the node, so resources are only reserved/enforced during the container launch and execution. This JIRA proposes to add a phase between localization and container launch to manage a collection of runnable containers. Similar to the localizer stage, a container will launch only after all the resources from its CLC are assigned by a _local scheduler_. The local scheduler will select containers to run based on priority, declared requirements, and by monitoring utilization on the node (YARN-1011). A few future and in-progress features motiviate this change. *Preemption* Instead of sending a kill when the RM selects a victim container, it could instead convert it from a {{GUARANTEED}} to an {{OPTIMISTIC}} container (YARN-4335). This has two benefits. First, the downgraded container can continue to run until a guaranteed container arrives _and_ finishes localizing its dependencies, so the downgraded container has an opportunity to complete or checkpoint. When the guaranteed container moves from {{LOCALIZED}} to {{SCHEDULING}}, the local scheduler may select the victim (formerly guaranteed) container to be killed. \[1\] Second, the NM may elect to kill the victim container to run _different_ optimistic containers, particularly short-running tasks. *Optimistic scheduling and overprovisioning* To support distributed scheduling (YARN-2877) and resource-aware scheduling (YARN-1011), the NM needs a component to select containers that are ready to run. The local scheduler can not only select tasks to run based on monitoring, it can also make offers to running containers using durations attached to leases \[2\]. Based on recent observations, it may start containers that oversubscribe the node, or delay starting containers if a lease is close to expiring (i.e., the container is likely to complete). *Long-running services*. Note that by separating the local scheduler, both that module _and_ the localizer could be opened up as services provided by the NM. The localizer could also be extended to prioritize downloads among {{OPTIMISTIC}} containers (possibly preemptable by {{GUARANTEED}}, and to group containers based on their dependencies (e.g., avoid downloading a large dep for fewer than N optimistic containers). By exposing these services, the NM can assist with the following: # Resource spikes. If a service container needs to spike temporarily, it may not need guaranteed resources (YARN-1197). Containers requiring low-latency elasticity could request optimistic resources instead of peak provisioning, resizing, or using workarounds like [Llama|http://cloudera.github.io/llama/]. If the local scheduler is addressable by local containers, then the lease could be logical (i.e., not start a process). Resources assigned to a {{RUNNING}} container could be published rather than triggering a launch. One could also imagine service workers marking some resources as unused, while retaining the authority to spike into them ("subleasing" them to opportunistic containers) by reclaiming them through the local scheduler. # Upgrades. If the container needs to pull new dependencies, it could use the NM Localizer rather of coordinating the download itself. # Maintenance tasks. Services often need to clean up, compact, scrub, and checkpoint local data. Right now, each service needs to independnetly monitor resource utilization to back off saturated resources (particularly disks). Coordination between services is difficult. In contrast, one could schedule tasks like block scrubbing as optimistic tasks in the NM to avoid interrupting services that are spiking. This is similar in spirit to distributed scheduling insofar as it does not involve the RM and targets a single host (i.e., the host the container is running on). \[1\] Though it was selected as a victim by the RM, the local scheduler may decide to kill a different {{OPTIMISTIC}} container when the guaranteed container requests resources. For example, if a container completes on the node after the RM selected the victim, then the NM may elect to kill a smaller optimistic process if it is sufficient to satisfy the guarantee. \[2\] Discussion on duration in YARN-1039 was part of a broader conversation on support for long-running services (YARN-896). > Add SCHEDULE to NM container lifecycle > -------------------------------------- > > Key: YARN-4597 > URL: https://issues.apache.org/jira/browse/YARN-4597 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Reporter: Chris Douglas > > Currently, the NM immediately launches containers after resource > localization. Several features could be more cleanly implemented if the NM > included a separate stage for reserving resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)