[
https://issues.apache.org/jira/browse/YARN-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844344#comment-13844344
]
Arun C Murthy commented on YARN-1404:
-------------------------------------
I've spent time thinking about this in the context of running a myriad of
external systems in YARN such as Impala, HDFS Caching (HDFS-4949) and some
others.
The overarching goal is to allow YARN to act as a ResourceManager for the
overall cluster *and* a Workload Manager for external systems i.e. this way
Impala or HDFS can rely on YARN's queues for workload management, SLAs via
preemption etc.
Is that a good characterization of the problem at hand?
I think it's a good goal to support - this will allow other external systems to
leverage YARN's capabilities for both resource sharing and workload management.
Now, if we all agree on this - we can figure the best way to support this in a
first-class manner.
----
Ok, the core requirement is for an external system (Impala, HDFS, others) to
leverage YARN's workload management capabilities (queues etc.) to acquire
resources (cpu, memory) *on behalf* of a particular entity (user, queue) for
completing a user's request (run a query, cache a dataset in RAM).
The *key* is that these external systems need to acquire resources on behalf of
the user and ensure that the chargeback is applied to the correct user, queue
etc.
This is a *brand new requirement* for YARN... so far, we have assumed that the
entity acquiring the resource would also be actually utilizing the resource by
launching a container etc.
Here, it's clear that the requirement is that entity acquiring the resource
would like to *delegate* the resource to an external framework. For e.g.
# A user query would like to acquire cpu, memory etc. for appropriate
accounting chargeback and then delegate it to Impala.
# A user request for caching data would like to acquire memory for appropriate
accounting chargeback and then delegate to the Datanode.
----
In this scenario, I think explicitly allowing for *delegation* of a container
would solve the problem in a first-class manner.
We should add a new API to the NodeManager which would allow an application to
*delegate* a container's resources to a different container:
{code:title=ContainerManagementProtocol.java|borderStyle=solid}
public interface ContainerManagementProtocol {
// ...
public DelegateContainerResponse delegateContainer(DelegateContainerRequest
request);
// ...
}
{code}
{code:title=DelegateContainerRequest.java|borderStyle=solid}
public abstract class DelegateContainerRequest {
// ...
public ContainerLaunchContext getSourceContainer();
public ContainerId getTargetContainer();
// ...
}
{code}
The implementation of this api would notify the NodeManager to change it's
monitoring on the recipient container i.e. Impala or Datanode by modifying
cgroup of the recipient container.
Similarly, the NodeManager could be instructed by the ResourceManager to
preempt the resources of the source container for continuing to serve the
global SLAs of the queues - again, this is implemented by modifying the cgroup
of the recipient container. This will allow for ResouceManager/NodeManager to
be explicitly in control of resources, even in the face of misbehaving AMs etc.
----
The result of the above proposal is very similar to what is already being
discussed, the only difference being that this is explicit (NodeManager knows
the source and recipient containers) and this allows for all existing features
such as preemption, over-allocation of resources to YARN queues etc. to
continue to work as today.
----
Thoughts?
> Enable external systems/frameworks to share resources with Hadoop leveraging
> Yarn resource scheduling
> -----------------------------------------------------------------------------------------------------
>
> Key: YARN-1404
> URL: https://issues.apache.org/jira/browse/YARN-1404
> Project: Hadoop YARN
> Issue Type: New Feature
> Components: nodemanager
> Affects Versions: 2.2.0
> Reporter: Alejandro Abdelnur
> Assignee: Alejandro Abdelnur
> Attachments: YARN-1404.patch
>
>
> Currently Hadoop Yarn expects to manage the lifecycle of the processes its
> applications run workload in. External frameworks/systems could benefit from
> sharing resources with other Yarn applications while running their workload
> within long-running processes owned by the external framework (in other
> words, running their workload outside of the context of a Yarn container
> process).
> Because Yarn provides robust and scalable resource management, it is
> desirable for some external systems to leverage the resource governance
> capabilities of Yarn (queues, capacities, scheduling, access control) while
> supplying their own resource enforcement.
> Impala is an example of such system. Impala uses Llama
> (http://cloudera.github.io/llama/) to request resources from Yarn.
> Impala runs an impalad process in every node of the cluster, when a user
> submits a query, the processing is broken into 'query fragments' which are
> run in multiple impalad processes leveraging data locality (similar to
> Map-Reduce Mappers processing a collocated HDFS block of input data).
> The execution of a 'query fragment' requires an amount of CPU and memory in
> the impalad. As the impalad shares the host with other services (HDFS
> DataNode, Yarn NodeManager, Hbase Region Server) and Yarn Applications
> (MapReduce tasks).
> To ensure cluster utilization that follow the Yarn scheduler policies and it
> does not overload the cluster nodes, before running a 'query fragment' in a
> node, Impala requests the required amount of CPU and memory from Yarn. Once
> the requested CPU and memory has been allocated, Impala starts running the
> 'query fragment' taking care that the 'query fragment' does not use more
> resources than the ones that have been allocated. Memory is book kept per
> 'query fragment' and the threads used for the processing of the 'query
> fragment' are placed under a cgroup to contain CPU utilization.
> Today, for all resources that have been asked to Yarn RM, a (container)
> process must be started via the corresponding NodeManager. Failing to do
> this, will result on the cancelation of the container allocation
> relinquishing the acquired resource capacity back to the pool of available
> resources. To avoid this, Impala starts a dummy container process doing
> 'sleep 10y'.
> Using a dummy container process has its drawbacks:
> * the dummy container process is in a cgroup with a given number of CPU
> shares that are not used and Impala is re-issuing those CPU shares to another
> cgroup for the thread running the 'query fragment'. The cgroup CPU
> enforcement works correctly because of the CPU controller implementation (but
> the formal specified behavior is actually undefined).
> * Impala may ask for CPU and memory independent of each other. Some requests
> may be only memory with no CPU or viceversa. Because a container requires a
> process, complete absence of memory or CPU is not possible even if the dummy
> process is 'sleep', a minimal amount of memory and CPU is required for the
> dummy process.
> Because of this it is desirable to be able to have a container without a
> backing process.
--
This message was sent by Atlassian JIRA
(v6.1.4#6159)