[
https://issues.apache.org/jira/browse/YARN-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Konstantinos Karanasos updated YARN-1011:
-----------------------------------------
Attachment: patch-for-yarn-1011.patch
Hi guys, as I mentioned to [~kasha] in an offline discussion we had also with
[~asuresh], I recently did some very similar changes to the RM and the
CapacityScheduler, which could be of use here.
We implemented a system, which we call Yaq (will be presented in EuroSys in a
couple of months, so I will soon be able to share the paper as well), that
allows queuing of tasks at the NMs (using some bits from YARN-2883). If you
consider queuing as a way of over-committing resources, the setting has many
commonalities with the current JIRA.
The basic idea is that each NM advertises a number of queue slots (i.e.,
containers that are allowed to be queued at the NM), in order to have some
tasks ready to be started immediately once resources become available. This way
we can mask the allocation latency when waiting for the RM to assign and send
new tasks to the NM.
Then the central scheduler (the CapacityScheduler in our case), performs
placement in the following way:
# When there are still available resources (I hard-coded it to be <95%
utilization, Karthik mentioned 80% above), the scheduling is heartbeat-driven,
like it is today.
# Above 95% utilization, there is an additional asynchronous thread that orders
the nodes (in the current implementation based on the expected wait time of
each node) and then starts filling up the queue slots. The reason for having
this thread is that (1) we don't want a container to be given a queue slot, if
there is a run slot available, and (2) we want to have the global view of all
nodes so that we favor nodes that have queues that are less loaded.
As you can see, the scheduling logic is very similar to what [~kasha] described
above, so I think we can reuse those bits.
What is also similar to the over-commitment that is described in this JIRA, is
that we extended the {{SchedulerNode}} so that it can account for two types of
resources, namely run slots (which were already there) and queue slots.
Although in the current implementation we do not over-commit resources (queued
tasks start only after allocated resources become available), this just has to
do with how the NM decides to treat the additional tasks it receives (and can
be easily changed to actually over-commit resources). As far as the
RM/scheduler is concerned, the logic is the same: you have the allocated
resources (run slots) and then you have some additional resources that are
advertised by the nodes and can be used either for queuing containers (in Yaq's
case) or for starting additional tasks based on the actual resource utilization
(in the present JIRA's case). So, for the RM, the logic should be the same.
I am attaching an initial patch that contains the classes that are related to
the RM. It is against branch-2.7.1. Since I left out many class that were
specific to YARN, the patch will not compile, but you should be able to see all
the changes I did to the {{SchedulerNode}}, the {{CapacityScheduler}}, the
{{*Queue}} classes, etc. I would be happy to share more code if needed.
> [Umbrella] Schedule containers based on utilization of currently allocated
> containers
> -------------------------------------------------------------------------------------
>
> Key: YARN-1011
> URL: https://issues.apache.org/jira/browse/YARN-1011
> Project: Hadoop YARN
> Issue Type: New Feature
> Reporter: Arun C Murthy
> Attachments: patch-for-yarn-1011.patch, yarn-1011-design-v0.pdf,
> yarn-1011-design-v1.pdf, yarn-1011-design-v2.pdf
>
>
> Currently RM allocates containers and assumes resources allocated are
> utilized.
> RM can, and should, get to a point where it measures utilization of allocated
> containers and, if appropriate, allocate more (speculative?) containers.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)