[
https://issues.apache.org/jira/browse/YARN-4059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14730856#comment-14730856
]
Advertising
Jason Lowe commented on YARN-4059:
----------------------------------
I think at a high level that can work, since we can use the reservation to
track the time per request. However there are some details with the way
reservations currently work that will cause problems. Here's an extreme
example:
Cluster is big and almost completely empty. Rack R is completely full but all
the other racks are completely empty. Lots of apps are trying to be scheduled
on the cluster. App A is at the front of the scheduling queue, wants nothing
but a lot of containers on rack R, and has a very big user limit. When a node
shows up that isn't in rack R, we'll place a reservation on it which only asks
for a small fraction of the overall node's capability. However since a node
can only contain one reservation at a time, nothing else can be scheduled on
that node even though it's got plenty of space. If the app has enough user
limit to put a reservation on each node not in rack R then we locked out the
whole cluster for the node-local-wait duration of app A. Even if we don't lock
out the whole cluster, app A is essentially locking out an entire node for each
reservation it is making until it finds locality or the locality wait period
ends. That's going to slow down scheduling in general.
The problem is that reservation assumes the node is full, hence there would
only ever be one reservation per node. So we would either need to support
handling multiple reservations on a node or modify the algorithm to use a
combination of containers and reservations. We could use reservations when the
node is not big enough to allocate the container we want to place, but we would
use a container allocation to "reserve" space on a node if the node actually
has space. We would _not_ give the container to the app until the
node-local-wait expired, and we would kill the container and re-alloc on a node
with locality if it arrives within the wait period. That would allow other
apps to schedule on the node if we have placed all the "reserved while waiting
for locality" containers and the node still has space or other things.
I think we also need to refine the algorithm a bit so it will move
reservations/containers as locality improves. For example app needs host A
which is totally full but the rest of the nodes on that rack are totally empty.
It initially reserves on an off-rack node since that's the first that
heartbeated. Again, peephole scheduling isn't helping here. It would be
unfortunate to have the app wait around for a node-local allocation only to
give up and use an off-rack allocation because that's where it happen to
initially reserve. If we initially reserve off-rack but then later find a
rack-local placement then we should migrate the reservation to improve the
fallback allocation if we never get node-local.
> Preemption should delay assignments back to the preempted queue
> ---------------------------------------------------------------
>
> Key: YARN-4059
> URL: https://issues.apache.org/jira/browse/YARN-4059
> Project: Hadoop YARN
> Issue Type: Improvement
> Reporter: Chang Li
> Assignee: Chang Li
> Attachments: YARN-4059.2.patch, YARN-4059.3.patch, YARN-4059.patch
>
>
> When preempting containers from a queue it can take a while for the other
> queues to fully consume the resources that were freed up, due to delays
> waiting for better locality, etc. Those delays can cause the resources to be
> assigned back to the preempted queue, and then the preemption cycle continues.
> We should consider adding a delay, either based on node heartbeat counts or
> time, to avoid granting containers to a queue that was recently preempted.
> The delay should be sufficient to cover the cycles of the preemption monitor,
> so we won't try to assign containers in-between preemption events for a queue.
> Worst-case scenario for assigning freed resources to other queues is when all
> the other queues want no locality. No locality means only one container is
> assigned per heartbeat, so we need to wait for the entire cluster
> heartbeating in times the number of containers that could run on a single
> node.
> So the "penalty time" for a queue should be the max of either the preemption
> monitor cycle time or the amount of time it takes to allocate the cluster
> with one container per heartbeat. Guessing this will be somewhere around 2
> minutes.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)