Wangda Tan commented on YARN-4059:
Thanks for explanation, now I can better understand why YARN choose to
count-based wait instead of time-based wait at the begining.
bq. If the app only wants a small portion of the cluster then it already scales
down the amount of time it will wait in getLocalityWaitFactor.
I think this is another thing we need to fix, currently, it uses
#request-container * localityWaitFactor as the minimum wait threshold of
offswitch, if an app asks for #containers >> (#hosts-per-rack) (let's say, ask
for 10k containers, 4 racks in the cluster, each rack has 5 nodes), and when
the expected racks are not available, the app needs to wait maybe 10+ minutes
to get one offswitch container. I would like to make this becomes more
determinated: it will be a fixed number, such as waiting for 5 secs for
rack-local and goes to off-switch.
bq. The problem I think we're going to run into with a time-based approach is
that we don't know what time an individual request arrived since we only store
the aggregation of requests for a particular priority.
I totally agree with this, here's one solution in our mind that may solve the
problem, I discussed this offline with [~vinodkv], it seems works end-to-end:
- When an app is able to allocate one container on a node, but it prefer to
wait, it will reserve on a node. (Current behavior is reservation happens only
app get enough missed-opportunity)
- Benefits of doing this before missed-opportunity are: 1) application
officially declares "this is my node" so rest of applications will be skipped.
2) We already has mechanism to avoid excessive reservation, so one high
priority app cannot block a whole cluster if it only asks for few containers.
- Redefine of locality-delay to be: number of time that one app willing to wait
to *allocate a single container for a given app/priority*. This is very
determinstic to me (much more deterministic than existing count-based delay).
- We will start the waiting-timer once we reserved a container on a node,
waiting-timer is a property of reserved RMContainer if we choose to move the
reservation, wait-timer will be kept.
- And this solution supports per-app/per-priority locality-delay, it doesn't
affect by how many nodes/racks in the cluster.
> Preemption should delay assignments back to the preempted queue
> Key: YARN-4059
> URL: https://issues.apache.org/jira/browse/YARN-4059
> Project: Hadoop YARN
> Issue Type: Improvement
> Reporter: Chang Li
> Assignee: Chang Li
> Attachments: YARN-4059.2.patch, YARN-4059.3.patch, YARN-4059.patch
> When preempting containers from a queue it can take a while for the other
> queues to fully consume the resources that were freed up, due to delays
> waiting for better locality, etc. Those delays can cause the resources to be
> assigned back to the preempted queue, and then the preemption cycle continues.
> We should consider adding a delay, either based on node heartbeat counts or
> time, to avoid granting containers to a queue that was recently preempted.
> The delay should be sufficient to cover the cycles of the preemption monitor,
> so we won't try to assign containers in-between preemption events for a queue.
> Worst-case scenario for assigning freed resources to other queues is when all
> the other queues want no locality. No locality means only one container is
> assigned per heartbeat, so we need to wait for the entire cluster
> heartbeating in times the number of containers that could run on a single
> So the "penalty time" for a queue should be the max of either the preemption
> monitor cycle time or the amount of time it takes to allocate the cluster
> with one container per heartbeat. Guessing this will be somewhere around 2
This message was sent by Atlassian JIRA