Wangda Tan commented on YARN-4059:

Hi [~jlowe],

Thanks for explanation, now I can better understand why YARN choose to 
count-based wait instead of time-based wait at the begining.

bq. If the app only wants a small portion of the cluster then it already scales 
down the amount of time it will wait in getLocalityWaitFactor.
I think this is another thing we need to fix, currently, it uses 
#request-container * localityWaitFactor as the minimum wait threshold of 
offswitch, if an app asks for #containers >> (#hosts-per-rack) (let's say, ask 
for 10k containers, 4 racks in the cluster, each rack has 5 nodes), and when 
the expected racks are not available, the app needs to wait maybe 10+ minutes 
to get one offswitch container. I would like to make this becomes more 
determinated: it will be a fixed number, such as waiting for 5 secs for 
rack-local and goes to off-switch.

bq. The problem I think we're going to run into with a time-based approach is 
that we don't know what time an individual request arrived since we only store 
the aggregation of requests for a particular priority.
I totally agree with this, here's one solution in our mind that may solve the 
problem, I discussed this offline with [~vinodkv], it seems works end-to-end:

- When an app is able to allocate one container on a node, but it prefer to 
wait, it will reserve on a node. (Current behavior is reservation happens only 
app get enough missed-opportunity)
- Benefits of doing this before missed-opportunity are: 1) application 
officially declares "this is my node" so rest of applications will be skipped. 
2) We already has mechanism to avoid excessive reservation, so one high 
priority app cannot block a whole cluster if it only asks for few containers.
- Redefine of locality-delay to be: number of time that one app willing to wait 
to *allocate a single container for a given app/priority*. This is very 
determinstic to me (much more deterministic than existing count-based delay).
- We will start the waiting-timer once we reserved a container on a node, 
waiting-timer is a property of reserved RMContainer if we choose to move the 
reservation, wait-timer will be kept.
- And this solution supports per-app/per-priority locality-delay, it doesn't 
affect by how many nodes/racks in the cluster.


> Preemption should delay assignments back to the preempted queue
> ---------------------------------------------------------------
>                 Key: YARN-4059
>                 URL: https://issues.apache.org/jira/browse/YARN-4059
>             Project: Hadoop YARN
>          Issue Type: Improvement
>            Reporter: Chang Li
>            Assignee: Chang Li
>         Attachments: YARN-4059.2.patch, YARN-4059.3.patch, YARN-4059.patch
> When preempting containers from a queue it can take a while for the other 
> queues to fully consume the resources that were freed up, due to delays 
> waiting for better locality, etc. Those delays can cause the resources to be 
> assigned back to the preempted queue, and then the preemption cycle continues.
> We should consider adding a delay, either based on node heartbeat counts or 
> time, to avoid granting containers to a queue that was recently preempted. 
> The delay should be sufficient to cover the cycles of the preemption monitor, 
> so we won't try to assign containers in-between preemption events for a queue.
> Worst-case scenario for assigning freed resources to other queues is when all 
> the other queues want no locality. No locality means only one container is 
> assigned per heartbeat, so we need to wait for the entire cluster 
> heartbeating in times the number of containers that could run on a single 
> node.
> So the "penalty time" for a queue should be the max of either the preemption 
> monitor cycle time or the amount of time it takes to allocate the cluster 
> with one container per heartbeat. Guessing this will be somewhere around 2 
> minutes.

This message was sent by Atlassian JIRA

Reply via email to