[ https://issues.apache.org/jira/browse/YARN-4059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729983#comment-14729983 ]
Wangda Tan commented on YARN-4059: ---------------------------------- Hi [~jlowe], Thanks for explanation, now I can better understand why YARN choose to count-based wait instead of time-based wait at the begining. bq. If the app only wants a small portion of the cluster then it already scales down the amount of time it will wait in getLocalityWaitFactor. I think this is another thing we need to fix, currently, it uses #request-container * localityWaitFactor as the minimum wait threshold of offswitch, if an app asks for #containers >> (#hosts-per-rack) (let's say, ask for 10k containers, 4 racks in the cluster, each rack has 5 nodes), and when the expected racks are not available, the app needs to wait maybe 10+ minutes to get one offswitch container. I would like to make this becomes more determinated: it will be a fixed number, such as waiting for 5 secs for rack-local and goes to off-switch. bq. The problem I think we're going to run into with a time-based approach is that we don't know what time an individual request arrived since we only store the aggregation of requests for a particular priority. I totally agree with this, here's one solution in our mind that may solve the problem, I discussed this offline with [~vinodkv], it seems works end-to-end: - When an app is able to allocate one container on a node, but it prefer to wait, it will reserve on a node. (Current behavior is reservation happens only app get enough missed-opportunity) - Benefits of doing this before missed-opportunity are: 1) application officially declares "this is my node" so rest of applications will be skipped. 2) We already has mechanism to avoid excessive reservation, so one high priority app cannot block a whole cluster if it only asks for few containers. - Redefine of locality-delay to be: number of time that one app willing to wait to *allocate a single container for a given app/priority*. This is very determinstic to me (much more deterministic than existing count-based delay). - We will start the waiting-timer once we reserved a container on a node, waiting-timer is a property of reserved RMContainer if we choose to move the reservation, wait-timer will be kept. - And this solution supports per-app/per-priority locality-delay, it doesn't affect by how many nodes/racks in the cluster. Thoughts? > Preemption should delay assignments back to the preempted queue > --------------------------------------------------------------- > > Key: YARN-4059 > URL: https://issues.apache.org/jira/browse/YARN-4059 > Project: Hadoop YARN > Issue Type: Improvement > Reporter: Chang Li > Assignee: Chang Li > Attachments: YARN-4059.2.patch, YARN-4059.3.patch, YARN-4059.patch > > > When preempting containers from a queue it can take a while for the other > queues to fully consume the resources that were freed up, due to delays > waiting for better locality, etc. Those delays can cause the resources to be > assigned back to the preempted queue, and then the preemption cycle continues. > We should consider adding a delay, either based on node heartbeat counts or > time, to avoid granting containers to a queue that was recently preempted. > The delay should be sufficient to cover the cycles of the preemption monitor, > so we won't try to assign containers in-between preemption events for a queue. > Worst-case scenario for assigning freed resources to other queues is when all > the other queues want no locality. No locality means only one container is > assigned per heartbeat, so we need to wait for the entire cluster > heartbeating in times the number of containers that could run on a single > node. > So the "penalty time" for a queue should be the max of either the preemption > monitor cycle time or the amount of time it takes to allocate the cluster > with one container per heartbeat. Guessing this will be somewhere around 2 > minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)