Jason Lowe commented on YARN-4059:

Thanks for the patch, Chang!

I'm wondering if this delay is the right approach.  I think it will cause 
problems when we start supporting in-queue preemption, where we're preempting 
from one user to give the resources to another within the same queue.  This 
would artificially delay giving the resources back to the queue and they could 
be stolen by another, lower-priority queue in the interim.

Hopefully by the time we decide to preempt the pending requests have already 
moved past the waiting for scheduling opportunities phase and are in the "at 
this point I'll take anything" stage.  If that's the case then we shouldn't 
need this delay.  If the delay helps then that implies we're preempting 
resources for asks that haven't waited very long since they are still willing 
to wait for better locality.

I think we need to do a better job of connecting preemption requests with the 
asks that triggered the need to preempt.   I believe [~leftnoteasy] was 
proposing this earlier in another JIRA, and I think the scheduler could make 
better decisions as a result without imposing delays.  For example, we also 
have a fragmentation problem where we can preempt a bunch of small containers 
across nodes to try to fill a single large request.  That doesn't make a lot of 
sense as none of the nodes will have enough space to satisfy the request after 
the containers are killed, so we end up giving them right back to the apps that 
were preempted and the cycle continues because those containers are now the 

> Preemption should delay assignments back to the preempted queue
> ---------------------------------------------------------------
>                 Key: YARN-4059
>                 URL: https://issues.apache.org/jira/browse/YARN-4059
>             Project: Hadoop YARN
>          Issue Type: Improvement
>            Reporter: Chang Li
>            Assignee: Chang Li
>         Attachments: YARN-4059.2.patch, YARN-4059.3.patch, YARN-4059.patch
> When preempting containers from a queue it can take a while for the other 
> queues to fully consume the resources that were freed up, due to delays 
> waiting for better locality, etc. Those delays can cause the resources to be 
> assigned back to the preempted queue, and then the preemption cycle continues.
> We should consider adding a delay, either based on node heartbeat counts or 
> time, to avoid granting containers to a queue that was recently preempted. 
> The delay should be sufficient to cover the cycles of the preemption monitor, 
> so we won't try to assign containers in-between preemption events for a queue.
> Worst-case scenario for assigning freed resources to other queues is when all 
> the other queues want no locality. No locality means only one container is 
> assigned per heartbeat, so we need to wait for the entire cluster 
> heartbeating in times the number of containers that could run on a single 
> node.
> So the "penalty time" for a queue should be the max of either the preemption 
> monitor cycle time or the amount of time it takes to allocate the cluster 
> with one container per heartbeat. Guessing this will be somewhere around 2 
> minutes.

This message was sent by Atlassian JIRA

Reply via email to