[ 
https://issues.apache.org/jira/browse/YARN-4059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729308#comment-14729308
 ] 

Jason Lowe commented on YARN-4059:
----------------------------------

bq. do you think if it acceptable to you if adding an option to CS choose to 
use heartbeat-based counting or time-based counting?

Time-based counting would be a bit better since at least it would help resolve 
the current bug where we are not counting full nodes as scheduling 
opportunities.  However we could approach this problem from a different angle 
as well.  Even if we do a time-based allocation, there's this problematic 
scenario:

- Highest priority app is asking for a lot of containers on a busy cluster
- We free up a bunch of resources via preemption
- One of the early resources we allocate to that app is very likely to have 
locality (since it's asking for so many).
- Now we'll reset the scheduling opportunities or timer and the app will wait 
for quite a bit before it will take anything that isn't perfect locality.
- In the meantime all those free resources end up going to other, lower 
priority apps, possibly the ones we originally preempted

Wondering if we should scale down the delay for allocation based on how full 
the cluster appears to be.  If the cluster is nearly full then we probably 
don't want to be particularly picky about what containers we're getting.  If we 
do delay then it's likely the scarce resource will be snarfed up by another app 
who isn't very picky and now we're back to waiting for anything again.

> Preemption should delay assignments back to the preempted queue
> ---------------------------------------------------------------
>
>                 Key: YARN-4059
>                 URL: https://issues.apache.org/jira/browse/YARN-4059
>             Project: Hadoop YARN
>          Issue Type: Improvement
>            Reporter: Chang Li
>            Assignee: Chang Li
>         Attachments: YARN-4059.2.patch, YARN-4059.3.patch, YARN-4059.patch
>
>
> When preempting containers from a queue it can take a while for the other 
> queues to fully consume the resources that were freed up, due to delays 
> waiting for better locality, etc. Those delays can cause the resources to be 
> assigned back to the preempted queue, and then the preemption cycle continues.
> We should consider adding a delay, either based on node heartbeat counts or 
> time, to avoid granting containers to a queue that was recently preempted. 
> The delay should be sufficient to cover the cycles of the preemption monitor, 
> so we won't try to assign containers in-between preemption events for a queue.
> Worst-case scenario for assigning freed resources to other queues is when all 
> the other queues want no locality. No locality means only one container is 
> assigned per heartbeat, so we need to wait for the entire cluster 
> heartbeating in times the number of containers that could run on a single 
> node.
> So the "penalty time" for a queue should be the max of either the preemption 
> monitor cycle time or the amount of time it takes to allocate the cluster 
> with one container per heartbeat. Guessing this will be somewhere around 2 
> minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to