[ 
https://issues.apache.org/jira/browse/YARN-4059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729798#comment-14729798
 ] 

Jason Lowe commented on YARN-4059:
----------------------------------

bq. I think this cannot handle the case if an app wants only a small proportion 
of cluster.
If the app only wants a small portion of the cluster then it already scales 
down the amount of time it will wait in getLocalityWaitFactor, so there needs 
to be a substantial request to get a substantial wait.

The problem I think we're going to run into with a time-based approach is that 
we don't know what time an individual request arrived since we only store the 
aggregation of requests for a particular priority.  I think it might be tricky 
to also track when a request becomes "eligible" for allocation.  For example, 
if the app has been sitting behind other applications in the queue and user 
limits are why it isn't getting containers then we do _not_ want to think that 
the app has already waited a long time for a local container.  It hasn't really 
waited any time from an opportunity perspective because user limits prevented 
it from getting what it wanted.  The cluster could be almost completely empty 
and then when the limits finally allow it to allocate it will be so far behind 
time-wise that we'll schedule it very poorly.  Similarly we could have 
satisfied a portion of the request at a certain priority, then user limits kick 
in, and many minutes later when the containers exit it may look like we have 
been trying to find locality for all that time which is incorrect.

If we can find a way to get the time bookkeeping right I think it could sort of 
work.  However as the cluster usage approaches capacity we get into priority 
inversion problems when apps at the front of the queue pass up containers due 
to locality and the apps behind them readily take them.  That can severely 
prolong the time it takes the apps to get what they are asking for, hence the 
thought that we may want to consider total cluster load when weighing how long 
we should be trying.

> Preemption should delay assignments back to the preempted queue
> ---------------------------------------------------------------
>
>                 Key: YARN-4059
>                 URL: https://issues.apache.org/jira/browse/YARN-4059
>             Project: Hadoop YARN
>          Issue Type: Improvement
>            Reporter: Chang Li
>            Assignee: Chang Li
>         Attachments: YARN-4059.2.patch, YARN-4059.3.patch, YARN-4059.patch
>
>
> When preempting containers from a queue it can take a while for the other 
> queues to fully consume the resources that were freed up, due to delays 
> waiting for better locality, etc. Those delays can cause the resources to be 
> assigned back to the preempted queue, and then the preemption cycle continues.
> We should consider adding a delay, either based on node heartbeat counts or 
> time, to avoid granting containers to a queue that was recently preempted. 
> The delay should be sufficient to cover the cycles of the preemption monitor, 
> so we won't try to assign containers in-between preemption events for a queue.
> Worst-case scenario for assigning freed resources to other queues is when all 
> the other queues want no locality. No locality means only one container is 
> assigned per heartbeat, so we need to wait for the entire cluster 
> heartbeating in times the number of containers that could run on a single 
> node.
> So the "penalty time" for a queue should be the max of either the preemption 
> monitor cycle time or the amount of time it takes to allocate the cluster 
> with one container per heartbeat. Guessing this will be somewhere around 2 
> minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to