Eric Payne updated YARN-3769:
    Attachment: YARN-3769.001.branch-2.8.patch

One thing I've thought for a while is adding a "lazy preemption" mechanism, 
which is: when a container is marked preempted and wait for 
max_wait_before_time, it becomes a "can_be_killed" container. If there's 
another queue can allocate on a node with "can_be_killed" container, such 
container will be killed immediately to make room the new containers.

I will upload a design doc shortly for review.

[~leftnoteasy], because it's been a couple of months since the last activity on 
this JIRA, would it be better to use this JIRA for the purpose of making the 
preemption monitor "user-limit" aware and open a separate JIRA to address a 

Towards that end, I am uploading a couple of patches:
- {{YARN-3769.001.branch-2.7.patch}} is a patch to 2.7 (and also 2.6) which we 
have been using internally. This fix has dramatically reduced the instances of 
"ping-pong"-ing as I outlined in [the comment 
- {{YARN-3769.001.branch-2.8.patch}} is similar to the fix made in 2.7, but it 
also takes into consideration node label partitions.
Thanks for your help and please let me know what you think.

> Preemption occurring unnecessarily because preemption doesn't consider user 
> limit
> ---------------------------------------------------------------------------------
>                 Key: YARN-3769
>                 URL: https://issues.apache.org/jira/browse/YARN-3769
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler
>    Affects Versions: 2.6.0, 2.7.0, 2.8.0
>            Reporter: Eric Payne
>            Assignee: Wangda Tan
>         Attachments: YARN-3769.001.branch-2.7.patch, 
> YARN-3769.001.branch-2.8.patch
> We are seeing the preemption monitor preempting containers from queue A and 
> then seeing the capacity scheduler giving them immediately back to queue A. 
> This happens quite often and causes a lot of churn.

This message was sent by Atlassian JIRA

Reply via email to