Carlo Curino commented on YARN-2022:


The problem with AM_CONTAINER_PRIORITY is that it is just a short cut for 
setting Priority = 0; The use can easily do so from its own code, and unless 
there are explicit checks that prevent ResourceRequest to assign priority = 0 
to all their containers, we have no defense against user abuses. The two 
options I see are:
 * we track which container is the AM not via Priority and protect the AM 
container from preemption whenever possible 
 * we assign a "quota" of protected-from-preemption containers, and save 
whichever containers have the lowest priority and fit within the "quota". This 
way the user can specify multiple containers at Priority=0 (think a 
replicated-AM or some other critical service for the job) and we will save as 
many of those as it fits in the quota.

I think we are agreeing on max-am-percentage... the final goal is to make sure 
that after preemption the max-am-resource-percent is respected (i.e., no more 
than a certain amount of the queue is dedicated to AMs).

The problem with user-limit-factor goes like this:  
 * Given a queue A of capacity: 10%, max-capacity = 50%, and user-limit-factor 
= 2 (i.e., a single user can go up to 20% of total resources)
 * Only one user is active in this queue and it gets 20% of resources (this 
also require low activity in other queues)
 * The overall cluster capacity is reduced (e.g., a failing rack) or a refresh 
of the queues as reduced this queue capacity 
 * The LeafQueue scheduler keeps "skipping" the scheduling for this user (since 
he is now over its user-limit-factor) although no other user in the cluster is 
asking for resources
  * If we ever get to this situation with the user holding only AMs the system 
is completely wedged, with the AMs waiting for more containers, and the system 
systematically skipping this user (as he is above its user-limit-factor).
If preemption proceeds systematically killing resources *including* AMs, the 
chances of this happening are rather low (the "head" of the queue is only AMs, 
while the tail contained AMs and other containers), but as we "save" AMs from 
preemption, this bad corner case is maybe a little more likely to happen. 

What I am trying to affect with my comments is that as we try to evolve 
preemption further, we should look at all the invariants of a queue, and try to 
make sure that our preemption policy can re-establish not only the capacity 
invariant but also all others invariants. The CS relies on those invariants 
heavily, and misbehave if they are violated.  An example of this is YARN-1957, 
where we introduce better handling for max-capacity and zero-size queues.

The changes you are proposing are not "creating" the problem, just making it 
more likely to happen in practice. A well tuned CS and reasonable load are 
unlikely to trigger this, but we should build for robustness as much as 
possible, since we cannot rely on users to understand this internals and tune 
the CS defensively.

[~acmurthy] any thoughts on this?

> Preempting an Application Master container can be kept as least priority when 
> multiple applications are marked for preemption by 
> ProportionalCapacityPreemptionPolicy
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
>                 Key: YARN-2022
>                 URL: https://issues.apache.org/jira/browse/YARN-2022
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>    Affects Versions: 2.4.0
>            Reporter: Sunil G
>            Assignee: Sunil G
>         Attachments: Yarn-2022.1.patch
> Cluster Size = 16GB [2NM's]
> Queue A Capacity = 50%
> Queue B Capacity = 50%
> Consider there are 3 applications running in Queue A which has taken the full 
> cluster capacity. 
> J1 = 2GB AM + 1GB * 4 Maps
> J2 = 2GB AM + 1GB * 4 Maps
> J3 = 2GB AM + 1GB * 2 Maps
> Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ].
> Currently in this scenario, Jobs J3 will get killed including its AM.
> It is better if AM can be given least priority among multiple applications. 
> In this same scenario, map tasks from J3 and J2 can be preempted.
> Later when cluster is free, maps can be allocated to these Jobs.

This message was sent by Atlassian JIRA

Reply via email to