Rohith Sharma K S commented on YARN-3849:

The below is the log trace for the issue.

In our cluster, 
there are 3 NodeManager and each with resource {{<memory:327680, vCores:35>}}. 
Total cluster resource is {{clusterResource: <memory:983040, vCores:105>}} with 
CapacityScheduler configured queue's with name *default* and *QueueA*.

 # Application app-1 is submitted to queue default and containers are started 
running the applications with 10 containers,each with {{resource: <memory:1024, 
vCores:10>}}. so total used is {{usedResources=<memory:10240, vCores:91>}}
default user=spark used=<memory:10240, vCores:91> numContainers=10 headroom = 
<memory:1024, vCores:10> user-resources=<memory:10240, vCores:91>
Re-sorting assigned queue: root.default stats: default: capacity=0.5, 
absoluteCapacity=0.5, usedResources=<memory:10240, vCores:91>, 
usedCapacity=1.7333333, absoluteUsedCapacity=0.8666667, numApps=1, 
*NOTE : Resource allocation is by CPU DOMINANT*
After 10 container running, available NodeManagers memory is
linux-174, available: <memory:323584, vCores:4>
linux-175, available: <memory:324608, vCores:5>
linux-223, available: <memory:324608, vCores:5>
# Application app-2 is submitted to QueueA. ApplicationMaster container started 
running and NodeManager memory is {{available: <memory:322560, vCores:3>}}
Assigned container container_1435072598099_0002_01_000001 of capacity 
<memory:1024, vCores:1> on host linux-174:26009, which has 5 containers, 
<memory:5120, vCores:32> used and <memory:322560, vCores:3> available after 
allocation | SchedulerNode.java:154
linux-174, available: <memory:322560, vCores:3>
# the preemption policy does the below calculation
2015-06-23 23:20:51,127 NAME: QueueA CUR: <memory:0, vCores:0> PEN: <memory:0, 
vCores:0> GAR: <memory:491520, vCores:52> NORM: NaN IDEAL_ASSIGNED: <memory:0, 
vCores:0> IDEAL_PREEMPT: <memory:0, vCores:0> ACTUAL_PREEMPT: <memory:0, 
vCores:0> UNTOUCHABLE: <memory:0, vCores:0> PREEMPTABLE: <memory:0, vCores:0>
2015-06-23 23:20:51,128 NAME: default CUR: <memory:851968, vCores:91> PEN: 
<memory:0, vCores:0> GAR: <memory:491520, vCores:52> NORM: 1.0 IDEAL_ASSIGNED: 
<memory:851968, vCores:91> IDEAL_PREEMPT: <memory:0, vCores:0> ACTUAL_PREEMPT: 
<memory:0, vCores:0> UNTOUCHABLE: <memory:0, vCores:0> PREEMPTABLE: 
<memory:360448, vCores:39>
In the above log , observe for the queue default *CUR is <memory:851968, 
vCores:91>*, but actually *usedResources=<memory:10240, vCores:91>*. Here, only 
CPU is matching but not MEMORY. The CUR calculation is done below formula
#* CUR=  {{clusterResource: <memory:983040, vCores:105>}} *  
{{absoluteUsedCapacity(0.8)}} = {{<memory:851968, vCores:91>}}
#* GAR=  {{clusterResource: <memory:983040, vCores:105>}} *  
{{absoluteCapacity(0.5)}}     = {{ <memory:491520, vCores:52>}}
#* PREEMPTABLE= GAR - CUR = {{<memory:360448, vCores:39>}}
# App-2 request for the containers with {{resource: <memory:1024, vCores:10>}}. 
So, the preemption cycle finds that how much memory toBePreempt
2015-06-23 23:21:03,131 | DEBUG | SchedulingMonitor 
(ProportionalCapacityPreemptionPolicy) | 1435072863131:  NAME: default CUR: 
<memory:851968, vCores:91> PEN: <memory:0, vCores:0> GAR: <memory:491520, 
vCores:52> NORM: NaN IDEAL_ASSIGNED: <memory:491520, vCores:52> IDEAL_PREEMPT: 
<memory:97043, vCores:10> ACTUAL_PREEMPT: <memory:0, vCores:0> UNTOUCHABLE: 
<memory:0, vCores:0> PREEMPTABLE: <memory:360448, vCores:39>
Observe that *IDEAL_PREEMPT: <memory:97043, vCores:10>*, but app-2 in queue 
QueueA required only 10 CPU resource to be preempt, but memory to be preempt is 
97043 but memory sufficiently available.
Below is the calculations which does IDEAL_PREMPT, 
#* totalPreemptionAllowed = clusterResource: <memory:983040, vCores:105> *  0.1 
= <memory:98304, vCores:10.5>
#* totPreemptionNeeded = CUR - IDEAL_ASSIGNED = CUR: <memory:851968, vCores:91>
#* scalingFactor = Resources.divide(drc, <memory:491520, vCores:52>, 
<memory:98304, vCores:10.5>, <memory:851968, vCores:91>);
scalingFactor = 0.114285715
#* toBePreempted = CUR: <memory:851968, vCores:91> *  
scalingFactor(0.1139045128455529) = <memory:97368, vCores:10>
{{resource-to-obtain = <memory:97043, vCores:10>}}

*So the problem is in either of the below steps*
# As [~sunilg] said, usedResources=<memory:10240, vCores:91> but preemption 
policy calculate wrongly that current used capacity as {{<memory:851968, 
vCores:91>}}. This is mainly becaue preemption policy is using absoluteCapacity 
for calculating for Current usage which always gives wrong result for one of 
the resources in DominantResourceAllocator used. I think, fraction should not 
be used which caused problem in DRC(Multi dimentional resources) instead we 
should be usedResource from CSQueue.
# Even bypassing above step-1, toBePreempted calculated as resource-to-obtain: 
<memory:97043, vCores:10>. When a container marked for preemption, preemption 
policy subtract the marked container resources. I.e in the above log, 
resource-to-obtain will become *<memory:96043, vCores:0>* since each container 
memory is <1gb,10cores>. On next container marking, MEMORY has become DOMINANT 
and policy tries to fullfil memory i.e 96GB even CPU is fulfilled. The dominant 
change i.e scheduler allocates container with CPU dominant, but preemption 
policy going for MEMORY dominant causing the problem. This allows kills all the 
NON-AM containers.

*And don't think that problem is only killing all the NON-AM containers but it 
continues loop:-(  i.e  when app-2 starts running containers in QueueA, app-1 
ask for container request which preemption policy kill all the NON-Am 
containers from app-1. This repeats for ever, and both applications kills the 
tasks each others in loop which both applications never completes at all*

> Too much of preemption activity causing continuos killing of containers 
> across queues
> -------------------------------------------------------------------------------------
>                 Key: YARN-3849
>                 URL: https://issues.apache.org/jira/browse/YARN-3849
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler
>    Affects Versions: 2.7.0
>            Reporter: Sunil G
>            Assignee: Sunil G
>            Priority: Critical
> Two queues are used. Each queue has given a capacity of 0.5. Dominant 
> Resource policy is used.
> 1. An app is submitted in QueueA which is consuming full cluster capacity
> 2. After submitting an app in QueueB, there are some demand  and invoking 
> preemption in QueueA
> 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that 
> all containers other than AM is getting killed in QueueA
> 4. Now the app in QueueB is trying to take over cluster with the current free 
> space. But there are some updated demand from the app in QueueA which lost 
> its containers earlier, and preemption is kicked in QueueB now.
> Scenario in step 3 and 4 continuously happening in loop. Thus none of the 
> apps are completing.

This message was sent by Atlassian JIRA

Reply via email to