[ 
https://issues.apache.org/jira/browse/YARN-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14063010#comment-14063010
 ] 

Wangda Tan commented on YARN-2297:
----------------------------------

Took a look at this issue, there're two issues caused preemption hanging

*1) "maximum-am-resource-percent" is used as a "ratio" instead of "percent"*
I found yarn.scheduler.capacity.maximum-am-resource-percent is set to 100 in 
configuration file sent from Tassapol offine.
I've checked how this value used in capacity scheduler and preemption policy, 
it should be a value in range of \[0, 1\]. So the name 
maximum-am-resource-percent is inconsist of how it used.

*Solution of issue#1*
a. Remove the "precent" in configuration, and limit its range within \[0, 1\].
b. Change current logic of code, divide 100 for this "percent" value in where 
it used.

*2) After I configure it to a expected value (within \[0,1\]), like 0.1, a new 
issue emerging -- Jitter happens, still in Tassapol provided environment.*
*a.*
q-A used 2G (about 5000% used), 2G pending
q-B used 2G (about 49% used), 2G pending
*b.*
Current preempt policy will take 1 container from qA because of following logic 
(pesudo code):
{code}
while (toBePreempt > 0):
  foreach application:
    foreach container:
      if (toBePreempt > 0):
        preempt container
        toBePreempt -= container.resource
      else:
        break
{code}
Then used resource of qA will dropped to 0% because there's only 1 container in 
qA
*c.*
Current capacity scheduler will allocate container in queue from least usage if 
there're multiple queues under a parent queue, usage of qA is 0%, so it will 
first try to allocate a container in qA.
*d.*
After a container allocated in qA, it goes back to *a.*, a infinite loop will 
happen: AM container in qA will be preempted many many times, but qB cannot 
allocate new container because usage of qA after preempted is always less than 
usage of qA.

*Solution of #2*
We can change following two places,

Current capacity scheduler will allocate container in queue from least usage if 
there're multiple queues under a parent queue.
We can change it to allocate container in queue from most lacking resource if 
there're multiple queus under a parent queue.
For example:
{code}
qA has guranteed resource = 100MB, it used 0MB, its usage is 0%, and its 
lacking resource is 100MB
qB has guranteed resource = 1024MB, it used 500MB, its usage is about 50%, and 
its lacking resource is 524MB.
{code}
In existing capacity scheduler, qA will be first allocated. After changed, qB 
will be first allocated.

Only preempt container if its resource usage is less than double of 
toBePreempt, we can change the logic in preemption policy to
{code}
while (toBePreempt > 0):
  foreach application:
    foreach container:
      if (toBePreempt > 0) and *(container.resource < toBePreempt * 2):*
        preempt container
        toBePreempt -= container.resource
      else:
        break
{code}

After we changed previous 2 places, 2 containers will be running in qB, and qA 
cannot preempt container from qB.

Any thoughts? [~vinodkv], [~jianhe], [~curino]?

> Preemption can hang in corner case by not allowing any task container to 
> proceed.
> ---------------------------------------------------------------------------------
>
>                 Key: YARN-2297
>                 URL: https://issues.apache.org/jira/browse/YARN-2297
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler
>    Affects Versions: 2.5.0
>            Reporter: Tassapol Athiapinya
>            Assignee: Wangda Tan
>            Priority: Critical
>
> Preemption can cause hang issue in single-node cluster. Only AMs run. No task 
> container can run.
> h3. queue configuration
> Queue A/B has 1% and 99% respectively. 
> No max capacity.
> h3. scenario
> Turn on preemption. Configure 1 NM with 4 GB of memory. Use only 2 apps. Use 
> 1 user.
> Submit app 1 to queue A. AM needs 2 GB. There is 1 task that needs 2 GB. 
> Occupy entire cluster.
> Submit app 2 to queue B. AM needs 2 GB. There are 3 tasks that need 2 GB each.
> Instead of entire app 1 preempted, app 1 AM will stay. App 2 AM will launch. 
> No task of either app can proceed. 
> h3. commands
> /usr/lib/hadoop/bin/hadoop jar 
> /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar randomtextwriter 
> "-Dmapreduce.map.memory.mb=2000" 
> "-Dyarn.app.mapreduce.am.command-opts=-Xmx1800M" 
> "-Dmapreduce.randomtextwriter.bytespermap=2147483648" 
> "-Dmapreduce.job.queuename=A" "-Dmapreduce.map.maxattempts=100" 
> "-Dmapreduce.am.max-attempts=1" "-Dyarn.app.mapreduce.am.resource.mb=2000" 
> "-Dmapreduce.map.java.opts=-Xmx1800M" 
> "-Dmapreduce.randomtextwriter.mapsperhost=1" 
> "-Dmapreduce.randomtextwriter.totalbytes=2147483648" dir1
> /usr/lib/hadoop/bin/hadoop jar 
> /usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar sleep 
> "-Dmapreduce.map.memory.mb=2000" 
> "-Dyarn.app.mapreduce.am.command-opts=-Xmx1800M" 
> "-Dmapreduce.job.queuename=B" "-Dmapreduce.map.maxattempts=100" 
> "-Dmapreduce.am.max-attempts=1" "-Dyarn.app.mapreduce.am.resource.mb=2000" 
> "-Dmapreduce.map.java.opts=-Xmx1800M" -m 1 -r 0 -mt 4000  -rt 0



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to