[
https://issues.apache.org/jira/browse/YARN-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14063010#comment-14063010
]
Wangda Tan commented on YARN-2297:
----------------------------------
Took a look at this issue, there're two issues caused preemption hanging
*1) "maximum-am-resource-percent" is used as a "ratio" instead of "percent"*
I found yarn.scheduler.capacity.maximum-am-resource-percent is set to 100 in
configuration file sent from Tassapol offine.
I've checked how this value used in capacity scheduler and preemption policy,
it should be a value in range of \[0, 1\]. So the name
maximum-am-resource-percent is inconsist of how it used.
*Solution of issue#1*
a. Remove the "precent" in configuration, and limit its range within \[0, 1\].
b. Change current logic of code, divide 100 for this "percent" value in where
it used.
*2) After I configure it to a expected value (within \[0,1\]), like 0.1, a new
issue emerging -- Jitter happens, still in Tassapol provided environment.*
*a.*
q-A used 2G (about 5000% used), 2G pending
q-B used 2G (about 49% used), 2G pending
*b.*
Current preempt policy will take 1 container from qA because of following logic
(pesudo code):
{code}
while (toBePreempt > 0):
foreach application:
foreach container:
if (toBePreempt > 0):
preempt container
toBePreempt -= container.resource
else:
break
{code}
Then used resource of qA will dropped to 0% because there's only 1 container in
qA
*c.*
Current capacity scheduler will allocate container in queue from least usage if
there're multiple queues under a parent queue, usage of qA is 0%, so it will
first try to allocate a container in qA.
*d.*
After a container allocated in qA, it goes back to *a.*, a infinite loop will
happen: AM container in qA will be preempted many many times, but qB cannot
allocate new container because usage of qA after preempted is always less than
usage of qA.
*Solution of #2*
We can change following two places,
Current capacity scheduler will allocate container in queue from least usage if
there're multiple queues under a parent queue.
We can change it to allocate container in queue from most lacking resource if
there're multiple queus under a parent queue.
For example:
{code}
qA has guranteed resource = 100MB, it used 0MB, its usage is 0%, and its
lacking resource is 100MB
qB has guranteed resource = 1024MB, it used 500MB, its usage is about 50%, and
its lacking resource is 524MB.
{code}
In existing capacity scheduler, qA will be first allocated. After changed, qB
will be first allocated.
Only preempt container if its resource usage is less than double of
toBePreempt, we can change the logic in preemption policy to
{code}
while (toBePreempt > 0):
foreach application:
foreach container:
if (toBePreempt > 0) and *(container.resource < toBePreempt * 2):*
preempt container
toBePreempt -= container.resource
else:
break
{code}
After we changed previous 2 places, 2 containers will be running in qB, and qA
cannot preempt container from qB.
Any thoughts? [~vinodkv], [~jianhe], [~curino]?
> Preemption can hang in corner case by not allowing any task container to
> proceed.
> ---------------------------------------------------------------------------------
>
> Key: YARN-2297
> URL: https://issues.apache.org/jira/browse/YARN-2297
> Project: Hadoop YARN
> Issue Type: Bug
> Components: capacityscheduler
> Affects Versions: 2.5.0
> Reporter: Tassapol Athiapinya
> Assignee: Wangda Tan
> Priority: Critical
>
> Preemption can cause hang issue in single-node cluster. Only AMs run. No task
> container can run.
> h3. queue configuration
> Queue A/B has 1% and 99% respectively.
> No max capacity.
> h3. scenario
> Turn on preemption. Configure 1 NM with 4 GB of memory. Use only 2 apps. Use
> 1 user.
> Submit app 1 to queue A. AM needs 2 GB. There is 1 task that needs 2 GB.
> Occupy entire cluster.
> Submit app 2 to queue B. AM needs 2 GB. There are 3 tasks that need 2 GB each.
> Instead of entire app 1 preempted, app 1 AM will stay. App 2 AM will launch.
> No task of either app can proceed.
> h3. commands
> /usr/lib/hadoop/bin/hadoop jar
> /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar randomtextwriter
> "-Dmapreduce.map.memory.mb=2000"
> "-Dyarn.app.mapreduce.am.command-opts=-Xmx1800M"
> "-Dmapreduce.randomtextwriter.bytespermap=2147483648"
> "-Dmapreduce.job.queuename=A" "-Dmapreduce.map.maxattempts=100"
> "-Dmapreduce.am.max-attempts=1" "-Dyarn.app.mapreduce.am.resource.mb=2000"
> "-Dmapreduce.map.java.opts=-Xmx1800M"
> "-Dmapreduce.randomtextwriter.mapsperhost=1"
> "-Dmapreduce.randomtextwriter.totalbytes=2147483648" dir1
> /usr/lib/hadoop/bin/hadoop jar
> /usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar sleep
> "-Dmapreduce.map.memory.mb=2000"
> "-Dyarn.app.mapreduce.am.command-opts=-Xmx1800M"
> "-Dmapreduce.job.queuename=B" "-Dmapreduce.map.maxattempts=100"
> "-Dmapreduce.am.max-attempts=1" "-Dyarn.app.mapreduce.am.resource.mb=2000"
> "-Dmapreduce.map.java.opts=-Xmx1800M" -m 1 -r 0 -mt 4000 -rt 0
--
This message was sent by Atlassian JIRA
(v6.2#6252)