Yu Wang created YARN-10112:
------------------------------
Summary: Livelock (Runnable FairScheduler.getAppWeight) in
Resource Manager when used with Fair Scheduler size based weights enabled
Key: YARN-10112
URL: https://issues.apache.org/jira/browse/YARN-10112
Project: Hadoop YARN
Issue Type: Bug
Components: fairscheduler
Affects Versions: 2.8.5
Reporter: Yu Wang
The user uses the FairScheduler, and yarn.scheduler.fair.sizebasedweight is set
true. From the ticket JStack thread dump from the support engineers, we could
see that the method getAppWeight below in the class of FairScheduler was
occupying the FairScheduler object monitor always, which made
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate
always await of entering the same object monitor, thus resulting in the the
livelock.
The issue occurs very infrequently and we are still unable to figure out a way
to consistently reproduce the issue. The issue resembles to what the Jira
YARN-1458 reports, but it seems that code fix has taken into effect since 2.6.
{code:java}
"ResourceManager Event Processor" #17 prio=5 os_prio=0 tid=0x00007fbcee65e800
nid=0x2ea4 waiting for monitor entry [0x00007fbcbcd5e000]
java.lang.Thread.State: BLOCKED (on object monitor) at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:1105)
- waiting to lock <0x00000006eb816b18> (a
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1362)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:129)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:801)
at java.lang.Thread.run(Thread.java:748)
"FairSchedulerUpdateThread" #23 daemon prio=5 os_prio=0 tid=0x00007fbceea0e800
nid=0x2ea2 runnable [0x00007fbcbcf60000] java.lang.Thread.State: RUNNABLE at
java.lang.StrictMath.log1p(Native Method) at
java.lang.Math.log1p(Math.java:1747) at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:570)
- locked <0x00000006eb816b18> (a
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getWeights(FSAppAttempt.java:953)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:192)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:180)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:51)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:138)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:235)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:89)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:365)
- locked <0x00000006eb816b18> (a
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:314){code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]