[
https://issues.apache.org/jira/browse/YARN-10112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yu Wang resolved YARN-10112.
----------------------------
Fix Version/s: 3.0.0
Resolution: Fixed
Need to backport from the commits of 3.0
> Livelock (Runnable FairScheduler.getAppWeight) in Resource Manager when used
> with Fair Scheduler size based weights enabled
> ---------------------------------------------------------------------------------------------------------------------------
>
> Key: YARN-10112
> URL: https://issues.apache.org/jira/browse/YARN-10112
> Project: Hadoop YARN
> Issue Type: Bug
> Components: fairscheduler
> Affects Versions: 2.8.5
> Reporter: Yu Wang
> Assignee: Wilfred Spiegelenburg
> Priority: Minor
> Fix For: 3.0.0
>
>
> The user uses the FairScheduler, and yarn.scheduler.fair.sizebasedweight is
> set true. From the ticket JStack thread dump from the support engineers, we
> could see that the method getAppWeight below in the class of FairScheduler
> was occupying the FairScheduler object monitor always, which made
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate
> always await of entering the same object monitor, thus resulting in the the
> livelock.
>
> The issue occurs very infrequently and we are still unable to figure out a
> way to consistently reproduce the issue. The issue resembles to what the Jira
> YARN-1458 reports, but it seems that code fix has taken into effect since
> 2.6.
>
>
> {code:java}
> "ResourceManager Event Processor" #17 prio=5 os_prio=0 tid=0x00007fbcee65e800
> nid=0x2ea4 waiting for monitor entry [0x00007fbcbcd5e000]
> java.lang.Thread.State: BLOCKED (on object monitor) at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:1105)
> - waiting to lock <0x00000006eb816b18> (a
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1362)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:129)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:801)
> at java.lang.Thread.run(Thread.java:748)
> "FairSchedulerUpdateThread" #23 daemon prio=5 os_prio=0
> tid=0x00007fbceea0e800 nid=0x2ea2 runnable [0x00007fbcbcf60000]
> java.lang.Thread.State: RUNNABLE at java.lang.StrictMath.log1p(Native Method)
> at java.lang.Math.log1p(Math.java:1747) at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:570)
> - locked <0x00000006eb816b18> (a
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getWeights(FSAppAttempt.java:953)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:192)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:180)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:51)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:138)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:235)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:89)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:365)
> - locked <0x00000006eb816b18> (a
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:314){code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]