[jira] [Commented] (YARN-10112) Livelock (Runnable FairScheduler.getAppWeight) in Resource Manager when used with Fair Scheduler size based weights enabled
[ https://issues.apache.org/jira/browse/YARN-10112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026740#comment-17026740 ] Yu Wang commented on YARN-10112: Thank you Wilfred for pointing out the resolution! > Livelock (Runnable FairScheduler.getAppWeight) in Resource Manager when used > with Fair Scheduler size based weights enabled > --- > > Key: YARN-10112 > URL: https://issues.apache.org/jira/browse/YARN-10112 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.8.5 >Reporter: Yu Wang >Assignee: Wilfred Spiegelenburg >Priority: Minor > > The user uses the FairScheduler, and yarn.scheduler.fair.sizebasedweight is > set true. From the ticket JStack thread dump from the support engineers, we > could see that the method getAppWeight below in the class of FairScheduler > was occupying the FairScheduler object monitor always, which made > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate > always await of entering the same object monitor, thus resulting in the the > livelock. > > The issue occurs very infrequently and we are still unable to figure out a > way to consistently reproduce the issue. The issue resembles to what the Jira > YARN-1458 reports, but it seems that code fix has taken into effect since > 2.6. > > > {code:java} > "ResourceManager Event Processor" #17 prio=5 os_prio=0 tid=0x7fbcee65e800 > nid=0x2ea4 waiting for monitor entry [0x7fbcbcd5e000] > java.lang.Thread.State: BLOCKED (on object monitor) at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:1105) > - waiting to lock <0x0006eb816b18> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1362) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:129) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:801) > at java.lang.Thread.run(Thread.java:748) > "FairSchedulerUpdateThread" #23 daemon prio=5 os_prio=0 > tid=0x7fbceea0e800 nid=0x2ea2 runnable [0x7fbcbcf6] > java.lang.Thread.State: RUNNABLE at java.lang.StrictMath.log1p(Native Method) > at java.lang.Math.log1p(Math.java:1747) at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:570) > - locked <0x0006eb816b18> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getWeights(FSAppAttempt.java:953) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:192) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:180) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:51) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:138) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:235) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:89) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:365) > - locked <0x0006eb816b18> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:314){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10112) Livelock (Runnable FairScheduler.getAppWeight) in Resource Manager when used with Fair Scheduler size based weights enabled
[ https://issues.apache.org/jira/browse/YARN-10112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026400#comment-17026400 ] Wilfred Spiegelenburg commented on YARN-10112: -- This does not happen in the current releases of YARN anymore. In YARN-7414 we moved the {{getAppWeight}} out of the scheduler into the {{FSAppAttempt}}. That did not solve the locking issue but was the right thing to do. In the follow up YARN-7513 I removed the lock from the new call. I would say that this is thus a duplicate of the combination YARN-7414 & YARN-7513. Both are fixed in hadoop 3.01 and 3.1. Backporting of this change is possible. > Livelock (Runnable FairScheduler.getAppWeight) in Resource Manager when used > with Fair Scheduler size based weights enabled > --- > > Key: YARN-10112 > URL: https://issues.apache.org/jira/browse/YARN-10112 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.8.5 >Reporter: Yu Wang >Priority: Minor > > The user uses the FairScheduler, and yarn.scheduler.fair.sizebasedweight is > set true. From the ticket JStack thread dump from the support engineers, we > could see that the method getAppWeight below in the class of FairScheduler > was occupying the FairScheduler object monitor always, which made > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate > always await of entering the same object monitor, thus resulting in the the > livelock. > > The issue occurs very infrequently and we are still unable to figure out a > way to consistently reproduce the issue. The issue resembles to what the Jira > YARN-1458 reports, but it seems that code fix has taken into effect since > 2.6. > > > {code:java} > "ResourceManager Event Processor" #17 prio=5 os_prio=0 tid=0x7fbcee65e800 > nid=0x2ea4 waiting for monitor entry [0x7fbcbcd5e000] > java.lang.Thread.State: BLOCKED (on object monitor) at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:1105) > - waiting to lock <0x0006eb816b18> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1362) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:129) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:801) > at java.lang.Thread.run(Thread.java:748) > "FairSchedulerUpdateThread" #23 daemon prio=5 os_prio=0 > tid=0x7fbceea0e800 nid=0x2ea2 runnable [0x7fbcbcf6] > java.lang.Thread.State: RUNNABLE at java.lang.StrictMath.log1p(Native Method) > at java.lang.Math.log1p(Math.java:1747) at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:570) > - locked <0x0006eb816b18> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getWeights(FSAppAttempt.java:953) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:192) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:180) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:51) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:138) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:235) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:89) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:365) > - locked <0x0006eb816b18> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:314){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For