[jira] [Commented] (YARN-10112) Livelock (Runnable FairScheduler.getAppWeight) in Resource Manager when used with Fair Scheduler size based weights enabled

2020-01-30 Thread Yu Wang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026740#comment-17026740
 ] 

Yu Wang commented on YARN-10112:


Thank you Wilfred for pointing out the resolution! 

> Livelock (Runnable FairScheduler.getAppWeight) in Resource Manager when used 
> with Fair Scheduler size based weights enabled
> ---
>
> Key: YARN-10112
> URL: https://issues.apache.org/jira/browse/YARN-10112
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.8.5
>Reporter: Yu Wang
>Assignee: Wilfred Spiegelenburg
>Priority: Minor
>
> The user uses the FairScheduler, and yarn.scheduler.fair.sizebasedweight is 
> set true. From the ticket JStack thread dump from the support engineers, we 
> could see that the method getAppWeight below in the class of FairScheduler 
> was occupying the FairScheduler object monitor always, which made 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate
>  always await of entering the same object monitor, thus resulting in the the 
> livelock.
>  
> The issue occurs very infrequently and we are still unable to figure out a 
> way to consistently reproduce the issue. The issue resembles to what the Jira 
> YARN-1458 reports, but it seems that code fix has taken into effect since 
> 2.6. 
>  
>  
> {code:java}
> "ResourceManager Event Processor" #17 prio=5 os_prio=0 tid=0x7fbcee65e800 
> nid=0x2ea4 waiting for monitor entry [0x7fbcbcd5e000] 
> java.lang.Thread.State: BLOCKED (on object monitor) at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:1105)
>  - waiting to lock <0x0006eb816b18> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1362)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:129)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:801)
>  at java.lang.Thread.run(Thread.java:748) 
> "FairSchedulerUpdateThread" #23 daemon prio=5 os_prio=0 
> tid=0x7fbceea0e800 nid=0x2ea2 runnable [0x7fbcbcf6] 
> java.lang.Thread.State: RUNNABLE at java.lang.StrictMath.log1p(Native Method) 
> at java.lang.Math.log1p(Math.java:1747) at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:570)
>  - locked <0x0006eb816b18> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getWeights(FSAppAttempt.java:953)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:192)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:180)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:51)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:138)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:235)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:89)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:365)
>  - locked <0x0006eb816b18> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:314){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10112) Livelock (Runnable FairScheduler.getAppWeight) in Resource Manager when used with Fair Scheduler size based weights enabled

2020-01-29 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026400#comment-17026400
 ] 

Wilfred Spiegelenburg commented on YARN-10112:
--

This does not happen in the current releases of YARN anymore.

In YARN-7414 we moved the {{getAppWeight}} out of the scheduler into the 
{{FSAppAttempt}}. That did not solve the locking issue but was the right thing 
to do. In the follow up YARN-7513 I removed the lock from the new call. I would 
say that this is thus a duplicate of the combination YARN-7414 & YARN-7513.

Both are fixed in hadoop 3.01 and 3.1. Backporting of this change is possible.

> Livelock (Runnable FairScheduler.getAppWeight) in Resource Manager when used 
> with Fair Scheduler size based weights enabled
> ---
>
> Key: YARN-10112
> URL: https://issues.apache.org/jira/browse/YARN-10112
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.8.5
>Reporter: Yu Wang
>Priority: Minor
>
> The user uses the FairScheduler, and yarn.scheduler.fair.sizebasedweight is 
> set true. From the ticket JStack thread dump from the support engineers, we 
> could see that the method getAppWeight below in the class of FairScheduler 
> was occupying the FairScheduler object monitor always, which made 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate
>  always await of entering the same object monitor, thus resulting in the the 
> livelock.
>  
> The issue occurs very infrequently and we are still unable to figure out a 
> way to consistently reproduce the issue. The issue resembles to what the Jira 
> YARN-1458 reports, but it seems that code fix has taken into effect since 
> 2.6. 
>  
>  
> {code:java}
> "ResourceManager Event Processor" #17 prio=5 os_prio=0 tid=0x7fbcee65e800 
> nid=0x2ea4 waiting for monitor entry [0x7fbcbcd5e000] 
> java.lang.Thread.State: BLOCKED (on object monitor) at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:1105)
>  - waiting to lock <0x0006eb816b18> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1362)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:129)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:801)
>  at java.lang.Thread.run(Thread.java:748) 
> "FairSchedulerUpdateThread" #23 daemon prio=5 os_prio=0 
> tid=0x7fbceea0e800 nid=0x2ea2 runnable [0x7fbcbcf6] 
> java.lang.Thread.State: RUNNABLE at java.lang.StrictMath.log1p(Native Method) 
> at java.lang.Math.log1p(Math.java:1747) at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:570)
>  - locked <0x0006eb816b18> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getWeights(FSAppAttempt.java:953)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:192)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:180)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:51)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:138)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:235)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:89)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:365)
>  - locked <0x0006eb816b18> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:314){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For