[jira] [Resolved] (YARN-10112) Livelock (Runnable FairScheduler.getAppWeight) in Resource Manager when used with Fair Scheduler size based weights enabled

2020-01-30 Thread Yu Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Wang resolved YARN-10112.

Fix Version/s: 3.0.0
   Resolution: Fixed

Need to backport from the commits of 3.0

> Livelock (Runnable FairScheduler.getAppWeight) in Resource Manager when used 
> with Fair Scheduler size based weights enabled
> ---
>
> Key: YARN-10112
> URL: https://issues.apache.org/jira/browse/YARN-10112
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.8.5
>Reporter: Yu Wang
>Assignee: Wilfred Spiegelenburg
>Priority: Minor
> Fix For: 3.0.0
>
>
> The user uses the FairScheduler, and yarn.scheduler.fair.sizebasedweight is 
> set true. From the ticket JStack thread dump from the support engineers, we 
> could see that the method getAppWeight below in the class of FairScheduler 
> was occupying the FairScheduler object monitor always, which made 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate
>  always await of entering the same object monitor, thus resulting in the the 
> livelock.
>  
> The issue occurs very infrequently and we are still unable to figure out a 
> way to consistently reproduce the issue. The issue resembles to what the Jira 
> YARN-1458 reports, but it seems that code fix has taken into effect since 
> 2.6. 
>  
>  
> {code:java}
> "ResourceManager Event Processor" #17 prio=5 os_prio=0 tid=0x7fbcee65e800 
> nid=0x2ea4 waiting for monitor entry [0x7fbcbcd5e000] 
> java.lang.Thread.State: BLOCKED (on object monitor) at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:1105)
>  - waiting to lock <0x0006eb816b18> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1362)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:129)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:801)
>  at java.lang.Thread.run(Thread.java:748) 
> "FairSchedulerUpdateThread" #23 daemon prio=5 os_prio=0 
> tid=0x7fbceea0e800 nid=0x2ea2 runnable [0x7fbcbcf6] 
> java.lang.Thread.State: RUNNABLE at java.lang.StrictMath.log1p(Native Method) 
> at java.lang.Math.log1p(Math.java:1747) at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:570)
>  - locked <0x0006eb816b18> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getWeights(FSAppAttempt.java:953)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:192)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:180)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:51)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:138)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:235)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:89)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:365)
>  - locked <0x0006eb816b18> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:314){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10112) Livelock (Runnable FairScheduler.getAppWeight) in Resource Manager when used with Fair Scheduler size based weights enabled

2020-01-30 Thread Yu Wang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026740#comment-17026740
 ] 

Yu Wang commented on YARN-10112:


Thank you Wilfred for pointing out the resolution! 

> Livelock (Runnable FairScheduler.getAppWeight) in Resource Manager when used 
> with Fair Scheduler size based weights enabled
> ---
>
> Key: YARN-10112
> URL: https://issues.apache.org/jira/browse/YARN-10112
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.8.5
>Reporter: Yu Wang
>Assignee: Wilfred Spiegelenburg
>Priority: Minor
>
> The user uses the FairScheduler, and yarn.scheduler.fair.sizebasedweight is 
> set true. From the ticket JStack thread dump from the support engineers, we 
> could see that the method getAppWeight below in the class of FairScheduler 
> was occupying the FairScheduler object monitor always, which made 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate
>  always await of entering the same object monitor, thus resulting in the the 
> livelock.
>  
> The issue occurs very infrequently and we are still unable to figure out a 
> way to consistently reproduce the issue. The issue resembles to what the Jira 
> YARN-1458 reports, but it seems that code fix has taken into effect since 
> 2.6. 
>  
>  
> {code:java}
> "ResourceManager Event Processor" #17 prio=5 os_prio=0 tid=0x7fbcee65e800 
> nid=0x2ea4 waiting for monitor entry [0x7fbcbcd5e000] 
> java.lang.Thread.State: BLOCKED (on object monitor) at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:1105)
>  - waiting to lock <0x0006eb816b18> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1362)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:129)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:801)
>  at java.lang.Thread.run(Thread.java:748) 
> "FairSchedulerUpdateThread" #23 daemon prio=5 os_prio=0 
> tid=0x7fbceea0e800 nid=0x2ea2 runnable [0x7fbcbcf6] 
> java.lang.Thread.State: RUNNABLE at java.lang.StrictMath.log1p(Native Method) 
> at java.lang.Math.log1p(Math.java:1747) at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:570)
>  - locked <0x0006eb816b18> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getWeights(FSAppAttempt.java:953)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:192)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:180)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:51)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:138)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:235)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:89)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:365)
>  - locked <0x0006eb816b18> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:314){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10112) Livelock (Runnable FairScheduler.getAppWeight) in Resource Manager when used with Fair Scheduler size based weights enabled

2020-01-29 Thread Yu Wang (Jira)
Yu Wang created YARN-10112:
--

 Summary: Livelock (Runnable FairScheduler.getAppWeight) in 
Resource Manager when used with Fair Scheduler size based weights enabled
 Key: YARN-10112
 URL: https://issues.apache.org/jira/browse/YARN-10112
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.8.5
Reporter: Yu Wang


The user uses the FairScheduler, and yarn.scheduler.fair.sizebasedweight is set 
true. From the ticket JStack thread dump from the support engineers, we could 
see that the method getAppWeight below in the class of FairScheduler was 
occupying the FairScheduler object monitor always, which made 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate
 always await of entering the same object monitor, thus resulting in the the 
livelock.

 

The issue occurs very infrequently and we are still unable to figure out a way 
to consistently reproduce the issue. The issue resembles to what the Jira 
YARN-1458 reports, but it seems that code fix has taken into effect since 2.6. 

 

 
{code:java}
"ResourceManager Event Processor" #17 prio=5 os_prio=0 tid=0x7fbcee65e800 
nid=0x2ea4 waiting for monitor entry [0x7fbcbcd5e000] 
java.lang.Thread.State: BLOCKED (on object monitor) at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:1105)
 - waiting to lock <0x0006eb816b18> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1362)
 at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:129)
 at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:801)
 at java.lang.Thread.run(Thread.java:748) 





"FairSchedulerUpdateThread" #23 daemon prio=5 os_prio=0 tid=0x7fbceea0e800 
nid=0x2ea2 runnable [0x7fbcbcf6] java.lang.Thread.State: RUNNABLE at 
java.lang.StrictMath.log1p(Native Method) at 
java.lang.Math.log1p(Math.java:1747) at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:570)
 - locked <0x0006eb816b18> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getWeights(FSAppAttempt.java:953)
 at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:192)
 at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:180)
 at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140)
 at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:51)
 at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:138)
 at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:235)
 at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:89)
 at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:365)
 - locked <0x0006eb816b18> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:314){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org