[
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15210097#comment-15210097
]
David Watzke commented on YARN-1458:
------------------------------------
Seems to me that this is not fixed in 2.6.0. We've hit this bug with CDH 5.4.4
which is shipped with patched hadoop 2.6.0
{noformat}
"FairSchedulerUpdateThread" daemon prio=10 tid=0x00007f550c0f5800 nid=0x1155
runnable [0x00007f54fdf4d000]
java.lang.Thread.State: RUNNABLE
at java.lang.StrictMath.log1p(Native Method)
at java.lang.Math.log1p(Math.java:1236)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:510)
- locked <0x00000007a58d9430> (a
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getWeights(FSAppAttempt.java:749)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:191)
{noformat}
disabling size-based weights helped immediately (apps were running again)
> FairScheduler: Zero weight can lead to livelock
> -----------------------------------------------
>
> Key: YARN-1458
> URL: https://issues.apache.org/jira/browse/YARN-1458
> Project: Hadoop YARN
> Issue Type: Bug
> Components: scheduler
> Affects Versions: 2.2.0
> Environment: Centos 2.6.18-238.19.1.el5 X86_64
> hadoop2.2.0
> Reporter: qingwu.fu
> Assignee: zhihai xu
> Labels: patch
> Attachments: YARN-1458.001.patch, YARN-1458.002.patch,
> YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch,
> YARN-1458.alternative0.patch, YARN-1458.alternative1.patch,
> YARN-1458.alternative2.patch, YARN-1458.patch, yarn-1458-5.patch,
> yarn-1458-7.patch, yarn-1458-8.patch
>
> Original Estimate: 408h
> Remaining Estimate: 408h
>
> The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when
> clients submit lots jobs, it is not easy to reapear. We run the test cluster
> for days to reapear it. The output of jstack command on resourcemanager pid:
> {code}
> "ResourceManager Event Processor" prio=10 tid=0x00002aaab0c5f000 nid=0x5dd3
> waiting for monitor entry [0x0000000043aa9000]
> java.lang.Thread.State: BLOCKED (on object monitor)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
> - waiting to lock <0x000000070026b6e0> (a
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
> at java.lang.Thread.run(Thread.java:744)
> ……
> "FairSchedulerUpdateThread" daemon prio=10 tid=0x00002aaab0a2c800 nid=0x5dc8
> runnable [0x00000000433a2000]
> java.lang.Thread.State: RUNNABLE
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
> - locked <0x000000070026b6e0> (a
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
> - locked <0x000000070026b6e0> (a
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
> at java.lang.Thread.run(Thread.java:744)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)