zhengchenyu created YARN-7560:
---------------------------------
Summary: Resourcemanager hangs when
resourceUsedWithWeightToResourceRatio return a overflow value
Key: YARN-7560
URL: https://issues.apache.org/jira/browse/YARN-7560
Project: Hadoop YARN
Issue Type: Bug
Components: resourcemanager
Affects Versions: 2.7.1, 3.0.0
Reporter: zhengchenyu
Fix For: 2.7.5
In our cluster, we changed the configuration, then refreshQueues, we found the
resourcemanager hangs. And the Resourcemanager can't restart successfully. We
got jstack information, like this:
{code}
"main" #1 prio=5 os_prio=0 tid=0x00007f98e8017000 nid=0x2f5 runnable
[0x00007f98eed9a000]
java.lang.Thread.State: RUNNABLE
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeSteadyShares(FairSharePolicy.java:148)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeSteadyShares(FSParentQueue.java:102)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:148)
- locked <0x00007f8c4a8177a0> (a java.util.HashMap)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:101)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.updateAllocationConfiguration(QueueManager.java:387)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$AllocationReloadListener.onReload(FairScheduler.java:1728)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:422)
- locked <0x00007f8c4a7eb2e0> (a
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1597)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1621)
at
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
- locked <0x00007f8c4a76ac48> (a java.lang.Object)
at
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569)
at
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
- locked <0x00007f8c49254268> (a java.lang.Object)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:997)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:257)
at
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
- locked <0x00007f8c467495e0> (a java.lang.Object)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1220)
{code}
When we debug the cluster, we found resourceUsedWithWeightToResourceRatio
return a negative value. So the loop can't return. We found in our cluster, all
minRes is over int.max.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]