[ 
https://issues.apache.org/jira/browse/YARN-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16266273#comment-16266273
 ] 

Wilfred Spiegelenburg commented on YARN-7560:
---------------------------------------------

Thank you [~zhengchenyu] for the patch
Some comments on the patch:
* Can you please remove the unneeded casts to long that are left in 
computeSharesInternal, handleFixedFairShares:
{code}
127      totalMaxShare = Math.min(maxShare + (long)totalMaxShare,
128          Long.MAX_VALUE);
...
169      target.setResourceValue(type, (long)computeShare(sched, right, type));
{code}
and
{code}
224        totalResource = Math.min((long)totalResource + (long)fixedShare,
225            Long.MAX_VALUE);
{code}
* In resourceUsedWithWeightToResourceRatio we should not have to create a 
temporary variable share and could do:
{code}
  resourcesTaken += computeShare(sched, w2rRatio, type);
{code}
* In {{computeShare}} we should move the cast from double to long to the point 
where we calculate the share instead of leaving at to after we do the min and 
max checks and remove the cast at the end of the call that will speed up 
calculations slightly and won't change the outcome:
{code}
192    long share = (long)(sched.getWeight() * w2rRatio);
{code}

> Resourcemanager hangs when  resourceUsedWithWeightToResourceRatio return a 
> overflow value 
> ------------------------------------------------------------------------------------------
>
>                 Key: YARN-7560
>                 URL: https://issues.apache.org/jira/browse/YARN-7560
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler, resourcemanager
>    Affects Versions: 3.0.0
>            Reporter: zhengchenyu
>            Assignee: zhengchenyu
>             Fix For: 3.0.0
>
>         Attachments: YARN-7560.000.patch
>
>
> In our cluster, we changed the configuration, then refreshQueues, we found 
> the resourcemanager hangs. And the Resourcemanager can't restart 
> successfully. We got jstack information, always show like this:
> {code}
> "main" #1 prio=5 os_prio=0 tid=0x00007f98e8017000 nid=0x2f5 runnable 
> [0x00007f98eed9a000]
>    java.lang.Thread.State: RUNNABLE
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeSteadyShares(FairSharePolicy.java:148)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeSteadyShares(FSParentQueue.java:102)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:148)
>         - locked <0x00007f8c4a8177a0> (a java.util.HashMap)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:101)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.updateAllocationConfiguration(QueueManager.java:387)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$AllocationReloadListener.onReload(FairScheduler.java:1728)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:422)
>         - locked <0x00007f8c4a7eb2e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1597)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1621)
>         at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>         - locked <0x00007f8c4a76ac48> (a java.lang.Object)
>         at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569)
>         at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>         - locked <0x00007f8c49254268> (a java.lang.Object)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:997)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:257)
>         at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>         - locked <0x00007f8c467495e0> (a java.lang.Object)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1220)
> {code}
> When we debug the cluster, we found resourceUsedWithWeightToResourceRatio 
> return a negative value. So the loop can't return. We found in our cluster, 
> the sum of all minRes is over int.max, so 
> resourceUsedWithWeightToResourceRatio return a negative value.
> below is the loop. Because totalResource is long, so always postive. But 
> resourceUsedWithWeightToResourceRatio return int type. Our cluster is so big 
> that resourceUsedWithWeightToResourceRatio will return a overflow value, just 
> a negative. So the loop will never break.
> {code}
>     while (resourceUsedWithWeightToResourceRatio(rMax, schedulables, type)
>         < totalResource) {
>       rMax *= 2.0;
>     }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to