[ https://issues.apache.org/jira/browse/YARN-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16687606#comment-16687606 ]
ASF GitHub Bot commented on YARN-8833: -------------------------------------- GitHub user yoelee opened a pull request: https://github.com/apache/hadoop/pull/439 YARN-8833 fix compute shares may lock the scheduling process When compute fair share, there may be a chance triggering the problem of Integer overflow, and entering an infinite loop, which blocks the scheduling process. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yoelee/hadoop trunk Alternatively you can review and apply these changes as the patch at: https://github.com/apache/hadoop/pull/439.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #439 ---- commit 39a6f7cab193be910bfb34265ceb696ddbd78da5 Author: liyakun.hit <liyakun.hit@...> Date: 2018-11-15T07:28:34Z YARN-8833 fix compute shares may lock the scheduling process ---- > compute shares may lock the scheduling process > ----------------------------------------------- > > Key: YARN-8833 > URL: https://issues.apache.org/jira/browse/YARN-8833 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler > Reporter: liyakun > Assignee: liyakun > Priority: Major > > When use w2rRatio compute fair share, there may be a chance triggering the > problem of Int overflow, and entering an infinite loop. > Since the compute share thread holds the writeLock, it may blocking > scheduling thread. > This issue occurs in a production environment with 8500 nodes. And we have > already fixed it. > > added 2018-10-29: elaborate the problem > /** > * Compute the resources that would be used given a weight-to-resource ratio > * w2rRatio, for use in the computeFairShares algorithm as described in # > */ > private static int resourceUsedWithWeightToResourceRatio(double w2rRatio, > Collection<? extends Schedulable> schedulables, String type) { > int resourcesTaken = 0; > for (Schedulable sched : schedulables) \{ int share = computeShare(sched, > w2rRatio, type); resourcesTaken += share; } > return resourcesTaken; > } > The variable resourcesTaken is an integer type. And it also is accumulated > value of result of > computeShare(Schedulable sched, double w2rRatio,String type) which is a value > between the min share and max share of a queue. > For example, when there are 3 queues, each has min share = max share = > Integer.MAX_VALUE, the resourcesTaken will be out of Integer bound, and it > will be a negative number. > when resourceUsedWithWeightToResourceRatio(double w2rRatio, Collection<? > extends Schedulable> schedulables, String type) return a negative number, the > loop in > computeSharesInternal() may never out which got the scheduler lock. > > //org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares > while (resourceUsedWithWeightToResourceRatio(rMax, schedulables, type) > < totalResource){ > rMax *= 2.0; > } > This may blocking scheduling thread. > > > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org