[ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14108519#comment-14108519
 ] 

zhihai xu commented on YARN-1458:
---------------------------------

I just found another corner case, which can cause loop forever, this corner 
case is if weight is 0 but minShare is not 0.
For example, we have tow active queues:queueA and queueB,
queueA 's weight is 0, queueA's minShare is 1.
queueB 's weight is 0, queueB's minShare is 1.
and total resource is 1024.
computeShare for both queueA and queueB will return 1.
So resourceUsedWithWeightToResourceRatio will always return 2, no matter what  
w2rRatio will be.
Then it will loop forever. Check zero("break when zero") won't be enough.
In this case, Modifying the first solution is very difficult to fix this case,
it will make more sense to modify the alternative solution.

> In Fair Scheduler, size based weight can cause update thread to hold lock 
> indefinitely
> --------------------------------------------------------------------------------------
>
>                 Key: YARN-1458
>                 URL: https://issues.apache.org/jira/browse/YARN-1458
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: scheduler
>    Affects Versions: 2.2.0
>         Environment: Centos 2.6.18-238.19.1.el5 X86_64
> hadoop2.2.0
>            Reporter: qingwu.fu
>            Assignee: zhihai xu
>              Labels: patch
>             Fix For: 2.2.1
>
>         Attachments: YARN-1458.001.patch, YARN-1458.002.patch, 
> YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.alternative0.patch, 
> YARN-1458.patch
>
>   Original Estimate: 408h
>  Remaining Estimate: 408h
>
> The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
> clients submit lots jobs, it is not easy to reapear. We run the test cluster 
> for days to reapear it. The output of  jstack command on resourcemanager pid:
> {code}
>  "ResourceManager Event Processor" prio=10 tid=0x00002aaab0c5f000 nid=0x5dd3 
> waiting for monitor entry [0x0000000043aa9000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
>         - waiting to lock <0x000000070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
>         at java.lang.Thread.run(Thread.java:744)
> ……
> "FairSchedulerUpdateThread" daemon prio=10 tid=0x00002aaab0a2c800 nid=0x5dc8 
> runnable [0x00000000433a2000]
>    java.lang.Thread.State: RUNNABLE
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
>         - locked <0x000000070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
>         - locked <0x000000070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
>         at java.lang.Thread.run(Thread.java:744)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to