[ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated YARN-1458:
-----------------------------

    Description: 
The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
clients submit lots jobs, it is not easy to reapear. We run the test cluster 
for days to reapear it. The output of  jstack command on resourcemanager pid:
{code}
 "ResourceManager Event Processor" prio=10 tid=0x00002aaab0c5f000 nid=0x5dd3 
waiting for monitor entry [0x0000000043aa9000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
        - waiting to lock <0x000000070026b6e0> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
        at java.lang.Thread.run(Thread.java:744)
……
"FairSchedulerUpdateThread" daemon prio=10 tid=0x00002aaab0a2c800 nid=0x5dc8 
runnable [0x00000000433a2000]
   java.lang.Thread.State: RUNNABLE
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
        - locked <0x000000070026b6e0> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
        - locked <0x000000070026b6e0> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
        at java.lang.Thread.run(Thread.java:744)
{code}


  was:
The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
clients submit lots jobs, it is not easy to reapear. We run the test cluster 
for days to reapear it. The output of  jstack command on resourcemanager pid:
 "ResourceManager Event Processor" prio=10 tid=0x00002aaab0c5f000 nid=0x5dd3 
waiting for monitor entry [0x0000000043aa9000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
        - waiting to lock <0x000000070026b6e0> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
        at java.lang.Thread.run(Thread.java:744)
……
"FairSchedulerUpdateThread" daemon prio=10 tid=0x00002aaab0a2c800 nid=0x5dc8 
runnable [0x00000000433a2000]
   java.lang.Thread.State: RUNNABLE
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
        - locked <0x000000070026b6e0> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
        - locked <0x000000070026b6e0> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
        at java.lang.Thread.run(Thread.java:744)


> In Fair Scheduler, size based weight can cause update thread to hold lock 
> indefinitely
> --------------------------------------------------------------------------------------
>
>                 Key: YARN-1458
>                 URL: https://issues.apache.org/jira/browse/YARN-1458
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: scheduler
>    Affects Versions: 2.2.0
>         Environment: Centos 2.6.18-238.19.1.el5 X86_64
> hadoop2.2.0
>            Reporter: qingwu.fu
>              Labels: patch
>   Original Estimate: 408h
>  Remaining Estimate: 408h
>
> The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
> clients submit lots jobs, it is not easy to reapear. We run the test cluster 
> for days to reapear it. The output of  jstack command on resourcemanager pid:
> {code}
>  "ResourceManager Event Processor" prio=10 tid=0x00002aaab0c5f000 nid=0x5dd3 
> waiting for monitor entry [0x0000000043aa9000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
>         - waiting to lock <0x000000070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
>         at java.lang.Thread.run(Thread.java:744)
> ……
> "FairSchedulerUpdateThread" daemon prio=10 tid=0x00002aaab0a2c800 nid=0x5dc8 
> runnable [0x00000000433a2000]
>    java.lang.Thread.State: RUNNABLE
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
>         - locked <0x000000070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
>         - locked <0x000000070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
>         at java.lang.Thread.run(Thread.java:744)
> {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to