[ 
https://issues.apache.org/jira/browse/YARN-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16602323#comment-16602323
 ] 

Wangda Tan commented on YARN-8513:
----------------------------------

Interesting, it must be caused by CS allocation doesn't fully consider queue 
maximum resource in some cases. Tried to look at related code, hasn't figured 
out root case yet. 

CS allocation phase relies on the logic of ResourceLimits passed by upper level 
component (Parent of queues, queue of apps, etc.). Under some corner cases, the 
ResourceLimits passed in could be larger than accurate. 

[~Card], could u enable DEBUG log of 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity, and rerun the 
test? (or you can click "dump DEBUG log" in CS web UI) It gonna be helpful if 
you can get a few seconds DEBUG log for our troubleshooting when the infinite 
loop happens.

> CapacityScheduler infinite loop when queue is near fully utilized
> -----------------------------------------------------------------
>
>                 Key: YARN-8513
>                 URL: https://issues.apache.org/jira/browse/YARN-8513
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler, yarn
>    Affects Versions: 3.1.0, 2.9.1
>         Environment: Ubuntu 14.04.5 and 16.04.4
> YARN is configured with one label and 5 queues.
>            Reporter: Chen Yufei
>            Priority: Major
>         Attachments: jstack-1.log, jstack-2.log, jstack-3.log, jstack-4.log, 
> jstack-5.log, top-during-lock.log, top-when-normal.log, yarn3-jstack1.log, 
> yarn3-jstack2.log, yarn3-jstack3.log, yarn3-jstack4.log, yarn3-jstack5.log, 
> yarn3-resourcemanager.log, yarn3-top
>
>
> ResourceManager does not respond to any request when queue is near fully 
> utilized sometimes. Sending SIGTERM won't stop RM, only SIGKILL can. After RM 
> restart, it can recover running jobs and start accepting new ones.
>  
> Seems like CapacityScheduler is in an infinite loop printing out the 
> following log messages (more than 25,000 lines in a second):
>  
> {{2018-07-10 17:16:29,227 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=root usedCapacity=0.99816763 
> absoluteUsedCapacity=0.99816763 used=<memory:16170624, vCores:1577> 
> cluster=<memory:29441544, vCores:5792>}}
> {{2018-07-10 17:16:29,227 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal}}
> {{2018-07-10 17:16:29,227 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
>  assignedContainer application attempt=appattempt_1530619767030_1652_000001 
> container=null 
> queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@14420943
>  clusterResource=<memory:29441544, vCores:5792> type=NODE_LOCAL 
> requestedPartition=}}
>  
> I encounter this problem several times after upgrading to YARN 2.9.1, while 
> the same configuration works fine under version 2.7.3.
>  
> YARN-4477 is an infinite loop bug in FairScheduler, not sure if this is a 
> similar problem.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to