[ https://issues.apache.org/jira/browse/YARN-8804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16623035#comment-16623035 ]
Tao Yang edited comment on YARN-8804 at 9/21/18 4:17 AM: --------------------------------------------------------- Thanks [~jlowe],[~leftnoteasy] for your review and reply. For the volatile keyword, it is a mistake when I copied from headroom field in ResourceLimits. I should removed it after did that. The resourceLimits in scheduling process is thread-safe because it isn't shared by multiple scheduling threads. Every scheduling thread will create a ResourceLimits instance at the beginning of scheduling process in CapacityScheduler#allocateOrReserveNewContainers or CapacityScheduler#allocateContainerOnSingleNode then pass it on. {quote} I think it would be cleaner if a queue could return an assignment result that not only indicated the allocation was skipped due to queue limits but also how much needs to be reserved as a result of that skipped assignment. {quote} Now we can get reserved resource from \{{childLimits.getHeadroom()}} for leaf queue, then add it into the blockedHeadroom of leaf/parent queue, so that later queues can get correct net limits through {{limit - blockedHeadroom}}. I think it's enough to solve this problem. Thoughts? {quote} The result would be less overhead for the normal scheduler loop, as we would only be adjusting when necessary rather than every time. {quote} Thanks for this mention. I will improve the calculation to avoid doing it every time through adding ResourceLimits#getNetLimit, this method will do calculation when necessary rather than every time. {quote} >From my analysis of YARN-8513, scheduler tries to allocate containers to queue >when it will go beyond max capacity (used + allocating > max). But resource >committer will reject such proposal. {quote} YARN-8513 is not the same problem with this issue as the former comments from [~jlowe], it seems similar to YARN-8771 whose problem is caused by wrong calculation for needUnreservedResource in RegularContainerAllocator#assignContainer when cluster has empty resource type. But I am not sure they are the same problem. was (Author: tao yang): Thanks [~jlowe],[~leftnoteasy] for your review and reply. For the volatile keyword, it is a mistake when I copied from headroom field in ResourceLimits. I should removed it after did that. The resourceLimits in scheduling process is thread-safe because it isn't shared by multiple scheduling threads. Every scheduling thread will create a ResourceLimits instance at the beginning of scheduling process in CapacityScheduler#allocateOrReserveNewContainers or CapacityScheduler#allocateContainerOnSingleNode then pass it on. {quote} I think it would be cleaner if a queue could return an assignment result that not only indicated the allocation was skipped due to queue limits but also how much needs to be reserved as a result of that skipped assignment. {quote} Now we can get reserved resource from \{{childLimits.getHeadroom()}} for leaf queue, then add it into the blockedHeadroom of leaf/parent queue, so that later queues can get correct net limits through {{limit - blockedHeadroom}}. I think it's enough to solve this problem. Thoughts? {quote} The result would be less overhead for the normal scheduler loop, as we would only be adjusting when necessary rather than every time. {quote} Thanks for this mention. I will improve the calculation to avoid doing it every time through adding ResourceLimits#getNetLimit, this method will do calculation when necessary rather than every time. {quote} >From my analysis of YARN-8513, scheduler tries to allocate containers to queue >when it will go beyond max capacity (used + allocating > max). But resource >committer will reject such proposal. {quote} YARN-8513 is not the same problem with this issue as the former comments from [~jlowe], it seems similar to YARN-8771 which may be caused by wrong calculation when needUnreservedResource with empty resource type in RegularContainerAllocator#assignContainer. But I am not sure they are the same problem. > resourceLimits may be wrongly calculated when leaf-queue is blocked in > cluster with 3+ level queues > --------------------------------------------------------------------------------------------------- > > Key: YARN-8804 > URL: https://issues.apache.org/jira/browse/YARN-8804 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler > Affects Versions: 3.2.0 > Reporter: Tao Yang > Assignee: Tao Yang > Priority: Critical > Attachments: YARN-8804.001.patch, YARN-8804.002.patch > > > This problem is due to YARN-4280, parent queue will deduct child queue's > headroom when the child queue reached its resource limit and the skipped type > is QUEUE_LIMIT, the resource limits of deepest parent queue will be correctly > calculated, but for non-deepest parent queue, its headroom may be much more > than the sum of reached-limit child queues' headroom, so that the resource > limit of non-deepest parent may be much less than its true value and block > the allocation for later queues. > To reproduce this problem with UT: > (1) Cluster has two nodes whose node resource both are <10GB, 10core> and > 3-level queues as below, among them max-capacity of "c1" is 10 and others are > all 100, so that max-capacity of queue "c1" is <2GB, 2core> > {noformat} > Root > / | \ > a b c > 10 20 70 > | \ > c1 c2 > 10(max=10) 90 > {noformat} > (2) Submit app1 to queue "c1" and launch am1(resource=<1GB, 1 core>) on nm1 > (3) Submit app2 to queue "b" and launch am2(resource=<1GB, 1 core>) on nm1 > (4) app1 and app2 both ask one <2GB, 1core> containers. > (5) nm1 do 1 heartbeat > Now queue "c" has lower capacity percentage than queue "b", the allocation > sequence will be "a" -> "c" -> "b", > queue "c1" has reached queue limit so that requests of app1 should be > pending, > headroom of queue "c1" is <1GB, 1core> (=max-capacity - used), > headroom of queue "c" is <18GB, 18core> (=max-capacity - used), > after allocation for queue "c", resource limit of queue "b" will be wrongly > calculated as <2GB, 2core>, > headroom of queue "b" will be <1GB, 1core> (=resource-limit - used) > so that scheduler won't allocate one container for app2 on nm1 -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org