[ https://issues.apache.org/jira/browse/YARN-10530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17248084#comment-17248084 ]
Wangda Tan commented on YARN-10530: ----------------------------------- cc: [~sunilg], [~epayne] > CapacityScheduler ResourceLimits doesn't handle node partition well > ------------------------------------------------------------------- > > Key: YARN-10530 > URL: https://issues.apache.org/jira/browse/YARN-10530 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler > Reporter: Wangda Tan > Priority: Blocker > > This is a serious bug may impact all releases, I need to do further check but > I want to log the JIRA so we will not forget: > ResourceLimits objects are used to handle two purposes: > 1) When there's cluster resource change, for example adding new node, or > scheduler config reinitialize. We will pass ResourceLimits to > updateClusterResource to queues. > 2) When allocate container, we try to pass parent's available resource to > child to make sure child's resource allocation won't violate parent's max > resource. For example below: > {code} > queue used max > -------------------------------------- > root 10 20 > root.a 8 10 > root.a.a1 2 10 > root.a.a2 6 10 > {code} > Even though a.a1 has 8 resources headroom (a1.max - a1.used). But we can at > most allocate 2 resources to a1 because root.a's limit will hit first. This > information will be passed down from parent queue to child queue during > assignContainers call via ResourceLimits. > However, we only pass 1 ResourceLimits from top, for queue initialize, we > passed in: > {code} > root.updateClusterResource(clusterResource, new ResourceLimits( > clusterResource)); > {code} > And when we update cluster resource, we only considered default partition > {code} > // Update all children > for (CSQueue childQueue : childQueues) { > // Get ResourceLimits of child queue before assign containers > ResourceLimits childLimits = getResourceLimitsOfChild(childQueue, > clusterResource, resourceLimits, > RMNodeLabelsManager.NO_LABEL, false); > childQueue.updateClusterResource(clusterResource, childLimits); > } > {code} > Same for allocation logic, we passed in: (Actually I found I added a TODO > item 5 years ago). > {code} > // Try to use NON_EXCLUSIVE > assignment = getRootQueue().assignContainers(getClusterResource(), > candidates, > // TODO, now we only consider limits for parent for non-labeled > // resources, should consider labeled resources as well. > new ResourceLimits(labelManager > .getResourceByLabel(RMNodeLabelsManager.NO_LABEL, > getClusterResource())), > SchedulingMode.IGNORE_PARTITION_EXCLUSIVITY); > {code} > The good thing is, in the assignContainers call, we calculated child limit > based on partition > {code} > ResourceLimits childLimits = > getResourceLimitsOfChild(childQueue, cluster, limits, > candidates.getPartition(), true); > {code} > So I think now the problem is, when a named partition has more resource than > default partition, effective min/max resource of each queue could be wrong. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org