[jira] [Commented] (YARN-10863) CGroupElasticMemoryController is not work
[ https://issues.apache.org/jira/browse/YARN-10863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17470163#comment-17470163 ] Adam Binford commented on YARN-10863: - Just hit this as well, you're forced to use strict memory control if you're using elastic memory control > CGroupElasticMemoryController is not work > - > > Key: YARN-10863 > URL: https://issues.apache.org/jira/browse/YARN-10863 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.3.1 >Reporter: LuoGe >Priority: Major > Labels: pull-request-available > Attachments: YARN-10863.001-1.patch, YARN-10863.002.patch, > YARN-10863.004.patch, YARN-10863.005.patch, YARN-10863.006.patch, > YARN-10863.007.patch > > Time Spent: 20m > Remaining Estimate: 0h > > When following the > [documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeManagerCGroupsMemory.html] > configuring elastic memory resource control, > yarn.nodemanager.elastic-memory-control.enabled set true, > yarn.nodemanager.resource.memory.enforced set to false, > yarn.nodemanager.pmem-check-enabled set true, and > yarn.nodemanager.resource.memory.enabled set true to use cgroup control > memory, but elastic memory control is not work. > I see the code ContainersMonitorImpl.java, in checkLimit function, the skip > logic have some problem. The return condition is strictMemoryEnforcement is > true and elasticMemoryEnforcement is false. So, following the document set > use elastic memory control, the check logic will continue, when container > memory used over limit will killed by checkLimit. > {code:java} > if (strictMemoryEnforcement && !elasticMemoryEnforcement) { > // When cgroup-based strict memory enforcement is used alone without > // elastic memory control, the oom-kill would take care of it. > // However, when elastic memory control is also enabled, the oom killer > // would be disabled at the root yarn container cgroup level (all child > // cgroups would inherit that setting). Hence, we fall back to the > // polling-based mechanism. > return; > } > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-11017) Unify node label access in queues
[ https://issues.apache.org/jira/browse/YARN-11017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andras Gyori reassigned YARN-11017: --- Assignee: Andras Gyori > Unify node label access in queues > - > > Key: YARN-11017 > URL: https://issues.apache.org/jira/browse/YARN-11017 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > > Currently there are a handful of ways in which queues are able to access node > labels. A non-exhaustive list of these are: > # configuredNodeLabels > # getNodeLabelsForQueue() > # QueueCapacities#getNodePartitionsSet() > # ResourceUsage#getNodePartitionsSet() > # accessibleNodeLabels > It is worth revisiting, as there already is a bug, which was implicitly > caused by this inconsistency (YARN-11016). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10944) AbstractCSQueue: Eliminate code duplication in overloaded versions of setMaxCapacity
[ https://issues.apache.org/jira/browse/YARN-10944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated YARN-10944: -- Labels: pull-request-available (was: ) > AbstractCSQueue: Eliminate code duplication in overloaded versions of > setMaxCapacity > > > Key: YARN-10944 > URL: https://issues.apache.org/jira/browse/YARN-10944 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Szilard Nemeth >Assignee: Andras Gyori >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Methods are: > - AbstractCSQueue#setMaxCapacity(float) > - AbstractCSQueue#setMaxCapacity(java.lang.String, float) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10944) AbstractCSQueue: Eliminate code duplication in overloaded versions of setMaxCapacity
[ https://issues.apache.org/jira/browse/YARN-10944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andras Gyori reassigned YARN-10944: --- Assignee: Andras Gyori > AbstractCSQueue: Eliminate code duplication in overloaded versions of > setMaxCapacity > > > Key: YARN-10944 > URL: https://issues.apache.org/jira/browse/YARN-10944 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Szilard Nemeth >Assignee: Andras Gyori >Priority: Minor > > Methods are: > - AbstractCSQueue#setMaxCapacity(float) > - AbstractCSQueue#setMaxCapacity(java.lang.String, float) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10590) Fix legacy auto queue creation absolute resource calculation loss
[ https://issues.apache.org/jira/browse/YARN-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andras Gyori reassigned YARN-10590: --- Assignee: Andras Gyori (was: Qi Zhu) > Fix legacy auto queue creation absolute resource calculation loss > - > > Key: YARN-10590 > URL: https://issues.apache.org/jira/browse/YARN-10590 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Andras Gyori >Priority: Major > Labels: pull-request-available > Attachments: YARN-10590.001.patch, YARN-10590.002.patch > > Time Spent: 10m > Remaining Estimate: 0h > > "Because as we discussed in YARN-10504 , the initialization of auto created > queues from template was changed (see comment and comment)." > 1. As the comment we discussed, we found the effective core is different(the > gap), because the update effective will override the absolute auto created > leaf queue. > 2. But actually, the new logic in YARN-10504 override is right, the > difference is caused by test case , don't consider the calculation loss of > multi resource type, the cap/absolute are all calculated by one type, > (memory) in DefaultResourceCalculator, (dominant type) in > DominantResourceCalculator. As we known in the comment, the absolute auto > created leaf queue will merge the effective resource by cap/absolute > calculated result, this caused the gap. > 2. In other case(not absolute case) in the auto created leaf queue, the merge > will not cause the gap, in update effective resource override will also use > the one type calculated result. > 3. So this jira just make the test right, the calculation result is already > right. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10590) Fix legacy auto queue creation absolute resource calculation loss
[ https://issues.apache.org/jira/browse/YARN-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated YARN-10590: -- Labels: pull-request-available (was: ) > Fix legacy auto queue creation absolute resource calculation loss > - > > Key: YARN-10590 > URL: https://issues.apache.org/jira/browse/YARN-10590 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: pull-request-available > Attachments: YARN-10590.001.patch, YARN-10590.002.patch > > Time Spent: 10m > Remaining Estimate: 0h > > "Because as we discussed in YARN-10504 , the initialization of auto created > queues from template was changed (see comment and comment)." > 1. As the comment we discussed, we found the effective core is different(the > gap), because the update effective will override the absolute auto created > leaf queue. > 2. But actually, the new logic in YARN-10504 override is right, the > difference is caused by test case , don't consider the calculation loss of > multi resource type, the cap/absolute are all calculated by one type, > (memory) in DefaultResourceCalculator, (dominant type) in > DominantResourceCalculator. As we known in the comment, the absolute auto > created leaf queue will merge the effective resource by cap/absolute > calculated result, this caused the gap. > 2. In other case(not absolute case) in the auto created leaf queue, the merge > will not cause the gap, in update effective resource override will also use > the one type calculated result. > 3. So this jira just make the test right, the calculation result is already > right. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10947) Simplify AbstractCSQueue#initializeQueueState
[ https://issues.apache.org/jira/browse/YARN-10947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated YARN-10947: -- Labels: pull-request-available (was: ) > Simplify AbstractCSQueue#initializeQueueState > - > > Key: YARN-10947 > URL: https://issues.apache.org/jira/browse/YARN-10947 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Szilard Nemeth >Assignee: Andras Gyori >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11015) Decouple queue capacity with ability to run OPPORTUNISTIC container
[ https://issues.apache.org/jira/browse/YARN-11015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated YARN-11015: -- Labels: pull-request-available (was: ) > Decouple queue capacity with ability to run OPPORTUNISTIC container > --- > > Key: YARN-11015 > URL: https://issues.apache.org/jira/browse/YARN-11015 > Project: Hadoop YARN > Issue Type: Sub-task > Components: container-queuing, resourcemanager >Reporter: Andrew Chung >Assignee: Andrew Chung >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Motivation: > With YARN-11005, we will be able to schedule OContainers on nodes based on > resource availability. That said, we should be able to allow nodes with 0 > queue capacity to run OContainers (as these containers should be started > directly immediately if resources are available, even if they are put on a > "queue" first). > However, with the current implementation, if we set the queue length of NMs > to be 0, at the RM, it assumes infinite queue capacity while at the NM, it > disables the running of any OContainers, killing OContainers that arrive > directly. > This issue works to address the above issues with the > {{QUEUE_LENGTH_THEN_RESOURCES}} allocator. > This issue does not aim to change the existing behavior of the > {{QUEUE_LENGTH}} allocator. > Proposed design: > To add a new {{NodeManager}} config, > {{opportunistic-containers-queue-policy}}, which allows the specification of > the queueing policy at the NM. > Will start with {{BY_RESOURCES}} and {{BY_QUEUE_LEN}}, where if > {{BY_RESOURCES}} is specified, the NM will queue as long as it has enough > resources to run all pending + running containers. Otherwise, it will reject > the {{OPPORTUNISTIC}} container. > On the other hand, if {{BY_QUEUE_LEN}} is specified, the NM will only accept > as many containers as its queue capacity is configured. > Thus, if {{BY_QUEUE_LEN}} is specified and the NM's queue capacity is > configured to be 0, the NM will reject all incoming {{OPPORTUNISTIC}} > containers (today's behavior). > Note that this configuration *does not affect how the RM behaves*. > At the RM, if the queue capacity reported by the node is = 0 *and* the > allocation policy is set to {{QUEUE_LENGTH_THEN_RESOURCES}}, it assumes that > the node can still run {{OPPORTUNISTIC}} containers if it has available > resources, otherwise it skips the node. > Subsequently, if the queue capacity reported by the node is = 0 *and* the > allocation policy is set to {{QUEUE_LENGTH}}, it still assumes that the node > can run infinitely many {{OPPORTUNISTIC}} containers, and it will be on the > NM to reject these containers (today's behavior). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10590) Fix legacy auto queue creation absolute resource calculation loss
[ https://issues.apache.org/jira/browse/YARN-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17469794#comment-17469794 ] Qi Zhu commented on YARN-10590: --- Thanks [~gandras] for taking this, feel free to take it. > Fix legacy auto queue creation absolute resource calculation loss > - > > Key: YARN-10590 > URL: https://issues.apache.org/jira/browse/YARN-10590 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10590.001.patch, YARN-10590.002.patch > > > "Because as we discussed in YARN-10504 , the initialization of auto created > queues from template was changed (see comment and comment)." > 1. As the comment we discussed, we found the effective core is different(the > gap), because the update effective will override the absolute auto created > leaf queue. > 2. But actually, the new logic in YARN-10504 override is right, the > difference is caused by test case , don't consider the calculation loss of > multi resource type, the cap/absolute are all calculated by one type, > (memory) in DefaultResourceCalculator, (dominant type) in > DominantResourceCalculator. As we known in the comment, the absolute auto > created leaf queue will merge the effective resource by cap/absolute > calculated result, this caused the gap. > 2. In other case(not absolute case) in the auto created leaf queue, the merge > will not cause the gap, in update effective resource override will also use > the one type calculated result. > 3. So this jira just make the test right, the calculation result is already > right. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10947) Simplify AbstractCSQueue#initializeQueueState
[ https://issues.apache.org/jira/browse/YARN-10947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andras Gyori reassigned YARN-10947: --- Assignee: Andras Gyori > Simplify AbstractCSQueue#initializeQueueState > - > > Key: YARN-10947 > URL: https://issues.apache.org/jira/browse/YARN-10947 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Szilard Nemeth >Assignee: Andras Gyori >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10590) Fix legacy auto queue creation absolute resource calculation loss
[ https://issues.apache.org/jira/browse/YARN-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17469780#comment-17469780 ] Andras Gyori commented on YARN-10590: - [~zhuqi] I would like to work on this if you do not mind! > Fix legacy auto queue creation absolute resource calculation loss > - > > Key: YARN-10590 > URL: https://issues.apache.org/jira/browse/YARN-10590 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10590.001.patch, YARN-10590.002.patch > > > "Because as we discussed in YARN-10504 , the initialization of auto created > queues from template was changed (see comment and comment)." > 1. As the comment we discussed, we found the effective core is different(the > gap), because the update effective will override the absolute auto created > leaf queue. > 2. But actually, the new logic in YARN-10504 override is right, the > difference is caused by test case , don't consider the calculation loss of > multi resource type, the cap/absolute are all calculated by one type, > (memory) in DefaultResourceCalculator, (dominant type) in > DominantResourceCalculator. As we known in the comment, the absolute auto > created leaf queue will merge the effective resource by cap/absolute > calculated result, this caused the gap. > 2. In other case(not absolute case) in the auto created leaf queue, the merge > will not cause the gap, in update effective resource override will also use > the one type calculated result. > 3. So this jira just make the test right, the calculation result is already > right. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10590) Fix legacy auto queue creation absolute resource calculation loss
[ https://issues.apache.org/jira/browse/YARN-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andras Gyori updated YARN-10590: Summary: Fix legacy auto queue creation absolute resource calculation loss (was: Fix TestCapacitySchedulerAutoCreatedQueueBase with related absolute calculation loss) > Fix legacy auto queue creation absolute resource calculation loss > - > > Key: YARN-10590 > URL: https://issues.apache.org/jira/browse/YARN-10590 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10590.001.patch, YARN-10590.002.patch > > > "Because as we discussed in YARN-10504 , the initialization of auto created > queues from template was changed (see comment and comment)." > 1. As the comment we discussed, we found the effective core is different(the > gap), because the update effective will override the absolute auto created > leaf queue. > 2. But actually, the new logic in YARN-10504 override is right, the > difference is caused by test case , don't consider the calculation loss of > multi resource type, the cap/absolute are all calculated by one type, > (memory) in DefaultResourceCalculator, (dominant type) in > DominantResourceCalculator. As we known in the comment, the absolute auto > created leaf queue will merge the effective resource by cap/absolute > calculated result, this caused the gap. > 2. In other case(not absolute case) in the auto created leaf queue, the merge > will not cause the gap, in update effective resource override will also use > the one type calculated result. > 3. So this jira just make the test right, the calculation result is already > right. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11059) Investigate whether legacy Auto Queue Creation in absolute mode works seamlessly when calling updateClusterResource
Andras Gyori created YARN-11059: --- Summary: Investigate whether legacy Auto Queue Creation in absolute mode works seamlessly when calling updateClusterResource Key: YARN-11059 URL: https://issues.apache.org/jira/browse/YARN-11059 Project: Hadoop YARN Issue Type: Bug Components: yarn Reporter: Andras Gyori Assignee: Andras Gyori Due to this check in ParentQueue#getCapacityConfigurationTypeForQueues: {code:java} if (queues.iterator().hasNext() && !queues.iterator().next().getQueuePath().equals( CapacitySchedulerConfiguration.ROOT) && (percentageIsSet ? 1 : 0) + (weightIsSet ? 1 : 0) + (absoluteMinResSet ? 1 : 0) > 1) { throw new IOException("Parent queue '" + getQueuePath() + "' have children queue used mixed of " + " weight mode, percentage and absolute mode, it is not allowed, please " + "double check, details:" + diagMsg.toString()); } {code} I was unable to call updateClusterResource on a ManagedParentQueue, when its children are in absolute mode. updateClusterResource is called whenever a node is updated etc., therefore it could break any time. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org