[jira] [Commented] (YARN-10863) CGroupElasticMemoryController is not work

2022-01-06 Thread Adam Binford (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17470163#comment-17470163
 ] 

Adam Binford commented on YARN-10863:
-

Just hit this as well, you're forced to use strict memory control if you're 
using elastic memory control

> CGroupElasticMemoryController is not work
> -
>
> Key: YARN-10863
> URL: https://issues.apache.org/jira/browse/YARN-10863
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.3.1
>Reporter: LuoGe
>Priority: Major
>  Labels: pull-request-available
> Attachments: YARN-10863.001-1.patch, YARN-10863.002.patch, 
> YARN-10863.004.patch, YARN-10863.005.patch, YARN-10863.006.patch, 
> YARN-10863.007.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When following the 
> [documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeManagerCGroupsMemory.html]
>  configuring elastic memory resource control, 
> yarn.nodemanager.elastic-memory-control.enabled set true,  
> yarn.nodemanager.resource.memory.enforced set to false, 
> yarn.nodemanager.pmem-check-enabled set true, and 
> yarn.nodemanager.resource.memory.enabled set true to use cgroup control 
> memory, but elastic memory control is not work.
> I see the code ContainersMonitorImpl.java, in checkLimit function, the skip 
> logic have some problem.  The return condition is strictMemoryEnforcement is 
> true and elasticMemoryEnforcement is false. So, following the document set 
> use elastic memory control, the check logic will continue, when container 
> memory used over limit will killed by checkLimit. 
> {code:java}
> if (strictMemoryEnforcement && !elasticMemoryEnforcement) {
>   // When cgroup-based strict memory enforcement is used alone without
>   // elastic memory control, the oom-kill would take care of it.
>   // However, when elastic memory control is also enabled, the oom killer
>   // would be disabled at the root yarn container cgroup level (all child
>   // cgroups would inherit that setting). Hence, we fall back to the
>   // polling-based mechanism.
>   return;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-11017) Unify node label access in queues

2022-01-06 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori reassigned YARN-11017:
---

Assignee: Andras Gyori

> Unify node label access in queues
> -
>
> Key: YARN-11017
> URL: https://issues.apache.org/jira/browse/YARN-11017
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>
> Currently there are a handful of ways in which queues are able to access node 
> labels. A non-exhaustive list of these are:
>  # configuredNodeLabels
>  # getNodeLabelsForQueue()
>  # QueueCapacities#getNodePartitionsSet()
>  # ResourceUsage#getNodePartitionsSet()
>  # accessibleNodeLabels
> It is worth revisiting, as there already is a bug, which was implicitly 
> caused by this inconsistency (YARN-11016).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10944) AbstractCSQueue: Eliminate code duplication in overloaded versions of setMaxCapacity

2022-01-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YARN-10944:
--
Labels: pull-request-available  (was: )

> AbstractCSQueue: Eliminate code duplication in overloaded versions of 
> setMaxCapacity
> 
>
> Key: YARN-10944
> URL: https://issues.apache.org/jira/browse/YARN-10944
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: Andras Gyori
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Methods are:
> - AbstractCSQueue#setMaxCapacity(float)
> - AbstractCSQueue#setMaxCapacity(java.lang.String, float)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10944) AbstractCSQueue: Eliminate code duplication in overloaded versions of setMaxCapacity

2022-01-06 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori reassigned YARN-10944:
---

Assignee: Andras Gyori

> AbstractCSQueue: Eliminate code duplication in overloaded versions of 
> setMaxCapacity
> 
>
> Key: YARN-10944
> URL: https://issues.apache.org/jira/browse/YARN-10944
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: Andras Gyori
>Priority: Minor
>
> Methods are:
> - AbstractCSQueue#setMaxCapacity(float)
> - AbstractCSQueue#setMaxCapacity(java.lang.String, float)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10590) Fix legacy auto queue creation absolute resource calculation loss

2022-01-06 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori reassigned YARN-10590:
---

Assignee: Andras Gyori  (was: Qi Zhu)

> Fix legacy auto queue creation absolute resource calculation loss
> -
>
> Key: YARN-10590
> URL: https://issues.apache.org/jira/browse/YARN-10590
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Andras Gyori
>Priority: Major
>  Labels: pull-request-available
> Attachments: YARN-10590.001.patch, YARN-10590.002.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> "Because as we discussed in YARN-10504 , the initialization of auto created 
> queues from template was changed (see comment and comment)."
> 1. As the comment we discussed, we found the effective core is different(the 
> gap), because the update effective  will override the absolute auto created 
> leaf queue.
> 2. But actually, the new logic in YARN-10504 override is right, the 
> difference is caused by test case , don't consider the calculation loss of 
> multi resource type, the cap/absolute are all calculated by one type, 
> (memory) in DefaultResourceCalculator, (dominant type) in 
> DominantResourceCalculator. As we known in the comment, the absolute auto 
> created leaf queue will merge the effective resource by cap/absolute 
> calculated result, this caused the gap.
> 2. In other case(not absolute case) in the auto created leaf queue, the merge 
> will not cause the gap, in update effective resource override will also use 
> the one type calculated result. 
> 3. So this jira just make the test right, the calculation result is already 
> right.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10590) Fix legacy auto queue creation absolute resource calculation loss

2022-01-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YARN-10590:
--
Labels: pull-request-available  (was: )

> Fix legacy auto queue creation absolute resource calculation loss
> -
>
> Key: YARN-10590
> URL: https://issues.apache.org/jira/browse/YARN-10590
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: pull-request-available
> Attachments: YARN-10590.001.patch, YARN-10590.002.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> "Because as we discussed in YARN-10504 , the initialization of auto created 
> queues from template was changed (see comment and comment)."
> 1. As the comment we discussed, we found the effective core is different(the 
> gap), because the update effective  will override the absolute auto created 
> leaf queue.
> 2. But actually, the new logic in YARN-10504 override is right, the 
> difference is caused by test case , don't consider the calculation loss of 
> multi resource type, the cap/absolute are all calculated by one type, 
> (memory) in DefaultResourceCalculator, (dominant type) in 
> DominantResourceCalculator. As we known in the comment, the absolute auto 
> created leaf queue will merge the effective resource by cap/absolute 
> calculated result, this caused the gap.
> 2. In other case(not absolute case) in the auto created leaf queue, the merge 
> will not cause the gap, in update effective resource override will also use 
> the one type calculated result. 
> 3. So this jira just make the test right, the calculation result is already 
> right.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10947) Simplify AbstractCSQueue#initializeQueueState

2022-01-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YARN-10947:
--
Labels: pull-request-available  (was: )

> Simplify AbstractCSQueue#initializeQueueState
> -
>
> Key: YARN-10947
> URL: https://issues.apache.org/jira/browse/YARN-10947
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: Andras Gyori
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11015) Decouple queue capacity with ability to run OPPORTUNISTIC container

2022-01-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YARN-11015:
--
Labels: pull-request-available  (was: )

> Decouple queue capacity with ability to run OPPORTUNISTIC container
> ---
>
> Key: YARN-11015
> URL: https://issues.apache.org/jira/browse/YARN-11015
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: container-queuing, resourcemanager
>Reporter: Andrew Chung
>Assignee: Andrew Chung
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Motivation:
> With YARN-11005, we will be able to schedule OContainers on nodes based on 
> resource availability. That said, we should be able to allow nodes with 0 
> queue capacity to run OContainers (as these containers should be started 
> directly immediately if resources are available, even if they are put on a 
> "queue" first).
> However, with the current implementation, if we set the queue length of NMs 
> to be 0, at the RM, it assumes infinite queue capacity while at the NM, it 
> disables the running of any OContainers, killing OContainers that arrive 
> directly.
> This issue works to address the above issues with the 
> {{QUEUE_LENGTH_THEN_RESOURCES}} allocator.
> This issue does not aim to change the existing behavior of the 
> {{QUEUE_LENGTH}} allocator.
> Proposed design:
> To add a new {{NodeManager}} config, 
> {{opportunistic-containers-queue-policy}}, which allows the specification of 
> the queueing policy at the NM.
> Will start with {{BY_RESOURCES}} and {{BY_QUEUE_LEN}}, where if 
> {{BY_RESOURCES}} is specified, the NM will queue as long as it has enough 
> resources to run all pending + running containers. Otherwise, it will reject 
> the {{OPPORTUNISTIC}} container.
> On the other hand, if {{BY_QUEUE_LEN}} is specified, the NM will only accept 
> as many containers as its queue capacity is configured.
> Thus, if {{BY_QUEUE_LEN}} is specified and the NM's queue capacity is 
> configured to be 0, the NM will reject all incoming {{OPPORTUNISTIC}} 
> containers (today's behavior).
> Note that this configuration *does not affect how the RM behaves*.
> At the RM, if the queue capacity reported by the node is = 0 *and* the 
> allocation policy is set to {{QUEUE_LENGTH_THEN_RESOURCES}}, it assumes that 
> the node can still run {{OPPORTUNISTIC}} containers if it has available 
> resources, otherwise it skips the node.
> Subsequently, if the queue capacity reported by the node is = 0 *and* the 
> allocation policy is set to {{QUEUE_LENGTH}}, it still assumes that the node 
> can run infinitely many {{OPPORTUNISTIC}} containers, and it will be on the 
> NM to reject these containers (today's behavior).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10590) Fix legacy auto queue creation absolute resource calculation loss

2022-01-06 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17469794#comment-17469794
 ] 

Qi Zhu commented on YARN-10590:
---

Thanks [~gandras] for taking this, feel free to take it. 

> Fix legacy auto queue creation absolute resource calculation loss
> -
>
> Key: YARN-10590
> URL: https://issues.apache.org/jira/browse/YARN-10590
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10590.001.patch, YARN-10590.002.patch
>
>
> "Because as we discussed in YARN-10504 , the initialization of auto created 
> queues from template was changed (see comment and comment)."
> 1. As the comment we discussed, we found the effective core is different(the 
> gap), because the update effective  will override the absolute auto created 
> leaf queue.
> 2. But actually, the new logic in YARN-10504 override is right, the 
> difference is caused by test case , don't consider the calculation loss of 
> multi resource type, the cap/absolute are all calculated by one type, 
> (memory) in DefaultResourceCalculator, (dominant type) in 
> DominantResourceCalculator. As we known in the comment, the absolute auto 
> created leaf queue will merge the effective resource by cap/absolute 
> calculated result, this caused the gap.
> 2. In other case(not absolute case) in the auto created leaf queue, the merge 
> will not cause the gap, in update effective resource override will also use 
> the one type calculated result. 
> 3. So this jira just make the test right, the calculation result is already 
> right.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10947) Simplify AbstractCSQueue#initializeQueueState

2022-01-06 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori reassigned YARN-10947:
---

Assignee: Andras Gyori

> Simplify AbstractCSQueue#initializeQueueState
> -
>
> Key: YARN-10947
> URL: https://issues.apache.org/jira/browse/YARN-10947
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: Andras Gyori
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10590) Fix legacy auto queue creation absolute resource calculation loss

2022-01-06 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17469780#comment-17469780
 ] 

Andras Gyori commented on YARN-10590:
-

[~zhuqi] I would like to work on this if you do not mind!

> Fix legacy auto queue creation absolute resource calculation loss
> -
>
> Key: YARN-10590
> URL: https://issues.apache.org/jira/browse/YARN-10590
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10590.001.patch, YARN-10590.002.patch
>
>
> "Because as we discussed in YARN-10504 , the initialization of auto created 
> queues from template was changed (see comment and comment)."
> 1. As the comment we discussed, we found the effective core is different(the 
> gap), because the update effective  will override the absolute auto created 
> leaf queue.
> 2. But actually, the new logic in YARN-10504 override is right, the 
> difference is caused by test case , don't consider the calculation loss of 
> multi resource type, the cap/absolute are all calculated by one type, 
> (memory) in DefaultResourceCalculator, (dominant type) in 
> DominantResourceCalculator. As we known in the comment, the absolute auto 
> created leaf queue will merge the effective resource by cap/absolute 
> calculated result, this caused the gap.
> 2. In other case(not absolute case) in the auto created leaf queue, the merge 
> will not cause the gap, in update effective resource override will also use 
> the one type calculated result. 
> 3. So this jira just make the test right, the calculation result is already 
> right.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10590) Fix legacy auto queue creation absolute resource calculation loss

2022-01-06 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-10590:

Summary: Fix legacy auto queue creation absolute resource calculation loss  
(was: Fix TestCapacitySchedulerAutoCreatedQueueBase with related absolute 
calculation loss)

> Fix legacy auto queue creation absolute resource calculation loss
> -
>
> Key: YARN-10590
> URL: https://issues.apache.org/jira/browse/YARN-10590
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10590.001.patch, YARN-10590.002.patch
>
>
> "Because as we discussed in YARN-10504 , the initialization of auto created 
> queues from template was changed (see comment and comment)."
> 1. As the comment we discussed, we found the effective core is different(the 
> gap), because the update effective  will override the absolute auto created 
> leaf queue.
> 2. But actually, the new logic in YARN-10504 override is right, the 
> difference is caused by test case , don't consider the calculation loss of 
> multi resource type, the cap/absolute are all calculated by one type, 
> (memory) in DefaultResourceCalculator, (dominant type) in 
> DominantResourceCalculator. As we known in the comment, the absolute auto 
> created leaf queue will merge the effective resource by cap/absolute 
> calculated result, this caused the gap.
> 2. In other case(not absolute case) in the auto created leaf queue, the merge 
> will not cause the gap, in update effective resource override will also use 
> the one type calculated result. 
> 3. So this jira just make the test right, the calculation result is already 
> right.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11059) Investigate whether legacy Auto Queue Creation in absolute mode works seamlessly when calling updateClusterResource

2022-01-06 Thread Andras Gyori (Jira)
Andras Gyori created YARN-11059:
---

 Summary: Investigate whether legacy Auto Queue Creation in 
absolute mode works seamlessly when calling updateClusterResource
 Key: YARN-11059
 URL: https://issues.apache.org/jira/browse/YARN-11059
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Reporter: Andras Gyori
Assignee: Andras Gyori


Due to this check in ParentQueue#getCapacityConfigurationTypeForQueues:
{code:java}
if (queues.iterator().hasNext() &&
!queues.iterator().next().getQueuePath().equals(
CapacitySchedulerConfiguration.ROOT) &&
(percentageIsSet ? 1 : 0) + (weightIsSet ? 1 : 0) + (absoluteMinResSet ?
1 :
0) > 1) {
  throw new IOException("Parent queue '" + getQueuePath()
  + "' have children queue used mixed of "
  + " weight mode, percentage and absolute mode, it is not allowed, please "
  + "double check, details:" + diagMsg.toString());
} {code}
I was unable to call updateClusterResource on a ManagedParentQueue, when its 
children are in absolute mode. updateClusterResource is called whenever a node 
is updated etc., therefore it could break any time.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org