[jira] [Commented] (YARN-11536) [Federation] Router CLI Supports Batch Save the SubClusterPolicyConfiguration Of Queues.

2023-08-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17750576#comment-17750576
 ] 

ASF GitHub Bot commented on YARN-11536:
---

slfan1989 commented on code in PR #5862:
URL: https://github.com/apache/hadoop/pull/5862#discussion_r1282658635


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/FederationQueueWeight.java:
##
@@ -166,4 +179,32 @@ public static void checkHeadRoomAlphaValid(String 
headRoomAlpha) throws YarnExce
   protected static boolean isNumeric(String value) {
 return NumberUtils.isCreatable(value);
   }
+
+  @Public
+  @Unstable
+  public abstract String getQueue();
+
+  @Public
+  @Unstable
+  public abstract void setQueue(String queue);
+
+  @Public
+  @Unstable
+  public abstract String getPolicyManagerClassName();
+
+  @Public
+  @Unstable
+  public abstract void setPolicyManagerClassName(String 
policyManagerClassName);
+
+  @Override
+  public String toString() {
+StringBuilder builder = new StringBuilder();
+builder.append("FederationQueueWeight{");
+builder.append("Queue:" + getQueue() + ", ");

Review Comment:
   Thank you very much for reviewing the code! I will fix it.





> [Federation] Router CLI Supports Batch Save the SubClusterPolicyConfiguration 
> Of Queues.
> 
>
> Key: YARN-11536
> URL: https://issues.apache.org/jira/browse/YARN-11536
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: federation
>Affects Versions: 3.4.0
>Reporter: Shilun Fan
>Assignee: Shilun Fan
>Priority: Major
>  Labels: pull-request-available
>
> In this jira, we will support batch saving of SubCluster Policy Configuration 
> of Queue. Users can provide an xml configuration file, and we will initialize 
> SubCluster Policy Configuration according to the configuration file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-3660) [GPG] Federation Global Policy Generator (service hook only)

2023-08-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-3660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17750574#comment-17750574
 ] 

ASF GitHub Bot commented on YARN-3660:
--

slfan1989 commented on PR #5903:
URL: https://github.com/apache/hadoop/pull/5903#issuecomment-1663307525

   @goiri Thank you very much for your help in reviewing the code!




> [GPG] Federation Global Policy Generator (service hook only)
> 
>
> Key: YARN-3660
> URL: https://issues.apache.org/jira/browse/YARN-3660
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, resourcemanager
>Reporter: Carlo Curino
>Assignee: Botong Huang
>Priority: Major
>  Labels: federation, gpg, pull-request-available
> Attachments: YARN-3660-YARN-7402.v1.patch, 
> YARN-3660-YARN-7402.v2.patch, YARN-3660-YARN-7402.v3.patch, 
> YARN-3660-YARN-7402.v3.patch, YARN-3660-YARN-7402.v3.patch, 
> YARN-3660-YARN-7402.v4.patch
>
>
> In a federated environment, local impairments of one sub-cluster might 
> unfairly affect users/queues that are mapped to that sub-cluster. A 
> centralized component (GPG) runs out-of-band and edits the policies governing 
> how users/queues are allocated to sub-clusters. This allows us to enforce 
> global invariants (by dynamically updating locally-enforced invariants).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11041) Replace all occurences of queuePath with the new QueuePath class - followup

2023-08-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17750534#comment-17750534
 ] 

ASF GitHub Bot commented on YARN-11041:
---

szilard-nemeth commented on PR #5332:
URL: https://github.com/apache/hadoop/pull/5332#issuecomment-1663216534

   > Thank you @szilard-nemeth for your thorough review on this change. I 
updated it addressing all your comments and questions. May I ask you to take 
another look on it please, if you'll have some time for it?
   
   Hi @p-szucs ,
   Sorry, totally missed your comment.
   I will try to find some time for the final round of review soon :) 
   In the meantime, would you mind checking the conflicts?
   Thanks.




> Replace all occurences of queuePath with the new QueuePath class - followup
> ---
>
> Key: YARN-11041
> URL: https://issues.apache.org/jira/browse/YARN-11041
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Tibor Kovács
>Assignee: Peter Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> The QueuePath class was introduced in YARN-10897, however, its current 
> adoption happened only for code changes after this JIRA. We need to adopt it 
> retrospectively.
>  
> A lot of changes are introduced via ticket YARN-10982. The replacing should 
> be continued by touching the next comments:
>  
> [...g/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AutoCreatedQueueTemplate.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-fde6885144b59bb06b2c3358780388d958829b13f68aceee7bb6d394bb5e0548]
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765012937]
> I think this could be also refactored in a follow-up jira so the string magic 
> could probably be replaced with some more elegant solution. Though, I think 
> this would be too much in this patch, hence I do suggest the follow-up jira.|
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765013096]
> [~bteke] [ |https://github.com/9uapaw] [~gandras] [ 
> \|https://github.com/9uapaw] Thoughts?|
> |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765110750]
> +1, even the QueuePath object could have some kind of support for this.|
> |[~gandras] [https://github.com/apache/hadoop/pull/3660#discussion_r765131244]
> Agreed, let's handle it in a followup!|
>  
> 
>  
> [...he/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-c4b0c5e70208f1e3cfbd5a86ffa2393e5c996cc8b45605d9d41abcb7e0bd382a]
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765023717]
> There are many string operations in this class:
> E.g. * getQueuePrefix that works with the full queue path
>  * getNodeLabelPrefix that also works with the full queue path|
> I suggest to create a static class, called "QueuePrefixes" or something like 
> that and add some static methods there to convert the QueuePath object to 
> those various queue prefix strings that are ultimately keys in the 
> Configuration object.
>  
> 
>  
> [...he/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-c4b0c5e70208f1e3cfbd5a86ffa2393e5c996cc8b45605d9d41abcb7e0bd382a]
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765026119]
> This seems hacky, just based on the constructor parameter names of QueuePath: 
> parent, leaf.
> The AQC Template prefix is not the leaf, obviously.
> Could we somehow circumvent this?|
> |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765126207]
> Maybe a factory method could be created, which returns a new QueuePath with 
> the parent set as the original queuePath. I.e 
> rootQueuePath.createChild(String childName) -> this could return a new 
> QueuePath object with root.childName path, and rootQueuePath as parent.|
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765039033]
> Looking at this getQueues method, I realized almost all the callers are using 
> some kind of string magic that should be addressed with this patch.
> For example, take a look at: 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.MutableCSConfigurationProvider#addQueue
> I think getQueues should also receive the QueuePath object instead of 
> Strings.|
>  
> 
>  
> 

[jira] [Assigned] (YARN-9430) Recovering containers does not check available resources on node

2023-08-02 Thread Riya Khandelwal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Riya Khandelwal reassigned YARN-9430:
-

Assignee: Riya Khandelwal  (was: Szilard Nemeth)

> Recovering containers does not check available resources on node
> 
>
> Key: YARN-9430
> URL: https://issues.apache.org/jira/browse/YARN-9430
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Riya Khandelwal
>Priority: Critical
>
> I have a testcase that checks if some GPU devices gone offline and recovery 
> happens, only the containers that fit into the node's resources will be 
> recovered. Unfortunately, this is not the case: RM does not check available 
> resources on node during recovery.
> *Detailed explanation:*
> *Testcase:* 
>  1. There are 2 nodes running NodeManagers
>  2. nvidia-smi is replaced with a fake bash script that reports 2 GPU devices 
> per node, initially. This means 4 GPU devices in the cluster altogether.
>  3. RM / NM recovery is enabled
>  4. The test starts off a sleep job, requesting 4 containers, 1 GPU device 
> for each (AM does not request GPUs)
>  5. Before restart, the fake bash script is adjusted to report 1 GPU device 
> per node (2 in the cluster) after restart.
>  6. Restart is initiated.
>  
> *Expected behavior:* 
>  After restart, only the AM and 2 normal containers should have been started, 
> as there are only 2 GPU devices in the cluster.
>  
> *Actual behaviour:* 
>  AM + 4 containers are allocated, this is all containers started originally 
> with step 4.
> App id was: 1553977186701_0001
> *Logs*:
>  
> {code:java}
> 2019-03-30 13:22:30,299 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> Processing event for appattempt_1553977186701_0001_01 of type RECOVER
> 2019-03-30 13:22:30,366 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Added Application Attempt appattempt_1553977186701_0001_01 to scheduler 
> from user: systest
>  2019-03-30 13:22:30,366 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> appattempt_1553977186701_0001_01 is recovering. Skipping notifying 
> ATTEMPT_ADDED
>  2019-03-30 13:22:30,367 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1553977186701_0001_01 State change from NEW to LAUNCHED on 
> event = RECOVER
> 2019-03-30 13:22:33,257 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler:
>  Recovering container [container_e84_1553977186701_0001_01_01, 
> CreateTime: 1553977260732, Version: 0, State: RUNNING, Capability: 
> , Diagnostics: , ExitStatus: -1000, 
> NodeLabelExpression: Priority: 0]
> 2019-03-30 13:22:33,275 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler:
>  Recovering container [container_e84_1553977186701_0001_01_04, 
> CreateTime: 1553977272802, Version: 0, State: RUNNING, Capability: 
> , Diagnostics: , ExitStatus: -1000, 
> NodeLabelExpression: Priority: 0]
> 2019-03-30 13:22:33,275 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: 
> Assigned container container_e84_1553977186701_0001_01_04 of capacity 
>  on host 
> snemeth-gpu-2.vpc.cloudera.com:8041, which has 2 containers,  vCores:2, yarn.io/gpu: 1> used and  available after 
> allocation
> 2019-03-30 13:22:33,276 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler:
>  Recovering container [container_e84_1553977186701_0001_01_05, 
> CreateTime: 1553977272803, Version: 0, State: RUNNING, Capability: 
> , Diagnostics: , ExitStatus: -1000, 
> NodeLabelExpression: Priority: 0]
>  2019-03-30 13:22:33,276 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> Processing container_e84_1553977186701_0001_01_05 of type RECOVER
>  2019-03-30 13:22:33,276 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_e84_1553977186701_0001_01_05 Container Transitioned from NEW to 
> RUNNING
>  2019-03-30 13:22:33,276 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: 
> Assigned container container_e84_1553977186701_0001_01_05 of capacity 
>  on host 
> snemeth-gpu-2.vpc.cloudera.com:8041, which has 3 containers,  vCores:3, yarn.io/gpu: 2> used and  
> available after allocation
> 2019-03-30 13:22:33,279 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler:
>  Recovering container [container_e84_1553977186701_0001_01_03, 
> CreateTime: 1553977272166, Version: 0, State: RUNNING, Capability: 
> , Diagnostics: , ExitStatus: -1000, 
> NodeLabelExpression: 

[jira] [Commented] (YARN-11536) [Federation] Router CLI Supports Batch Save the SubClusterPolicyConfiguration Of Queues.

2023-08-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17750129#comment-17750129
 ] 

ASF GitHub Bot commented on YARN-11536:
---

hadoop-yetus commented on PR #5862:
URL: https://github.com/apache/hadoop/pull/5862#issuecomment-1661586627

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 29s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  buf  |   0m  1s |  |  buf was not available.  |
   | +0 :ok: |  buf  |   0m  1s |  |  buf was not available.  |
   | +0 :ok: |  xmllint  |   0m  1s |  |  xmllint was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 7 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m 27s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  20m 35s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   5m  2s |  |  trunk passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  compile  |   4m 21s |  |  trunk passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  checkstyle  |   1m 11s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   4m 21s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   4m 24s |  |  trunk passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javadoc  |   4m 10s |  |  trunk passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  spotbugs  |   7m  5s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  21m 16s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 25s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m 30s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   4m 21s |  |  the patch passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  cc  |   4m 21s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   4m 21s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   4m 20s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  cc  |   4m 20s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   4m 20s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   1m  5s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   3m 57s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   3m 54s |  |  the patch passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javadoc  |   3m 44s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  spotbugs  |   7m 26s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  21m 53s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   0m 57s |  |  hadoop-yarn-api in the patch 
passed.  |
   | +1 :green_heart: |  unit  |   4m 49s |  |  hadoop-yarn-common in the patch 
passed.  |
   | +1 :green_heart: |  unit  |   2m 48s |  |  hadoop-yarn-server-common in 
the patch passed.  |
   | +1 :green_heart: |  unit  |  86m  8s |  |  
hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  unit  |  26m  6s |  |  hadoop-yarn-client in the patch 
passed.  |
   | +1 :green_heart: |  unit  |   0m 37s |  |  hadoop-yarn-server-router in 
the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 50s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 268m  1s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5862/15/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5862 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets cc buflint 
bufcompat xmllint |
   | uname | Linux 80a2f97556cd 4.15.0-213-generic #224-Ubuntu SMP Mon Jun 19 
13:30:12