[jira] [Commented] (YARN-11536) [Federation] Router CLI Supports Batch Save the SubClusterPolicyConfiguration Of Queues.
[ https://issues.apache.org/jira/browse/YARN-11536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17750576#comment-17750576 ] ASF GitHub Bot commented on YARN-11536: --- slfan1989 commented on code in PR #5862: URL: https://github.com/apache/hadoop/pull/5862#discussion_r1282658635 ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/FederationQueueWeight.java: ## @@ -166,4 +179,32 @@ public static void checkHeadRoomAlphaValid(String headRoomAlpha) throws YarnExce protected static boolean isNumeric(String value) { return NumberUtils.isCreatable(value); } + + @Public + @Unstable + public abstract String getQueue(); + + @Public + @Unstable + public abstract void setQueue(String queue); + + @Public + @Unstable + public abstract String getPolicyManagerClassName(); + + @Public + @Unstable + public abstract void setPolicyManagerClassName(String policyManagerClassName); + + @Override + public String toString() { +StringBuilder builder = new StringBuilder(); +builder.append("FederationQueueWeight{"); +builder.append("Queue:" + getQueue() + ", "); Review Comment: Thank you very much for reviewing the code! I will fix it. > [Federation] Router CLI Supports Batch Save the SubClusterPolicyConfiguration > Of Queues. > > > Key: YARN-11536 > URL: https://issues.apache.org/jira/browse/YARN-11536 > Project: Hadoop YARN > Issue Type: Sub-task > Components: federation >Affects Versions: 3.4.0 >Reporter: Shilun Fan >Assignee: Shilun Fan >Priority: Major > Labels: pull-request-available > > In this jira, we will support batch saving of SubCluster Policy Configuration > of Queue. Users can provide an xml configuration file, and we will initialize > SubCluster Policy Configuration according to the configuration file. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-3660) [GPG] Federation Global Policy Generator (service hook only)
[ https://issues.apache.org/jira/browse/YARN-3660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17750574#comment-17750574 ] ASF GitHub Bot commented on YARN-3660: -- slfan1989 commented on PR #5903: URL: https://github.com/apache/hadoop/pull/5903#issuecomment-1663307525 @goiri Thank you very much for your help in reviewing the code! > [GPG] Federation Global Policy Generator (service hook only) > > > Key: YARN-3660 > URL: https://issues.apache.org/jira/browse/YARN-3660 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Carlo Curino >Assignee: Botong Huang >Priority: Major > Labels: federation, gpg, pull-request-available > Attachments: YARN-3660-YARN-7402.v1.patch, > YARN-3660-YARN-7402.v2.patch, YARN-3660-YARN-7402.v3.patch, > YARN-3660-YARN-7402.v3.patch, YARN-3660-YARN-7402.v3.patch, > YARN-3660-YARN-7402.v4.patch > > > In a federated environment, local impairments of one sub-cluster might > unfairly affect users/queues that are mapped to that sub-cluster. A > centralized component (GPG) runs out-of-band and edits the policies governing > how users/queues are allocated to sub-clusters. This allows us to enforce > global invariants (by dynamically updating locally-enforced invariants). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11041) Replace all occurences of queuePath with the new QueuePath class - followup
[ https://issues.apache.org/jira/browse/YARN-11041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17750534#comment-17750534 ] ASF GitHub Bot commented on YARN-11041: --- szilard-nemeth commented on PR #5332: URL: https://github.com/apache/hadoop/pull/5332#issuecomment-1663216534 > Thank you @szilard-nemeth for your thorough review on this change. I updated it addressing all your comments and questions. May I ask you to take another look on it please, if you'll have some time for it? Hi @p-szucs , Sorry, totally missed your comment. I will try to find some time for the final round of review soon :) In the meantime, would you mind checking the conflicts? Thanks. > Replace all occurences of queuePath with the new QueuePath class - followup > --- > > Key: YARN-11041 > URL: https://issues.apache.org/jira/browse/YARN-11041 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Reporter: Tibor Kovács >Assignee: Peter Szucs >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > The QueuePath class was introduced in YARN-10897, however, its current > adoption happened only for code changes after this JIRA. We need to adopt it > retrospectively. > > A lot of changes are introduced via ticket YARN-10982. The replacing should > be continued by touching the next comments: > > [...g/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AutoCreatedQueueTemplate.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-fde6885144b59bb06b2c3358780388d958829b13f68aceee7bb6d394bb5e0548] > |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765012937] > I think this could be also refactored in a follow-up jira so the string magic > could probably be replaced with some more elegant solution. Though, I think > this would be too much in this patch, hence I do suggest the follow-up jira.| > |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765013096] > [~bteke] [ |https://github.com/9uapaw] [~gandras] [ > \|https://github.com/9uapaw] Thoughts?| > |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765110750] > +1, even the QueuePath object could have some kind of support for this.| > |[~gandras] [https://github.com/apache/hadoop/pull/3660#discussion_r765131244] > Agreed, let's handle it in a followup!| > > > > [...he/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-c4b0c5e70208f1e3cfbd5a86ffa2393e5c996cc8b45605d9d41abcb7e0bd382a] > |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765023717] > There are many string operations in this class: > E.g. * getQueuePrefix that works with the full queue path > * getNodeLabelPrefix that also works with the full queue path| > I suggest to create a static class, called "QueuePrefixes" or something like > that and add some static methods there to convert the QueuePath object to > those various queue prefix strings that are ultimately keys in the > Configuration object. > > > > [...he/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-c4b0c5e70208f1e3cfbd5a86ffa2393e5c996cc8b45605d9d41abcb7e0bd382a] > |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765026119] > This seems hacky, just based on the constructor parameter names of QueuePath: > parent, leaf. > The AQC Template prefix is not the leaf, obviously. > Could we somehow circumvent this?| > |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765126207] > Maybe a factory method could be created, which returns a new QueuePath with > the parent set as the original queuePath. I.e > rootQueuePath.createChild(String childName) -> this could return a new > QueuePath object with root.childName path, and rootQueuePath as parent.| > |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765039033] > Looking at this getQueues method, I realized almost all the callers are using > some kind of string magic that should be addressed with this patch. > For example, take a look at: > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.MutableCSConfigurationProvider#addQueue > I think getQueues should also receive the QueuePath object instead of > Strings.| > > > >
[jira] [Assigned] (YARN-9430) Recovering containers does not check available resources on node
[ https://issues.apache.org/jira/browse/YARN-9430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Riya Khandelwal reassigned YARN-9430: - Assignee: Riya Khandelwal (was: Szilard Nemeth) > Recovering containers does not check available resources on node > > > Key: YARN-9430 > URL: https://issues.apache.org/jira/browse/YARN-9430 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Szilard Nemeth >Assignee: Riya Khandelwal >Priority: Critical > > I have a testcase that checks if some GPU devices gone offline and recovery > happens, only the containers that fit into the node's resources will be > recovered. Unfortunately, this is not the case: RM does not check available > resources on node during recovery. > *Detailed explanation:* > *Testcase:* > 1. There are 2 nodes running NodeManagers > 2. nvidia-smi is replaced with a fake bash script that reports 2 GPU devices > per node, initially. This means 4 GPU devices in the cluster altogether. > 3. RM / NM recovery is enabled > 4. The test starts off a sleep job, requesting 4 containers, 1 GPU device > for each (AM does not request GPUs) > 5. Before restart, the fake bash script is adjusted to report 1 GPU device > per node (2 in the cluster) after restart. > 6. Restart is initiated. > > *Expected behavior:* > After restart, only the AM and 2 normal containers should have been started, > as there are only 2 GPU devices in the cluster. > > *Actual behaviour:* > AM + 4 containers are allocated, this is all containers started originally > with step 4. > App id was: 1553977186701_0001 > *Logs*: > > {code:java} > 2019-03-30 13:22:30,299 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > Processing event for appattempt_1553977186701_0001_01 of type RECOVER > 2019-03-30 13:22:30,366 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Added Application Attempt appattempt_1553977186701_0001_01 to scheduler > from user: systest > 2019-03-30 13:22:30,366 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > appattempt_1553977186701_0001_01 is recovering. Skipping notifying > ATTEMPT_ADDED > 2019-03-30 13:22:30,367 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > appattempt_1553977186701_0001_01 State change from NEW to LAUNCHED on > event = RECOVER > 2019-03-30 13:22:33,257 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: > Recovering container [container_e84_1553977186701_0001_01_01, > CreateTime: 1553977260732, Version: 0, State: RUNNING, Capability: > , Diagnostics: , ExitStatus: -1000, > NodeLabelExpression: Priority: 0] > 2019-03-30 13:22:33,275 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: > Recovering container [container_e84_1553977186701_0001_01_04, > CreateTime: 1553977272802, Version: 0, State: RUNNING, Capability: > , Diagnostics: , ExitStatus: -1000, > NodeLabelExpression: Priority: 0] > 2019-03-30 13:22:33,275 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: > Assigned container container_e84_1553977186701_0001_01_04 of capacity > on host > snemeth-gpu-2.vpc.cloudera.com:8041, which has 2 containers, vCores:2, yarn.io/gpu: 1> used and available after > allocation > 2019-03-30 13:22:33,276 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: > Recovering container [container_e84_1553977186701_0001_01_05, > CreateTime: 1553977272803, Version: 0, State: RUNNING, Capability: > , Diagnostics: , ExitStatus: -1000, > NodeLabelExpression: Priority: 0] > 2019-03-30 13:22:33,276 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > Processing container_e84_1553977186701_0001_01_05 of type RECOVER > 2019-03-30 13:22:33,276 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_e84_1553977186701_0001_01_05 Container Transitioned from NEW to > RUNNING > 2019-03-30 13:22:33,276 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: > Assigned container container_e84_1553977186701_0001_01_05 of capacity > on host > snemeth-gpu-2.vpc.cloudera.com:8041, which has 3 containers, vCores:3, yarn.io/gpu: 2> used and > available after allocation > 2019-03-30 13:22:33,279 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: > Recovering container [container_e84_1553977186701_0001_01_03, > CreateTime: 1553977272166, Version: 0, State: RUNNING, Capability: > , Diagnostics: , ExitStatus: -1000, > NodeLabelExpression:
[jira] [Commented] (YARN-11536) [Federation] Router CLI Supports Batch Save the SubClusterPolicyConfiguration Of Queues.
[ https://issues.apache.org/jira/browse/YARN-11536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17750129#comment-17750129 ] ASF GitHub Bot commented on YARN-11536: --- hadoop-yetus commented on PR #5862: URL: https://github.com/apache/hadoop/pull/5862#issuecomment-1661586627 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 29s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +0 :ok: | buf | 0m 1s | | buf was not available. | | +0 :ok: | buf | 0m 1s | | buf was not available. | | +0 :ok: | xmllint | 0m 1s | | xmllint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 7 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 14m 27s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 20m 35s | | trunk passed | | +1 :green_heart: | compile | 5m 2s | | trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | compile | 4m 21s | | trunk passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | checkstyle | 1m 11s | | trunk passed | | +1 :green_heart: | mvnsite | 4m 21s | | trunk passed | | +1 :green_heart: | javadoc | 4m 24s | | trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javadoc | 4m 10s | | trunk passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | spotbugs | 7m 5s | | trunk passed | | +1 :green_heart: | shadedclient | 21m 16s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 25s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 2m 30s | | the patch passed | | +1 :green_heart: | compile | 4m 21s | | the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | cc | 4m 21s | | the patch passed | | +1 :green_heart: | javac | 4m 21s | | the patch passed | | +1 :green_heart: | compile | 4m 20s | | the patch passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | cc | 4m 20s | | the patch passed | | +1 :green_heart: | javac | 4m 20s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 1m 5s | | the patch passed | | +1 :green_heart: | mvnsite | 3m 57s | | the patch passed | | +1 :green_heart: | javadoc | 3m 54s | | the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javadoc | 3m 44s | | the patch passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | spotbugs | 7m 26s | | the patch passed | | +1 :green_heart: | shadedclient | 21m 53s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 0m 57s | | hadoop-yarn-api in the patch passed. | | +1 :green_heart: | unit | 4m 49s | | hadoop-yarn-common in the patch passed. | | +1 :green_heart: | unit | 2m 48s | | hadoop-yarn-server-common in the patch passed. | | +1 :green_heart: | unit | 86m 8s | | hadoop-yarn-server-resourcemanager in the patch passed. | | +1 :green_heart: | unit | 26m 6s | | hadoop-yarn-client in the patch passed. | | +1 :green_heart: | unit | 0m 37s | | hadoop-yarn-server-router in the patch passed. | | +1 :green_heart: | asflicense | 0m 50s | | The patch does not generate ASF License warnings. | | | | 268m 1s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5862/15/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5862 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets cc buflint bufcompat xmllint | | uname | Linux 80a2f97556cd 4.15.0-213-generic #224-Ubuntu SMP Mon Jun 19 13:30:12