[jira] [Reopened] (YARN-9615) Add dispatcher metrics to RM
[ https://issues.apache.org/jira/browse/YARN-9615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko reopened YARN-9615: > Add dispatcher metrics to RM > > > Key: YARN-9615 > URL: https://issues.apache.org/jira/browse/YARN-9615 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jonathan Hung >Assignee: Qi Zhu >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-9615-branch-3.3-001.patch, YARN-9615.001.patch, > YARN-9615.002.patch, YARN-9615.003.patch, YARN-9615.004.patch, > YARN-9615.005.patch, YARN-9615.006.patch, YARN-9615.007.patch, > YARN-9615.008.patch, YARN-9615.009.patch, YARN-9615.010.patch, > YARN-9615.011.patch, YARN-9615.011.patch, YARN-9615.poc.patch, > image-2021-03-04-10-35-10-626.png, image-2021-03-04-10-36-12-441.png, > screenshot-1.png > > > It'd be good to have counts/processing times for each event type in RM async > dispatcher and scheduler async dispatcher. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9615) Add dispatcher metrics to RM
[ https://issues.apache.org/jira/browse/YARN-9615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-9615: --- Attachment: YARN-9615-branch-3.3-001.patch > Add dispatcher metrics to RM > > > Key: YARN-9615 > URL: https://issues.apache.org/jira/browse/YARN-9615 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jonathan Hung >Assignee: Qi Zhu >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-9615-branch-3.3-001.patch, YARN-9615.001.patch, > YARN-9615.002.patch, YARN-9615.003.patch, YARN-9615.004.patch, > YARN-9615.005.patch, YARN-9615.006.patch, YARN-9615.007.patch, > YARN-9615.008.patch, YARN-9615.009.patch, YARN-9615.010.patch, > YARN-9615.011.patch, YARN-9615.011.patch, YARN-9615.poc.patch, > image-2021-03-04-10-35-10-626.png, image-2021-03-04-10-36-12-441.png, > screenshot-1.png > > > It'd be good to have counts/processing times for each event type in RM async > dispatcher and scheduler async dispatcher. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10571) Refactor dynamic queue handling logic
[ https://issues.apache.org/jira/browse/YARN-10571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342462#comment-17342462 ] Peter Bacsko commented on YARN-10571: - Finally, no javac issues! [~gandras] please check the test failure. > Refactor dynamic queue handling logic > - > > Key: YARN-10571 > URL: https://issues.apache.org/jira/browse/YARN-10571 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Minor > Attachments: YARN-10571.001.patch, YARN-10571.002.patch, > YARN-10571.003.patch, YARN-10571.004.patch > > > As per YARN-10506 we have introduced an other mode for auto queue creation > and a new class, which handles it. We should move the old, managed queue > related logic to CSAutoQueueHandler as well, and do additional cleanup > regarding queue management. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10763) add the speed of containers assigned metrics to ClusterMetrics
[ https://issues.apache.org/jira/browse/YARN-10763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342360#comment-17342360 ] Peter Bacsko commented on YARN-10763: - [~chaosju] some comments: 1. {{Timer}} / {{TimerTask}} are rather old constructs, I'd prefer {{ScheduledThreadPoolExecutor}}. 2. Another problem is that the {{Timer}} is not stopped in {{destroy()}}, this can definitely be a problem in the tests. 3. This is a singleton class, so the constructor should not be public, so remove the modifier. > add the speed of containers assigned metrics to ClusterMetrics > --- > > Key: YARN-10763 > URL: https://issues.apache.org/jira/browse/YARN-10763 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.2.1 >Reporter: chaosju >Priority: Major > Attachments: YARN-10763.001.patch, screenshot-1.png > > > It'd be good to have ContainerAssignedNum/Second in ClusterMetrics for > measuring cluster throughput. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9615) Add dispatcher metrics to RM
[ https://issues.apache.org/jira/browse/YARN-9615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340163#comment-17340163 ] Peter Bacsko commented on YARN-9615: [~BilwaST] I'm currently on vacation, I can get back to this on Monday. > Add dispatcher metrics to RM > > > Key: YARN-9615 > URL: https://issues.apache.org/jira/browse/YARN-9615 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jonathan Hung >Assignee: Qi Zhu >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-9615.001.patch, YARN-9615.002.patch, > YARN-9615.003.patch, YARN-9615.004.patch, YARN-9615.005.patch, > YARN-9615.006.patch, YARN-9615.007.patch, YARN-9615.008.patch, > YARN-9615.009.patch, YARN-9615.010.patch, YARN-9615.011.patch, > YARN-9615.011.patch, YARN-9615.poc.patch, image-2021-03-04-10-35-10-626.png, > image-2021-03-04-10-36-12-441.png, screenshot-1.png > > > It'd be good to have counts/processing times for each event type in RM async > dispatcher and scheduler async dispatcher. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10571) Refactor dynamic queue handling logic
[ https://issues.apache.org/jira/browse/YARN-10571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334674#comment-17334674 ] Peter Bacsko commented on YARN-10571: - Thanks [~gandras] for the patch. Do you know what's going on with the javac warnings? That code wasn't even touched. Maybe it has to do with the failing build ("Unable to create native thread"). I'll trigger a rebuild. > Refactor dynamic queue handling logic > - > > Key: YARN-10571 > URL: https://issues.apache.org/jira/browse/YARN-10571 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Minor > Attachments: YARN-10571.001.patch, YARN-10571.002.patch, > YARN-10571.003.patch > > > As per YARN-10506 we have introduced an other mode for auto queue creation > and a new class, which handles it. We should move the old, managed queue > related logic to CSAutoQueueHandler as well, and do additional cleanup > regarding queue management. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10739) GenericEventHandler.printEventQueueDetails causes RM recovery to take too much time
[ https://issues.apache.org/jira/browse/YARN-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333171#comment-17333171 ] Peter Bacsko commented on YARN-10739: - +1 thanks [~zhuqi] for the patch and [~gandras] / [~zhanqi.cai] for the review. Committed to trunk. > GenericEventHandler.printEventQueueDetails causes RM recovery to take too > much time > --- > > Key: YARN-10739 > URL: https://issues.apache.org/jira/browse/YARN-10739 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 3.4.0, 3.3.1, 3.2.3 >Reporter: Zhanqi Cai >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-10739-001.patch, YARN-10739-002.patch, > YARN-10739.003.patch, YARN-10739.003.patch, YARN-10739.004.patch, > YARN-10739.005.patch, YARN-10739.006.patch > > > Due to YARN-8995 YARN-10642 add GenericEventHandler.printEventQueueDetails on > AsyncDispatcher, if the event queue size is too large, the > printEventQueueDetails will cost too much time and RM take a long time to > process. > For example: > If we have 4K nodes on cluster and 4K apps running, if we do switch and the > node manager will register with RM, and RM will call NodesListManager to do > RMAppNodeUpdateEvent, code like below: > {code:java} > for(RMApp app : rmContext.getRMApps().values()) { > if (!app.isAppFinalStateStored()) { > this.rmContext > .getDispatcher() > .getEventHandler() > .handle( > new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode, > appNodeUpdateType)); > } > }{code} > So the total event is 4k*4k=16 mil, during this window, the > GenericEventHandler.printEventQueueDetails will print the event queue detail > and be called frequently, once the event queue size reaches 1 mil+, the > Iterator of the queue from printEventQueueDetails will be so slow refer to > below: > {code:java} > private void printEventQueueDetails() { > Iterator iterator = eventQueue.iterator(); > Map counterMap = new HashMap<>(); > while (iterator.hasNext()) { > Enum eventType = iterator.next().getType(); > {code} > Then RM recovery will cost too much time. > Refer to our log: > {code:java} > 2021-04-14 20:35:34,432 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(306)) - Size of event-queue is 1200 > 2021-04-14 20:35:35,818 INFO event.AsyncDispatcher > (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: KILL, Event > record counter: 310836 > 2021-04-14 20:35:35,818 INFO event.AsyncDispatcher > (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_UPDATE, > Event record counter: 1103 > 2021-04-14 20:35:35,818 INFO event.AsyncDispatcher > (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: > NODE_REMOVED, Event record counter: 1 > 2021-04-14 20:35:35,818 INFO event.AsyncDispatcher > (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: APP_REMOVED, > Event record counter: 1 > {code} > Between AsyncDispatcher.handle and printEventQueueDetails, here is more than > 1s to do Iterator. > I upload a file to ensure the printEventQueueDetails only be called one-time > pre-30s. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10739) GenericEventHandler.printEventQueueDetails causes RM recovery to take too much time
[ https://issues.apache.org/jira/browse/YARN-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10739: Summary: GenericEventHandler.printEventQueueDetails causes RM recovery to take too much time (was: GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time) > GenericEventHandler.printEventQueueDetails causes RM recovery to take too > much time > --- > > Key: YARN-10739 > URL: https://issues.apache.org/jira/browse/YARN-10739 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 3.4.0, 3.3.1, 3.2.3 >Reporter: Zhanqi Cai >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-10739-001.patch, YARN-10739-002.patch, > YARN-10739.003.patch, YARN-10739.003.patch, YARN-10739.004.patch, > YARN-10739.005.patch, YARN-10739.006.patch > > > Due to YARN-8995 YARN-10642 add GenericEventHandler.printEventQueueDetails on > AsyncDispatcher, if the event queue size is too large, the > printEventQueueDetails will cost too much time and RM take a long time to > process. > For example: > If we have 4K nodes on cluster and 4K apps running, if we do switch and the > node manager will register with RM, and RM will call NodesListManager to do > RMAppNodeUpdateEvent, code like below: > {code:java} > for(RMApp app : rmContext.getRMApps().values()) { > if (!app.isAppFinalStateStored()) { > this.rmContext > .getDispatcher() > .getEventHandler() > .handle( > new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode, > appNodeUpdateType)); > } > }{code} > So the total event is 4k*4k=16 mil, during this window, the > GenericEventHandler.printEventQueueDetails will print the event queue detail > and be called frequently, once the event queue size reaches 1 mil+, the > Iterator of the queue from printEventQueueDetails will be so slow refer to > below: > {code:java} > private void printEventQueueDetails() { > Iterator iterator = eventQueue.iterator(); > Map counterMap = new HashMap<>(); > while (iterator.hasNext()) { > Enum eventType = iterator.next().getType(); > {code} > Then RM recovery will cost too much time. > Refer to our log: > {code:java} > 2021-04-14 20:35:34,432 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(306)) - Size of event-queue is 1200 > 2021-04-14 20:35:35,818 INFO event.AsyncDispatcher > (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: KILL, Event > record counter: 310836 > 2021-04-14 20:35:35,818 INFO event.AsyncDispatcher > (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_UPDATE, > Event record counter: 1103 > 2021-04-14 20:35:35,818 INFO event.AsyncDispatcher > (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: > NODE_REMOVED, Event record counter: 1 > 2021-04-14 20:35:35,818 INFO event.AsyncDispatcher > (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: APP_REMOVED, > Event record counter: 1 > {code} > Between AsyncDispatcher.handle and printEventQueueDetails, here is more than > 1s to do Iterator. > I upload a file to ensure the printEventQueueDetails only be called one-time > pre-30s. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10739) GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time
[ https://issues.apache.org/jira/browse/YARN-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17332076#comment-17332076 ] Peter Bacsko commented on YARN-10739: - Thanks for the patch [~zhuqi]. I have some comments: 1. {{PrintEventDetailsService #%d}} - I think it's better to call it {{PrintEventDetailsThread #%d}}. 2. Variable {{printEventDetailsService}} - same here, {{printEventDetailsExecutor}} sounds better. 3. {{printEventDetailsService.allowCoreThreadTimeOut(true);}} --> there is just one core thread. I think it's fine if we don't allow it to time out, so I suggest to set this to "false" (which is the default). 4. {{printEventDetailsService.shutdown();}} -- since we're shutting it down in {{serviceStop()}}, let's call {{shutdownNow()}} which is safer. Don't wait for printing. 5. Tracing log: {noformat} // For test if (LOG.isTraceEnabled()) { LOG.trace("Event type: " + entry.getKey() + " printed."); } {noformat} I know that this is for testing, but still, this affects production code. Trace level already floods the logs with everything. I don't think we should print this, even on TRACE. It's not a huge issue if it is not tested. > GenericEventHandler.printEventQueueDetails cause RM recovery cost too much > time > --- > > Key: YARN-10739 > URL: https://issues.apache.org/jira/browse/YARN-10739 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 3.4.0, 3.3.1, 3.2.3 >Reporter: Zhanqi Cai >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-10739-001.patch, YARN-10739-002.patch, > YARN-10739.003.patch, YARN-10739.003.patch, YARN-10739.004.patch > > > Due to YARN-8995 YARN-10642 add GenericEventHandler.printEventQueueDetails on > AsyncDispatcher, if the event queue size is too large, the > printEventQueueDetails will cost too much time and RM take a long time to > process. > For example: > If we have 4K nodes on cluster and 4K apps running, if we do switch and the > node manager will register with RM, and RM will call NodesListManager to do > RMAppNodeUpdateEvent, code like below: > {code:java} > for(RMApp app : rmContext.getRMApps().values()) { > if (!app.isAppFinalStateStored()) { > this.rmContext > .getDispatcher() > .getEventHandler() > .handle( > new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode, > appNodeUpdateType)); > } > }{code} > So the total event is 4k*4k=16 mil, during this window, the > GenericEventHandler.printEventQueueDetails will print the event queue detail > and be called frequently, once the event queue size reaches 1 mil+, the > Iterator of the queue from printEventQueueDetails will be so slow refer to > below: > {code:java} > private void printEventQueueDetails() { > Iterator iterator = eventQueue.iterator(); > Map counterMap = new HashMap<>(); > while (iterator.hasNext()) { > Enum eventType = iterator.next().getType(); > {code} > Then RM recovery will cost too much time. > Refer to our log: > {code:java} > 2021-04-14 20:35:34,432 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(306)) - Size of event-queue is 1200 > 2021-04-14 20:35:35,818 INFO event.AsyncDispatcher > (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: KILL, Event > record counter: 310836 > 2021-04-14 20:35:35,818 INFO event.AsyncDispatcher > (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_UPDATE, > Event record counter: 1103 > 2021-04-14 20:35:35,818 INFO event.AsyncDispatcher > (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: > NODE_REMOVED, Event record counter: 1 > 2021-04-14 20:35:35,818 INFO event.AsyncDispatcher > (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: APP_REMOVED, > Event record counter: 1 > {code} > Between AsyncDispatcher.handle and printEventQueueDetails, here is more than > 1s to do Iterator. > I upload a file to ensure the printEventQueueDetails only be called one-time > pre-30s. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10739) GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time
[ https://issues.apache.org/jira/browse/YARN-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17332076#comment-17332076 ] Peter Bacsko edited comment on YARN-10739 at 4/26/21, 12:14 PM: Thanks for the patch [~zhuqi]. I have some comments: 1. {{PrintEventDetailsService #%d}} - I think it's better to call it {{PrintEventDetailsThread #%d}}. 2. Variable {{printEventDetailsService}} - same here, {{printEventDetailsExecutor}} sounds better. 3. {{printEventDetailsService.allowCoreThreadTimeOut(true);}} --> there is just one core thread. I think it's fine if we don't allow it to time out, so I suggest to set this to "false" (which is the default). 4. {{printEventDetailsService.shutdown();}} -- since we're shutting it down in {{serviceStop()}}, let's call {{shutdownNow()}} which is safer. Don't wait for printing. 5. Tracing log: {noformat} // For test if (LOG.isTraceEnabled()) { LOG.trace("Event type: " + entry.getKey() + " printed."); } {noformat} I know that this is for testing, but still, this affects the production code. Trace level already floods the logs with everything. I don't think we should print this, even on TRACE. It's not a huge issue if it is not tested. was (Author: pbacsko): Thanks for the patch [~zhuqi]. I have some comments: 1. {{PrintEventDetailsService #%d}} - I think it's better to call it {{PrintEventDetailsThread #%d}}. 2. Variable {{printEventDetailsService}} - same here, {{printEventDetailsExecutor}} sounds better. 3. {{printEventDetailsService.allowCoreThreadTimeOut(true);}} --> there is just one core thread. I think it's fine if we don't allow it to time out, so I suggest to set this to "false" (which is the default). 4. {{printEventDetailsService.shutdown();}} -- since we're shutting it down in {{serviceStop()}}, let's call {{shutdownNow()}} which is safer. Don't wait for printing. 5. Tracing log: {noformat} // For test if (LOG.isTraceEnabled()) { LOG.trace("Event type: " + entry.getKey() + " printed."); } {noformat} I know that this is for testing, but still, this affects production code. Trace level already floods the logs with everything. I don't think we should print this, even on TRACE. It's not a huge issue if it is not tested. > GenericEventHandler.printEventQueueDetails cause RM recovery cost too much > time > --- > > Key: YARN-10739 > URL: https://issues.apache.org/jira/browse/YARN-10739 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 3.4.0, 3.3.1, 3.2.3 >Reporter: Zhanqi Cai >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-10739-001.patch, YARN-10739-002.patch, > YARN-10739.003.patch, YARN-10739.003.patch, YARN-10739.004.patch > > > Due to YARN-8995 YARN-10642 add GenericEventHandler.printEventQueueDetails on > AsyncDispatcher, if the event queue size is too large, the > printEventQueueDetails will cost too much time and RM take a long time to > process. > For example: > If we have 4K nodes on cluster and 4K apps running, if we do switch and the > node manager will register with RM, and RM will call NodesListManager to do > RMAppNodeUpdateEvent, code like below: > {code:java} > for(RMApp app : rmContext.getRMApps().values()) { > if (!app.isAppFinalStateStored()) { > this.rmContext > .getDispatcher() > .getEventHandler() > .handle( > new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode, > appNodeUpdateType)); > } > }{code} > So the total event is 4k*4k=16 mil, during this window, the > GenericEventHandler.printEventQueueDetails will print the event queue detail > and be called frequently, once the event queue size reaches 1 mil+, the > Iterator of the queue from printEventQueueDetails will be so slow refer to > below: > {code:java} > private void printEventQueueDetails() { > Iterator iterator = eventQueue.iterator(); > Map counterMap = new HashMap<>(); > while (iterator.hasNext()) { > Enum eventType = iterator.next().getType(); > {code} > Then RM recovery will cost too much time. > Refer to our log: > {code:java} > 2021-04-14 20:35:34,432 INFO event.AsyncDispatcher > (AsyncDispatcher.java:handle(306)) - Size of event-queue is 1200 > 2021-04-14 20:35:35,818 INFO event.AsyncDispatcher > (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: KILL, Event > record counter: 310836 > 2021-04-14 20:35:35,818 INFO event.AsyncDispatcher > (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_UPDATE, > Event record counter: 1103 > 2021-04-14
[jira] [Commented] (YARN-10637) fs2cs: add queue autorefresh policy during conversion
[ https://issues.apache.org/jira/browse/YARN-10637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17332014#comment-17332014 ] Peter Bacsko commented on YARN-10637: - +1 Thanks [~zhuqi] for the patch and [~gandras] for the review. Committed to trunk. > fs2cs: add queue autorefresh policy during conversion > - > > Key: YARN-10637 > URL: https://issues.apache.org/jira/browse/YARN-10637 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10637.001.patch, YARN-10637.002.patch, > YARN-10637.003.patch, YARN-10637.004.patch > > > cc [~pbacsko] [~gandras] [~bteke] > We should also fill this, when YARN-10623 finished. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10637) fs2cs: add queue autorefresh policy during conversion
[ https://issues.apache.org/jira/browse/YARN-10637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10637: Summary: fs2cs: add queue autorefresh policy during conversion (was: We should support fs to cs support for auto refresh queues when conf changed, after YARN-10623 finished.) > fs2cs: add queue autorefresh policy during conversion > - > > Key: YARN-10637 > URL: https://issues.apache.org/jira/browse/YARN-10637 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10637.001.patch, YARN-10637.002.patch, > YARN-10637.003.patch, YARN-10637.004.patch > > > cc [~pbacsko] [~gandras] [~bteke] > We should also fill this, when YARN-10623 finished. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10637) fs2cs: add queue autorefresh policy during conversion
[ https://issues.apache.org/jira/browse/YARN-10637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10637: Labels: fs2cs (was: ) > fs2cs: add queue autorefresh policy during conversion > - > > Key: YARN-10637 > URL: https://issues.apache.org/jira/browse/YARN-10637 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10637.001.patch, YARN-10637.002.patch, > YARN-10637.003.patch, YARN-10637.004.patch > > > cc [~pbacsko] [~gandras] [~bteke] > We should also fill this, when YARN-10623 finished. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10732) Disallow restarting a queue while it is in DRAINING state on CS reinitialization
[ https://issues.apache.org/jira/browse/YARN-10732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17330917#comment-17330917 ] Peter Bacsko commented on YARN-10732: - [~BilwaST] thanks for your comment - I think this is a question that can be answered by [~gandras]. > Disallow restarting a queue while it is in DRAINING state on CS > reinitialization > > > Key: YARN-10732 > URL: https://issues.apache.org/jira/browse/YARN-10732 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Attachments: YARN-10732.001.patch > > > CSConfigValidator#validateQueueHierarchy does not check a state where the old > queue is in DRAINING state but the new queue state is RUNNING. User should > wait until a queue is fully stopped. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10705) Misleading DEBUG log for container assignment needs to be removed when the container is actually reserved, not assigned in FairScheduler
[ https://issues.apache.org/jira/browse/YARN-10705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17330872#comment-17330872 ] Peter Bacsko commented on YARN-10705: - Thanks for the patch [~sahuja], committed to trunk. > Misleading DEBUG log for container assignment needs to be removed when the > container is actually reserved, not assigned in FairScheduler > > > Key: YARN-10705 > URL: https://issues.apache.org/jira/browse/YARN-10705 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.4.0 >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Minor > Attachments: YARN-10705.001.patch > > > Following DEBUG logs are logged if a container reservation is made when a > node has been offered to the queue in FairScheduler: > {code} > 2021-02-10 07:33:55,049 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: > application_1610442362681_2607's resource request is reserved. > 2021-02-10 07:33:55,049 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue: > Assigned container in queue:root.pj_dc_pe container: > {code} > The latter log from above seems to indicate a bad container assignment with > resource allocation, whereas, in actual, it is a bad > log which shouldn't have been logged in the first place. > This log comes from [1] after an application attempt with an unmet demand is > checked for container assignment/reservation. > If the container for this app attempt is reserved on the node, then, it > returns from [2]. > From [3]: > {quote} >* If an assignment was made, returns the resources allocated to the >* container. If a reservation was made, returns >* FairScheduler.CONTAINER_RESERVED. If no assignment or reservation > was >* made, returns an empty resource. > {quote} > We are checking for the empty resource at [4], but not > FairScheduler.CONTAINER_RESERVED before logging out a message for container > assignment specifically which is incorrect. > Instead of: > {code} > if (!assigned.equals(none())) { > LOG.debug("Assigned container in queue:{} container:{}", > getName(), assigned); > break; > } > {code} > it should be: > {code} > // check if an assignment or a reservation was made. > if (!assigned.equals(none())) { > // only log container assignment if there is > // an actual assignment, not a reservation. > if (!assigned.equals(FairScheduler.CONTAINER_RESERVED) > && LOG.isDebugEnabled()) { > LOG.debug("Assigned container in queue:" + getName() + " " + > "container:" + assigned); > } > break; > } > {code} > [1] > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java#L356 > [2] > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L911 > [3] > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L842 > [4] > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java#L355 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10732) Disallow restarting a queue while it is in DRAINING state on CS reinitialization
[ https://issues.apache.org/jira/browse/YARN-10732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17330865#comment-17330865 ] Peter Bacsko commented on YARN-10732: - [~gandras] the old queue state comes from a {{CSQueueStore}} which can be mocked or a mock CSQueue can be added with a DRAINING state. The new queue can be set to RUNNING in the config. I think this scanario is testable. It's also a bit regrettable that {{validateQueueHierarchy()}} is completely untested, at least there is no unit test for it in {{TestCapacitySchedulerConfigValidator}}. I think it could be a good idea to provide tests for it, if not in this JIRA, then maybe in a follow-up. > Disallow restarting a queue while it is in DRAINING state on CS > reinitialization > > > Key: YARN-10732 > URL: https://issues.apache.org/jira/browse/YARN-10732 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Attachments: YARN-10732.001.patch > > > CSConfigValidator#validateQueueHierarchy does not check a state where the old > queue is in DRAINING state but the new queue state is RUNNING. User should > wait until a queue is fully stopped. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10732) Disallow restarting a queue while it is in DRAINING state on CS reinitialization
[ https://issues.apache.org/jira/browse/YARN-10732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17330859#comment-17330859 ] Peter Bacsko commented on YARN-10732: - I manually triggered a build and set the status to "Patch available". > Disallow restarting a queue while it is in DRAINING state on CS > reinitialization > > > Key: YARN-10732 > URL: https://issues.apache.org/jira/browse/YARN-10732 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Attachments: YARN-10732.001.patch > > > CSConfigValidator#validateQueueHierarchy does not check a state where the old > queue is in DRAINING state but the new queue state is RUNNING. User should > wait until a queue is fully stopped. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10732) Disallow restarting a queue while it is in DRAINING state on CS reinitialization
[ https://issues.apache.org/jira/browse/YARN-10732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko reassigned YARN-10732: --- Assignee: Andras Gyori (was: Peter Bacsko) > Disallow restarting a queue while it is in DRAINING state on CS > reinitialization > > > Key: YARN-10732 > URL: https://issues.apache.org/jira/browse/YARN-10732 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Attachments: YARN-10732.001.patch > > > CSConfigValidator#validateQueueHierarchy does not check a state where the old > queue is in DRAINING state but the new queue state is RUNNING. User should > wait until a queue is fully stopped. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10732) Disallow restarting a queue while it is in DRAINING state on CS reinitialization
[ https://issues.apache.org/jira/browse/YARN-10732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko reassigned YARN-10732: --- Assignee: Peter Bacsko (was: Andras Gyori) > Disallow restarting a queue while it is in DRAINING state on CS > reinitialization > > > Key: YARN-10732 > URL: https://issues.apache.org/jira/browse/YARN-10732 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: Andras Gyori >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10732.001.patch > > > CSConfigValidator#validateQueueHierarchy does not check a state where the old > queue is in DRAINING state but the new queue state is RUNNING. User should > wait until a queue is fully stopped. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10705) Misleading DEBUG log for container assignment needs to be removed when the container is actually reserved, not assigned in FairScheduler
[ https://issues.apache.org/jira/browse/YARN-10705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17330857#comment-17330857 ] Peter Bacsko commented on YARN-10705: - +1 LGTM. > Misleading DEBUG log for container assignment needs to be removed when the > container is actually reserved, not assigned in FairScheduler > > > Key: YARN-10705 > URL: https://issues.apache.org/jira/browse/YARN-10705 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.4.0 >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Minor > Attachments: YARN-10705.001.patch > > > Following DEBUG logs are logged if a container reservation is made when a > node has been offered to the queue in FairScheduler: > {code} > 2021-02-10 07:33:55,049 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: > application_1610442362681_2607's resource request is reserved. > 2021-02-10 07:33:55,049 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue: > Assigned container in queue:root.pj_dc_pe container: > {code} > The latter log from above seems to indicate a bad container assignment with > resource allocation, whereas, in actual, it is a bad > log which shouldn't have been logged in the first place. > This log comes from [1] after an application attempt with an unmet demand is > checked for container assignment/reservation. > If the container for this app attempt is reserved on the node, then, it > returns from [2]. > From [3]: > {quote} >* If an assignment was made, returns the resources allocated to the >* container. If a reservation was made, returns >* FairScheduler.CONTAINER_RESERVED. If no assignment or reservation > was >* made, returns an empty resource. > {quote} > We are checking for the empty resource at [4], but not > FairScheduler.CONTAINER_RESERVED before logging out a message for container > assignment specifically which is incorrect. > Instead of: > {code} > if (!assigned.equals(none())) { > LOG.debug("Assigned container in queue:{} container:{}", > getName(), assigned); > break; > } > {code} > it should be: > {code} > // check if an assignment or a reservation was made. > if (!assigned.equals(none())) { > // only log container assignment if there is > // an actual assignment, not a reservation. > if (!assigned.equals(FairScheduler.CONTAINER_RESERVED) > && LOG.isDebugEnabled()) { > LOG.debug("Assigned container in queue:" + getName() + " " + > "container:" + assigned); > } > break; > } > {code} > [1] > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java#L356 > [2] > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L911 > [3] > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L842 > [4] > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java#L355 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10654) Dots '.' in CSMappingRule path variables should be replaced
[ https://issues.apache.org/jira/browse/YARN-10654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321128#comment-17321128 ] Peter Bacsko commented on YARN-10654: - [~snemeth] [~shuzirra] do you guys have some time to review this? It's the equivalent of what FS does. > Dots '.' in CSMappingRule path variables should be replaced > --- > > Key: YARN-10654 > URL: https://issues.apache.org/jira/browse/YARN-10654 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Gergely Pollak >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10654-001.patch > > > Dots are used as separators, so we should escape them somehow in the > variables when substituting them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10654) Dots '.' in CSMappingRule path variables should be replaced
[ https://issues.apache.org/jira/browse/YARN-10654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17320954#comment-17320954 ] Peter Bacsko commented on YARN-10654: - Uploaded patch v1 which is probably the simplest approach to the '.' problem. > Dots '.' in CSMappingRule path variables should be replaced > --- > > Key: YARN-10654 > URL: https://issues.apache.org/jira/browse/YARN-10654 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Gergely Pollak >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10654-001.patch > > > Dots are used as separators, so we should escape them somehow in the > variables when substituting them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10654) Dots '.' in CSMappingRule path variables should be replaced
[ https://issues.apache.org/jira/browse/YARN-10654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10654: Attachment: YARN-10654-001.patch > Dots '.' in CSMappingRule path variables should be replaced > --- > > Key: YARN-10654 > URL: https://issues.apache.org/jira/browse/YARN-10654 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Gergely Pollak >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10654-001.patch > > > Dots are used as separators, so we should escape them somehow in the > variables when substituting them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10654) Dots '.' in CSMappingRule path variables should be replaced
[ https://issues.apache.org/jira/browse/YARN-10654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko reassigned YARN-10654: --- Assignee: Peter Bacsko (was: Gergely Pollak) > Dots '.' in CSMappingRule path variables should be replaced > --- > > Key: YARN-10654 > URL: https://issues.apache.org/jira/browse/YARN-10654 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Gergely Pollak >Assignee: Peter Bacsko >Priority: Major > > Dots are used as separators, so we should escape them somehow in the > variables when substituting them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10564) Support Auto Queue Creation template configurations
[ https://issues.apache.org/jira/browse/YARN-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317080#comment-17317080 ] Peter Bacsko commented on YARN-10564: - +1 Committed to trunk. Thanks [~gandras] for the patch and [~zhuqi] for the review. > Support Auto Queue Creation template configurations > --- > > Key: YARN-10564 > URL: https://issues.apache.org/jira/browse/YARN-10564 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Attachments: YARN-10564.001.patch, YARN-10564.002.patch, > YARN-10564.003.patch, YARN-10564.004.patch, YARN-10564.005.patch, > YARN-10564.006.patch, YARN-10564.poc.001.patch > > > Similar to how the template configuration works for ManagedParents, we need > to support templates for the new auto queue creation logic. Proposition is to > allow wildcards in template configs such as: > {noformat} > yarn.scheduler.capacity.root.*.*.weight 10{noformat} > which would mean, that set weight to 10 of every leaf of every parent under > root. > We should possibly take an approach, that could support arbitrary depth of > template configuration, because we might need to lift the limitation of auto > queue nesting. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10564) Support Auto Queue Creation template configurations
[ https://issues.apache.org/jira/browse/YARN-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316309#comment-17316309 ] Peter Bacsko commented on YARN-10564: - Thanks [~gandras] I have the following suggestions: please add comments to the "for" loop which explains this. I don't want to dictate the wording. It could be more sentences. I think it's important. Also, maybe also comment that "supportedWildcardLevel" or MAX_WILDCARD_LEVEL might change in the future (just like me, people might realize that the range is [0-1] and it might make people confused). Also, an overall comment like "collect all template settings based on prefix, then finally apply the collected settings to the newly created queue" might be useful. I'd put it somewhere before the "while" loop, but this is just an idea. > Support Auto Queue Creation template configurations > --- > > Key: YARN-10564 > URL: https://issues.apache.org/jira/browse/YARN-10564 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Attachments: YARN-10564.001.patch, YARN-10564.002.patch, > YARN-10564.003.patch, YARN-10564.004.patch, YARN-10564.005.patch, > YARN-10564.poc.001.patch > > > Similar to how the template configuration works for ManagedParents, we need > to support templates for the new auto queue creation logic. Proposition is to > allow wildcards in template configs such as: > {noformat} > yarn.scheduler.capacity.root.*.*.weight 10{noformat} > which would mean, that set weight to 10 of every leaf of every parent under > root. > We should possibly take an approach, that could support arbitrary depth of > template configuration, because we might need to lift the limitation of auto > queue nesting. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10564) Support Auto Queue Creation template configurations
[ https://issues.apache.org/jira/browse/YARN-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316277#comment-17316277 ] Peter Bacsko edited comment on YARN-10564 at 4/7/21, 12:16 PM: --- Thanks [~gandras], I think I get it. I guess the trick is the "for" loop which modifies "queuePathParts". First we try to find the templates for the parent explicitly, then we step back a wildcard at each iteration. By changing "queuePathParts", the prefix changes so eventually we might find a parent which contains templates. Finally, we call {{setConfigFromTemplateEntries()}} where we set the collected values for the original queue. Is this correct? was (Author: pbacsko): Thanks [~gandras], I think I get it. I guess the trick is the "for" loop which modifies "queuePathParts". First we try to find the templates for the parent explicitly, then we step back each wildcard at a time. By changing "queuePathParts", the prefix changes so eventually we might find a parent which contains templates. Finally, we call {{setConfigFromTemplateEntries()}} where we set the collected values for the original queue. Is this correct? > Support Auto Queue Creation template configurations > --- > > Key: YARN-10564 > URL: https://issues.apache.org/jira/browse/YARN-10564 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Attachments: YARN-10564.001.patch, YARN-10564.002.patch, > YARN-10564.003.patch, YARN-10564.004.patch, YARN-10564.005.patch, > YARN-10564.poc.001.patch > > > Similar to how the template configuration works for ManagedParents, we need > to support templates for the new auto queue creation logic. Proposition is to > allow wildcards in template configs such as: > {noformat} > yarn.scheduler.capacity.root.*.*.weight 10{noformat} > which would mean, that set weight to 10 of every leaf of every parent under > root. > We should possibly take an approach, that could support arbitrary depth of > template configuration, because we might need to lift the limitation of auto > queue nesting. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10564) Support Auto Queue Creation template configurations
[ https://issues.apache.org/jira/browse/YARN-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316277#comment-17316277 ] Peter Bacsko commented on YARN-10564: - Thanks [~gandras], I think I get it. I guess the trick is the "for" loop which modifies "queuePathParts". First we try to find the templates for the parent explicitly, then we step back each wildcard at a time. By changing "queuePathParts", the prefix changes so eventually we might find a parent which contains templates. Finally, we call {{setConfigFromTemplateEntries()}} where we set the collected values for the original queue. Is this correct? > Support Auto Queue Creation template configurations > --- > > Key: YARN-10564 > URL: https://issues.apache.org/jira/browse/YARN-10564 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Attachments: YARN-10564.001.patch, YARN-10564.002.patch, > YARN-10564.003.patch, YARN-10564.004.patch, YARN-10564.005.patch, > YARN-10564.poc.001.patch > > > Similar to how the template configuration works for ManagedParents, we need > to support templates for the new auto queue creation logic. Proposition is to > allow wildcards in template configs such as: > {noformat} > yarn.scheduler.capacity.root.*.*.weight 10{noformat} > which would mean, that set weight to 10 of every leaf of every parent under > root. > We should possibly take an approach, that could support arbitrary depth of > template configuration, because we might need to lift the limitation of auto > queue nesting. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10564) Support Auto Queue Creation template configurations
[ https://issues.apache.org/jira/browse/YARN-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316213#comment-17316213 ] Peter Bacsko edited comment on YARN-10564 at 4/7/21, 11:51 AM: --- [~gandras] thanks for the patch. >From coding POV it looks ok, this is more like a high level review. There's are some things I just can't figure out (maybe I'm in a bad shape today). 1. Let's say you set the capacity 6w for {{root.a.*}}. Then a dynamic queue {{root.a.newparent.newchild}} get created. How does the weight settings propagate to "newparent" and "newchild"? I kept looking at the code, but it's just not obvious. I can see that "root.a" will have an entry in {{templateEntries}}, but then what? 2. I can't deciper this part: {noformat} for (int i = 0; i <= wildcardLevel; ++i) { queuePathParts.set(queuePathParts.size() - 1 - i, WILDCARD_QUEUE); } {noformat} What's happening here? 3. There is a variable called "supportedWildcardLevel". What is "supported" means in this context? Later on we set it to {{Math.min(queueHierarchyParts - 1, MAX_WILDCARD_LEVEL);}}. It seems to me that it is either 0 or 1, because {{MAX_WILDCARD_LEVEL}} is 1. I assume most of the time it's going to be 1? I don't understand what it is meant to represent. was (Author: pbacsko): [~gandras] thanks for the patch. >From coding POV it looks ok, this is more like a high level review. There's are some things I just can't figure out (maybe I'm in a bad shape today). 1. Let's say you set the capacity 6w for {{root.a.*}}. Then a dynamic queue {{root.a.newparent.newchild}} get created. How does the weight settings propagate to "newparent" and "newchild"? I kept looking at the code, but it's just not obvious. I can see that "root.a" will have an entry in {{templateEntries}}, but then what? 2. I can't deciper this part: {noformat} for (int i = 0; i <= wildcardLevel; ++i) { queuePathParts.set(queuePathParts.size() - 1 - i, WILDCARD_QUEUE); } {noformat} What's happening here? 3. There is a variable called "supportedWildcardLevel". What is "supported" means in this context? Later on we set it to {{Math.min(queueHierarchyParts - 1, MAX_WILDCARD_LEVEL);}}. It seems to me that it is either 0 or 1, because {{MAX_WILDCARD_LEVEL}} is 1. I assume most of the time it's going to be 1? Mentally I don't understand what it is meant to represent. > Support Auto Queue Creation template configurations > --- > > Key: YARN-10564 > URL: https://issues.apache.org/jira/browse/YARN-10564 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Attachments: YARN-10564.001.patch, YARN-10564.002.patch, > YARN-10564.003.patch, YARN-10564.004.patch, YARN-10564.005.patch, > YARN-10564.poc.001.patch > > > Similar to how the template configuration works for ManagedParents, we need > to support templates for the new auto queue creation logic. Proposition is to > allow wildcards in template configs such as: > {noformat} > yarn.scheduler.capacity.root.*.*.weight 10{noformat} > which would mean, that set weight to 10 of every leaf of every parent under > root. > We should possibly take an approach, that could support arbitrary depth of > template configuration, because we might need to lift the limitation of auto > queue nesting. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10564) Support Auto Queue Creation template configurations
[ https://issues.apache.org/jira/browse/YARN-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316213#comment-17316213 ] Peter Bacsko edited comment on YARN-10564 at 4/7/21, 10:49 AM: --- [~gandras] thanks for the patch. >From coding POV it looks ok, this is more like a high level review. There's are some things I just can't figure out (maybe I'm in a bad shape today). 1. Let's say you set the capacity 6w for {{root.a.*}}. Then a dynamic queue {{root.a.newparent.newchild}} get created. How does the weight settings propagate to "newparent" and "newchild"? I kept looking at the code, but it's just not obvious. I can see that "root.a" will have an entry in {{templateEntries}}, but then what? 2. I can't deciper this part: {noformat} for (int i = 0; i <= wildcardLevel; ++i) { queuePathParts.set(queuePathParts.size() - 1 - i, WILDCARD_QUEUE); } {noformat} What's happening here? 3. There is a variable called "supportedWildcardLevel". What is "supported" means in this context? Later on we set it to {{Math.min(queueHierarchyParts - 1, MAX_WILDCARD_LEVEL);}}. It seems to me that it is either 0 or 1, because {{MAX_WILDCARD_LEVEL}} is 1. I assume most of the time it's going to be 1? Mentally I don't understand what it is meant to represent. was (Author: pbacsko): [~gandras] thanks for the patch. >From coding POV it looks ok, this is more like a high level review. There's are some things I just can't figure out (maybe I'm in a bad shape today). 1. Let's say you set 6w for {{root.a.*}}. Then a dynamic queue {{root.a.newparent.newchild}} get created. How does the weight settings propagate to "newparent" and "newchild"? I kept looking at the code, but it's just not obvious. I can see that "root.a" will have an entry in {{templateEntries}}, but then what? 2. I can't deciper this part: {noformat} for (int i = 0; i <= wildcardLevel; ++i) { queuePathParts.set(queuePathParts.size() - 1 - i, WILDCARD_QUEUE); } {noformat} What's happening here? 3. There is a variable called "supportedWildcardLevel". What is "supported" means in this context? Later on we set it to {{Math.min(queueHierarchyParts - 1, MAX_WILDCARD_LEVEL);}} which seems to be that it is either 0 or 1, because {{MAX_WILDCARD_LEVEL}} is 1. I assume most of the time it's going to be 1? Mentally I don't understand what it is meant to represent. > Support Auto Queue Creation template configurations > --- > > Key: YARN-10564 > URL: https://issues.apache.org/jira/browse/YARN-10564 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Attachments: YARN-10564.001.patch, YARN-10564.002.patch, > YARN-10564.003.patch, YARN-10564.004.patch, YARN-10564.005.patch, > YARN-10564.poc.001.patch > > > Similar to how the template configuration works for ManagedParents, we need > to support templates for the new auto queue creation logic. Proposition is to > allow wildcards in template configs such as: > {noformat} > yarn.scheduler.capacity.root.*.*.weight 10{noformat} > which would mean, that set weight to 10 of every leaf of every parent under > root. > We should possibly take an approach, that could support arbitrary depth of > template configuration, because we might need to lift the limitation of auto > queue nesting. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10564) Support Auto Queue Creation template configurations
[ https://issues.apache.org/jira/browse/YARN-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316213#comment-17316213 ] Peter Bacsko commented on YARN-10564: - [~gandras] thanks for the patch. >From coding POV it looks ok, this is more like a high level review. There's are some things I just can't figure out (maybe I'm in a bad shape today). 1. Let's say you set 6w for {{root.a.*}}. Then a dynamic queue {{root.a.newparent.newchild}} get created. How does the weight settings propagate to "newparent" and "newchild"? I kept looking at the code, but it's just not obvious. I can see that "root.a" will have an entry in {{templateEntries}}, but then what? 2. I can't deciper this part: {noformat} for (int i = 0; i <= wildcardLevel; ++i) { queuePathParts.set(queuePathParts.size() - 1 - i, WILDCARD_QUEUE); } {noformat} What's happening here? 3. There is a variable called "supportedWildcardLevel". What is "supported" means in this context? Later on we set it to {{Math.min(queueHierarchyParts - 1, MAX_WILDCARD_LEVEL);}} which seems to be that it is either 0 or 1, because {{MAX_WILDCARD_LEVEL}} is 1. I assume most of the time it's going to be 1? Mentally I don't understand what it is meant to represent. > Support Auto Queue Creation template configurations > --- > > Key: YARN-10564 > URL: https://issues.apache.org/jira/browse/YARN-10564 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Attachments: YARN-10564.001.patch, YARN-10564.002.patch, > YARN-10564.003.patch, YARN-10564.004.patch, YARN-10564.005.patch, > YARN-10564.poc.001.patch > > > Similar to how the template configuration works for ManagedParents, we need > to support templates for the new auto queue creation logic. Proposition is to > allow wildcards in template configs such as: > {noformat} > yarn.scheduler.capacity.root.*.*.weight 10{noformat} > which would mean, that set weight to 10 of every leaf of every parent under > root. > We should possibly take an approach, that could support arbitrary depth of > template configuration, because we might need to lift the limitation of auto > queue nesting. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10726) Log the size of DelegationTokenRenewer event queue in case of too many pending events
[ https://issues.apache.org/jira/browse/YARN-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313241#comment-17313241 ] Peter Bacsko commented on YARN-10726: - Ok, I strongly believe that the failing tests are flaky. [~zhuqi] could you verify it by running them locally a couple of times? > Log the size of DelegationTokenRenewer event queue in case of too many > pending events > - > > Key: YARN-10726 > URL: https://issues.apache.org/jira/browse/YARN-10726 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10726.001.patch, YARN-10726.002.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10693) Add document for YARN-10623 auto refresh queue conf in cs.
[ https://issues.apache.org/jira/browse/YARN-10693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313219#comment-17313219 ] Peter Bacsko commented on YARN-10693: - I'll review this as soon as I have some spare cycles. > Add document for YARN-10623 auto refresh queue conf in cs. > -- > > Key: YARN-10693 > URL: https://issues.apache.org/jira/browse/YARN-10693 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10693.001.patch, YARN-10693.002.patch, > YARN-10693.003.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10637) We should support fs to cs support for auto refresh queues when conf changed, after YARN-10623 finished.
[ https://issues.apache.org/jira/browse/YARN-10637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313218#comment-17313218 ] Peter Bacsko commented on YARN-10637: - Thanks [~zhuqi] I think it's good then. [~gandras] do you have any comments? > We should support fs to cs support for auto refresh queues when conf changed, > after YARN-10623 finished. > > > Key: YARN-10637 > URL: https://issues.apache.org/jira/browse/YARN-10637 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10637.001.patch, YARN-10637.002.patch, > YARN-10637.003.patch, YARN-10637.004.patch > > > cc [~pbacsko] [~gandras] [~bteke] > We should also fill this, when YARN-10623 finished. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10726) Log the size of DelegationTokenRenewer event queue in case of too many pending events
[ https://issues.apache.org/jira/browse/YARN-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313192#comment-17313192 ] Peter Bacsko commented on YARN-10726: - Ah, I already committed the change. Let's hope Jenkins comes back green :) +1 > Log the size of DelegationTokenRenewer event queue in case of too many > pending events > - > > Key: YARN-10726 > URL: https://issues.apache.org/jira/browse/YARN-10726 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10726.001.patch, YARN-10726.002.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10726) Log the size of DelegationTokenRenewer event queue in case of too many pending events
[ https://issues.apache.org/jira/browse/YARN-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313189#comment-17313189 ] Peter Bacsko commented on YARN-10726: - "hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer" - this is unrelated I believe. This test case has been failing for a long time. > Log the size of DelegationTokenRenewer event queue in case of too many > pending events > - > > Key: YARN-10726 > URL: https://issues.apache.org/jira/browse/YARN-10726 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10726.001.patch, YARN-10726.002.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10637) We should support fs to cs support for auto refresh queues when conf changed, after YARN-10623 finished.
[ https://issues.apache.org/jira/browse/YARN-10637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313184#comment-17313184 ] Peter Bacsko commented on YARN-10637: - Thanks [~zhuqi] this makes sense. Is this always enabled in Fair Scheduler? Because we should only add this policy if auto-refresh is enabled on the FS-side. > We should support fs to cs support for auto refresh queues when conf changed, > after YARN-10623 finished. > > > Key: YARN-10637 > URL: https://issues.apache.org/jira/browse/YARN-10637 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10637.001.patch, YARN-10637.002.patch, > YARN-10637.003.patch, YARN-10637.004.patch > > > cc [~pbacsko] [~gandras] [~bteke] > We should also fill this, when YARN-10623 finished. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10726) Log the size of DelegationTokenRenewer event queue in case of too many pending events
[ https://issues.apache.org/jira/browse/YARN-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313138#comment-17313138 ] Peter Bacsko commented on YARN-10726: - This is from {{AsyncDispatcher}}: {noformat} if (qSize != 0 && qSize % 1000 == 0 && lastEventQueueSizeLogged != qSize) { lastEventQueueSizeLogged = qSize; LOG.info("Size of event-queue is " + qSize); } {noformat} Update the code with {{lastEventQueueSizeLogged}}. > Log the size of DelegationTokenRenewer event queue in case of too many > pending events > - > > Key: YARN-10726 > URL: https://issues.apache.org/jira/browse/YARN-10726 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10726.001.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10726) Log the size of DelegationTokenRenewer event queue in case of too many pending events
[ https://issues.apache.org/jira/browse/YARN-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313123#comment-17313123 ] Peter Bacsko edited comment on YARN-10726 at 4/1/21, 12:01 PM: --- Thanks [~zhuqi]. I think it's a good idea. My only concern (which might not be valid) is that we have too many events, this code can possibly run too frequently. For example, if you go 998, 998, 999, 1000, 1001, 1002, then it prints at 1000, then it starts to consume events, size goes back from 1000 to 990, then it prints the size again. I think we should limit how often we print this message. We should log it too often, I'm not sure how we do this in other parts of the code. I'll check what can be the best solution. was (Author: pbacsko): Thanks [~zhuqi]. I think it's a good idea. My only "concern" is that we have too many events, this code can possibly run too frequently. For example, if you go 998, 998, 999, 1000, 1001, 1002, then it prints at 1000, then it starts to consume events, size goes back from 1000 to 990, then it prints the size again. I think we should limit how often we print this message. We should log it too often, I'm not sure how we do this in other parts of the code. I'll check what can be the best solution. > Log the size of DelegationTokenRenewer event queue in case of too many > pending events > - > > Key: YARN-10726 > URL: https://issues.apache.org/jira/browse/YARN-10726 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10726.001.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10726) Log the size of DelegationTokenRenewer event queue in case of too many pending events
[ https://issues.apache.org/jira/browse/YARN-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313123#comment-17313123 ] Peter Bacsko commented on YARN-10726: - Thanks [~zhuqi]. I think it's a good idea. My only "concern" is that we have too many events, this code can possibly run too frequently. For example, if you go 998, 998, 999, 1000, 1001, 1002, then it prints at 1000, then it starts to consume events, size goes back from 1000 to 990, then it prints the size again. I think we should limit how often we print this message. We should log it too often, I'm not sure how we do this in other parts of the code. I'll check what can be the best solution. > Log the size of DelegationTokenRenewer event queue in case of too many > pending events > - > > Key: YARN-10726 > URL: https://issues.apache.org/jira/browse/YARN-10726 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10726.001.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10726) Log the size of DelegationTokenRenewer event queue in case of too many pending events
[ https://issues.apache.org/jira/browse/YARN-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10726: Summary: Log the size of DelegationTokenRenewer event queue in case of too many pending events (was: We should log size of pending DelegationTokenRenewerEvent queue, when pending too many events.) > Log the size of DelegationTokenRenewer event queue in case of too many > pending events > - > > Key: YARN-10726 > URL: https://issues.apache.org/jira/browse/YARN-10726 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10726.001.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9618) NodesListManager event improvement
[ https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313105#comment-17313105 ] Peter Bacsko commented on YARN-9618: Thanks for the patch [~zhuqi] and [~gandras] for the review, I committed this to trunk. > NodesListManager event improvement > -- > > Key: YARN-9618 > URL: https://issues.apache.org/jira/browse/YARN-9618 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: Qi Zhu >Priority: Critical > Fix For: 3.4.0 > > Attachments: YARN-9618.001.patch, YARN-9618.002.patch, > YARN-9618.003.patch, YARN-9618.004.patch, YARN-9618.005.patch, > YARN-9618.006.patch, YARN-9618.007.patch > > > Current implementation nodelistmanager event blocks async dispacher and can > cause RM crash and slowing down event processing. > # Cluster restart with 1K running apps . Each usable event will create 1K > events over all events could be 5k*1k events for 5K cluster > # Event processing is blocked till new events are added to queue. > Solution : > # Add another async Event handler similar to scheduler. > # Instead of adding events to dispatcher directly call RMApp event handler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9618) NodesListManager event improvement
[ https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-9618: --- Summary: NodesListManager event improvement (was: NodeListManager event improvement) > NodesListManager event improvement > -- > > Key: YARN-9618 > URL: https://issues.apache.org/jira/browse/YARN-9618 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-9618.001.patch, YARN-9618.002.patch, > YARN-9618.003.patch, YARN-9618.004.patch, YARN-9618.005.patch, > YARN-9618.006.patch, YARN-9618.007.patch > > > Current implementation nodelistmanager event blocks async dispacher and can > cause RM crash and slowing down event processing. > # Cluster restart with 1K running apps . Each usable event will create 1K > events over all events could be 5k*1k events for 5K cluster > # Event processing is blocked till new events are added to queue. > Solution : > # Add another async Event handler similar to scheduler. > # Instead of adding events to dispatcher directly call RMApp event handler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9618) NodeListManager event improvement
[ https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312989#comment-17312989 ] Peter Bacsko commented on YARN-9618: +1 LGTM [~gandras] are you OK with the patch? > NodeListManager event improvement > - > > Key: YARN-9618 > URL: https://issues.apache.org/jira/browse/YARN-9618 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-9618.001.patch, YARN-9618.002.patch, > YARN-9618.003.patch, YARN-9618.004.patch, YARN-9618.005.patch, > YARN-9618.006.patch, YARN-9618.007.patch > > > Current implementation nodelistmanager event blocks async dispacher and can > cause RM crash and slowing down event processing. > # Cluster restart with 1K running apps . Each usable event will create 1K > events over all events could be 5k*1k events for 5K cluster > # Event processing is blocked till new events are added to queue. > Solution : > # Add another async Event handler similar to scheduler. > # Instead of adding events to dispatcher directly call RMApp event handler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10720) YARN WebAppProxyServlet should support connection timeout to prevent proxy server from hanging
[ https://issues.apache.org/jira/browse/YARN-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312945#comment-17312945 ] Peter Bacsko commented on YARN-10720: - +1 thanks [~zhuqi] for the patch, committed to trunk. > YARN WebAppProxyServlet should support connection timeout to prevent proxy > server from hanging > -- > > Key: YARN-10720 > URL: https://issues.apache.org/jira/browse/YARN-10720 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-10720.001.patch, YARN-10720.002.patch, > YARN-10720.003.patch, YARN-10720.004.patch, YARN-10720.005.patch, > YARN-10720.006.patch, image-2021-03-29-14-04-33-776.png, > image-2021-03-29-14-05-32-708.png > > > Following is proxy server show, {color:#de350b}too many connections from one > client{color}, this caused the proxy server hang, and the yarn web can't jump > to web proxy. > !image-2021-03-29-14-04-33-776.png|width=632,height=57! > Following is the AM which is abnormal, but proxy server don't know it is > abnormal already, so the connections can't be closed, we should add time out > support in proxy server to prevent this. And one abnormal AM may cause > hundreds even thousands of connections, it is very heavy. > !image-2021-03-29-14-05-32-708.png|width=669,height=101! > > After i kill the abnormal AM, the proxy server become healthy. This case > happened many times in our production clusters, our clusters are huge, and > the abnormal AM will be existed in a regular case. > > I will add timeout supported in web proxy server in this jira. > > cc [~pbacsko] [~ebadger] [~Jim_Brennan] [~ztang] [~epayne] [~gandras] > [~bteke] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10720) YARN WebAppProxyServlet should support connection timeout to prevent proxy server from hanging
[ https://issues.apache.org/jira/browse/YARN-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10720: Summary: YARN WebAppProxyServlet should support connection timeout to prevent proxy server from hanging (was: YARN WebAppProxyServlet should support connection timeout to prevent proxy server hang.) > YARN WebAppProxyServlet should support connection timeout to prevent proxy > server from hanging > -- > > Key: YARN-10720 > URL: https://issues.apache.org/jira/browse/YARN-10720 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-10720.001.patch, YARN-10720.002.patch, > YARN-10720.003.patch, YARN-10720.004.patch, YARN-10720.005.patch, > YARN-10720.006.patch, image-2021-03-29-14-04-33-776.png, > image-2021-03-29-14-05-32-708.png > > > Following is proxy server show, {color:#de350b}too many connections from one > client{color}, this caused the proxy server hang, and the yarn web can't jump > to web proxy. > !image-2021-03-29-14-04-33-776.png|width=632,height=57! > Following is the AM which is abnormal, but proxy server don't know it is > abnormal already, so the connections can't be closed, we should add time out > support in proxy server to prevent this. And one abnormal AM may cause > hundreds even thousands of connections, it is very heavy. > !image-2021-03-29-14-05-32-708.png|width=669,height=101! > > After i kill the abnormal AM, the proxy server become healthy. This case > happened many times in our production clusters, our clusters are huge, and > the abnormal AM will be existed in a regular case. > > I will add timeout supported in web proxy server in this jira. > > cc [~pbacsko] [~ebadger] [~Jim_Brennan] [~ztang] [~epayne] [~gandras] > [~bteke] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9618) NodeListManager event improvement
[ https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312516#comment-17312516 ] Peter Bacsko commented on YARN-9618: Small things: 1. {noformat} //Is trigger RMAppNodeUpdateEvent private Boolean isRMAppEvent = false; //Is trigger NodesListManagerEvent private Boolean isNodesListEvent = false; {noformat} a) No need for comments b) use ordinary "boolean" instead of "Boolean" (also, init to "false" is not necessary, it is "false" by default because it's dictated by the JVM spec). 2. {noformat} Assert.assertFalse(getIsRMAppEvent()); Assert.assertTrue(getIsNodesListEvent()); {noformat} Add some assertion message here, like {noformat} Assert.assertFalse("Got unexpected RM app event", getIsRMAppEvent()); Assert.assertTrue("Received no NodesListManagerEvent", getIsNodesListEvent()); {noformat} 3. Return values of {{getIsNodesListEvent()}} and {{getIsRMAppEvent()}} should be just "boolean". > NodeListManager event improvement > - > > Key: YARN-9618 > URL: https://issues.apache.org/jira/browse/YARN-9618 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-9618.001.patch, YARN-9618.002.patch, > YARN-9618.003.patch, YARN-9618.004.patch, YARN-9618.005.patch, > YARN-9618.006.patch > > > Current implementation nodelistmanager event blocks async dispacher and can > cause RM crash and slowing down event processing. > # Cluster restart with 1K running apps . Each usable event will create 1K > events over all events could be 5k*1k events for 5K cluster > # Event processing is blocked till new events are added to queue. > Solution : > # Add another async Event handler similar to scheduler. > # Instead of adding events to dispatcher directly call RMApp event handler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10720) YARN WebAppProxyServlet should support connection timeout to prevent proxy server hang.
[ https://issues.apache.org/jira/browse/YARN-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312492#comment-17312492 ] Peter Bacsko commented on YARN-10720: - {noformat} } catch (InterruptedException e) { LOG.warn("doGet() interrupted", e); resp.setStatus(HttpServletResponse.SC_BAD_REQUEST); } resp.setStatus(HttpServletResponse.SC_OK); } {noformat} This is not good - you set the response status to {{SC_BAD_REQUEST}} only to override it with {{SC_OK}}. You need a "return". {noformat} try { servlet.init(config); } catch (ServletException e) { LOG.error(e.getMessage()); fail("Failed to init servlet"); } try { servlet.doGet(request, response); } catch (ServletException e) { LOG.error(e.getMessage()); fail("ServletException thrown during doGet."); } } {noformat} You can remove try-catch here and just add {{throws ServletException}}. If that happens for whatever reason, it will be a test error (which is desired - checking if the servlet can init is not the purpose of the test), not a test failure. > YARN WebAppProxyServlet should support connection timeout to prevent proxy > server hang. > --- > > Key: YARN-10720 > URL: https://issues.apache.org/jira/browse/YARN-10720 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-10720.001.patch, YARN-10720.002.patch, > YARN-10720.003.patch, YARN-10720.004.patch, YARN-10720.005.patch, > image-2021-03-29-14-04-33-776.png, image-2021-03-29-14-05-32-708.png > > > Following is proxy server show, {color:#de350b}too many connections from one > client{color}, this caused the proxy server hang, and the yarn web can't jump > to web proxy. > !image-2021-03-29-14-04-33-776.png|width=632,height=57! > Following is the AM which is abnormal, but proxy server don't know it is > abnormal already, so the connections can't be closed, we should add time out > support in proxy server to prevent this. And one abnormal AM may cause > hundreds even thousands of connections, it is very heavy. > !image-2021-03-29-14-05-32-708.png|width=669,height=101! > > After i kill the abnormal AM, the proxy server become healthy. This case > happened many times in our production clusters, our clusters are huge, and > the abnormal AM will be existed in a regular case. > > I will add timeout supported in web proxy server in this jira. > > cc [~pbacsko] [~ebadger] [~Jim_Brennan] [~ztang] [~epayne] [~gandras] > [~bteke] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10720) YARN WebAppProxyServlet should support connection timeout to prevent proxy server hang.
[ https://issues.apache.org/jira/browse/YARN-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312253#comment-17312253 ] Peter Bacsko commented on YARN-10720: - Thanks [~zhuqi] for the patch. 1. As you said {{ExpectedException.none()}} has been deprecated. Either use the new {{assertThrows()}} or {{@Test(expected = SocketTimeoutException.class)}}, I think using the second is easier. 2. {noformat} conf.setInt(YarnConfiguration.RM_PROXY_CONNECTION_TIMEOUT, 1 * 1000); {noformat} Just write "1000" instead of "1 * 1000". 3. {noformat} try { when(response.getOutputStream()).thenReturn(null); } catch (IOException e) { e.printStackTrace(); } {noformat} Unnecessary try-catch block. The method already has a {{throws}} clause. 4. {noformat} @Override protected void doGet(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException { try { Thread.sleep(10 * 1000); } catch (InterruptedException e) { e.printStackTrace(); } resp.setStatus(HttpServletResponse.SC_OK); } {noformat} Maybe a minor thing, but if you catch {{InterruptedException}}, don't just print the stack trace, log it with {{LOG.warn("doGet() interrupted", e)}}. In this case, I'd also return with {{HttpServletResponse.SC_BAD_REQUEST}}. 5. {{The web proxy connection timeout, default is 60s(60 * 1000ms).}} This already goes to {{yarn-default.xml}}, so you can omit the part "default is 60s(60 * 1000ms)" and just write "The web proxy connection timeout". > YARN WebAppProxyServlet should support connection timeout to prevent proxy > server hang. > --- > > Key: YARN-10720 > URL: https://issues.apache.org/jira/browse/YARN-10720 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-10720.001.patch, YARN-10720.002.patch, > YARN-10720.003.patch, image-2021-03-29-14-04-33-776.png, > image-2021-03-29-14-05-32-708.png > > > Following is proxy server show, {color:#de350b}too many connections from one > client{color}, this caused the proxy server hang, and the yarn web can't jump > to web proxy. > !image-2021-03-29-14-04-33-776.png|width=632,height=57! > Following is the AM which is abnormal, but proxy server don't know it is > abnormal already, so the connections can't be closed, we should add time out > support in proxy server to prevent this. And one abnormal AM may cause > hundreds even thousands of connections, it is very heavy. > !image-2021-03-29-14-05-32-708.png|width=669,height=101! > > After i kill the abnormal AM, the proxy server become healthy. This case > happened many times in our production clusters, our clusters are huge, and > the abnormal AM will be existed in a regular case. > > I will add timeout supported in web proxy server in this jira. > > cc [~pbacsko] [~ebadger] [~Jim_Brennan] [~ztang] [~epayne] [~gandras] > [~bteke] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10718) Fix CapacityScheduler#initScheduler log error.
[ https://issues.apache.org/jira/browse/YARN-10718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312203#comment-17312203 ] Peter Bacsko commented on YARN-10718: - Committed to trunk. Closing. > Fix CapacityScheduler#initScheduler log error. > --- > > Key: YARN-10718 > URL: https://issues.apache.org/jira/browse/YARN-10718 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: capacity-scheduler, capacityscheduler > Attachments: YARN-10718.001.patch, image-2021-03-28-00-03-28-244.png > > > !image-2021-03-28-00-03-28-244.png|width=972,height=52! > The Resource toString() method already with "<" and ">" string, it's wrong > to add it again. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10718) Fix CapacityScheduler#initScheduler log error.
[ https://issues.apache.org/jira/browse/YARN-10718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10718: Labels: resourcemanager (was: ) > Fix CapacityScheduler#initScheduler log error. > --- > > Key: YARN-10718 > URL: https://issues.apache.org/jira/browse/YARN-10718 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: resourcemanager > Attachments: YARN-10718.001.patch, image-2021-03-28-00-03-28-244.png > > > !image-2021-03-28-00-03-28-244.png|width=972,height=52! > The Resource toString() method already with "<" and ">" string, it's wrong > to add it again. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10718) Fix CapacityScheduler#initScheduler log error.
[ https://issues.apache.org/jira/browse/YARN-10718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10718: Labels: capacity-scheduler capacityscheduler (was: resourcemanager) > Fix CapacityScheduler#initScheduler log error. > --- > > Key: YARN-10718 > URL: https://issues.apache.org/jira/browse/YARN-10718 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: capacity-scheduler, capacityscheduler > Attachments: YARN-10718.001.patch, image-2021-03-28-00-03-28-244.png > > > !image-2021-03-28-00-03-28-244.png|width=972,height=52! > The Resource toString() method already with "<" and ">" string, it's wrong > to add it again. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10718) Fix CapacityScheduler#initScheduler log error.
[ https://issues.apache.org/jira/browse/YARN-10718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312195#comment-17312195 ] Peter Bacsko commented on YARN-10718: - Thanks [~zhuqi], +1 LGTM. Will commit this soon. > Fix CapacityScheduler#initScheduler log error. > --- > > Key: YARN-10718 > URL: https://issues.apache.org/jira/browse/YARN-10718 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10718.001.patch, image-2021-03-28-00-03-28-244.png > > > !image-2021-03-28-00-03-28-244.png|width=972,height=52! > The Resource toString() method already with "<" and ">" string, it's wrong > to add it again. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10674) fs2cs should generate auto-created queue deletion properties
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307605#comment-17307605 ] Peter Bacsko commented on YARN-10674: - Thanks [~zhuqi] for the patch and [~gandras] for the review. Committed to trunk. > fs2cs should generate auto-created queue deletion properties > > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, > YARN-10674.006.patch, YARN-10674.007.patch, YARN-10674.008.patch, > YARN-10674.009.patch, YARN-10674.010.patch, YARN-10674.011.patch, > YARN-10674.012.patch, YARN-10674.013.patch, YARN-10674.014.patch, > YARN-10674.015.patch, YARN-10674.016.patch, YARN-10674.017.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10674) fs2cs should generate auto-created queue deletion properties
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307602#comment-17307602 ] Peter Bacsko commented on YARN-10674: - +1 LGTM. I'm going to commit this soon. > fs2cs should generate auto-created queue deletion properties > > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, > YARN-10674.006.patch, YARN-10674.007.patch, YARN-10674.008.patch, > YARN-10674.009.patch, YARN-10674.010.patch, YARN-10674.011.patch, > YARN-10674.012.patch, YARN-10674.013.patch, YARN-10674.014.patch, > YARN-10674.015.patch, YARN-10674.016.patch, YARN-10674.017.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10674) fs2cs should generate auto-created queue deletion properties
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10674: Summary: fs2cs should generate auto-created queue deletion properties (was: fs2cs: should support auto created queue deletion.) > fs2cs should generate auto-created queue deletion properties > > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, > YARN-10674.006.patch, YARN-10674.007.patch, YARN-10674.008.patch, > YARN-10674.009.patch, YARN-10674.010.patch, YARN-10674.011.patch, > YARN-10674.012.patch, YARN-10674.013.patch, YARN-10674.014.patch, > YARN-10674.015.patch, YARN-10674.016.patch, YARN-10674.017.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17306240#comment-17306240 ] Peter Bacsko commented on YARN-10674: - [~zhuqi] I had a discussion with [~gandras], he will post an update soon. > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, > YARN-10674.006.patch, YARN-10674.007.patch, YARN-10674.008.patch, > YARN-10674.009.patch, YARN-10674.010.patch, YARN-10674.011.patch, > YARN-10674.012.patch, YARN-10674.013.patch, YARN-10674.014.patch, > YARN-10674.015.patch, YARN-10674.016.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10645) Fix queue state related update for auto created queue.
[ https://issues.apache.org/jira/browse/YARN-10645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17306203#comment-17306203 ] Peter Bacsko commented on YARN-10645: - [~zhuqi] [~gandras] is this patch still needed? Looking at Andras' comment, it is telling me that this ticket is a duplicate. Is it a dup? > Fix queue state related update for auto created queue. > -- > > Key: YARN-10645 > URL: https://issues.apache.org/jira/browse/YARN-10645 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-10645.001.patch > > > Now the queue state in auto created queue can't be updated after refactor in > YARN-10504. > We should support fix the queue state related logic. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10503) Support queue capacity in terms of absolute resources with gpu resourceType.
[ https://issues.apache.org/jira/browse/YARN-10503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17306157#comment-17306157 ] Peter Bacsko commented on YARN-10503: - The question is this part: {noformat} public enum AbsoluteResourceType { MEMORY, VCORES, GPUS, FPGAS } {noformat} Do we want to treat GPUs and FPGAs like that? In other parts of the code, we have mem/vcore as primary resources, then an array of other resources. For example, constructors from {{org.apache.hadoop.yarn.api.records.Resource}}: {noformat} @Public @Stable public static Resource newInstance(long memory, int vCores, Map others) { if (others != null) { return new LightWeightResource(memory, vCores, ResourceUtils.createResourceTypesArray(others)); } else { return newInstance(memory, vCores); } } @InterfaceAudience.Private @InterfaceStability.Unstable public static Resource newInstance(Resource resource) { Resource ret; int numberOfKnownResourceTypes = ResourceUtils .getNumberOfKnownResourceTypes(); if (numberOfKnownResourceTypes > 2) { ret = new LightWeightResource(resource.getMemorySize(), resource.getVirtualCores(), resource.getResources()); } else { ret = new LightWeightResource(resource.getMemorySize(), resource.getVirtualCores()); } return ret; } {noformat} But with this modification, we sort of promote GPU and FPGA to the level of vcore and memory, at least from the perspective of the code and it also becomes inconsistent with the existing code. This is just my opinion though. cc [~epayne] [~ebadger]. > Support queue capacity in terms of absolute resources with gpu resourceType. > > > Key: YARN-10503 > URL: https://issues.apache.org/jira/browse/YARN-10503 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-10503.001.patch, YARN-10503.002.patch, > YARN-10503.003.patch > > > Now the absolute resources are memory and cores. > {code:java} > /** > * Different resource types supported. > */ > public enum AbsoluteResourceType { > MEMORY, VCORES; > }{code} > But in our GPU production clusters, we need to support more resourceTypes. > It's very import for cluster scaling when with different resourceType > absolute demands. > > This Jira will handle GPU first. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10704) The CS effective capacity for absolute mode in UI should support GPU and other custom resources.
[ https://issues.apache.org/jira/browse/YARN-10704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17306154#comment-17306154 ] Peter Bacsko commented on YARN-10704: - Thanks [~zhuqi] I have some minor comments: 1. {noformat} sb.append(" The CS effective capacity for absolute mode in UI should support GPU and > other custom resources. > > > Key: YARN-10704 > URL: https://issues.apache.org/jira/browse/YARN-10704 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10704.001.patch, YARN-10704.002.patch, > image-2021-03-19-12-05-28-412.png, image-2021-03-19-12-08-35-273.png > > > Actually there are no information about the effective capacity about GPU in > UI for absolute resource mode. > !image-2021-03-19-12-05-28-412.png|width=873,height=136! > But we have this information in QueueMetrics: > !image-2021-03-19-12-08-35-273.png|width=613,height=268! > > It's very important for our GPU users to use in absolute mode, there still > have nothing to know GPU absolute information in CS Queue UI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10597) CSMappingPlacementRule should not create new instance of Groups
[ https://issues.apache.org/jira/browse/YARN-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304971#comment-17304971 ] Peter Bacsko edited comment on YARN-10597 at 3/19/21, 3:35 PM: --- [~shuzirra] is it really that simple? You told me that there were bunch of unit test failures when you tried to change it months back. Anyway it's great news if the change is tiny. was (Author: pbacsko): [~shuzirra] is it really that simple? You told me that there were bunch of unit test failures. Anyway it's great news if the change is tiny. > CSMappingPlacementRule should not create new instance of Groups > --- > > Key: YARN-10597 > URL: https://issues.apache.org/jira/browse/YARN-10597 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Gergely Pollak >Assignee: Gergely Pollak >Priority: Major > Attachments: YARN-10597.001.patch > > > As [~ahussein] pointed out in YARN-10425, no new Groups instance should be > created. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10597) CSMappingPlacementRule should not create new instance of Groups
[ https://issues.apache.org/jira/browse/YARN-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304971#comment-17304971 ] Peter Bacsko commented on YARN-10597: - [~shuzirra] is it really that simple? You told me that there were bunch of unit test failures. Anyway it's great news if the change is tiny. > CSMappingPlacementRule should not create new instance of Groups > --- > > Key: YARN-10597 > URL: https://issues.apache.org/jira/browse/YARN-10597 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Gergely Pollak >Assignee: Gergely Pollak >Priority: Major > Attachments: YARN-10597.001.patch > > > As [~ahussein] pointed out in YARN-10425, no new Groups instance should be > created. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10641) Refactor the max app related update, and fix maxApllications update error when add new queues.
[ https://issues.apache.org/jira/browse/YARN-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304117#comment-17304117 ] Peter Bacsko commented on YARN-10641: - +1 Thanks for the patch [~zhuqi] and [~gandras] for the review. Committed to trunk. > Refactor the max app related update, and fix maxApllications update error > when add new queues. > -- > > Key: YARN-10641 > URL: https://issues.apache.org/jira/browse/YARN-10641 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-10641.001.patch, YARN-10641.002.patch, > YARN-10641.003.patch, YARN-10641.004.patch, YARN-10641.005.patch, > YARN-10641.006.patch, image-2021-02-20-15-49-58-677.png, > image-2021-02-20-15-53-51-099.png, image-2021-02-20-15-55-44-780.png, > image-2021-02-20-16-29-18-519.png, image-2021-02-20-16-31-13-714.png > > > When refactor the update logic in YARN-10504 . > The update max applications based abs/cap is wrong, this should be fixed, > because the max applications is key part to limit applications in CS. > For example: > When adding a dynamic queue, the other children's max app of parent queue are > not updated correctly: > !image-2021-02-20-15-53-51-099.png|width=639,height=509! > The new added queue's max app will updated correctly: > !image-2021-02-20-15-55-44-780.png|width=542,height=426! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10692) Add Node GPU Utilization and apply to NodeMetrics.
[ https://issues.apache.org/jira/browse/YARN-10692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304089#comment-17304089 ] Peter Bacsko commented on YARN-10692: - Thanks [~zhuqi] for the patch, committed to trunk. > Add Node GPU Utilization and apply to NodeMetrics. > -- > > Key: YARN-10692 > URL: https://issues.apache.org/jira/browse/YARN-10692 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10692.001.patch, YARN-10692.002.patch, > YARN-10692.003.patch > > > Now there are no node level GPU Utilization, this issue will add it, and add > it to NodeMetrics first. > cc [~pbacsko] [~Jim_Brennan] [~ebadger] [~gandras] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10692) Add Node GPU Utilization and apply to NodeMetrics.
[ https://issues.apache.org/jira/browse/YARN-10692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304078#comment-17304078 ] Peter Bacsko commented on YARN-10692: - +1 LGTM. Committing this soon. > Add Node GPU Utilization and apply to NodeMetrics. > -- > > Key: YARN-10692 > URL: https://issues.apache.org/jira/browse/YARN-10692 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10692.001.patch, YARN-10692.002.patch, > YARN-10692.003.patch > > > Now there are no node level GPU Utilization, this issue will add it, and add > it to NodeMetrics first. > cc [~pbacsko] [~Jim_Brennan] [~ebadger] [~gandras] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10685) Fix typos in AbstractCSQueue
[ https://issues.apache.org/jira/browse/YARN-10685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304041#comment-17304041 ] Peter Bacsko commented on YARN-10685: - +1 thanks [~zhuqi] for the patch, committed to trunk. > Fix typos in AbstractCSQueue > > > Key: YARN-10685 > URL: https://issues.apache.org/jira/browse/YARN-10685 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10685.001.patch, YARN-10685.002.patch, > YARN-10685.003.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10685) Fix typos in AbstractCSQueue
[ https://issues.apache.org/jira/browse/YARN-10685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10685: Summary: Fix typos in AbstractCSQueue (was: Fixed some Typo in AbstractCSQueue.) > Fix typos in AbstractCSQueue > > > Key: YARN-10685 > URL: https://issues.apache.org/jira/browse/YARN-10685 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10685.001.patch, YARN-10685.002.patch, > YARN-10685.003.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304027#comment-17304027 ] Peter Bacsko commented on YARN-10674: - Thanks [~zhuqi] for the patch. I think we are very close. I still have some comments: 1. {noformat} private FSConfigToCSConfigConverterParams. PreemptionMode disablePreemption; private FSConfigToCSConfigConverterParams. PreemptionMode preemptionMode; {noformat} We don't need two enums. We need only one which covers all states (enabled / observeonly / nopolicy). You can extend {{PreemptionMode}} with a new variable which says whether it's enabled or disabled: {noformat} public enum PreemptionMode { ENABLE("enable", true), NO_POLICY("nopolicy", false), OBSERVE_ONLY("observeonly", false); private String cliOption; private boolean enabled; PreemptionMode(String cliOption, boolean enabled) { this.cliOption = cliOption; this.enabled = enabled; } public String getCliOption() { return cliOption; } public boolean isEnabled() { return enabled; } {noformat} So you just call {{preemptionMode.isEnabled()}} and don't need two variables just to hold the information whether it's enabled or not. 2. {{public static PreemptionMode fromString(String cliOption)}} --> this method never returns ENABLED, which is important (also, pls change "ENABLE" to "ENABLED", note the "D" at the end). cc [~gandras] please review patch v14. > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, > YARN-10674.006.patch, YARN-10674.007.patch, YARN-10674.008.patch, > YARN-10674.009.patch, YARN-10674.010.patch, YARN-10674.011.patch, > YARN-10674.012.patch, YARN-10674.013.patch, YARN-10674.014.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10692) Add Node GPU Utilization and apply to NodeMetrics.
[ https://issues.apache.org/jira/browse/YARN-10692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303542#comment-17303542 ] Peter Bacsko edited comment on YARN-10692 at 3/17/21, 4:11 PM: --- Thanks [~zhuqi] in general this looks good. I just have two nits: 1. {{getNodeGPUUtilization()}} --> rename this to {{getNodeGpuUtilization()}}, the method name looks better this way 2. {{getNodeGPUUtilization()}} you can simplify the addition with streams: {noformat} float totalGpuUtilization = 0; if (gpuList != null && gpuList.size() != 0) { totalGpuUtilization = gpuList .stream() .map(g -> g.getGpuUtilizations().getOverallGpuUtilization()) .collect(Collectors.summingDouble(Float::floatValue)) .floatValue() / gpuList.size(); } return totalGpuUtilization; {noformat} Also, you should consider renaming "totalGpuUtilization" to "nodeGpuUtilization" so that it matches the method name. was (Author: pbacsko): Thanks [~zhuqi] in general this looks good. I just have two nits: 1. {{getNodeGPUUtilization()}} --> rename this to {{getNodeGpuUtilization()}}, the method name looks better this way 2. {{getNodeGPUUtilization()}} you can simplify the addition with streams: {noformat} float totalGpuUtilization = 0; if (gpuList != null && gpuList.size() != 0) { totalGpuUtilization = gpuList .stream() .map(g -> g.getGpuUtilizations().getOverallGpuUtilization()) .collect(Collectors.summingDouble(Float::floatValue)) .floatValue() / gpuList.size(); } return totalGpuUtilization; {noformat} > Add Node GPU Utilization and apply to NodeMetrics. > -- > > Key: YARN-10692 > URL: https://issues.apache.org/jira/browse/YARN-10692 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10692.001.patch, YARN-10692.002.patch > > > Now there are no node level GPU Utilization, this issue will add it, and add > it to NodeMetrics first. > cc [~pbacsko] [~Jim_Brennan] [~ebadger] [~gandras] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10692) Add Node GPU Utilization and apply to NodeMetrics.
[ https://issues.apache.org/jira/browse/YARN-10692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303542#comment-17303542 ] Peter Bacsko commented on YARN-10692: - Thanks [~zhuqi] in general this looks good. I just have two nits: 1. {{getNodeGPUUtilization()}} --> rename this to {{getNodeGpuUtilization()}}, the method name looks better this way 2. {{getNodeGPUUtilization()}} you can simplify the addition with streams: {noformat} float totalGpuUtilization = 0; if (gpuList != null && gpuList.size() != 0) { totalGpuUtilization = gpuList .stream() .map(g -> g.getGpuUtilizations().getOverallGpuUtilization()) .collect(Collectors.summingDouble(Float::floatValue)) .floatValue() / gpuList.size(); } return totalGpuUtilization; {noformat} > Add Node GPU Utilization and apply to NodeMetrics. > -- > > Key: YARN-10692 > URL: https://issues.apache.org/jira/browse/YARN-10692 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10692.001.patch, YARN-10692.002.patch > > > Now there are no node level GPU Utilization, this issue will add it, and add > it to NodeMetrics first. > cc [~pbacsko] [~Jim_Brennan] [~ebadger] [~gandras] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10497) Fix an issue in CapacityScheduler which fails to delete queues
[ https://issues.apache.org/jira/browse/YARN-10497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10497: Labels: capacity-scheduler capacityscheduler (was: ) > Fix an issue in CapacityScheduler which fails to delete queues > -- > > Key: YARN-10497 > URL: https://issues.apache.org/jira/browse/YARN-10497 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Major > Labels: capacity-scheduler, capacityscheduler > Fix For: 3.4.0 > > Attachments: YARN-10497.001.patch, YARN-10497.002.patch, > YARN-10497.003.patch, YARN-10497.004.patch, YARN-10497.005.patch, > YARN-10497.006.patch > > > We saw an exception when using queue mutation APIs: > {code:java} > 2020-11-13 16:47:46,327 WARN > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: > CapacityScheduler configuration validation failed:java.io.IOException: Queue > root.am2cmQueueSecond not found > {code} > Which comes from this code: > {code:java} > List siblingQueues = getSiblingQueues(queueToRemove, > proposedConf); > if (!siblingQueues.contains(queueName)) { > throw new IOException("Queue " + queueToRemove + " not found"); > } > {code} > (Inside MutableCSConfigurationProvider) > If you look at the method: > {code:java} > > private List getSiblingQueues(String queuePath, Configuration conf) > { > String parentQueue = queuePath.substring(0, queuePath.lastIndexOf('.')); > String childQueuesKey = CapacitySchedulerConfiguration.PREFIX + > parentQueue + CapacitySchedulerConfiguration.DOT + > CapacitySchedulerConfiguration.QUEUES; > return new ArrayList<>(conf.getStringCollection(childQueuesKey)); > } > {code} > And here's capacity-scheduler.xml I got > {code:java} > yarn.scheduler.capacity.root.queuesdefault, q1, > q2 > {code} > You can notice there're spaces between default, q1, a2 > So conf.getStringCollection returns: > {code:java} > default > q1 > ... > {code} > Which causes match issue when we try to delete the queue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10497) Fix an issue in CapacityScheduler which fails to delete queues
[ https://issues.apache.org/jira/browse/YARN-10497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303365#comment-17303365 ] Peter Bacsko commented on YARN-10497: - +1 Thanks [~wangda] / [~zhuqi] for the patch and [~gandras], [~shuzirra] for the review. Committed to trunk. > Fix an issue in CapacityScheduler which fails to delete queues > -- > > Key: YARN-10497 > URL: https://issues.apache.org/jira/browse/YARN-10497 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Major > Attachments: YARN-10497.001.patch, YARN-10497.002.patch, > YARN-10497.003.patch, YARN-10497.004.patch, YARN-10497.005.patch, > YARN-10497.006.patch > > > We saw an exception when using queue mutation APIs: > {code:java} > 2020-11-13 16:47:46,327 WARN > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: > CapacityScheduler configuration validation failed:java.io.IOException: Queue > root.am2cmQueueSecond not found > {code} > Which comes from this code: > {code:java} > List siblingQueues = getSiblingQueues(queueToRemove, > proposedConf); > if (!siblingQueues.contains(queueName)) { > throw new IOException("Queue " + queueToRemove + " not found"); > } > {code} > (Inside MutableCSConfigurationProvider) > If you look at the method: > {code:java} > > private List getSiblingQueues(String queuePath, Configuration conf) > { > String parentQueue = queuePath.substring(0, queuePath.lastIndexOf('.')); > String childQueuesKey = CapacitySchedulerConfiguration.PREFIX + > parentQueue + CapacitySchedulerConfiguration.DOT + > CapacitySchedulerConfiguration.QUEUES; > return new ArrayList<>(conf.getStringCollection(childQueuesKey)); > } > {code} > And here's capacity-scheduler.xml I got > {code:java} > yarn.scheduler.capacity.root.queuesdefault, q1, > q2 > {code} > You can notice there're spaces between default, q1, a2 > So conf.getStringCollection returns: > {code:java} > default > q1 > ... > {code} > Which causes match issue when we try to delete the queue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303342#comment-17303342 ] Peter Bacsko commented on YARN-10674: - [~gandras] good suggestions, thanks! [~zhuqi] please apply the suggested modifications. > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, > YARN-10674.006.patch, YARN-10674.007.patch, YARN-10674.008.patch, > YARN-10674.009.patch, YARN-10674.010.patch, YARN-10674.011.patch, > YARN-10674.012.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10497) Fix an issue in CapacityScheduler which fails to delete queues
[ https://issues.apache.org/jira/browse/YARN-10497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303245#comment-17303245 ] Peter Bacsko commented on YARN-10497: - I think it's good. Let's wait for Jenkins and I'll commit it. > Fix an issue in CapacityScheduler which fails to delete queues > -- > > Key: YARN-10497 > URL: https://issues.apache.org/jira/browse/YARN-10497 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Major > Attachments: YARN-10497.001.patch, YARN-10497.002.patch, > YARN-10497.003.patch, YARN-10497.004.patch, YARN-10497.005.patch, > YARN-10497.006.patch > > > We saw an exception when using queue mutation APIs: > {code:java} > 2020-11-13 16:47:46,327 WARN > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: > CapacityScheduler configuration validation failed:java.io.IOException: Queue > root.am2cmQueueSecond not found > {code} > Which comes from this code: > {code:java} > List siblingQueues = getSiblingQueues(queueToRemove, > proposedConf); > if (!siblingQueues.contains(queueName)) { > throw new IOException("Queue " + queueToRemove + " not found"); > } > {code} > (Inside MutableCSConfigurationProvider) > If you look at the method: > {code:java} > > private List getSiblingQueues(String queuePath, Configuration conf) > { > String parentQueue = queuePath.substring(0, queuePath.lastIndexOf('.')); > String childQueuesKey = CapacitySchedulerConfiguration.PREFIX + > parentQueue + CapacitySchedulerConfiguration.DOT + > CapacitySchedulerConfiguration.QUEUES; > return new ArrayList<>(conf.getStringCollection(childQueuesKey)); > } > {code} > And here's capacity-scheduler.xml I got > {code:java} > yarn.scheduler.capacity.root.queuesdefault, q1, > q2 > {code} > You can notice there're spaces between default, q1, a2 > So conf.getStringCollection returns: > {code:java} > default > q1 > ... > {code} > Which causes match issue when we try to delete the queue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303222#comment-17303222 ] Peter Bacsko commented on YARN-10674: - [~gandras] do you have further comments? I think the patch is in good shape now. > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, > YARN-10674.006.patch, YARN-10674.007.patch, YARN-10674.008.patch, > YARN-10674.009.patch, YARN-10674.010.patch, YARN-10674.011.patch, > YARN-10674.012.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10370) [Umbrella] Reduce the feature gap between FS Placement Rules and CS Queue Mapping rules
[ https://issues.apache.org/jira/browse/YARN-10370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302878#comment-17302878 ] Peter Bacsko edited comment on YARN-10370 at 3/16/21, 8:36 PM: --- [~shuzirra] [~snemeth] the vast majority of tasks in this JIRA are done. There are some open tasks left. I think it's safe to say that this feature is ready and the remaining tasks can be completed either as standalone tasks or under a "Part II" JIRA. Otherwise we might need to keep this open for a long time. IMO we should move the open / patch available tasks under a new umbrella and resolve this, marked with a proper Fix version. Opinions? was (Author: pbacsko): [~shuzirra] [~snemeth] the vast majority of tasks in this JIRA are done. There are some open tasks left. I think it's safe to say that the umbrella is done and the remaining tasks can be completed either as standalone tasks or under a "Part II" JIRA. Otherwise we might need to keep this open for a long time. IMO we should move the open / patch available tasks under a new umbrella and resolve this, marked with a proper Fix version. Opinions? > [Umbrella] Reduce the feature gap between FS Placement Rules and CS Queue > Mapping rules > --- > > Key: YARN-10370 > URL: https://issues.apache.org/jira/browse/YARN-10370 > Project: Hadoop YARN > Issue Type: New Feature > Components: yarn >Reporter: Gergely Pollak >Assignee: Gergely Pollak >Priority: Major > Labels: capacity-scheduler, capacityscheduler > Attachments: MappingRuleEnhancements.pdf, Possible extensions of > mapping rule format in Capacity Scheduler.pdf > > > To continue closing the feature gaps between Fair Scheduler and Capacity > Scheduler to help users migrate between the scheduler more easy, we need to > add some of the Fair Scheduler placement rules to the capacity scheduler's > queue mapping functionality. > With [~snemeth] and [~pbacsko] we've created the following design docs about > the proposed changes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10370) [Umbrella] Reduce the feature gap between FS Placement Rules and CS Queue Mapping rules
[ https://issues.apache.org/jira/browse/YARN-10370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302878#comment-17302878 ] Peter Bacsko commented on YARN-10370: - [~shuzirra] [~snemeth] the vast majority of tasks in this JIRA are done. There are some open tasks left. I think it's safe to say that the umbrella is done and the remaining tasks can be completed either as standalone tasks or under a "Part II" JIRA. Otherwise we might need to keep this open for a long time. IMO we should move the open / patch available tasks under a new umbrella and resolve this, marked with a proper Fix version. Opinions? > [Umbrella] Reduce the feature gap between FS Placement Rules and CS Queue > Mapping rules > --- > > Key: YARN-10370 > URL: https://issues.apache.org/jira/browse/YARN-10370 > Project: Hadoop YARN > Issue Type: New Feature > Components: yarn >Reporter: Gergely Pollak >Assignee: Gergely Pollak >Priority: Major > Labels: capacity-scheduler, capacityscheduler > Attachments: MappingRuleEnhancements.pdf, Possible extensions of > mapping rule format in Capacity Scheduler.pdf > > > To continue closing the feature gaps between Fair Scheduler and Capacity > Scheduler to help users migrate between the scheduler more easy, we need to > add some of the Fair Scheduler placement rules to the capacity scheduler's > queue mapping functionality. > With [~snemeth] and [~pbacsko] we've created the following design docs about > the proposed changes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10686) Fix TestCapacitySchedulerAutoQueueCreation#testAutoQueueCreationFailsForEmptyPathWithAQCAndWeightMode
[ https://issues.apache.org/jira/browse/YARN-10686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302599#comment-17302599 ] Peter Bacsko commented on YARN-10686: - +1 Thanks [~zhuqi] for the patch and [~gandras] for the review. Committed to trunk. > Fix > TestCapacitySchedulerAutoQueueCreation#testAutoQueueCreationFailsForEmptyPathWithAQCAndWeightMode > - > > Key: YARN-10686 > URL: https://issues.apache.org/jira/browse/YARN-10686 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10686.001.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10686) Fix TestCapacitySchedulerAutoQueueCreation#testAutoQueueCreationFailsForEmptyPathWithAQCAndWeightMode
[ https://issues.apache.org/jira/browse/YARN-10686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10686: Summary: Fix TestCapacitySchedulerAutoQueueCreation#testAutoQueueCreationFailsForEmptyPathWithAQCAndWeightMode (was: Fix testAutoQueueCreationFailsForEmptyPathWithAQCAndWeightMode user error.) > Fix > TestCapacitySchedulerAutoQueueCreation#testAutoQueueCreationFailsForEmptyPathWithAQCAndWeightMode > - > > Key: YARN-10686 > URL: https://issues.apache.org/jira/browse/YARN-10686 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10686.001.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10682) The scheduler monitor policies conf should trim values separated by comma
[ https://issues.apache.org/jira/browse/YARN-10682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302567#comment-17302567 ] Peter Bacsko commented on YARN-10682: - +1 Thanks for the patch [~zhuqi] and [~gandras] for the review, committed to trunk. > The scheduler monitor policies conf should trim values separated by comma > - > > Key: YARN-10682 > URL: https://issues.apache.org/jira/browse/YARN-10682 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10682.001.patch > > > When i configured scheduler monitor policies with space, the RM will start > with error. > The conf should support trim between "," , such as : > "a,b,c" is supported now, but "a, b, c" is not supported now, just add > trim in this jira. > > When tested multi policy, it happened. > > yarn.resourcemanager.scheduler.monitor.policies > > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.QueueConfigurationAutoRefreshPolicy, > > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AutoCreatedQueueDeletionPolicy > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10682) The scheduler monitor policies conf should trim values separated by comma
[ https://issues.apache.org/jira/browse/YARN-10682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10682: Summary: The scheduler monitor policies conf should trim values separated by comma (was: The scheduler monitor policies conf should support trim between ",".) > The scheduler monitor policies conf should trim values separated by comma > - > > Key: YARN-10682 > URL: https://issues.apache.org/jira/browse/YARN-10682 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10682.001.patch > > > When i configured scheduler monitor policies with space, the RM will start > with error. > The conf should support trim between "," , such as : > "a,b,c" is supported now, but "a, b, c" is not supported now, just add > trim in this jira. > > When tested multi policy, it happened. > > yarn.resourcemanager.scheduler.monitor.policies > > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.QueueConfigurationAutoRefreshPolicy, > > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AutoCreatedQueueDeletionPolicy > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302548#comment-17302548 ] Peter Bacsko commented on YARN-10674: - Thanks [~zhuqi] this is definitely looks better. We're close to the final version. Some comments: 1. {noformat} Disable the preemption with nopolicy or observeonly mode, " + "default mode is nopolicy with no arg." + "When use nopolicy arg, it means to remove " + "ProportionalCapacityPreemptionPolicy for CS preemption, " + "When use observeonly arg, " + "it means to set " + "yarn.resourcemanager.monitor.capacity.preemption.observe_only " + "to true" {noformat} I'd to slightly modify this text: {noformat} Disable the preemption with \"nopolicy\" or \"observeonly\" mode. Default is \"nopolicy\". \"nopolicy\" removes ProportionalCapacityPreemptionPolicy from the list of monitor policies. \"observeronly\" sets \"yarn.resourcemanager.monitor.capacity.preemption.observe_only\" to true. {noformat} 2. This definition: {{private String disablePreemptionMode;}} This should be a simple enum like: {noformat} public enum DisablePreemptionMode { OBSERVE_ONLY { @Override String getCliOption() { return "observeonly"; } }, NO_POLICY { @Override String getCliOption() { return "nopolicy"; } }; abstract String getCliOption(); } {noformat} So you can also use them here: {noformat} private static void checkDisablePreemption(CliOption cliOption, String disablePreemptionMode) { if (disablePreemptionMode == null || disablePreemptionMode.trim().isEmpty()) { // The default mode is nopolicy. return; } try { DisablePreemptionMode.valueOf(disablePreemptionMode); } catch (IllegalArgumentException e) { throw new PreconditionException( String.format("Specified disable-preemption option %s is illegal, " + " use \"nopolicy\" or \"observeonly\"")); } {noformat} "disablePreemptionMode" should be an enum everywhere. 3. {noformat} public void convertSiteProperties(Configuration conf, Configuration yarnSiteConfig, boolean drfUsed, boolean enableAsyncScheduler) boolean enableAsyncScheduler, boolean userPercentage, boolean disablePreemption, String disablePreemptionMode) { {noformat} Here "disablePreemptionMode" should be an enum also and make sure that it always has a value. If it always has a value, this part becomes much simpler: {noformat} if (disablePreemption && disablePreemptionMode == DisablePreemptionMode.NO_POLICY) { yarnSiteConfig.set(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES, ""); } } {noformat} 4. {{AutoCreatedQueueDeletionPolicy.class.getCanonicalName())}} This string is referenced very often in the tests. Instead, use a final String: {noformat} private static final String DELETION_POLICY_CLASS = AutoCreatedQueueDeletionPolicy.class.getCanonicalName(); {noformat} So the readability becomes much better. > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, > YARN-10674.006.patch, YARN-10674.007.patch, YARN-10674.008.patch, > YARN-10674.009.patch, YARN-10674.010.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300319#comment-17300319 ] Peter Bacsko commented on YARN-10674: - [~zhuqi] I didn't have too much time to deeply review the patch, but your change ignore the "observeonly" setting. So, if I use "\-\-disablepreemption observeonly", nothing happens. Could you insert this to {{FSConfigToCSConfigConverter}}? I believe that is the best place for it. > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, > YARN-10674.006.patch, YARN-10674.007.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300319#comment-17300319 ] Peter Bacsko edited comment on YARN-10674 at 3/12/21, 1:34 PM: --- [~zhuqi] I didn't have too much time to deeply review the patch, but your change ignores the "observeonly" setting. So, if I use "\-\-disablepreemption observeonly", nothing happens. Could you insert this to {{FSConfigToCSConfigConverter}}? I believe that is the best place for it. was (Author: pbacsko): [~zhuqi] I didn't have too much time to deeply review the patch, but your change ignore the "observeonly" setting. So, if I use "\-\-disablepreemption observeonly", nothing happens. Could you insert this to {{FSConfigToCSConfigConverter}}? I believe that is the best place for it. > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, > YARN-10674.006.patch, YARN-10674.007.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299602#comment-17299602 ] Peter Bacsko edited comment on YARN-10674 at 3/11/21, 3:25 PM: --- Ok, I did some research, I think we have 3 options to completely disable preemption: 1) Set disable_preemption to "root", which will propagate down to other queues. 2) Remove "ProportionalCapacityPreemptionPolicy" from the list of policies. 3) Enable "observe_only" property. I think #1 is not really good, because it relies on a side-effect (propagation of a setting). The intention is not clear. #2 is perfectly acceptable and this goes to {{yarn-site.xml}} so it should be in {{FSYarnSiteConverter}}. #3 is also OK, but that goes to {{capacity-scheduler.xml}} and NOT to {{yarn-site.xml}}, I just verified it. So this should be placed somewhere else. So we can do: 1) Vote for what's best 2) Introduce a command line switch like "-dp" "\-\-disable-preemption" with values like "nopolicy" or "observeonly" and we pick a default value, eg. "nopolicy". So we can do something like: {noformat} yarn fs2cs --disable-preemption observeonly --yarnsiteconfig /path/to/yarn-site.xml {noformat} [~gandras] [~zhuqi] what do you think? was (Author: pbacsko): Ok, I did some research, I think we 3 options to completely disable preemption: 1) Set disable_preemption to "root", which will propagate down to other queues. 2) Remove "ProportionalCapacityPreemptionPolicy" from the list of policies. 3) Enable "observe_only" property. I think #1 is not really good, because it relies on a side-effect (propagation of a setting). The intention is not clear. #2 is perfectly acceptable and this goes to {{yarn-site.xml}} so it should be in {{FSYarnSiteConverter}}. #3 is also OK, but that goes to {{capacity-scheduler.xml}} and NOT to {{yarn-site.xml}}, I just verified it. So this should be placed somewhere else. So we can do: 1) Vote for what's best 2) Introduce a command line switch like "-dp" "\-\-disable-preemption" with values like "nopolicy" or "observeonly" and we pick a default value, eg. "nopolicy". So we can do something like: {noformat} yarn fs2cs --disable-preemption observeonly --yarnsiteconfig /path/to/yarn-site.xml {noformat} [~gandras] [~zhuqi] what do you think? > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, > YARN-10674.006.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299602#comment-17299602 ] Peter Bacsko edited comment on YARN-10674 at 3/11/21, 2:43 PM: --- Ok, I did some research, I think we 3 options to completely disable preemption: 1) Set disable_preemption to "root", which will propagate down to other queues. 2) Remove "ProportionalCapacityPreemptionPolicy" from the list of policies. 3) Enable "observe_only" property. I think #1 is not really good, because it relies on a side-effect (propagation of a setting). The intention is not clear. #2 is perfectly acceptable and this goes to {{yarn-site.xml}} so it should be in {{FSYarnSiteConverter}}. #3 is also OK, but that goes to {{capacity-scheduler.xml}} and NOT to {{yarn-site.xml}}, I just verified it. So this should be placed somewhere else. So we can do: 1) Vote for what's best 2) Introduce a command line switch like "-dp" "\-\-disable-preemption" with values like "nopolicy" or "observeonly" and we pick a default value, eg. "nopolicy". So we can do something like: {noformat} yarn fs2cs --disable-preemption observeonly --yarnsiteconfig /path/to/yarn-site.xml {noformat} [~gandras] [~zhuqi] what do you think? was (Author: pbacsko): Ok, I did some research, I think we 3 options to completely disable preemption: 1) Set disable_preemption to "root", which will propagate down to other queues. 2) Remove "ProportionalCapacityPreemptionPolicy" from the list of policies. 3) Enable "observe_only" property. I think #1 is not really good, because it relies on a side-effect (propagation of a setting). The intention is not clear. #2 is perfectly acceptable and this goes to {{yarn-site.xml}} so it should be in {{FSYarnSiteConverter}}. #3 is also OK, but that goes to {{capacity-scheduler.xml}} and NOT in {{yarn-site.xml}}, I just verified it. So this should be placed somewhere else. So we can do: 1) Vote for what's best 2) Introduce a command line switch like "-dp" "\-\-disable-preemption" with values like "nopolicy" or "observeonly" and we pick a default value, eg. "nopolicy". So we can do something like: {noformat} yarn fs2cs --disable-preemption observeonly --yarnsiteconfig /path/to/yarn-site.xml {noformat} [~gandras] [~zhuqi] what do you think? > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, > YARN-10674.006.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299602#comment-17299602 ] Peter Bacsko commented on YARN-10674: - Ok, I did some research, I think we 3 options to completely disable preemption: 1) Set disable_preemption to "root", which will propagate down to other queues. 2) Remove "ProportionalCapacityPreemptionPolicy" from the list of policies. 3) Enable "observe_only" property. I think #1 is not really good, because it relies on a side-effect (propagation of a setting). The intention is not clear. #2 is perfectly acceptable and this goes to {{yarn-site.xml}} so it should be in {{FSYarnSiteConverter}}. #3 is also OK, but that goes to {{capacity-scheduler.xml}} and NOT in {{yarn-site.xml}}, I just verified it. So this should be placed somewhere else. So we can do: 1) Vote for what's best 2) Introduce a command line switch like "-dp" "\-\-disable-preemption" with values like "nopolicy" or "observeonly" and we pick a default value, eg. "nopolicy". So we can do something like: {noformat} yarn fs2cs --disable-preemption observeonly --yarnsiteconfig /path/to/yarn-site.xml {noformat} [~gandras] [~zhuqi] what do you think? > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, > YARN-10674.006.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299466#comment-17299466 ] Peter Bacsko commented on YARN-10674: - [~zhuqi] yes that's right. This is the default setting for policies: {noformat} The list of SchedulingEditPolicy classes that interact with the scheduler. A particular module may be incompatible with the scheduler, other policies, or a configuration of either. yarn.resourcemanager.scheduler.monitor.policies org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy {noformat} This is from {{yarn-default.xml}}. So when we don't use preemption, we should remove this policy. But we actually have to think a little bit, because how we disable preemption affects our downstream Hadoop codebase. So let's wait until we figure out what is the best solution to turn off preemption. > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, > YARN-10674.006.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299456#comment-17299456 ] Peter Bacsko commented on YARN-10674: - [~gandras] h - that's true. I just overcomplicated the whole thing (not that preemption in general is easy to begin with). Yes, we don't need it if we don't have the policy. [~zhuqi] please wait with the new patch. What Andras said is correct, but there might be other changes that I'll recommend. > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, > YARN-10674.006.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299427#comment-17299427 ] Peter Bacsko commented on YARN-10674: - I'll do a deeper review today. [~gandras] you say: "Is setting observe only necessary here? This is an extremely subtle property.". I'm not sure how subtle it is, but it is mentioned in the upstream documentation: |{{yarn.resourcemanager.monitor.capacity.preemption.observe_only}}|If true, run the policy but do not affect the cluster with preemption and kill events. Default value is false| However, if someone thinks that disabling preemption for "root" is a better solution, I'm not against that. We might need other folks to chime in and share their thoughts. > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, > YARN-10674.006.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10685) Fixed some Typo in AbstractCSQueue.
[ https://issues.apache.org/jira/browse/YARN-10685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298874#comment-17298874 ] Peter Bacsko commented on YARN-10685: - Sure, I'll check it out. > Fixed some Typo in AbstractCSQueue. > > > Key: YARN-10685 > URL: https://issues.apache.org/jira/browse/YARN-10685 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10685.001.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10571) Refactor dynamic queue handling logic
[ https://issues.apache.org/jira/browse/YARN-10571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298861#comment-17298861 ] Peter Bacsko commented on YARN-10571: - [~gandras] thanks for the patch. I just have one question: the class {{CapacitySchedulerAutoQueueHandler}} was renamed to {{CapacitySchedulerQueueHandler}}. But the latter is telling me that this is class which handles all kinds of queues, not just auto-created queues. Wouldn't it make sense to keep the original name? Even the instance is called {{autoQueueHandler}}. Also, there's a Javadoc and a checkstyle problem. > Refactor dynamic queue handling logic > - > > Key: YARN-10571 > URL: https://issues.apache.org/jira/browse/YARN-10571 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Minor > Attachments: YARN-10571.001.patch > > > As per YARN-10506 we have introduced an other mode for auto queue creation > and a new class, which handles it. We should move the old, managed queue > related logic to CSAutoQueueHandler as well, and do additional cleanup > regarding queue management. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298842#comment-17298842 ] Peter Bacsko commented on YARN-10674: - Ok, here is what I found: 1. {{RM_SCHEDULER_ENABLE_MONITORS}} --> ok, this can be set to "true" in all cases. 2. If FS preemption is disabled --> there is a property which is better than configuring the "root" queue. If FS preemption is disabled ({{yarn.scheduler.fair.preemption}} = {{false}}), then we should generate {{yarn.resourcemanager.monitor.capacity.preemption.observe_only}} = {{true}}. This means that we have the monitor thread running but we don't do any preemption. So we don't need to set "root.disable_preemption". 3. As I mentioned, the {{Configuration}} object is empty. The problem is, in order to use the preemption, we need to set the preemption policy, which is missing right now. So, if FS preemption is enabled, this line must be added: {noformat} if (conf.getBoolean(FairSchedulerConfiguration.PREEMPTION, FairSchedulerConfiguration.DEFAULT_PREEMPTION)) { yarnSiteConfig.set(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES, ProportionalCapacityPreemptionPolicy.class.getCanonicalName(); ... {noformat} So, the modified code should look like this: {noformat} yarnSiteConfig.setBoolean( YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true); if (conf.getBoolean(FairSchedulerConfiguration.PREEMPTION, FairSchedulerConfiguration.DEFAULT_PREEMPTION)) { yarnSiteConfig.set(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES, ProportionalCapacityPreemptionPolicy.class.getCanonicalName(); ... } else { // no preemption yarnSiteConfig.setBoolean(CapacitySchedulerConfiguration.PREEMPTION_OBSERVE_ONLY, true); } // new code comes here if (!userPercentage) { String policies = yarnSiteConfig.get(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES); if (policies == null) { ... {noformat} Please modify the test cases accordingly and the checkstyle issues also. > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298776#comment-17298776 ] Peter Bacsko edited comment on YARN-10674 at 3/10/21, 12:03 PM: [~zhuqi] thanks for the patch. I found a new property which is probably good for us if preemption is completely disabled on the FS side. I have to check if it is really acceptable. was (Author: pbacsko): [~zhuqi] thanks for the patch. I found a new property which is probably good for us if preemption is completely disabled on the FS side. I have to check if it is good for us. > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298776#comment-17298776 ] Peter Bacsko commented on YARN-10674: - [~zhuqi] thanks for the patch. I found a new property which is probably good for us if preemption is completely disabled on the FS side. I have to check if it is good for us. > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298156#comment-17298156 ] Peter Bacsko commented on YARN-10674: - [~zhuqi] this is very interesting. If we set RM Monitors to enabled, it means that system-wide preemption is always enabled, too: AbstractCSQueue: {noformat} private boolean isQueueHierarchyPreemptionDisabled(CSQueue q, CapacitySchedulerConfiguration configuration) { boolean systemWidePreemption = csContext.getConfiguration() .getBoolean(YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, YarnConfiguration.DEFAULT_RM_SCHEDULER_ENABLE_MONITORS); CSQueue parentQ = q.getParent(); // If the system-wide preemption switch is turned off, all of the queues in // the qPath hierarchy have preemption disabled, so return true. if (!systemWidePreemption) return true; {noformat} However, you already added a policy in YARN-10623, so looks like this property always has to be enabled in weight mode. But what if we convert an FS configuration which disabled preemption completely? I think the best thing we can do right now is that we disable preemption for "root", which will propagate to all other parent queues. So I suggest the following approach: 1. In percentage conversion mode, do not enable RM monitors by default, because it's not needed. 2. In weight mode (which is the default now), we have to enable it. But if "yarn.scheduler.fair.preemption" is false, then "yarn.scheduler.capacity.root.disable_preemption" must be set to true, but only for "root". This can be done in {{FSQueueConverter}}. cc [~bteke] [~gandras] [~snemeth], not sure if this is a good approach, but I can't see anything better. > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298112#comment-17298112 ] Peter Bacsko edited comment on YARN-10674 at 3/9/21, 3:23 PM: -- [~zhuqi] I have the following comments: 1. This change seems to always enable "RM monitors": {noformat} // This should be always true to trigger dynamic queue auto deletion // when expired. yarnSiteConfig.setBoolean( YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true); {noformat} But I don't think this is necessary. We need to enable it in two cases: preemption is enabled OR we're in weight mode. We don't have auto-queue delete in percentage mode (fs2cs can still convert to percentages with a command line switch). So I suggest that you pass an extra boolean "usePercentages". Invocation from {{FSConfigToCSConfigConverter}}: {noformat} siteConverter.convertSiteProperties(inputYarnSiteConfig, convertedYarnSiteConfig, drfUsed, conversionOptions.isEnableAsyncScheduler(), usePercentages); <-- last argument is new {noformat} Then in the site converter: {noformat} if (conf.getBoolean(FairSchedulerConfiguration.PREEMPTION, FairSchedulerConfiguration.DEFAULT_PREEMPTION)) { yarnSiteConfig.setBoolean( YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true); preemptionEnabled = true; ... } if (!usePercentages) { yarnSiteConfig.setBoolean( YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true); // setting it again is OK String policies = yarnSiteConfig.get(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES); if (policies == null) { policies = AutoCreatedQueueDeletionPolicy. class.getCanonicalName(); } else { policies += "," + AutoCreatedQueueDeletionPolicy. class.getCanonicalName(); } yarnSiteConfig.set(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES, policies); // Set the expired for deletion interval to 10s, consistent with fs. yarnSiteConfig.setInt(CapacitySchedulerConfiguration. AUTO_CREATE_CHILD_QUEUE_EXPIRED_TIME, 10); } {noformat} If I think about it, {{yarnSiteConfig}} is the output config. So this cannot happen: {noformat} } else { policies += "," + AutoCreatedQueueDeletionPolicy. class.getCanonicalName(); } {noformat} This {{Configuration}} object is created with no entries. The {{else}} branch will never be taken. So it can be simplified to: {noformat} if (!usePercentages) { yarnSiteConfig.setBoolean( YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true); String policy = AutoCreatedQueueDeletionPolicy. class.getCanonicalName(); yarnSiteConfig.set(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES, policy); // Set the expired for deletion interval to 10s, consistent with fs. yarnSiteConfig.setInt(CapacitySchedulerConfiguration. AUTO_CREATE_CHILD_QUEUE_EXPIRED_TIME, 10); } {noformat} 2. This also means two separate test cases: * When usePercentages = false, then {{RM_SCHEDULER_ENABLE_MONITORS}} and {{RM_SCHEDULER_MONITOR_POLICIES}} should be set (with preemption = false) * When usePercentages = true, then {{RM_SCHEDULER_ENABLE_MONITORS}} and {{RM_SCHEDULER_MONITOR_POLICIES}} should NOT be set (with preemption = false) I recommend the following naming: {{testRmMonitorsAndPoliciesSetWhenUsingWeights()}} - first scenario {{testRmMonitorsAndPoliciesSetWhenUsingPercentages()}} - second scenario was (Author: pbacsko): [~zhuqi] I have the following comments: 1. This change seems to always enable "RM monitors": {noformat} // This should be always true to trigger dynamic queue auto deletion // when expired. yarnSiteConfig.setBoolean( YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true); {noformat} But I don't think this is necessary. We need to enable it in two cases: preemption is enabled OR we're in weight mode. We don't have auto-queue delete in percentage mode (fs2cs can still convert to percentages with a command line switch). So I suggest that you pass an extra boolean "usePercentages". Invocation from {{FSConfigToCSConfigConverter}}: {noformat} siteConverter.convertSiteProperties(inputYarnSiteConfig, convertedYarnSiteConfig, drfUsed, conversionOptions.isEnableAsyncScheduler(), usePercentages); <-- last argument is new {noformat} Then in the site converter: {noformat} if (conf.getBoolean(FairSchedulerConfiguration.PREEMPTION, FairSchedulerConfiguration.DEFAULT_PREEMPTION)) { yarnSiteConfig.setBoolean( YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true); preemptionEnabled = true; ... } if (!usePercentages) { yarnSiteConfig.setBoolean( YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true);
[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298112#comment-17298112 ] Peter Bacsko commented on YARN-10674: - [~zhuqi] I have the following comments: 1. This change seems to always enable "RM monitors": {noformat} // This should be always true to trigger dynamic queue auto deletion // when expired. yarnSiteConfig.setBoolean( YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true); {noformat} But I don't think this is necessary. We need to enable it in two cases: preemption is enabled OR we're in weight mode. We don't have auto-queue delete in percentage mode (fs2cs can still convert to percentages with a command line switch). So I suggest that you pass an extra boolean "usePercentages". Invocation from {{FSConfigToCSConfigConverter}}: {noformat} siteConverter.convertSiteProperties(inputYarnSiteConfig, convertedYarnSiteConfig, drfUsed, conversionOptions.isEnableAsyncScheduler(), usePercentages); <-- last argument is new {noformat} Then in the site converter: {noformat} if (conf.getBoolean(FairSchedulerConfiguration.PREEMPTION, FairSchedulerConfiguration.DEFAULT_PREEMPTION)) { yarnSiteConfig.setBoolean( YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true); preemptionEnabled = true; ... } if (!usePercentages) { yarnSiteConfig.setBoolean( YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true); // setting it again is OK String policies = yarnSiteConfig.get(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES); if (policies == null) { policies = AutoCreatedQueueDeletionPolicy. class.getCanonicalName(); } else { policies += "," + AutoCreatedQueueDeletionPolicy. class.getCanonicalName(); } yarnSiteConfig.set(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES, policies); // Set the expired for deletion interval to 10s, consistent with fs. yarnSiteConfig.setInt(CapacitySchedulerConfiguration. AUTO_CREATE_CHILD_QUEUE_EXPIRED_TIME, 10); } {noformat} If I think about it, {{yarnSiteConfig}} is the output config. So this cannot happen: {noformat} } else { policies += "," + AutoCreatedQueueDeletionPolicy. class.getCanonicalName(); } {noformat} This {{Configuration}} object is created with no entries. The {{else}} branch will never be taken. So it can be simplified to: {noformat} if (!usePercentages) { yarnSiteConfig.setBoolean( YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true); String policy = AutoCreatedQueueDeletionPolicy. class.getCanonicalName(); yarnSiteConfig.set(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES, policy); // Set the expired for deletion interval to 10s, consistent with fs. yarnSiteConfig.setInt(CapacitySchedulerConfiguration. AUTO_CREATE_CHILD_QUEUE_EXPIRED_TIME, 10); } {noformat} 2. This also means two separate test cases: * When usePercentages = false, then {{RM_SCHEDULER_ENABLE_MONITORS}} and {{RM_SCHEDULER_MONITOR_POLICIES}} should be set (with preemption = false) * When usePercentages = true, then\{{RM_SCHEDULER_ENABLE_MONITORS}} and {{RM_SCHEDULER_MONITOR_POLICIES}} should NOT be set (with preemption = false) I recommend the following naming: {{testRmMonitorsAndPoliciesSetWhenUsingWeights()}} - first scenario {{testRmMonitorsAndPoliciesSetWhenUsingPercentages()}} - second scenario > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298092#comment-17298092 ] Peter Bacsko commented on YARN-10674: - Ok thanks, I'll review this one soon. > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298074#comment-17298074 ] Peter Bacsko commented on YARN-10674: - [~zhuqi] am I right when I think that this patch depends on YARN-10682? Because this change generates a config entry with "," and it's not supported now. > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org