[jira] [Reopened] (YARN-9615) Add dispatcher metrics to RM

2021-05-11 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko reopened YARN-9615:


> Add dispatcher metrics to RM
> 
>
> Key: YARN-9615
> URL: https://issues.apache.org/jira/browse/YARN-9615
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jonathan Hung
>Assignee: Qi Zhu
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9615-branch-3.3-001.patch, YARN-9615.001.patch, 
> YARN-9615.002.patch, YARN-9615.003.patch, YARN-9615.004.patch, 
> YARN-9615.005.patch, YARN-9615.006.patch, YARN-9615.007.patch, 
> YARN-9615.008.patch, YARN-9615.009.patch, YARN-9615.010.patch, 
> YARN-9615.011.patch, YARN-9615.011.patch, YARN-9615.poc.patch, 
> image-2021-03-04-10-35-10-626.png, image-2021-03-04-10-36-12-441.png, 
> screenshot-1.png
>
>
> It'd be good to have counts/processing times for each event type in RM async 
> dispatcher and scheduler async dispatcher.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9615) Add dispatcher metrics to RM

2021-05-11 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9615:
---
Attachment: YARN-9615-branch-3.3-001.patch

> Add dispatcher metrics to RM
> 
>
> Key: YARN-9615
> URL: https://issues.apache.org/jira/browse/YARN-9615
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jonathan Hung
>Assignee: Qi Zhu
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9615-branch-3.3-001.patch, YARN-9615.001.patch, 
> YARN-9615.002.patch, YARN-9615.003.patch, YARN-9615.004.patch, 
> YARN-9615.005.patch, YARN-9615.006.patch, YARN-9615.007.patch, 
> YARN-9615.008.patch, YARN-9615.009.patch, YARN-9615.010.patch, 
> YARN-9615.011.patch, YARN-9615.011.patch, YARN-9615.poc.patch, 
> image-2021-03-04-10-35-10-626.png, image-2021-03-04-10-36-12-441.png, 
> screenshot-1.png
>
>
> It'd be good to have counts/processing times for each event type in RM async 
> dispatcher and scheduler async dispatcher.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10571) Refactor dynamic queue handling logic

2021-05-11 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342462#comment-17342462
 ] 

Peter Bacsko commented on YARN-10571:
-

Finally, no javac issues!

[~gandras] please check the test failure.

> Refactor dynamic queue handling logic
> -
>
> Key: YARN-10571
> URL: https://issues.apache.org/jira/browse/YARN-10571
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Minor
> Attachments: YARN-10571.001.patch, YARN-10571.002.patch, 
> YARN-10571.003.patch, YARN-10571.004.patch
>
>
> As per YARN-10506 we have introduced an other mode for auto queue creation 
> and a new class, which handles it. We should move the old, managed queue 
> related logic to CSAutoQueueHandler as well, and do additional cleanup 
> regarding queue management.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10763) add the speed of containers assigned metrics to ClusterMetrics

2021-05-11 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342360#comment-17342360
 ] 

Peter Bacsko commented on YARN-10763:
-

[~chaosju] some comments:

1. {{Timer}} / {{TimerTask}} are rather old constructs, I'd prefer 
{{ScheduledThreadPoolExecutor}}.
2. Another problem is that the {{Timer}} is not stopped in {{destroy()}}, this 
can definitely be a problem in the tests.
3. This is a singleton class, so the constructor should not be public, so 
remove the modifier.


> add  the speed of containers assigned metrics to ClusterMetrics
> ---
>
> Key: YARN-10763
> URL: https://issues.apache.org/jira/browse/YARN-10763
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.2.1
>Reporter: chaosju
>Priority: Major
> Attachments: YARN-10763.001.patch, screenshot-1.png
>
>
> It'd be good to have ContainerAssignedNum/Second in ClusterMetrics for 
> measuring cluster throughput.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9615) Add dispatcher metrics to RM

2021-05-06 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340163#comment-17340163
 ] 

Peter Bacsko commented on YARN-9615:


[~BilwaST] I'm currently on vacation, I can get back to this on Monday. 

> Add dispatcher metrics to RM
> 
>
> Key: YARN-9615
> URL: https://issues.apache.org/jira/browse/YARN-9615
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jonathan Hung
>Assignee: Qi Zhu
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-9615.001.patch, YARN-9615.002.patch, 
> YARN-9615.003.patch, YARN-9615.004.patch, YARN-9615.005.patch, 
> YARN-9615.006.patch, YARN-9615.007.patch, YARN-9615.008.patch, 
> YARN-9615.009.patch, YARN-9615.010.patch, YARN-9615.011.patch, 
> YARN-9615.011.patch, YARN-9615.poc.patch, image-2021-03-04-10-35-10-626.png, 
> image-2021-03-04-10-36-12-441.png, screenshot-1.png
>
>
> It'd be good to have counts/processing times for each event type in RM async 
> dispatcher and scheduler async dispatcher.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10571) Refactor dynamic queue handling logic

2021-04-28 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334674#comment-17334674
 ] 

Peter Bacsko commented on YARN-10571:
-

Thanks [~gandras] for the patch. Do you know what's going on with the javac 
warnings? That code wasn't even touched. Maybe it has to do with the failing 
build ("Unable to create native thread").

I'll trigger a rebuild.

> Refactor dynamic queue handling logic
> -
>
> Key: YARN-10571
> URL: https://issues.apache.org/jira/browse/YARN-10571
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Minor
> Attachments: YARN-10571.001.patch, YARN-10571.002.patch, 
> YARN-10571.003.patch
>
>
> As per YARN-10506 we have introduced an other mode for auto queue creation 
> and a new class, which handles it. We should move the old, managed queue 
> related logic to CSAutoQueueHandler as well, and do additional cleanup 
> regarding queue management.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10739) GenericEventHandler.printEventQueueDetails causes RM recovery to take too much time

2021-04-27 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333171#comment-17333171
 ] 

Peter Bacsko commented on YARN-10739:
-

+1

thanks [~zhuqi] for the patch and [~gandras] / [~zhanqi.cai] for the review. 
Committed to trunk.

> GenericEventHandler.printEventQueueDetails causes RM recovery to take too 
> much time
> ---
>
> Key: YARN-10739
> URL: https://issues.apache.org/jira/browse/YARN-10739
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 3.4.0, 3.3.1, 3.2.3
>Reporter: Zhanqi Cai
>Assignee: Qi Zhu
>Priority: Critical
> Attachments: YARN-10739-001.patch, YARN-10739-002.patch, 
> YARN-10739.003.patch, YARN-10739.003.patch, YARN-10739.004.patch, 
> YARN-10739.005.patch, YARN-10739.006.patch
>
>
> Due to YARN-8995 YARN-10642 add GenericEventHandler.printEventQueueDetails on 
> AsyncDispatcher, if the event queue size is too large, the 
> printEventQueueDetails will cost too much time and RM  take a long time to 
> process.
> For example:
>  If we have 4K nodes on cluster and 4K apps running, if we do switch and the 
> node manager will register with RM, and RM will call NodesListManager to do 
> RMAppNodeUpdateEvent, code like below:
> {code:java}
> for(RMApp app : rmContext.getRMApps().values()) {
>   if (!app.isAppFinalStateStored()) {
> this.rmContext
> .getDispatcher()
> .getEventHandler()
> .handle(
> new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
> appNodeUpdateType));
>   }
> }{code}
> So the total event is 4k*4k=16 mil, during this window, the 
> GenericEventHandler.printEventQueueDetails will print the event queue detail 
> and be called frequently, once the event queue size reaches 1 mil+, the 
> Iterator of the queue from printEventQueueDetails will be so slow refer to 
> below: 
> {code:java}
> private void printEventQueueDetails() {
>   Iterator iterator = eventQueue.iterator();
>   Map counterMap = new HashMap<>();
>   while (iterator.hasNext()) {
> Enum eventType = iterator.next().getType();
> {code}
> Then RM recovery will cost too much time.
>  Refer to our log:
> {code:java}
> 2021-04-14 20:35:34,432 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(306)) - Size of event-queue is 1200
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: KILL, Event 
> record counter: 310836
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_UPDATE, 
> Event record counter: 1103
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: 
> NODE_REMOVED, Event record counter: 1
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: APP_REMOVED, 
> Event record counter: 1
> {code}
> Between AsyncDispatcher.handle and printEventQueueDetails, here is more than 
> 1s to do Iterator.
> I upload a file to ensure the printEventQueueDetails only be called one-time 
> pre-30s.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10739) GenericEventHandler.printEventQueueDetails causes RM recovery to take too much time

2021-04-27 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10739:

Summary: GenericEventHandler.printEventQueueDetails causes RM recovery to 
take too much time  (was: GenericEventHandler.printEventQueueDetails cause RM 
recovery cost too much time)

> GenericEventHandler.printEventQueueDetails causes RM recovery to take too 
> much time
> ---
>
> Key: YARN-10739
> URL: https://issues.apache.org/jira/browse/YARN-10739
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 3.4.0, 3.3.1, 3.2.3
>Reporter: Zhanqi Cai
>Assignee: Qi Zhu
>Priority: Critical
> Attachments: YARN-10739-001.patch, YARN-10739-002.patch, 
> YARN-10739.003.patch, YARN-10739.003.patch, YARN-10739.004.patch, 
> YARN-10739.005.patch, YARN-10739.006.patch
>
>
> Due to YARN-8995 YARN-10642 add GenericEventHandler.printEventQueueDetails on 
> AsyncDispatcher, if the event queue size is too large, the 
> printEventQueueDetails will cost too much time and RM  take a long time to 
> process.
> For example:
>  If we have 4K nodes on cluster and 4K apps running, if we do switch and the 
> node manager will register with RM, and RM will call NodesListManager to do 
> RMAppNodeUpdateEvent, code like below:
> {code:java}
> for(RMApp app : rmContext.getRMApps().values()) {
>   if (!app.isAppFinalStateStored()) {
> this.rmContext
> .getDispatcher()
> .getEventHandler()
> .handle(
> new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
> appNodeUpdateType));
>   }
> }{code}
> So the total event is 4k*4k=16 mil, during this window, the 
> GenericEventHandler.printEventQueueDetails will print the event queue detail 
> and be called frequently, once the event queue size reaches 1 mil+, the 
> Iterator of the queue from printEventQueueDetails will be so slow refer to 
> below: 
> {code:java}
> private void printEventQueueDetails() {
>   Iterator iterator = eventQueue.iterator();
>   Map counterMap = new HashMap<>();
>   while (iterator.hasNext()) {
> Enum eventType = iterator.next().getType();
> {code}
> Then RM recovery will cost too much time.
>  Refer to our log:
> {code:java}
> 2021-04-14 20:35:34,432 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(306)) - Size of event-queue is 1200
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: KILL, Event 
> record counter: 310836
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_UPDATE, 
> Event record counter: 1103
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: 
> NODE_REMOVED, Event record counter: 1
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: APP_REMOVED, 
> Event record counter: 1
> {code}
> Between AsyncDispatcher.handle and printEventQueueDetails, here is more than 
> 1s to do Iterator.
> I upload a file to ensure the printEventQueueDetails only be called one-time 
> pre-30s.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10739) GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time

2021-04-26 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17332076#comment-17332076
 ] 

Peter Bacsko commented on YARN-10739:
-

Thanks for the patch [~zhuqi].

I have some comments:
1. {{PrintEventDetailsService #%d}} - I think it's better to call it 
{{PrintEventDetailsThread #%d}}.

2. Variable {{printEventDetailsService}} - same here, 
{{printEventDetailsExecutor}} sounds better.

3. {{printEventDetailsService.allowCoreThreadTimeOut(true);}} --> there is just 
one core thread. I think it's fine if we don't allow it to time out, so I 
suggest to set this to "false" (which is the default).

4. {{printEventDetailsService.shutdown();}} -- since we're shutting it down in 
{{serviceStop()}}, let's call {{shutdownNow()}} which is safer. Don't wait for 
printing.

5. Tracing log:
{noformat}
// For test
if (LOG.isTraceEnabled()) {
  LOG.trace("Event type: " + entry.getKey() + " printed.");
}
{noformat}

I know that this is for testing, but still, this affects production code. Trace 
level already floods the logs with everything. I don't think we should print 
this, even on TRACE. It's not a huge issue if it is not tested. 



> GenericEventHandler.printEventQueueDetails cause RM recovery cost too much 
> time
> ---
>
> Key: YARN-10739
> URL: https://issues.apache.org/jira/browse/YARN-10739
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 3.4.0, 3.3.1, 3.2.3
>Reporter: Zhanqi Cai
>Assignee: Qi Zhu
>Priority: Critical
> Attachments: YARN-10739-001.patch, YARN-10739-002.patch, 
> YARN-10739.003.patch, YARN-10739.003.patch, YARN-10739.004.patch
>
>
> Due to YARN-8995 YARN-10642 add GenericEventHandler.printEventQueueDetails on 
> AsyncDispatcher, if the event queue size is too large, the 
> printEventQueueDetails will cost too much time and RM  take a long time to 
> process.
> For example:
>  If we have 4K nodes on cluster and 4K apps running, if we do switch and the 
> node manager will register with RM, and RM will call NodesListManager to do 
> RMAppNodeUpdateEvent, code like below:
> {code:java}
> for(RMApp app : rmContext.getRMApps().values()) {
>   if (!app.isAppFinalStateStored()) {
> this.rmContext
> .getDispatcher()
> .getEventHandler()
> .handle(
> new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
> appNodeUpdateType));
>   }
> }{code}
> So the total event is 4k*4k=16 mil, during this window, the 
> GenericEventHandler.printEventQueueDetails will print the event queue detail 
> and be called frequently, once the event queue size reaches 1 mil+, the 
> Iterator of the queue from printEventQueueDetails will be so slow refer to 
> below: 
> {code:java}
> private void printEventQueueDetails() {
>   Iterator iterator = eventQueue.iterator();
>   Map counterMap = new HashMap<>();
>   while (iterator.hasNext()) {
> Enum eventType = iterator.next().getType();
> {code}
> Then RM recovery will cost too much time.
>  Refer to our log:
> {code:java}
> 2021-04-14 20:35:34,432 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(306)) - Size of event-queue is 1200
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: KILL, Event 
> record counter: 310836
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_UPDATE, 
> Event record counter: 1103
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: 
> NODE_REMOVED, Event record counter: 1
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: APP_REMOVED, 
> Event record counter: 1
> {code}
> Between AsyncDispatcher.handle and printEventQueueDetails, here is more than 
> 1s to do Iterator.
> I upload a file to ensure the printEventQueueDetails only be called one-time 
> pre-30s.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10739) GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time

2021-04-26 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17332076#comment-17332076
 ] 

Peter Bacsko edited comment on YARN-10739 at 4/26/21, 12:14 PM:


Thanks for the patch [~zhuqi].

I have some comments:
1. {{PrintEventDetailsService #%d}} - I think it's better to call it 
{{PrintEventDetailsThread #%d}}.

2. Variable {{printEventDetailsService}} - same here, 
{{printEventDetailsExecutor}} sounds better.

3. {{printEventDetailsService.allowCoreThreadTimeOut(true);}} --> there is just 
one core thread. I think it's fine if we don't allow it to time out, so I 
suggest to set this to "false" (which is the default).

4. {{printEventDetailsService.shutdown();}} -- since we're shutting it down in 
{{serviceStop()}}, let's call {{shutdownNow()}} which is safer. Don't wait for 
printing.

5. Tracing log:
{noformat}
// For test
if (LOG.isTraceEnabled()) {
  LOG.trace("Event type: " + entry.getKey() + " printed.");
}
{noformat}

I know that this is for testing, but still, this affects the production code. 
Trace level already floods the logs with everything. I don't think we should 
print this, even on TRACE. It's not a huge issue if it is not tested. 




was (Author: pbacsko):
Thanks for the patch [~zhuqi].

I have some comments:
1. {{PrintEventDetailsService #%d}} - I think it's better to call it 
{{PrintEventDetailsThread #%d}}.

2. Variable {{printEventDetailsService}} - same here, 
{{printEventDetailsExecutor}} sounds better.

3. {{printEventDetailsService.allowCoreThreadTimeOut(true);}} --> there is just 
one core thread. I think it's fine if we don't allow it to time out, so I 
suggest to set this to "false" (which is the default).

4. {{printEventDetailsService.shutdown();}} -- since we're shutting it down in 
{{serviceStop()}}, let's call {{shutdownNow()}} which is safer. Don't wait for 
printing.

5. Tracing log:
{noformat}
// For test
if (LOG.isTraceEnabled()) {
  LOG.trace("Event type: " + entry.getKey() + " printed.");
}
{noformat}

I know that this is for testing, but still, this affects production code. Trace 
level already floods the logs with everything. I don't think we should print 
this, even on TRACE. It's not a huge issue if it is not tested. 



> GenericEventHandler.printEventQueueDetails cause RM recovery cost too much 
> time
> ---
>
> Key: YARN-10739
> URL: https://issues.apache.org/jira/browse/YARN-10739
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 3.4.0, 3.3.1, 3.2.3
>Reporter: Zhanqi Cai
>Assignee: Qi Zhu
>Priority: Critical
> Attachments: YARN-10739-001.patch, YARN-10739-002.patch, 
> YARN-10739.003.patch, YARN-10739.003.patch, YARN-10739.004.patch
>
>
> Due to YARN-8995 YARN-10642 add GenericEventHandler.printEventQueueDetails on 
> AsyncDispatcher, if the event queue size is too large, the 
> printEventQueueDetails will cost too much time and RM  take a long time to 
> process.
> For example:
>  If we have 4K nodes on cluster and 4K apps running, if we do switch and the 
> node manager will register with RM, and RM will call NodesListManager to do 
> RMAppNodeUpdateEvent, code like below:
> {code:java}
> for(RMApp app : rmContext.getRMApps().values()) {
>   if (!app.isAppFinalStateStored()) {
> this.rmContext
> .getDispatcher()
> .getEventHandler()
> .handle(
> new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
> appNodeUpdateType));
>   }
> }{code}
> So the total event is 4k*4k=16 mil, during this window, the 
> GenericEventHandler.printEventQueueDetails will print the event queue detail 
> and be called frequently, once the event queue size reaches 1 mil+, the 
> Iterator of the queue from printEventQueueDetails will be so slow refer to 
> below: 
> {code:java}
> private void printEventQueueDetails() {
>   Iterator iterator = eventQueue.iterator();
>   Map counterMap = new HashMap<>();
>   while (iterator.hasNext()) {
> Enum eventType = iterator.next().getType();
> {code}
> Then RM recovery will cost too much time.
>  Refer to our log:
> {code:java}
> 2021-04-14 20:35:34,432 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(306)) - Size of event-queue is 1200
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: KILL, Event 
> record counter: 310836
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_UPDATE, 
> Event record counter: 1103
> 2021-04-14 

[jira] [Commented] (YARN-10637) fs2cs: add queue autorefresh policy during conversion

2021-04-26 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17332014#comment-17332014
 ] 

Peter Bacsko commented on YARN-10637:
-

+1

Thanks [~zhuqi] for the patch and [~gandras] for the review. Committed to trunk.

> fs2cs: add queue autorefresh policy during conversion
> -
>
> Key: YARN-10637
> URL: https://issues.apache.org/jira/browse/YARN-10637
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10637.001.patch, YARN-10637.002.patch, 
> YARN-10637.003.patch, YARN-10637.004.patch
>
>
> cc [~pbacsko] [~gandras] [~bteke]
> We should also fill this, when  YARN-10623 finished.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10637) fs2cs: add queue autorefresh policy during conversion

2021-04-26 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10637:

Summary: fs2cs: add queue autorefresh policy during conversion  (was: We 
should support fs to cs support for auto refresh queues when conf changed, 
after YARN-10623 finished.)

> fs2cs: add queue autorefresh policy during conversion
> -
>
> Key: YARN-10637
> URL: https://issues.apache.org/jira/browse/YARN-10637
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10637.001.patch, YARN-10637.002.patch, 
> YARN-10637.003.patch, YARN-10637.004.patch
>
>
> cc [~pbacsko] [~gandras] [~bteke]
> We should also fill this, when  YARN-10623 finished.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10637) fs2cs: add queue autorefresh policy during conversion

2021-04-26 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10637:

Labels: fs2cs  (was: )

> fs2cs: add queue autorefresh policy during conversion
> -
>
> Key: YARN-10637
> URL: https://issues.apache.org/jira/browse/YARN-10637
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10637.001.patch, YARN-10637.002.patch, 
> YARN-10637.003.patch, YARN-10637.004.patch
>
>
> cc [~pbacsko] [~gandras] [~bteke]
> We should also fill this, when  YARN-10623 finished.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10732) Disallow restarting a queue while it is in DRAINING state on CS reinitialization

2021-04-23 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17330917#comment-17330917
 ] 

Peter Bacsko commented on YARN-10732:
-

[~BilwaST] thanks for your comment - I think this is a question that can be 
answered by [~gandras].

> Disallow restarting a queue while it is in DRAINING state on CS 
> reinitialization
> 
>
> Key: YARN-10732
> URL: https://issues.apache.org/jira/browse/YARN-10732
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10732.001.patch
>
>
> CSConfigValidator#validateQueueHierarchy does not check a state where the old 
> queue is in DRAINING state but the new queue state is RUNNING. User should 
> wait until a queue is fully stopped.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10705) Misleading DEBUG log for container assignment needs to be removed when the container is actually reserved, not assigned in FairScheduler

2021-04-23 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17330872#comment-17330872
 ] 

Peter Bacsko commented on YARN-10705:
-

Thanks for the patch [~sahuja], committed to trunk.

> Misleading DEBUG log for container assignment needs to be removed when the 
> container is actually reserved, not assigned in FairScheduler
> 
>
> Key: YARN-10705
> URL: https://issues.apache.org/jira/browse/YARN-10705
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Minor
> Attachments: YARN-10705.001.patch
>
>
> Following DEBUG logs are logged if a container reservation is made when a 
> node has been offered to the queue in FairScheduler:
> {code}
> 2021-02-10 07:33:55,049 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: 
> application_1610442362681_2607's resource request is reserved.
> 2021-02-10 07:33:55,049 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue: 
> Assigned container in queue:root.pj_dc_pe container:
> {code}
> The latter log from above seems to indicate a bad container assignment with 
>  resource allocation, whereas, in actual, it is a bad 
> log which shouldn't have been logged in the first place.
> This log comes from [1] after an application attempt with an unmet demand is 
> checked for container assignment/reservation.
> If the container for this app attempt is reserved on the node, then, it 
> returns  from [2].
> From [3]:
> {quote}
>* If an assignment was made, returns the resources allocated to the
>* container.  If a reservation was made, returns
>* FairScheduler.CONTAINER_RESERVED.  If no assignment or reservation 
> was
>* made, returns an empty resource.
> {quote}
> We are checking for the empty resource at [4], but not 
> FairScheduler.CONTAINER_RESERVED before logging out a message for container 
> assignment specifically which is incorrect.
> Instead of:
> {code}
>   if (!assigned.equals(none())) {
> LOG.debug("Assigned container in queue:{} container:{}",
> getName(), assigned);
> break;
>   }
> {code}
> it should be:
> {code}
>   // check if an assignment or a reservation was made.
>   if (!assigned.equals(none())) {
> // only log container assignment if there is
> // an actual assignment, not a reservation.
> if (!assigned.equals(FairScheduler.CONTAINER_RESERVED)
> && LOG.isDebugEnabled()) {
>   LOG.debug("Assigned container in queue:" + getName() + " " +
> "container:" + assigned);
> }
> break;
>   }
> {code}
> [1] 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java#L356
> [2] 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L911
> [3] 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L842
> [4] 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java#L355



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10732) Disallow restarting a queue while it is in DRAINING state on CS reinitialization

2021-04-23 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17330865#comment-17330865
 ] 

Peter Bacsko commented on YARN-10732:
-

[~gandras] the old queue state comes from a {{CSQueueStore}} which can be 
mocked or a mock CSQueue can be added with a DRAINING state. The new queue can 
be set to RUNNING in the config. I think this scanario is testable.

It's also a bit regrettable that {{validateQueueHierarchy()}} is completely 
untested, at least there is no unit test for it in 
{{TestCapacitySchedulerConfigValidator}}. I think it could be a good idea to 
provide tests for it, if not in this JIRA, then maybe in a follow-up.

> Disallow restarting a queue while it is in DRAINING state on CS 
> reinitialization
> 
>
> Key: YARN-10732
> URL: https://issues.apache.org/jira/browse/YARN-10732
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10732.001.patch
>
>
> CSConfigValidator#validateQueueHierarchy does not check a state where the old 
> queue is in DRAINING state but the new queue state is RUNNING. User should 
> wait until a queue is fully stopped.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10732) Disallow restarting a queue while it is in DRAINING state on CS reinitialization

2021-04-23 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17330859#comment-17330859
 ] 

Peter Bacsko commented on YARN-10732:
-

I manually triggered a build and set the status to "Patch available".

> Disallow restarting a queue while it is in DRAINING state on CS 
> reinitialization
> 
>
> Key: YARN-10732
> URL: https://issues.apache.org/jira/browse/YARN-10732
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10732.001.patch
>
>
> CSConfigValidator#validateQueueHierarchy does not check a state where the old 
> queue is in DRAINING state but the new queue state is RUNNING. User should 
> wait until a queue is fully stopped.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10732) Disallow restarting a queue while it is in DRAINING state on CS reinitialization

2021-04-23 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko reassigned YARN-10732:
---

Assignee: Andras Gyori  (was: Peter Bacsko)

> Disallow restarting a queue while it is in DRAINING state on CS 
> reinitialization
> 
>
> Key: YARN-10732
> URL: https://issues.apache.org/jira/browse/YARN-10732
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10732.001.patch
>
>
> CSConfigValidator#validateQueueHierarchy does not check a state where the old 
> queue is in DRAINING state but the new queue state is RUNNING. User should 
> wait until a queue is fully stopped.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10732) Disallow restarting a queue while it is in DRAINING state on CS reinitialization

2021-04-23 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko reassigned YARN-10732:
---

Assignee: Peter Bacsko  (was: Andras Gyori)

> Disallow restarting a queue while it is in DRAINING state on CS 
> reinitialization
> 
>
> Key: YARN-10732
> URL: https://issues.apache.org/jira/browse/YARN-10732
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Andras Gyori
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10732.001.patch
>
>
> CSConfigValidator#validateQueueHierarchy does not check a state where the old 
> queue is in DRAINING state but the new queue state is RUNNING. User should 
> wait until a queue is fully stopped.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10705) Misleading DEBUG log for container assignment needs to be removed when the container is actually reserved, not assigned in FairScheduler

2021-04-23 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17330857#comment-17330857
 ] 

Peter Bacsko commented on YARN-10705:
-

+1 LGTM.

> Misleading DEBUG log for container assignment needs to be removed when the 
> container is actually reserved, not assigned in FairScheduler
> 
>
> Key: YARN-10705
> URL: https://issues.apache.org/jira/browse/YARN-10705
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Minor
> Attachments: YARN-10705.001.patch
>
>
> Following DEBUG logs are logged if a container reservation is made when a 
> node has been offered to the queue in FairScheduler:
> {code}
> 2021-02-10 07:33:55,049 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: 
> application_1610442362681_2607's resource request is reserved.
> 2021-02-10 07:33:55,049 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue: 
> Assigned container in queue:root.pj_dc_pe container:
> {code}
> The latter log from above seems to indicate a bad container assignment with 
>  resource allocation, whereas, in actual, it is a bad 
> log which shouldn't have been logged in the first place.
> This log comes from [1] after an application attempt with an unmet demand is 
> checked for container assignment/reservation.
> If the container for this app attempt is reserved on the node, then, it 
> returns  from [2].
> From [3]:
> {quote}
>* If an assignment was made, returns the resources allocated to the
>* container.  If a reservation was made, returns
>* FairScheduler.CONTAINER_RESERVED.  If no assignment or reservation 
> was
>* made, returns an empty resource.
> {quote}
> We are checking for the empty resource at [4], but not 
> FairScheduler.CONTAINER_RESERVED before logging out a message for container 
> assignment specifically which is incorrect.
> Instead of:
> {code}
>   if (!assigned.equals(none())) {
> LOG.debug("Assigned container in queue:{} container:{}",
> getName(), assigned);
> break;
>   }
> {code}
> it should be:
> {code}
>   // check if an assignment or a reservation was made.
>   if (!assigned.equals(none())) {
> // only log container assignment if there is
> // an actual assignment, not a reservation.
> if (!assigned.equals(FairScheduler.CONTAINER_RESERVED)
> && LOG.isDebugEnabled()) {
>   LOG.debug("Assigned container in queue:" + getName() + " " +
> "container:" + assigned);
> }
> break;
>   }
> {code}
> [1] 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java#L356
> [2] 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L911
> [3] 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L842
> [4] 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java#L355



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10654) Dots '.' in CSMappingRule path variables should be replaced

2021-04-14 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321128#comment-17321128
 ] 

Peter Bacsko commented on YARN-10654:
-

[~snemeth] [~shuzirra] do you guys have some time to review this? It's the 
equivalent of what FS does.

> Dots '.' in CSMappingRule path variables should be replaced
> ---
>
> Key: YARN-10654
> URL: https://issues.apache.org/jira/browse/YARN-10654
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10654-001.patch
>
>
> Dots are used as separators, so we should escape them somehow in the 
> variables when substituting them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10654) Dots '.' in CSMappingRule path variables should be replaced

2021-04-14 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17320954#comment-17320954
 ] 

Peter Bacsko commented on YARN-10654:
-

Uploaded patch v1 which is probably the simplest approach to the '.' problem.

> Dots '.' in CSMappingRule path variables should be replaced
> ---
>
> Key: YARN-10654
> URL: https://issues.apache.org/jira/browse/YARN-10654
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10654-001.patch
>
>
> Dots are used as separators, so we should escape them somehow in the 
> variables when substituting them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10654) Dots '.' in CSMappingRule path variables should be replaced

2021-04-14 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10654:

Attachment: YARN-10654-001.patch

> Dots '.' in CSMappingRule path variables should be replaced
> ---
>
> Key: YARN-10654
> URL: https://issues.apache.org/jira/browse/YARN-10654
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10654-001.patch
>
>
> Dots are used as separators, so we should escape them somehow in the 
> variables when substituting them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10654) Dots '.' in CSMappingRule path variables should be replaced

2021-04-14 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko reassigned YARN-10654:
---

Assignee: Peter Bacsko  (was: Gergely Pollak)

> Dots '.' in CSMappingRule path variables should be replaced
> ---
>
> Key: YARN-10654
> URL: https://issues.apache.org/jira/browse/YARN-10654
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Peter Bacsko
>Priority: Major
>
> Dots are used as separators, so we should escape them somehow in the 
> variables when substituting them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10564) Support Auto Queue Creation template configurations

2021-04-08 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317080#comment-17317080
 ] 

Peter Bacsko commented on YARN-10564:
-

+1

Committed to trunk. Thanks [~gandras] for the patch and [~zhuqi] for the review.

> Support Auto Queue Creation template configurations
> ---
>
> Key: YARN-10564
> URL: https://issues.apache.org/jira/browse/YARN-10564
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10564.001.patch, YARN-10564.002.patch, 
> YARN-10564.003.patch, YARN-10564.004.patch, YARN-10564.005.patch, 
> YARN-10564.006.patch, YARN-10564.poc.001.patch
>
>
> Similar to how the template configuration works for ManagedParents, we need 
> to support templates for the new auto queue creation logic. Proposition is to 
> allow wildcards in template configs such as:
> {noformat}
> yarn.scheduler.capacity.root.*.*.weight 10{noformat}
> which would mean, that set weight to 10 of every leaf of every parent under 
> root.
> We should possibly take an approach, that could support arbitrary depth of 
> template configuration, because we might need to lift the limitation of auto 
> queue nesting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10564) Support Auto Queue Creation template configurations

2021-04-07 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316309#comment-17316309
 ] 

Peter Bacsko commented on YARN-10564:
-

Thanks [~gandras] I have the following suggestions: please add comments to the 
"for" loop which explains this. I don't want to dictate the wording. It could 
be more sentences. I think it's important. Also, maybe also comment that 
"supportedWildcardLevel" or MAX_WILDCARD_LEVEL might change in the future (just 
like me, people might realize that the range is [0-1] and it might make people 
confused).

Also, an overall comment like "collect all template settings based on prefix, 
then finally apply the collected settings to the newly created queue" might be 
useful. I'd put it somewhere before the "while" loop, but this is just an idea.

> Support Auto Queue Creation template configurations
> ---
>
> Key: YARN-10564
> URL: https://issues.apache.org/jira/browse/YARN-10564
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10564.001.patch, YARN-10564.002.patch, 
> YARN-10564.003.patch, YARN-10564.004.patch, YARN-10564.005.patch, 
> YARN-10564.poc.001.patch
>
>
> Similar to how the template configuration works for ManagedParents, we need 
> to support templates for the new auto queue creation logic. Proposition is to 
> allow wildcards in template configs such as:
> {noformat}
> yarn.scheduler.capacity.root.*.*.weight 10{noformat}
> which would mean, that set weight to 10 of every leaf of every parent under 
> root.
> We should possibly take an approach, that could support arbitrary depth of 
> template configuration, because we might need to lift the limitation of auto 
> queue nesting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10564) Support Auto Queue Creation template configurations

2021-04-07 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316277#comment-17316277
 ] 

Peter Bacsko edited comment on YARN-10564 at 4/7/21, 12:16 PM:
---

Thanks [~gandras], I think I get it. I guess the trick is the "for" loop which 
modifies "queuePathParts". First we try to find the templates for the parent 
explicitly, then we step back a wildcard at each iteration. By changing 
"queuePathParts", the prefix changes so eventually we might find a parent which 
contains templates. 

Finally, we call {{setConfigFromTemplateEntries()}} where we set the collected 
values for the original queue.

Is this correct?


was (Author: pbacsko):
Thanks [~gandras], I think I get it. I guess the trick is the "for" loop which 
modifies "queuePathParts". First we try to find the templates for the parent 
explicitly, then we step back each wildcard at a time. By changing 
"queuePathParts", the prefix changes so eventually we might find a parent which 
contains templates. 

Finally, we call {{setConfigFromTemplateEntries()}} where we set the collected 
values for the original queue.

Is this correct?

> Support Auto Queue Creation template configurations
> ---
>
> Key: YARN-10564
> URL: https://issues.apache.org/jira/browse/YARN-10564
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10564.001.patch, YARN-10564.002.patch, 
> YARN-10564.003.patch, YARN-10564.004.patch, YARN-10564.005.patch, 
> YARN-10564.poc.001.patch
>
>
> Similar to how the template configuration works for ManagedParents, we need 
> to support templates for the new auto queue creation logic. Proposition is to 
> allow wildcards in template configs such as:
> {noformat}
> yarn.scheduler.capacity.root.*.*.weight 10{noformat}
> which would mean, that set weight to 10 of every leaf of every parent under 
> root.
> We should possibly take an approach, that could support arbitrary depth of 
> template configuration, because we might need to lift the limitation of auto 
> queue nesting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10564) Support Auto Queue Creation template configurations

2021-04-07 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316277#comment-17316277
 ] 

Peter Bacsko commented on YARN-10564:
-

Thanks [~gandras], I think I get it. I guess the trick is the "for" loop which 
modifies "queuePathParts". First we try to find the templates for the parent 
explicitly, then we step back each wildcard at a time. By changing 
"queuePathParts", the prefix changes so eventually we might find a parent which 
contains templates. 

Finally, we call {{setConfigFromTemplateEntries()}} where we set the collected 
values for the original queue.

Is this correct?

> Support Auto Queue Creation template configurations
> ---
>
> Key: YARN-10564
> URL: https://issues.apache.org/jira/browse/YARN-10564
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10564.001.patch, YARN-10564.002.patch, 
> YARN-10564.003.patch, YARN-10564.004.patch, YARN-10564.005.patch, 
> YARN-10564.poc.001.patch
>
>
> Similar to how the template configuration works for ManagedParents, we need 
> to support templates for the new auto queue creation logic. Proposition is to 
> allow wildcards in template configs such as:
> {noformat}
> yarn.scheduler.capacity.root.*.*.weight 10{noformat}
> which would mean, that set weight to 10 of every leaf of every parent under 
> root.
> We should possibly take an approach, that could support arbitrary depth of 
> template configuration, because we might need to lift the limitation of auto 
> queue nesting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10564) Support Auto Queue Creation template configurations

2021-04-07 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316213#comment-17316213
 ] 

Peter Bacsko edited comment on YARN-10564 at 4/7/21, 11:51 AM:
---

[~gandras] thanks for the patch.
>From coding POV it looks ok, this is more like a high level review.

There's are some things I just can't figure out (maybe I'm in a bad shape 
today).

1. Let's say you set the capacity 6w for {{root.a.*}}. Then a dynamic queue 
{{root.a.newparent.newchild}} get created. How does the weight settings 
propagate to "newparent" and "newchild"? I kept looking at the code, but it's 
just not obvious. I can see that "root.a" will have an entry in 
{{templateEntries}}, but then what?

2. I can't deciper this part:
{noformat}
for (int i = 0; i <= wildcardLevel; ++i) {
queuePathParts.set(queuePathParts.size() - 1 - i, WILDCARD_QUEUE);
}
{noformat}
What's happening here?

3. There is a variable called "supportedWildcardLevel". What is "supported" 
means in this context? Later on we set it to {{Math.min(queueHierarchyParts - 
1, MAX_WILDCARD_LEVEL);}}. It seems to me that it is either 0 or 1, because 
{{MAX_WILDCARD_LEVEL}} is 1. I assume most of the time it's going to be 1? I 
don't understand what it is meant to represent.


was (Author: pbacsko):
[~gandras] thanks for the patch.
>From coding POV it looks ok, this is more like a high level review.

There's are some things I just can't figure out (maybe I'm in a bad shape 
today).

1. Let's say you set the capacity 6w for {{root.a.*}}. Then a dynamic queue 
{{root.a.newparent.newchild}} get created. How does the weight settings 
propagate to "newparent" and "newchild"? I kept looking at the code, but it's 
just not obvious. I can see that "root.a" will have an entry in 
{{templateEntries}}, but then what?

2. I can't deciper this part:
{noformat}
for (int i = 0; i <= wildcardLevel; ++i) {
queuePathParts.set(queuePathParts.size() - 1 - i, WILDCARD_QUEUE);
}
{noformat}
What's happening here?

3. There is a variable called "supportedWildcardLevel". What is "supported" 
means in this context? Later on we set it to {{Math.min(queueHierarchyParts - 
1, MAX_WILDCARD_LEVEL);}}. It seems to me that it is either 0 or 1, because 
{{MAX_WILDCARD_LEVEL}} is 1. I assume most of the time it's going to be 1? 
Mentally I don't understand what it is meant to represent.

> Support Auto Queue Creation template configurations
> ---
>
> Key: YARN-10564
> URL: https://issues.apache.org/jira/browse/YARN-10564
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10564.001.patch, YARN-10564.002.patch, 
> YARN-10564.003.patch, YARN-10564.004.patch, YARN-10564.005.patch, 
> YARN-10564.poc.001.patch
>
>
> Similar to how the template configuration works for ManagedParents, we need 
> to support templates for the new auto queue creation logic. Proposition is to 
> allow wildcards in template configs such as:
> {noformat}
> yarn.scheduler.capacity.root.*.*.weight 10{noformat}
> which would mean, that set weight to 10 of every leaf of every parent under 
> root.
> We should possibly take an approach, that could support arbitrary depth of 
> template configuration, because we might need to lift the limitation of auto 
> queue nesting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10564) Support Auto Queue Creation template configurations

2021-04-07 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316213#comment-17316213
 ] 

Peter Bacsko edited comment on YARN-10564 at 4/7/21, 10:49 AM:
---

[~gandras] thanks for the patch.
>From coding POV it looks ok, this is more like a high level review.

There's are some things I just can't figure out (maybe I'm in a bad shape 
today).

1. Let's say you set the capacity 6w for {{root.a.*}}. Then a dynamic queue 
{{root.a.newparent.newchild}} get created. How does the weight settings 
propagate to "newparent" and "newchild"? I kept looking at the code, but it's 
just not obvious. I can see that "root.a" will have an entry in 
{{templateEntries}}, but then what?

2. I can't deciper this part:
{noformat}
for (int i = 0; i <= wildcardLevel; ++i) {
queuePathParts.set(queuePathParts.size() - 1 - i, WILDCARD_QUEUE);
}
{noformat}
What's happening here?

3. There is a variable called "supportedWildcardLevel". What is "supported" 
means in this context? Later on we set it to {{Math.min(queueHierarchyParts - 
1, MAX_WILDCARD_LEVEL);}}. It seems to me that it is either 0 or 1, because 
{{MAX_WILDCARD_LEVEL}} is 1. I assume most of the time it's going to be 1? 
Mentally I don't understand what it is meant to represent.


was (Author: pbacsko):
[~gandras] thanks for the patch.
>From coding POV it looks ok, this is more like a high level review.

There's are some things I just can't figure out (maybe I'm in a bad shape 
today).

1. Let's say you set 6w for {{root.a.*}}. Then a dynamic queue 
{{root.a.newparent.newchild}} get created. How does the weight settings 
propagate to "newparent" and "newchild"? I kept looking at the code, but it's 
just not obvious. I can see that "root.a" will have an entry in 
{{templateEntries}}, but then what?

2. I can't deciper this part:
{noformat}
for (int i = 0; i <= wildcardLevel; ++i) {
queuePathParts.set(queuePathParts.size() - 1 - i, WILDCARD_QUEUE);
}
{noformat}
What's happening here?

3. There is a variable called "supportedWildcardLevel". What is "supported" 
means in this context? Later on we set it to {{Math.min(queueHierarchyParts - 
1, MAX_WILDCARD_LEVEL);}} which seems to be that it is either 0 or 1, because 
{{MAX_WILDCARD_LEVEL}} is 1. I assume most of the time it's going to be 1? 
Mentally I don't understand what it is meant to represent.

> Support Auto Queue Creation template configurations
> ---
>
> Key: YARN-10564
> URL: https://issues.apache.org/jira/browse/YARN-10564
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10564.001.patch, YARN-10564.002.patch, 
> YARN-10564.003.patch, YARN-10564.004.patch, YARN-10564.005.patch, 
> YARN-10564.poc.001.patch
>
>
> Similar to how the template configuration works for ManagedParents, we need 
> to support templates for the new auto queue creation logic. Proposition is to 
> allow wildcards in template configs such as:
> {noformat}
> yarn.scheduler.capacity.root.*.*.weight 10{noformat}
> which would mean, that set weight to 10 of every leaf of every parent under 
> root.
> We should possibly take an approach, that could support arbitrary depth of 
> template configuration, because we might need to lift the limitation of auto 
> queue nesting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10564) Support Auto Queue Creation template configurations

2021-04-07 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316213#comment-17316213
 ] 

Peter Bacsko commented on YARN-10564:
-

[~gandras] thanks for the patch.
>From coding POV it looks ok, this is more like a high level review.

There's are some things I just can't figure out (maybe I'm in a bad shape 
today).

1. Let's say you set 6w for {{root.a.*}}. Then a dynamic queue 
{{root.a.newparent.newchild}} get created. How does the weight settings 
propagate to "newparent" and "newchild"? I kept looking at the code, but it's 
just not obvious. I can see that "root.a" will have an entry in 
{{templateEntries}}, but then what?

2. I can't deciper this part:
{noformat}
for (int i = 0; i <= wildcardLevel; ++i) {
queuePathParts.set(queuePathParts.size() - 1 - i, WILDCARD_QUEUE);
}
{noformat}
What's happening here?

3. There is a variable called "supportedWildcardLevel". What is "supported" 
means in this context? Later on we set it to {{Math.min(queueHierarchyParts - 
1, MAX_WILDCARD_LEVEL);}} which seems to be that it is either 0 or 1, because 
{{MAX_WILDCARD_LEVEL}} is 1. I assume most of the time it's going to be 1? 
Mentally I don't understand what it is meant to represent.

> Support Auto Queue Creation template configurations
> ---
>
> Key: YARN-10564
> URL: https://issues.apache.org/jira/browse/YARN-10564
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10564.001.patch, YARN-10564.002.patch, 
> YARN-10564.003.patch, YARN-10564.004.patch, YARN-10564.005.patch, 
> YARN-10564.poc.001.patch
>
>
> Similar to how the template configuration works for ManagedParents, we need 
> to support templates for the new auto queue creation logic. Proposition is to 
> allow wildcards in template configs such as:
> {noformat}
> yarn.scheduler.capacity.root.*.*.weight 10{noformat}
> which would mean, that set weight to 10 of every leaf of every parent under 
> root.
> We should possibly take an approach, that could support arbitrary depth of 
> template configuration, because we might need to lift the limitation of auto 
> queue nesting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10726) Log the size of DelegationTokenRenewer event queue in case of too many pending events

2021-04-01 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313241#comment-17313241
 ] 

Peter Bacsko commented on YARN-10726:
-

Ok, I strongly believe that the failing tests are flaky.

[~zhuqi] could you verify it by running them locally a couple of times?

> Log the size of DelegationTokenRenewer event queue in case of too many 
> pending events
> -
>
> Key: YARN-10726
> URL: https://issues.apache.org/jira/browse/YARN-10726
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10726.001.patch, YARN-10726.002.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10693) Add document for YARN-10623 auto refresh queue conf in cs.

2021-04-01 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313219#comment-17313219
 ] 

Peter Bacsko commented on YARN-10693:
-

I'll review this as soon as I have some spare cycles.

> Add document for YARN-10623 auto refresh queue conf in cs.
> --
>
> Key: YARN-10693
> URL: https://issues.apache.org/jira/browse/YARN-10693
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10693.001.patch, YARN-10693.002.patch, 
> YARN-10693.003.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10637) We should support fs to cs support for auto refresh queues when conf changed, after YARN-10623 finished.

2021-04-01 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313218#comment-17313218
 ] 

Peter Bacsko commented on YARN-10637:
-

Thanks [~zhuqi] I think it's good then.

[~gandras] do you have any comments?

> We should support fs to cs support for auto refresh queues when conf changed, 
> after YARN-10623 finished.
> 
>
> Key: YARN-10637
> URL: https://issues.apache.org/jira/browse/YARN-10637
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10637.001.patch, YARN-10637.002.patch, 
> YARN-10637.003.patch, YARN-10637.004.patch
>
>
> cc [~pbacsko] [~gandras] [~bteke]
> We should also fill this, when  YARN-10623 finished.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10726) Log the size of DelegationTokenRenewer event queue in case of too many pending events

2021-04-01 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313192#comment-17313192
 ] 

Peter Bacsko commented on YARN-10726:
-

Ah, I already committed the change. Let's hope Jenkins comes back green :)

+1

> Log the size of DelegationTokenRenewer event queue in case of too many 
> pending events
> -
>
> Key: YARN-10726
> URL: https://issues.apache.org/jira/browse/YARN-10726
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10726.001.patch, YARN-10726.002.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10726) Log the size of DelegationTokenRenewer event queue in case of too many pending events

2021-04-01 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313189#comment-17313189
 ] 

Peter Bacsko commented on YARN-10726:
-

"hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer" - this 
is unrelated I believe. This test case has been failing for a long time.

> Log the size of DelegationTokenRenewer event queue in case of too many 
> pending events
> -
>
> Key: YARN-10726
> URL: https://issues.apache.org/jira/browse/YARN-10726
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10726.001.patch, YARN-10726.002.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10637) We should support fs to cs support for auto refresh queues when conf changed, after YARN-10623 finished.

2021-04-01 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313184#comment-17313184
 ] 

Peter Bacsko commented on YARN-10637:
-

Thanks [~zhuqi] this makes sense. Is this always enabled in Fair Scheduler? 
Because we should only add this policy if auto-refresh is enabled on the 
FS-side.

> We should support fs to cs support for auto refresh queues when conf changed, 
> after YARN-10623 finished.
> 
>
> Key: YARN-10637
> URL: https://issues.apache.org/jira/browse/YARN-10637
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10637.001.patch, YARN-10637.002.patch, 
> YARN-10637.003.patch, YARN-10637.004.patch
>
>
> cc [~pbacsko] [~gandras] [~bteke]
> We should also fill this, when  YARN-10623 finished.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10726) Log the size of DelegationTokenRenewer event queue in case of too many pending events

2021-04-01 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313138#comment-17313138
 ] 

Peter Bacsko commented on YARN-10726:
-

This is from {{AsyncDispatcher}}:

{noformat}
 if (qSize != 0 && qSize % 1000 == 0
  && lastEventQueueSizeLogged != qSize) {
lastEventQueueSizeLogged = qSize;
LOG.info("Size of event-queue is " + qSize);
  }
{noformat}

Update the code with {{lastEventQueueSizeLogged}}.

> Log the size of DelegationTokenRenewer event queue in case of too many 
> pending events
> -
>
> Key: YARN-10726
> URL: https://issues.apache.org/jira/browse/YARN-10726
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10726.001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10726) Log the size of DelegationTokenRenewer event queue in case of too many pending events

2021-04-01 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313123#comment-17313123
 ] 

Peter Bacsko edited comment on YARN-10726 at 4/1/21, 12:01 PM:
---

Thanks [~zhuqi]. I think it's a good idea. My only concern (which might not be 
valid) is that we have too many events, this code can possibly run too 
frequently. For example, if you go 998, 998, 999, 1000, 1001, 1002, then it 
prints at 1000, then it starts to consume events, size goes back from 1000 to 
990, then it prints the size again.

I think we should limit how often we print this message. We should log it too 
often, I'm not sure how we do this in other parts of the code. I'll check what 
can be the best solution.


was (Author: pbacsko):
Thanks [~zhuqi]. I think it's a good idea. My only "concern" is that we have 
too many events, this code can possibly run too frequently. For example, if you 
go 998, 998, 999, 1000, 1001, 1002, then it prints at 1000, then it starts to 
consume events, size goes back from 1000 to 990, then it prints the size again.

I think we should limit how often we print this message. We should log it too 
often, I'm not sure how we do this in other parts of the code. I'll check what 
can be the best solution.

> Log the size of DelegationTokenRenewer event queue in case of too many 
> pending events
> -
>
> Key: YARN-10726
> URL: https://issues.apache.org/jira/browse/YARN-10726
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10726.001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10726) Log the size of DelegationTokenRenewer event queue in case of too many pending events

2021-04-01 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313123#comment-17313123
 ] 

Peter Bacsko commented on YARN-10726:
-

Thanks [~zhuqi]. I think it's a good idea. My only "concern" is that we have 
too many events, this code can possibly run too frequently. For example, if you 
go 998, 998, 999, 1000, 1001, 1002, then it prints at 1000, then it starts to 
consume events, size goes back from 1000 to 990, then it prints the size again.

I think we should limit how often we print this message. We should log it too 
often, I'm not sure how we do this in other parts of the code. I'll check what 
can be the best solution.

> Log the size of DelegationTokenRenewer event queue in case of too many 
> pending events
> -
>
> Key: YARN-10726
> URL: https://issues.apache.org/jira/browse/YARN-10726
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10726.001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10726) Log the size of DelegationTokenRenewer event queue in case of too many pending events

2021-04-01 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10726:

Summary: Log the size of DelegationTokenRenewer event queue in case of too 
many pending events  (was: We should log size of pending 
DelegationTokenRenewerEvent queue, when pending too many events.)

> Log the size of DelegationTokenRenewer event queue in case of too many 
> pending events
> -
>
> Key: YARN-10726
> URL: https://issues.apache.org/jira/browse/YARN-10726
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10726.001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9618) NodesListManager event improvement

2021-04-01 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313105#comment-17313105
 ] 

Peter Bacsko commented on YARN-9618:


Thanks for the patch [~zhuqi] and [~gandras] for the review, I committed this 
to trunk.

> NodesListManager event improvement
> --
>
> Key: YARN-9618
> URL: https://issues.apache.org/jira/browse/YARN-9618
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin Chundatt
>Assignee: Qi Zhu
>Priority: Critical
> Fix For: 3.4.0
>
> Attachments: YARN-9618.001.patch, YARN-9618.002.patch, 
> YARN-9618.003.patch, YARN-9618.004.patch, YARN-9618.005.patch, 
> YARN-9618.006.patch, YARN-9618.007.patch
>
>
> Current implementation nodelistmanager event blocks async dispacher and can 
> cause RM crash and slowing down event processing.
> # Cluster restart with 1K running apps . Each usable event will create 1K 
> events over all events could be 5k*1k events for 5K cluster
> # Event processing is blocked till new events are added to queue.
> Solution :
> # Add another async Event handler similar to scheduler.
> # Instead of adding events to dispatcher directly call RMApp event handler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9618) NodesListManager event improvement

2021-04-01 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9618:
---
Summary: NodesListManager event improvement  (was: NodeListManager event 
improvement)

> NodesListManager event improvement
> --
>
> Key: YARN-9618
> URL: https://issues.apache.org/jira/browse/YARN-9618
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin Chundatt
>Assignee: Qi Zhu
>Priority: Critical
> Attachments: YARN-9618.001.patch, YARN-9618.002.patch, 
> YARN-9618.003.patch, YARN-9618.004.patch, YARN-9618.005.patch, 
> YARN-9618.006.patch, YARN-9618.007.patch
>
>
> Current implementation nodelistmanager event blocks async dispacher and can 
> cause RM crash and slowing down event processing.
> # Cluster restart with 1K running apps . Each usable event will create 1K 
> events over all events could be 5k*1k events for 5K cluster
> # Event processing is blocked till new events are added to queue.
> Solution :
> # Add another async Event handler similar to scheduler.
> # Instead of adding events to dispatcher directly call RMApp event handler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9618) NodeListManager event improvement

2021-04-01 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312989#comment-17312989
 ] 

Peter Bacsko commented on YARN-9618:


+1 LGTM

[~gandras] are you OK with the patch?

> NodeListManager event improvement
> -
>
> Key: YARN-9618
> URL: https://issues.apache.org/jira/browse/YARN-9618
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin Chundatt
>Assignee: Qi Zhu
>Priority: Critical
> Attachments: YARN-9618.001.patch, YARN-9618.002.patch, 
> YARN-9618.003.patch, YARN-9618.004.patch, YARN-9618.005.patch, 
> YARN-9618.006.patch, YARN-9618.007.patch
>
>
> Current implementation nodelistmanager event blocks async dispacher and can 
> cause RM crash and slowing down event processing.
> # Cluster restart with 1K running apps . Each usable event will create 1K 
> events over all events could be 5k*1k events for 5K cluster
> # Event processing is blocked till new events are added to queue.
> Solution :
> # Add another async Event handler similar to scheduler.
> # Instead of adding events to dispatcher directly call RMApp event handler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10720) YARN WebAppProxyServlet should support connection timeout to prevent proxy server from hanging

2021-04-01 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312945#comment-17312945
 ] 

Peter Bacsko commented on YARN-10720:
-

+1

thanks [~zhuqi] for the patch, committed to trunk.

> YARN WebAppProxyServlet should support connection timeout to prevent proxy 
> server from hanging
> --
>
> Key: YARN-10720
> URL: https://issues.apache.org/jira/browse/YARN-10720
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Critical
> Attachments: YARN-10720.001.patch, YARN-10720.002.patch, 
> YARN-10720.003.patch, YARN-10720.004.patch, YARN-10720.005.patch, 
> YARN-10720.006.patch, image-2021-03-29-14-04-33-776.png, 
> image-2021-03-29-14-05-32-708.png
>
>
> Following is proxy server show, {color:#de350b}too many connections from one 
> client{color}, this caused the proxy server hang, and the yarn web can't jump 
> to web proxy.
> !image-2021-03-29-14-04-33-776.png|width=632,height=57!
> Following is the AM which is abnormal, but proxy server don't know it is 
> abnormal already, so the connections can't be closed, we should add time out 
> support in proxy server to prevent this. And one abnormal AM may cause 
> hundreds even thousands of connections, it is very heavy.
> !image-2021-03-29-14-05-32-708.png|width=669,height=101!
>  
> After i kill the abnormal AM, the proxy server become healthy. This case 
> happened many times in our production clusters, our clusters are huge, and 
> the abnormal AM will be existed in a regular case.
>  
> I will add timeout supported in web proxy server in this jira.
>  
> cc  [~pbacsko] [~ebadger] [~Jim_Brennan]  [~ztang]  [~epayne] [~gandras]  
> [~bteke]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10720) YARN WebAppProxyServlet should support connection timeout to prevent proxy server from hanging

2021-04-01 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10720:

Summary: YARN WebAppProxyServlet should support connection timeout to 
prevent proxy server from hanging  (was: YARN WebAppProxyServlet should support 
connection timeout to prevent proxy server hang.)

> YARN WebAppProxyServlet should support connection timeout to prevent proxy 
> server from hanging
> --
>
> Key: YARN-10720
> URL: https://issues.apache.org/jira/browse/YARN-10720
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Critical
> Attachments: YARN-10720.001.patch, YARN-10720.002.patch, 
> YARN-10720.003.patch, YARN-10720.004.patch, YARN-10720.005.patch, 
> YARN-10720.006.patch, image-2021-03-29-14-04-33-776.png, 
> image-2021-03-29-14-05-32-708.png
>
>
> Following is proxy server show, {color:#de350b}too many connections from one 
> client{color}, this caused the proxy server hang, and the yarn web can't jump 
> to web proxy.
> !image-2021-03-29-14-04-33-776.png|width=632,height=57!
> Following is the AM which is abnormal, but proxy server don't know it is 
> abnormal already, so the connections can't be closed, we should add time out 
> support in proxy server to prevent this. And one abnormal AM may cause 
> hundreds even thousands of connections, it is very heavy.
> !image-2021-03-29-14-05-32-708.png|width=669,height=101!
>  
> After i kill the abnormal AM, the proxy server become healthy. This case 
> happened many times in our production clusters, our clusters are huge, and 
> the abnormal AM will be existed in a regular case.
>  
> I will add timeout supported in web proxy server in this jira.
>  
> cc  [~pbacsko] [~ebadger] [~Jim_Brennan]  [~ztang]  [~epayne] [~gandras]  
> [~bteke]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9618) NodeListManager event improvement

2021-03-31 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312516#comment-17312516
 ] 

Peter Bacsko commented on YARN-9618:


Small things:

1.
{noformat}
//Is trigger RMAppNodeUpdateEvent
private Boolean isRMAppEvent = false;
//Is trigger NodesListManagerEvent
private Boolean isNodesListEvent = false;
{noformat}
a) No need for comments
 b) use ordinary "boolean" instead of "Boolean" (also, init to "false" is not 
necessary, it is "false" by default because it's dictated by the JVM spec).

 

2.
{noformat}
Assert.assertFalse(getIsRMAppEvent());
Assert.assertTrue(getIsNodesListEvent());
{noformat}
Add some assertion message here, like
{noformat}
Assert.assertFalse("Got unexpected RM app event", getIsRMAppEvent());
Assert.assertTrue("Received no NodesListManagerEvent", getIsNodesListEvent());
{noformat}
3. Return values of {{getIsNodesListEvent()}} and {{getIsRMAppEvent()}} should 
be just "boolean".

> NodeListManager event improvement
> -
>
> Key: YARN-9618
> URL: https://issues.apache.org/jira/browse/YARN-9618
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin Chundatt
>Assignee: Qi Zhu
>Priority: Critical
> Attachments: YARN-9618.001.patch, YARN-9618.002.patch, 
> YARN-9618.003.patch, YARN-9618.004.patch, YARN-9618.005.patch, 
> YARN-9618.006.patch
>
>
> Current implementation nodelistmanager event blocks async dispacher and can 
> cause RM crash and slowing down event processing.
> # Cluster restart with 1K running apps . Each usable event will create 1K 
> events over all events could be 5k*1k events for 5K cluster
> # Event processing is blocked till new events are added to queue.
> Solution :
> # Add another async Event handler similar to scheduler.
> # Instead of adding events to dispatcher directly call RMApp event handler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10720) YARN WebAppProxyServlet should support connection timeout to prevent proxy server hang.

2021-03-31 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312492#comment-17312492
 ] 

Peter Bacsko commented on YARN-10720:
-

{noformat}
  } catch (InterruptedException e) {
LOG.warn("doGet() interrupted", e);
resp.setStatus(HttpServletResponse.SC_BAD_REQUEST);
  }
  resp.setStatus(HttpServletResponse.SC_OK);
}
{noformat}

This is not good - you set the response status to {{SC_BAD_REQUEST}} only to 
override it with {{SC_OK}}. You need a "return".

{noformat}
try {
  servlet.init(config);
} catch (ServletException e) {
  LOG.error(e.getMessage());
  fail("Failed to init servlet");
}

try {
  servlet.doGet(request, response);
} catch (ServletException e) {
  LOG.error(e.getMessage());
  fail("ServletException thrown during doGet.");
}
  }
{noformat}

You can remove try-catch here and just add {{throws ServletException}}. If that 
happens for whatever reason, it will be a test error (which is desired - 
checking if the servlet can init is not the purpose of the test), not a test 
failure.

> YARN WebAppProxyServlet should support connection timeout to prevent proxy 
> server hang.
> ---
>
> Key: YARN-10720
> URL: https://issues.apache.org/jira/browse/YARN-10720
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Critical
> Attachments: YARN-10720.001.patch, YARN-10720.002.patch, 
> YARN-10720.003.patch, YARN-10720.004.patch, YARN-10720.005.patch, 
> image-2021-03-29-14-04-33-776.png, image-2021-03-29-14-05-32-708.png
>
>
> Following is proxy server show, {color:#de350b}too many connections from one 
> client{color}, this caused the proxy server hang, and the yarn web can't jump 
> to web proxy.
> !image-2021-03-29-14-04-33-776.png|width=632,height=57!
> Following is the AM which is abnormal, but proxy server don't know it is 
> abnormal already, so the connections can't be closed, we should add time out 
> support in proxy server to prevent this. And one abnormal AM may cause 
> hundreds even thousands of connections, it is very heavy.
> !image-2021-03-29-14-05-32-708.png|width=669,height=101!
>  
> After i kill the abnormal AM, the proxy server become healthy. This case 
> happened many times in our production clusters, our clusters are huge, and 
> the abnormal AM will be existed in a regular case.
>  
> I will add timeout supported in web proxy server in this jira.
>  
> cc  [~pbacsko] [~ebadger] [~Jim_Brennan]  [~ztang]  [~epayne] [~gandras]  
> [~bteke]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10720) YARN WebAppProxyServlet should support connection timeout to prevent proxy server hang.

2021-03-31 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312253#comment-17312253
 ] 

Peter Bacsko commented on YARN-10720:
-

Thanks [~zhuqi] for the patch.

1. As you said {{ExpectedException.none()}} has been deprecated. Either use the 
new {{assertThrows()}} or {{@Test(expected = SocketTimeoutException.class)}}, I 
think using the second is easier.

2.
{noformat}
conf.setInt(YarnConfiguration.RM_PROXY_CONNECTION_TIMEOUT,
1 * 1000);
{noformat}
Just write "1000" instead of "1 * 1000".

3.
{noformat}
try {
  when(response.getOutputStream()).thenReturn(null);
} catch (IOException e) {
  e.printStackTrace();
}
{noformat}
Unnecessary try-catch block. The method already has a {{throws}} clause.

4.
{noformat}
@Override
protected void doGet(HttpServletRequest req, HttpServletResponse 
resp)
throws ServletException, IOException {
  try {
Thread.sleep(10 * 1000);
  } catch (InterruptedException e) {
e.printStackTrace();
  }
  resp.setStatus(HttpServletResponse.SC_OK);
}
{noformat}
Maybe a minor thing, but if you catch {{InterruptedException}}, don't just 
print the stack trace, log it with {{LOG.warn("doGet() interrupted", e)}}. In 
this case, I'd also return with {{HttpServletResponse.SC_BAD_REQUEST}}.

5. 
 {{The web proxy connection timeout, default is 60s(60 * 
1000ms).}}

This already goes to {{yarn-default.xml}}, so you can omit the part "default is 
60s(60 * 1000ms)" and just write "The web proxy connection timeout".

> YARN WebAppProxyServlet should support connection timeout to prevent proxy 
> server hang.
> ---
>
> Key: YARN-10720
> URL: https://issues.apache.org/jira/browse/YARN-10720
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Critical
> Attachments: YARN-10720.001.patch, YARN-10720.002.patch, 
> YARN-10720.003.patch, image-2021-03-29-14-04-33-776.png, 
> image-2021-03-29-14-05-32-708.png
>
>
> Following is proxy server show, {color:#de350b}too many connections from one 
> client{color}, this caused the proxy server hang, and the yarn web can't jump 
> to web proxy.
> !image-2021-03-29-14-04-33-776.png|width=632,height=57!
> Following is the AM which is abnormal, but proxy server don't know it is 
> abnormal already, so the connections can't be closed, we should add time out 
> support in proxy server to prevent this. And one abnormal AM may cause 
> hundreds even thousands of connections, it is very heavy.
> !image-2021-03-29-14-05-32-708.png|width=669,height=101!
>  
> After i kill the abnormal AM, the proxy server become healthy. This case 
> happened many times in our production clusters, our clusters are huge, and 
> the abnormal AM will be existed in a regular case.
>  
> I will add timeout supported in web proxy server in this jira.
>  
> cc  [~pbacsko] [~ebadger] [~Jim_Brennan]  [~ztang]  [~epayne] [~gandras]  
> [~bteke]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10718) Fix CapacityScheduler#initScheduler log error.

2021-03-31 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312203#comment-17312203
 ] 

Peter Bacsko commented on YARN-10718:
-

Committed to trunk. Closing.

> Fix CapacityScheduler#initScheduler log error. 
> ---
>
> Key: YARN-10718
> URL: https://issues.apache.org/jira/browse/YARN-10718
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: capacity-scheduler, capacityscheduler
> Attachments: YARN-10718.001.patch, image-2021-03-28-00-03-28-244.png
>
>
> !image-2021-03-28-00-03-28-244.png|width=972,height=52!
> The Resource  toString() method already with "<" and ">" string, it's wrong 
> to add it again.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10718) Fix CapacityScheduler#initScheduler log error.

2021-03-31 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10718:

Labels: resourcemanager  (was: )

> Fix CapacityScheduler#initScheduler log error. 
> ---
>
> Key: YARN-10718
> URL: https://issues.apache.org/jira/browse/YARN-10718
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: resourcemanager
> Attachments: YARN-10718.001.patch, image-2021-03-28-00-03-28-244.png
>
>
> !image-2021-03-28-00-03-28-244.png|width=972,height=52!
> The Resource  toString() method already with "<" and ">" string, it's wrong 
> to add it again.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10718) Fix CapacityScheduler#initScheduler log error.

2021-03-31 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10718:

Labels: capacity-scheduler capacityscheduler  (was: resourcemanager)

> Fix CapacityScheduler#initScheduler log error. 
> ---
>
> Key: YARN-10718
> URL: https://issues.apache.org/jira/browse/YARN-10718
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: capacity-scheduler, capacityscheduler
> Attachments: YARN-10718.001.patch, image-2021-03-28-00-03-28-244.png
>
>
> !image-2021-03-28-00-03-28-244.png|width=972,height=52!
> The Resource  toString() method already with "<" and ">" string, it's wrong 
> to add it again.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10718) Fix CapacityScheduler#initScheduler log error.

2021-03-31 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312195#comment-17312195
 ] 

Peter Bacsko commented on YARN-10718:
-

Thanks [~zhuqi], +1 LGTM.

Will commit this soon.

> Fix CapacityScheduler#initScheduler log error. 
> ---
>
> Key: YARN-10718
> URL: https://issues.apache.org/jira/browse/YARN-10718
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10718.001.patch, image-2021-03-28-00-03-28-244.png
>
>
> !image-2021-03-28-00-03-28-244.png|width=972,height=52!
> The Resource  toString() method already with "<" and ">" string, it's wrong 
> to add it again.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10674) fs2cs should generate auto-created queue deletion properties

2021-03-24 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307605#comment-17307605
 ] 

Peter Bacsko commented on YARN-10674:
-

Thanks [~zhuqi] for the patch and [~gandras] for the review. Committed to trunk.

> fs2cs should generate auto-created queue deletion properties
> 
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch, 
> YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, 
> YARN-10674.006.patch, YARN-10674.007.patch, YARN-10674.008.patch, 
> YARN-10674.009.patch, YARN-10674.010.patch, YARN-10674.011.patch, 
> YARN-10674.012.patch, YARN-10674.013.patch, YARN-10674.014.patch, 
> YARN-10674.015.patch, YARN-10674.016.patch, YARN-10674.017.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10674) fs2cs should generate auto-created queue deletion properties

2021-03-24 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307602#comment-17307602
 ] 

Peter Bacsko commented on YARN-10674:
-

+1 LGTM. I'm going to commit this soon.

> fs2cs should generate auto-created queue deletion properties
> 
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch, 
> YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, 
> YARN-10674.006.patch, YARN-10674.007.patch, YARN-10674.008.patch, 
> YARN-10674.009.patch, YARN-10674.010.patch, YARN-10674.011.patch, 
> YARN-10674.012.patch, YARN-10674.013.patch, YARN-10674.014.patch, 
> YARN-10674.015.patch, YARN-10674.016.patch, YARN-10674.017.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10674) fs2cs should generate auto-created queue deletion properties

2021-03-24 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10674:

Summary: fs2cs should generate auto-created queue deletion properties  
(was: fs2cs: should support auto created queue deletion.)

> fs2cs should generate auto-created queue deletion properties
> 
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch, 
> YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, 
> YARN-10674.006.patch, YARN-10674.007.patch, YARN-10674.008.patch, 
> YARN-10674.009.patch, YARN-10674.010.patch, YARN-10674.011.patch, 
> YARN-10674.012.patch, YARN-10674.013.patch, YARN-10674.014.patch, 
> YARN-10674.015.patch, YARN-10674.016.patch, YARN-10674.017.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-22 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17306240#comment-17306240
 ] 

Peter Bacsko commented on YARN-10674:
-

[~zhuqi] I had a discussion with [~gandras], he will post an update soon.

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch, 
> YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, 
> YARN-10674.006.patch, YARN-10674.007.patch, YARN-10674.008.patch, 
> YARN-10674.009.patch, YARN-10674.010.patch, YARN-10674.011.patch, 
> YARN-10674.012.patch, YARN-10674.013.patch, YARN-10674.014.patch, 
> YARN-10674.015.patch, YARN-10674.016.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10645) Fix queue state related update for auto created queue.

2021-03-22 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17306203#comment-17306203
 ] 

Peter Bacsko commented on YARN-10645:
-

[~zhuqi] [~gandras] is this patch still needed? Looking at Andras' comment, it 
is telling me that this ticket is a duplicate. Is it a dup? 

> Fix queue state related update for auto created queue.
> --
>
> Key: YARN-10645
> URL: https://issues.apache.org/jira/browse/YARN-10645
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Critical
> Attachments: YARN-10645.001.patch
>
>
> Now the queue state in auto created queue can't be updated after refactor in 
> YARN-10504.
> We should support fix the queue state related logic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10503) Support queue capacity in terms of absolute resources with gpu resourceType.

2021-03-22 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17306157#comment-17306157
 ] 

Peter Bacsko commented on YARN-10503:
-

The question is this part:

{noformat}
public enum AbsoluteResourceType {
MEMORY, VCORES, GPUS, FPGAS
}
{noformat}

Do we want to treat GPUs and FPGAs like that? In other parts of the code, we 
have mem/vcore as primary resources, then an array of other resources.  For 
example, constructors from {{org.apache.hadoop.yarn.api.records.Resource}}:

{noformat}
  @Public
  @Stable
  public static Resource newInstance(long memory, int vCores,
  Map others) {
if (others != null) {
  return new LightWeightResource(memory, vCores,
  ResourceUtils.createResourceTypesArray(others));
} else {
  return newInstance(memory, vCores);
}
  }

  @InterfaceAudience.Private
  @InterfaceStability.Unstable
  public static Resource newInstance(Resource resource) {
Resource ret;
int numberOfKnownResourceTypes = ResourceUtils
.getNumberOfKnownResourceTypes();
if (numberOfKnownResourceTypes > 2) {
  ret = new LightWeightResource(resource.getMemorySize(),
  resource.getVirtualCores(), resource.getResources());
} else {
  ret = new LightWeightResource(resource.getMemorySize(),
  resource.getVirtualCores());
}
return ret;
  }
{noformat}

But with this modification, we sort of promote GPU and FPGA to the level of 
vcore and memory, at least from the perspective of the code and it also becomes 
inconsistent with the existing code.

This is just my opinion though. cc [~epayne] [~ebadger].

> Support queue capacity in terms of absolute resources with gpu resourceType.
> 
>
> Key: YARN-10503
> URL: https://issues.apache.org/jira/browse/YARN-10503
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Critical
> Attachments: YARN-10503.001.patch, YARN-10503.002.patch, 
> YARN-10503.003.patch
>
>
> Now the absolute resources are memory and cores.
> {code:java}
> /**
>  * Different resource types supported.
>  */
> public enum AbsoluteResourceType {
>   MEMORY, VCORES;
> }{code}
> But in our GPU production clusters, we need to support more resourceTypes.
> It's very import for cluster scaling when with different resourceType 
> absolute demands.
>  
> This Jira will handle GPU first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10704) The CS effective capacity for absolute mode in UI should support GPU and other custom resources.

2021-03-22 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17306154#comment-17306154
 ] 

Peter Bacsko commented on YARN-10704:
-

Thanks [~zhuqi] I have some minor comments:

1.
{noformat}
sb.append(" The CS effective capacity for absolute mode in UI should support GPU and 
> other custom resources.
> 
>
> Key: YARN-10704
> URL: https://issues.apache.org/jira/browse/YARN-10704
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10704.001.patch, YARN-10704.002.patch, 
> image-2021-03-19-12-05-28-412.png, image-2021-03-19-12-08-35-273.png
>
>
> Actually there are no information about the effective capacity about GPU in 
> UI for absolute resource mode.
> !image-2021-03-19-12-05-28-412.png|width=873,height=136!
> But we have this information in QueueMetrics:
> !image-2021-03-19-12-08-35-273.png|width=613,height=268!
>  
> It's very important for our GPU users to use in absolute mode, there still 
> have nothing to know GPU absolute information in CS Queue UI. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10597) CSMappingPlacementRule should not create new instance of Groups

2021-03-19 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304971#comment-17304971
 ] 

Peter Bacsko edited comment on YARN-10597 at 3/19/21, 3:35 PM:
---

[~shuzirra] is it really that simple? You told me that there were bunch of unit 
test failures when you tried to change it months back. Anyway it's great news 
if the change is tiny.


was (Author: pbacsko):
[~shuzirra] is it really that simple? You told me that there were bunch of unit 
test failures. Anyway it's great news if the change is tiny.

> CSMappingPlacementRule should not create new instance of Groups
> ---
>
> Key: YARN-10597
> URL: https://issues.apache.org/jira/browse/YARN-10597
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Attachments: YARN-10597.001.patch
>
>
> As [~ahussein] pointed out in YARN-10425, no new Groups instance should be 
> created.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10597) CSMappingPlacementRule should not create new instance of Groups

2021-03-19 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304971#comment-17304971
 ] 

Peter Bacsko commented on YARN-10597:
-

[~shuzirra] is it really that simple? You told me that there were bunch of unit 
test failures. Anyway it's great news if the change is tiny.

> CSMappingPlacementRule should not create new instance of Groups
> ---
>
> Key: YARN-10597
> URL: https://issues.apache.org/jira/browse/YARN-10597
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Attachments: YARN-10597.001.patch
>
>
> As [~ahussein] pointed out in YARN-10425, no new Groups instance should be 
> created.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10641) Refactor the max app related update, and fix maxApllications update error when add new queues.

2021-03-18 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304117#comment-17304117
 ] 

Peter Bacsko commented on YARN-10641:
-

+1

Thanks for the patch [~zhuqi] and [~gandras] for the review. Committed to trunk.

> Refactor the max app related update, and fix maxApllications update error 
> when add new queues.
> --
>
> Key: YARN-10641
> URL: https://issues.apache.org/jira/browse/YARN-10641
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Critical
> Attachments: YARN-10641.001.patch, YARN-10641.002.patch, 
> YARN-10641.003.patch, YARN-10641.004.patch, YARN-10641.005.patch, 
> YARN-10641.006.patch, image-2021-02-20-15-49-58-677.png, 
> image-2021-02-20-15-53-51-099.png, image-2021-02-20-15-55-44-780.png, 
> image-2021-02-20-16-29-18-519.png, image-2021-02-20-16-31-13-714.png
>
>
> When refactor the update logic in YARN-10504 .
> The update max applications based abs/cap is wrong, this should be fixed, 
> because the max applications is key part to limit applications in CS.
> For example: 
> When adding a dynamic queue, the other children's max app of parent queue are 
> not updated correctly:
> !image-2021-02-20-15-53-51-099.png|width=639,height=509!  
> The new added queue's max app will updated correctly:
> !image-2021-02-20-15-55-44-780.png|width=542,height=426!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10692) Add Node GPU Utilization and apply to NodeMetrics.

2021-03-18 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304089#comment-17304089
 ] 

Peter Bacsko commented on YARN-10692:
-

Thanks [~zhuqi] for the patch, committed to trunk.

> Add Node GPU Utilization and apply to NodeMetrics.
> --
>
> Key: YARN-10692
> URL: https://issues.apache.org/jira/browse/YARN-10692
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10692.001.patch, YARN-10692.002.patch, 
> YARN-10692.003.patch
>
>
> Now there are no node level GPU Utilization, this issue will add it, and add 
> it to NodeMetrics first.
> cc [~pbacsko]  [~Jim_Brennan]  [~ebadger]  [~gandras]  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10692) Add Node GPU Utilization and apply to NodeMetrics.

2021-03-18 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304078#comment-17304078
 ] 

Peter Bacsko commented on YARN-10692:
-

+1 LGTM.

Committing this soon.

> Add Node GPU Utilization and apply to NodeMetrics.
> --
>
> Key: YARN-10692
> URL: https://issues.apache.org/jira/browse/YARN-10692
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10692.001.patch, YARN-10692.002.patch, 
> YARN-10692.003.patch
>
>
> Now there are no node level GPU Utilization, this issue will add it, and add 
> it to NodeMetrics first.
> cc [~pbacsko]  [~Jim_Brennan]  [~ebadger]  [~gandras]  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10685) Fix typos in AbstractCSQueue

2021-03-18 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304041#comment-17304041
 ] 

Peter Bacsko commented on YARN-10685:
-

+1 thanks [~zhuqi] for the patch, committed to trunk.

> Fix typos in AbstractCSQueue
> 
>
> Key: YARN-10685
> URL: https://issues.apache.org/jira/browse/YARN-10685
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10685.001.patch, YARN-10685.002.patch, 
> YARN-10685.003.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10685) Fix typos in AbstractCSQueue

2021-03-18 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10685:

Summary: Fix typos in AbstractCSQueue  (was: Fixed some Typo  in 
AbstractCSQueue.)

> Fix typos in AbstractCSQueue
> 
>
> Key: YARN-10685
> URL: https://issues.apache.org/jira/browse/YARN-10685
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10685.001.patch, YARN-10685.002.patch, 
> YARN-10685.003.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-18 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304027#comment-17304027
 ] 

Peter Bacsko commented on YARN-10674:
-

Thanks [~zhuqi] for the patch. I think we are very close.

I still have some comments:
 1.
{noformat}
  private FSConfigToCSConfigConverterParams.
  PreemptionMode disablePreemption;
  private FSConfigToCSConfigConverterParams.
  PreemptionMode preemptionMode;
{noformat}
We don't need two enums. We need only one which covers all states (enabled / 
observeonly / nopolicy).

You can extend {{PreemptionMode}} with a new variable which says whether it's 
enabled or disabled:
{noformat}
  public enum PreemptionMode {
ENABLE("enable", true),
NO_POLICY("nopolicy", false),
OBSERVE_ONLY("observeonly", false);

private String cliOption;
private boolean enabled;

PreemptionMode(String cliOption, boolean enabled) {
  this.cliOption = cliOption;
  this.enabled = enabled;
}

public String getCliOption() {
  return cliOption;
}

public boolean isEnabled() {
  return enabled;
}
{noformat}
So you just call {{preemptionMode.isEnabled()}} and don't need two variables 
just to hold the information whether it's enabled or not.

2. {{public static PreemptionMode fromString(String cliOption)}} --> this 
method never returns ENABLED, which is important (also, pls change "ENABLE" to 
"ENABLED", note the "D" at the end).

cc [~gandras] please review patch v14.

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch, 
> YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, 
> YARN-10674.006.patch, YARN-10674.007.patch, YARN-10674.008.patch, 
> YARN-10674.009.patch, YARN-10674.010.patch, YARN-10674.011.patch, 
> YARN-10674.012.patch, YARN-10674.013.patch, YARN-10674.014.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10692) Add Node GPU Utilization and apply to NodeMetrics.

2021-03-17 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303542#comment-17303542
 ] 

Peter Bacsko edited comment on YARN-10692 at 3/17/21, 4:11 PM:
---

Thanks [~zhuqi] in general this looks good.

I just have two nits:
1. {{getNodeGPUUtilization()}} --> rename this to {{getNodeGpuUtilization()}}, 
the method name looks better this way

2. {{getNodeGPUUtilization()}} you can simplify the addition with streams:
{noformat}
float totalGpuUtilization = 0;
if (gpuList != null &&
gpuList.size() != 0) {

  totalGpuUtilization = gpuList
.stream()
.map(g -> g.getGpuUtilizations().getOverallGpuUtilization())
.collect(Collectors.summingDouble(Float::floatValue))
.floatValue() / gpuList.size();
}

return totalGpuUtilization;
{noformat}

Also, you should consider renaming "totalGpuUtilization" to 
"nodeGpuUtilization" so that it matches the method name.


was (Author: pbacsko):
Thanks [~zhuqi] in general this looks good.

I just have two nits:
1. {{getNodeGPUUtilization()}} --> rename this to {{getNodeGpuUtilization()}}, 
the method name looks better this way

2. {{getNodeGPUUtilization()}} you can simplify the addition with streams:
{noformat}
float totalGpuUtilization = 0;
if (gpuList != null &&
gpuList.size() != 0) {

  totalGpuUtilization = gpuList
.stream()
.map(g -> g.getGpuUtilizations().getOverallGpuUtilization())
.collect(Collectors.summingDouble(Float::floatValue))
.floatValue() / gpuList.size();
}

return totalGpuUtilization;
{noformat}

> Add Node GPU Utilization and apply to NodeMetrics.
> --
>
> Key: YARN-10692
> URL: https://issues.apache.org/jira/browse/YARN-10692
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10692.001.patch, YARN-10692.002.patch
>
>
> Now there are no node level GPU Utilization, this issue will add it, and add 
> it to NodeMetrics first.
> cc [~pbacsko]  [~Jim_Brennan]  [~ebadger]  [~gandras]  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10692) Add Node GPU Utilization and apply to NodeMetrics.

2021-03-17 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303542#comment-17303542
 ] 

Peter Bacsko commented on YARN-10692:
-

Thanks [~zhuqi] in general this looks good.

I just have two nits:
1. {{getNodeGPUUtilization()}} --> rename this to {{getNodeGpuUtilization()}}, 
the method name looks better this way

2. {{getNodeGPUUtilization()}} you can simplify the addition with streams:
{noformat}
float totalGpuUtilization = 0;
if (gpuList != null &&
gpuList.size() != 0) {

  totalGpuUtilization = gpuList
.stream()
.map(g -> g.getGpuUtilizations().getOverallGpuUtilization())
.collect(Collectors.summingDouble(Float::floatValue))
.floatValue() / gpuList.size();
}

return totalGpuUtilization;
{noformat}

> Add Node GPU Utilization and apply to NodeMetrics.
> --
>
> Key: YARN-10692
> URL: https://issues.apache.org/jira/browse/YARN-10692
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10692.001.patch, YARN-10692.002.patch
>
>
> Now there are no node level GPU Utilization, this issue will add it, and add 
> it to NodeMetrics first.
> cc [~pbacsko]  [~Jim_Brennan]  [~ebadger]  [~gandras]  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10497) Fix an issue in CapacityScheduler which fails to delete queues

2021-03-17 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10497:

Labels: capacity-scheduler capacityscheduler  (was: )

> Fix an issue in CapacityScheduler which fails to delete queues
> --
>
> Key: YARN-10497
> URL: https://issues.apache.org/jira/browse/YARN-10497
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
>  Labels: capacity-scheduler, capacityscheduler
> Fix For: 3.4.0
>
> Attachments: YARN-10497.001.patch, YARN-10497.002.patch, 
> YARN-10497.003.patch, YARN-10497.004.patch, YARN-10497.005.patch, 
> YARN-10497.006.patch
>
>
> We saw an exception when using queue mutation APIs:
> {code:java}
> 2020-11-13 16:47:46,327 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: 
> CapacityScheduler configuration validation failed:java.io.IOException: Queue 
> root.am2cmQueueSecond not found
> {code}
> Which comes from this code:
> {code:java}
> List siblingQueues = getSiblingQueues(queueToRemove,
> proposedConf);
> if (!siblingQueues.contains(queueName)) {
>   throw new IOException("Queue " + queueToRemove + " not found");
> } 
> {code}
> (Inside MutableCSConfigurationProvider)
> If you look at the method:
> {code:java}
>  
>   private List getSiblingQueues(String queuePath, Configuration conf) 
> {
> String parentQueue = queuePath.substring(0, queuePath.lastIndexOf('.'));
> String childQueuesKey = CapacitySchedulerConfiguration.PREFIX +
> parentQueue + CapacitySchedulerConfiguration.DOT +
> CapacitySchedulerConfiguration.QUEUES;
> return new ArrayList<>(conf.getStringCollection(childQueuesKey));
>   }
> {code}
> And here's capacity-scheduler.xml I got
> {code:java}
> yarn.scheduler.capacity.root.queuesdefault, q1, 
> q2
> {code}
> You can notice there're spaces between default, q1, a2
> So conf.getStringCollection returns:
> {code:java}
> default
> q1
> ...
> {code}
> Which causes match issue when we try to delete the queue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10497) Fix an issue in CapacityScheduler which fails to delete queues

2021-03-17 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303365#comment-17303365
 ] 

Peter Bacsko commented on YARN-10497:
-

+1

Thanks [~wangda] / [~zhuqi] for the patch and [~gandras], [~shuzirra]  for the 
review. Committed to trunk.

> Fix an issue in CapacityScheduler which fails to delete queues
> --
>
> Key: YARN-10497
> URL: https://issues.apache.org/jira/browse/YARN-10497
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
> Attachments: YARN-10497.001.patch, YARN-10497.002.patch, 
> YARN-10497.003.patch, YARN-10497.004.patch, YARN-10497.005.patch, 
> YARN-10497.006.patch
>
>
> We saw an exception when using queue mutation APIs:
> {code:java}
> 2020-11-13 16:47:46,327 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: 
> CapacityScheduler configuration validation failed:java.io.IOException: Queue 
> root.am2cmQueueSecond not found
> {code}
> Which comes from this code:
> {code:java}
> List siblingQueues = getSiblingQueues(queueToRemove,
> proposedConf);
> if (!siblingQueues.contains(queueName)) {
>   throw new IOException("Queue " + queueToRemove + " not found");
> } 
> {code}
> (Inside MutableCSConfigurationProvider)
> If you look at the method:
> {code:java}
>  
>   private List getSiblingQueues(String queuePath, Configuration conf) 
> {
> String parentQueue = queuePath.substring(0, queuePath.lastIndexOf('.'));
> String childQueuesKey = CapacitySchedulerConfiguration.PREFIX +
> parentQueue + CapacitySchedulerConfiguration.DOT +
> CapacitySchedulerConfiguration.QUEUES;
> return new ArrayList<>(conf.getStringCollection(childQueuesKey));
>   }
> {code}
> And here's capacity-scheduler.xml I got
> {code:java}
> yarn.scheduler.capacity.root.queuesdefault, q1, 
> q2
> {code}
> You can notice there're spaces between default, q1, a2
> So conf.getStringCollection returns:
> {code:java}
> default
> q1
> ...
> {code}
> Which causes match issue when we try to delete the queue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-17 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303342#comment-17303342
 ] 

Peter Bacsko commented on YARN-10674:
-

[~gandras] good suggestions, thanks! [~zhuqi] please apply the suggested 
modifications. 

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch, 
> YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, 
> YARN-10674.006.patch, YARN-10674.007.patch, YARN-10674.008.patch, 
> YARN-10674.009.patch, YARN-10674.010.patch, YARN-10674.011.patch, 
> YARN-10674.012.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10497) Fix an issue in CapacityScheduler which fails to delete queues

2021-03-17 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303245#comment-17303245
 ] 

Peter Bacsko commented on YARN-10497:
-

I think it's good. Let's wait for Jenkins and I'll commit it.

> Fix an issue in CapacityScheduler which fails to delete queues
> --
>
> Key: YARN-10497
> URL: https://issues.apache.org/jira/browse/YARN-10497
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
> Attachments: YARN-10497.001.patch, YARN-10497.002.patch, 
> YARN-10497.003.patch, YARN-10497.004.patch, YARN-10497.005.patch, 
> YARN-10497.006.patch
>
>
> We saw an exception when using queue mutation APIs:
> {code:java}
> 2020-11-13 16:47:46,327 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: 
> CapacityScheduler configuration validation failed:java.io.IOException: Queue 
> root.am2cmQueueSecond not found
> {code}
> Which comes from this code:
> {code:java}
> List siblingQueues = getSiblingQueues(queueToRemove,
> proposedConf);
> if (!siblingQueues.contains(queueName)) {
>   throw new IOException("Queue " + queueToRemove + " not found");
> } 
> {code}
> (Inside MutableCSConfigurationProvider)
> If you look at the method:
> {code:java}
>  
>   private List getSiblingQueues(String queuePath, Configuration conf) 
> {
> String parentQueue = queuePath.substring(0, queuePath.lastIndexOf('.'));
> String childQueuesKey = CapacitySchedulerConfiguration.PREFIX +
> parentQueue + CapacitySchedulerConfiguration.DOT +
> CapacitySchedulerConfiguration.QUEUES;
> return new ArrayList<>(conf.getStringCollection(childQueuesKey));
>   }
> {code}
> And here's capacity-scheduler.xml I got
> {code:java}
> yarn.scheduler.capacity.root.queuesdefault, q1, 
> q2
> {code}
> You can notice there're spaces between default, q1, a2
> So conf.getStringCollection returns:
> {code:java}
> default
> q1
> ...
> {code}
> Which causes match issue when we try to delete the queue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-17 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303222#comment-17303222
 ] 

Peter Bacsko commented on YARN-10674:
-

[~gandras] do you have further comments? I think the patch is in good shape now.

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch, 
> YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, 
> YARN-10674.006.patch, YARN-10674.007.patch, YARN-10674.008.patch, 
> YARN-10674.009.patch, YARN-10674.010.patch, YARN-10674.011.patch, 
> YARN-10674.012.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10370) [Umbrella] Reduce the feature gap between FS Placement Rules and CS Queue Mapping rules

2021-03-16 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302878#comment-17302878
 ] 

Peter Bacsko edited comment on YARN-10370 at 3/16/21, 8:36 PM:
---

[~shuzirra] [~snemeth] the vast majority of tasks in this JIRA are done. There 
are some open tasks left.

I think it's safe to say that this feature is ready and the remaining tasks can 
be completed either as standalone tasks or under a "Part II" JIRA. Otherwise we 
might need to keep this open for a long time.

IMO we should move the open / patch available tasks under a new umbrella and 
resolve this, marked with a proper Fix version.

Opinions?


was (Author: pbacsko):
[~shuzirra] [~snemeth] the vast majority of tasks in this JIRA are done. There 
are some open tasks left.

I think it's safe to say that the umbrella is done and the remaining tasks can 
be completed either as standalone tasks or under a "Part II" JIRA. Otherwise we 
might need to keep this open for a long time.

IMO we should move the open / patch available tasks under a new umbrella and 
resolve this, marked with a proper Fix version.

Opinions?

> [Umbrella] Reduce the feature gap between FS Placement Rules and CS Queue 
> Mapping rules
> ---
>
> Key: YARN-10370
> URL: https://issues.apache.org/jira/browse/YARN-10370
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
>  Labels: capacity-scheduler, capacityscheduler
> Attachments: MappingRuleEnhancements.pdf, Possible extensions of 
> mapping rule format in Capacity Scheduler.pdf
>
>
> To continue closing the feature gaps between Fair Scheduler and Capacity 
> Scheduler to help users migrate between the scheduler more easy, we need to 
> add some of the Fair Scheduler placement rules to the capacity scheduler's 
> queue mapping functionality.
> With [~snemeth] and [~pbacsko] we've created the following design docs about 
> the proposed changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10370) [Umbrella] Reduce the feature gap between FS Placement Rules and CS Queue Mapping rules

2021-03-16 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302878#comment-17302878
 ] 

Peter Bacsko commented on YARN-10370:
-

[~shuzirra] [~snemeth] the vast majority of tasks in this JIRA are done. There 
are some open tasks left.

I think it's safe to say that the umbrella is done and the remaining tasks can 
be completed either as standalone tasks or under a "Part II" JIRA. Otherwise we 
might need to keep this open for a long time.

IMO we should move the open / patch available tasks under a new umbrella and 
resolve this, marked with a proper Fix version.

Opinions?

> [Umbrella] Reduce the feature gap between FS Placement Rules and CS Queue 
> Mapping rules
> ---
>
> Key: YARN-10370
> URL: https://issues.apache.org/jira/browse/YARN-10370
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
>  Labels: capacity-scheduler, capacityscheduler
> Attachments: MappingRuleEnhancements.pdf, Possible extensions of 
> mapping rule format in Capacity Scheduler.pdf
>
>
> To continue closing the feature gaps between Fair Scheduler and Capacity 
> Scheduler to help users migrate between the scheduler more easy, we need to 
> add some of the Fair Scheduler placement rules to the capacity scheduler's 
> queue mapping functionality.
> With [~snemeth] and [~pbacsko] we've created the following design docs about 
> the proposed changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10686) Fix TestCapacitySchedulerAutoQueueCreation#testAutoQueueCreationFailsForEmptyPathWithAQCAndWeightMode

2021-03-16 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302599#comment-17302599
 ] 

Peter Bacsko commented on YARN-10686:
-

+1

Thanks [~zhuqi] for the patch and [~gandras] for the review. Committed to trunk.

> Fix 
> TestCapacitySchedulerAutoQueueCreation#testAutoQueueCreationFailsForEmptyPathWithAQCAndWeightMode
> -
>
> Key: YARN-10686
> URL: https://issues.apache.org/jira/browse/YARN-10686
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10686.001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10686) Fix TestCapacitySchedulerAutoQueueCreation#testAutoQueueCreationFailsForEmptyPathWithAQCAndWeightMode

2021-03-16 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10686:

Summary: Fix 
TestCapacitySchedulerAutoQueueCreation#testAutoQueueCreationFailsForEmptyPathWithAQCAndWeightMode
  (was: Fix testAutoQueueCreationFailsForEmptyPathWithAQCAndWeightMode user 
error.)

> Fix 
> TestCapacitySchedulerAutoQueueCreation#testAutoQueueCreationFailsForEmptyPathWithAQCAndWeightMode
> -
>
> Key: YARN-10686
> URL: https://issues.apache.org/jira/browse/YARN-10686
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10686.001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10682) The scheduler monitor policies conf should trim values separated by comma

2021-03-16 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302567#comment-17302567
 ] 

Peter Bacsko commented on YARN-10682:
-

+1

Thanks for the patch [~zhuqi] and [~gandras] for the review, committed to trunk.

> The scheduler monitor policies conf should trim values separated by comma
> -
>
> Key: YARN-10682
> URL: https://issues.apache.org/jira/browse/YARN-10682
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10682.001.patch
>
>
> When i configured scheduler monitor policies with space, the RM will start 
> with error.
> The conf should support trim between "," , such as :
> "a,b,c" is supported now, but "a,   b,  c" is not supported now, just add 
> trim in this jira.
>  
> When tested multi policy, it happened.
>  
>  yarn.resourcemanager.scheduler.monitor.policies
>  
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.QueueConfigurationAutoRefreshPolicy,
>    
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AutoCreatedQueueDeletionPolicy
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10682) The scheduler monitor policies conf should trim values separated by comma

2021-03-16 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10682:

Summary: The scheduler monitor policies conf should trim values separated 
by comma  (was: The scheduler monitor policies conf should support trim between 
",".)

> The scheduler monitor policies conf should trim values separated by comma
> -
>
> Key: YARN-10682
> URL: https://issues.apache.org/jira/browse/YARN-10682
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10682.001.patch
>
>
> When i configured scheduler monitor policies with space, the RM will start 
> with error.
> The conf should support trim between "," , such as :
> "a,b,c" is supported now, but "a,   b,  c" is not supported now, just add 
> trim in this jira.
>  
> When tested multi policy, it happened.
>  
>  yarn.resourcemanager.scheduler.monitor.policies
>  
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.QueueConfigurationAutoRefreshPolicy,
>    
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AutoCreatedQueueDeletionPolicy
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-16 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302548#comment-17302548
 ] 

Peter Bacsko commented on YARN-10674:
-

Thanks [~zhuqi] this is definitely looks better. We're close to the final 
version.

Some comments:
 1.
{noformat}
Disable the preemption with nopolicy or observeonly mode, " +
"default mode is nopolicy with no arg." +
"When use nopolicy arg, it means to remove " +
"ProportionalCapacityPreemptionPolicy for CS preemption, " +
"When use observeonly arg, " +
"it means to set " +

"yarn.resourcemanager.monitor.capacity.preemption.observe_only " +
"to true"
{noformat}
I'd to slightly modify this text:
{noformat}
Disable the preemption with \"nopolicy\" or \"observeonly\" mode.
Default is \"nopolicy\".
\"nopolicy\" removes ProportionalCapacityPreemptionPolicy from
the list of monitor policies.
\"observeronly\" sets 
\"yarn.resourcemanager.monitor.capacity.preemption.observe_only\"
to true.
{noformat}
2. This definition:
 {{private String disablePreemptionMode;}}

This should be a simple enum like:
{noformat}
public enum DisablePreemptionMode {
  OBSERVE_ONLY {
@Override
String getCliOption() {
  return "observeonly";
}
  },
  NO_POLICY {
@Override
String getCliOption() {
  return "nopolicy";
}
  };

  abstract String getCliOption();
}
{noformat}
So you can also use them here:
{noformat}
 private static void checkDisablePreemption(CliOption cliOption,
  String disablePreemptionMode) {
if (disablePreemptionMode == null ||
disablePreemptionMode.trim().isEmpty()) {
  // The default mode is nopolicy.
  return;
}

try {
  DisablePreemptionMode.valueOf(disablePreemptionMode);
} catch (IllegalArgumentException e) {
  throw new PreconditionException(
  String.format("Specified disable-preemption option %s is 
illegal, " +
  " use \"nopolicy\" or \"observeonly\""));
}
{noformat}
"disablePreemptionMode" should be an enum everywhere.

3.
{noformat}
  public void convertSiteProperties(Configuration conf,
  Configuration yarnSiteConfig, boolean drfUsed, boolean 
enableAsyncScheduler) 
  boolean enableAsyncScheduler, boolean userPercentage,
  boolean disablePreemption, String disablePreemptionMode) {
{noformat}
Here "disablePreemptionMode" should be an enum also and make sure that it 
always has a value. If it always has a value, this part becomes much simpler:
{noformat}
  if (disablePreemption && 
  disablePreemptionMode == DisablePreemptionMode.NO_POLICY) {

yarnSiteConfig.set(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES, "");
  }
}
{noformat}
4.
 {{AutoCreatedQueueDeletionPolicy.class.getCanonicalName())}}

This string is referenced very often in the tests. Instead, use a final String:
{noformat}
private static final String DELETION_POLICY_CLASS =
   AutoCreatedQueueDeletionPolicy.class.getCanonicalName();
{noformat}
So the readability becomes much better.

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch, 
> YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, 
> YARN-10674.006.patch, YARN-10674.007.patch, YARN-10674.008.patch, 
> YARN-10674.009.patch, YARN-10674.010.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-12 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300319#comment-17300319
 ] 

Peter Bacsko commented on YARN-10674:
-

[~zhuqi] I didn't have too much time to deeply review the patch, but your 
change ignore the "observeonly" setting. So, if I use "\-\-disablepreemption 
observeonly", nothing happens. Could you insert this to 
{{FSConfigToCSConfigConverter}}? I believe that is the best place for it.

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch, 
> YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, 
> YARN-10674.006.patch, YARN-10674.007.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-12 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300319#comment-17300319
 ] 

Peter Bacsko edited comment on YARN-10674 at 3/12/21, 1:34 PM:
---

[~zhuqi] I didn't have too much time to deeply review the patch, but your 
change ignores the "observeonly" setting. So, if I use "\-\-disablepreemption 
observeonly", nothing happens. Could you insert this to 
{{FSConfigToCSConfigConverter}}? I believe that is the best place for it.


was (Author: pbacsko):
[~zhuqi] I didn't have too much time to deeply review the patch, but your 
change ignore the "observeonly" setting. So, if I use "\-\-disablepreemption 
observeonly", nothing happens. Could you insert this to 
{{FSConfigToCSConfigConverter}}? I believe that is the best place for it.

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch, 
> YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, 
> YARN-10674.006.patch, YARN-10674.007.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-11 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299602#comment-17299602
 ] 

Peter Bacsko edited comment on YARN-10674 at 3/11/21, 3:25 PM:
---

Ok, I did some research, I think we have 3 options to completely disable 
preemption:

1) Set disable_preemption to "root", which will propagate down to other queues.
2) Remove "ProportionalCapacityPreemptionPolicy" from the list of policies.
3) Enable "observe_only" property.

I think #1 is not really good, because it relies on a side-effect (propagation 
of a setting). The intention is not clear.

#2 is perfectly acceptable and this goes to {{yarn-site.xml}} so it should be 
in {{FSYarnSiteConverter}}.
#3 is also OK, but that goes to {{capacity-scheduler.xml}} and NOT to 
{{yarn-site.xml}}, I just verified it. So this should be placed somewhere else.

So we can do:
1) Vote for what's best
2) Introduce a command line switch like "-dp" "\-\-disable-preemption" with 
values like "nopolicy" or "observeonly" and we pick a default value, eg. 
"nopolicy". So we can do something like:
{noformat}
yarn fs2cs --disable-preemption observeonly --yarnsiteconfig 
/path/to/yarn-site.xml 
{noformat}

[~gandras] [~zhuqi] what do you think?


was (Author: pbacsko):
Ok, I did some research, I think we 3 options to completely disable preemption:

1) Set disable_preemption to "root", which will propagate down to other queues.
2) Remove "ProportionalCapacityPreemptionPolicy" from the list of policies.
3) Enable "observe_only" property.

I think #1 is not really good, because it relies on a side-effect (propagation 
of a setting). The intention is not clear.

#2 is perfectly acceptable and this goes to {{yarn-site.xml}} so it should be 
in {{FSYarnSiteConverter}}.
#3 is also OK, but that goes to {{capacity-scheduler.xml}} and NOT to 
{{yarn-site.xml}}, I just verified it. So this should be placed somewhere else.

So we can do:
1) Vote for what's best
2) Introduce a command line switch like "-dp" "\-\-disable-preemption" with 
values like "nopolicy" or "observeonly" and we pick a default value, eg. 
"nopolicy". So we can do something like:
{noformat}
yarn fs2cs --disable-preemption observeonly --yarnsiteconfig 
/path/to/yarn-site.xml 
{noformat}

[~gandras] [~zhuqi] what do you think?

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch, 
> YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, 
> YARN-10674.006.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-11 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299602#comment-17299602
 ] 

Peter Bacsko edited comment on YARN-10674 at 3/11/21, 2:43 PM:
---

Ok, I did some research, I think we 3 options to completely disable preemption:

1) Set disable_preemption to "root", which will propagate down to other queues.
2) Remove "ProportionalCapacityPreemptionPolicy" from the list of policies.
3) Enable "observe_only" property.

I think #1 is not really good, because it relies on a side-effect (propagation 
of a setting). The intention is not clear.

#2 is perfectly acceptable and this goes to {{yarn-site.xml}} so it should be 
in {{FSYarnSiteConverter}}.
#3 is also OK, but that goes to {{capacity-scheduler.xml}} and NOT to 
{{yarn-site.xml}}, I just verified it. So this should be placed somewhere else.

So we can do:
1) Vote for what's best
2) Introduce a command line switch like "-dp" "\-\-disable-preemption" with 
values like "nopolicy" or "observeonly" and we pick a default value, eg. 
"nopolicy". So we can do something like:
{noformat}
yarn fs2cs --disable-preemption observeonly --yarnsiteconfig 
/path/to/yarn-site.xml 
{noformat}

[~gandras] [~zhuqi] what do you think?


was (Author: pbacsko):
Ok, I did some research, I think we 3 options to completely disable preemption:

1) Set disable_preemption to "root", which will propagate down to other queues.
2) Remove "ProportionalCapacityPreemptionPolicy" from the list of policies.
3) Enable "observe_only" property.

I think #1 is not really good, because it relies on a side-effect (propagation 
of a setting). The intention is not clear.

#2 is perfectly acceptable and this goes to {{yarn-site.xml}} so it should be 
in {{FSYarnSiteConverter}}.
#3 is also OK, but that goes to {{capacity-scheduler.xml}} and NOT in 
{{yarn-site.xml}}, I just verified it. So this should be placed somewhere else.

So we can do:
1) Vote for what's best
2) Introduce a command line switch like "-dp" "\-\-disable-preemption" with 
values like "nopolicy" or "observeonly" and we pick a default value, eg. 
"nopolicy". So we can do something like:
{noformat}
yarn fs2cs --disable-preemption observeonly --yarnsiteconfig 
/path/to/yarn-site.xml 
{noformat}

[~gandras] [~zhuqi] what do you think?

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch, 
> YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, 
> YARN-10674.006.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-11 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299602#comment-17299602
 ] 

Peter Bacsko commented on YARN-10674:
-

Ok, I did some research, I think we 3 options to completely disable preemption:

1) Set disable_preemption to "root", which will propagate down to other queues.
2) Remove "ProportionalCapacityPreemptionPolicy" from the list of policies.
3) Enable "observe_only" property.

I think #1 is not really good, because it relies on a side-effect (propagation 
of a setting). The intention is not clear.

#2 is perfectly acceptable and this goes to {{yarn-site.xml}} so it should be 
in {{FSYarnSiteConverter}}.
#3 is also OK, but that goes to {{capacity-scheduler.xml}} and NOT in 
{{yarn-site.xml}}, I just verified it. So this should be placed somewhere else.

So we can do:
1) Vote for what's best
2) Introduce a command line switch like "-dp" "\-\-disable-preemption" with 
values like "nopolicy" or "observeonly" and we pick a default value, eg. 
"nopolicy". So we can do something like:
{noformat}
yarn fs2cs --disable-preemption observeonly --yarnsiteconfig 
/path/to/yarn-site.xml 
{noformat}

[~gandras] [~zhuqi] what do you think?

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch, 
> YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, 
> YARN-10674.006.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-11 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299466#comment-17299466
 ] 

Peter Bacsko commented on YARN-10674:
-

[~zhuqi] yes that's right.

This is the default setting for policies:

{noformat}
  
The list of SchedulingEditPolicy classes that interact with
the scheduler. A particular module may be incompatible with the
scheduler, other policies, or a configuration of either.
yarn.resourcemanager.scheduler.monitor.policies

org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy
  
{noformat}

This is from {{yarn-default.xml}}. So when we don't use preemption, we should 
remove this policy.

But we actually have to think a little bit, because how we disable preemption 
affects our downstream Hadoop codebase. So let's wait until we figure out what 
is the best solution to turn off preemption.

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch, 
> YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, 
> YARN-10674.006.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-11 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299456#comment-17299456
 ] 

Peter Bacsko commented on YARN-10674:
-

[~gandras] h - that's true. I just overcomplicated the whole thing (not 
that preemption in general is easy to begin with).

Yes, we don't need it if we don't have the policy.

[~zhuqi] please wait with the new patch. What Andras said is correct, but there 
might be other changes that I'll recommend.

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch, 
> YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, 
> YARN-10674.006.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-11 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299427#comment-17299427
 ] 

Peter Bacsko commented on YARN-10674:
-

I'll do a deeper review today.

[~gandras] you say: "Is setting observe only necessary here? This is an 
extremely subtle property.".

I'm not sure how subtle it is, but it is mentioned in the upstream 
documentation:
|{{yarn.resourcemanager.monitor.capacity.preemption.observe_only}}|If true, run 
the policy but do not affect the cluster with preemption and kill events. 
Default value is false|

However, if someone thinks that disabling preemption for "root" is a better 
solution, I'm not against that. We might need other folks to chime in and share 
their thoughts.

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch, 
> YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, 
> YARN-10674.006.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10685) Fixed some Typo in AbstractCSQueue.

2021-03-10 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298874#comment-17298874
 ] 

Peter Bacsko commented on YARN-10685:
-

Sure, I'll check it out.

> Fixed some Typo  in AbstractCSQueue.
> 
>
> Key: YARN-10685
> URL: https://issues.apache.org/jira/browse/YARN-10685
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10685.001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10571) Refactor dynamic queue handling logic

2021-03-10 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298861#comment-17298861
 ] 

Peter Bacsko commented on YARN-10571:
-

[~gandras] thanks for the patch.

I just have one question: the class {{CapacitySchedulerAutoQueueHandler}} was 
renamed to {{CapacitySchedulerQueueHandler}}. But the latter is telling me that 
this is class which handles all kinds of queues, not just auto-created queues. 
Wouldn't it make sense to keep the original name? Even the instance is called 
{{autoQueueHandler}}.

Also, there's a Javadoc and a checkstyle problem.

> Refactor dynamic queue handling logic
> -
>
> Key: YARN-10571
> URL: https://issues.apache.org/jira/browse/YARN-10571
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Minor
> Attachments: YARN-10571.001.patch
>
>
> As per YARN-10506 we have introduced an other mode for auto queue creation 
> and a new class, which handles it. We should move the old, managed queue 
> related logic to CSAutoQueueHandler as well, and do additional cleanup 
> regarding queue management.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-10 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298842#comment-17298842
 ] 

Peter Bacsko commented on YARN-10674:
-

Ok, here is what I found:

1. {{RM_SCHEDULER_ENABLE_MONITORS}} --> ok, this can be set to "true" in all 
cases.

2. If FS preemption is disabled --> there is a property which is better than 
configuring the "root" queue. If FS preemption is disabled 
({{yarn.scheduler.fair.preemption}} = {{false}}),
then we should generate 
{{yarn.resourcemanager.monitor.capacity.preemption.observe_only}} = {{true}}. 
This means that we have the monitor thread running but we don't do any 
preemption. So we don't need to set "root.disable_preemption".

3. As I mentioned, the {{Configuration}} object is empty. The problem is, in 
order to use the preemption, we need to set the preemption policy, which is 
missing right now. So, if FS preemption is enabled, this line must be added:


{noformat}
   if (conf.getBoolean(FairSchedulerConfiguration.PREEMPTION,
FairSchedulerConfiguration.DEFAULT_PREEMPTION)) {

yarnSiteConfig.set(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES,
  ProportionalCapacityPreemptionPolicy.class.getCanonicalName();
...
{noformat}

So, the modified code should look like this:

{noformat}
   yarnSiteConfig.setBoolean(
YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true);
   
   if (conf.getBoolean(FairSchedulerConfiguration.PREEMPTION,
FairSchedulerConfiguration.DEFAULT_PREEMPTION)) {

yarnSiteConfig.set(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES,
  ProportionalCapacityPreemptionPolicy.class.getCanonicalName();
...
   } else {
 // no preemption
 
yarnSiteConfig.setBoolean(CapacitySchedulerConfiguration.PREEMPTION_OBSERVE_ONLY,
   true);
   }

// new code comes here
if (!userPercentage) {
  String policies =
yarnSiteConfig.get(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES);
  if (policies == null) {
  ...
{noformat}


Please modify the test cases accordingly and the checkstyle issues also.

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch, 
> YARN-10674.003.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-10 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298776#comment-17298776
 ] 

Peter Bacsko edited comment on YARN-10674 at 3/10/21, 12:03 PM:


[~zhuqi] thanks for the patch. I found a new property which is probably good 
for us if preemption is completely disabled on the FS side. I have to check if 
it is really acceptable.


was (Author: pbacsko):
[~zhuqi] thanks for the patch. I found a new property which is probably good 
for us if preemption is completely disabled on the FS side. I have to check if 
it is good for us.

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch, 
> YARN-10674.003.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-10 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298776#comment-17298776
 ] 

Peter Bacsko commented on YARN-10674:
-

[~zhuqi] thanks for the patch. I found a new property which is probably good 
for us if preemption is completely disabled on the FS side. I have to check if 
it is good for us.

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch, 
> YARN-10674.003.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-09 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298156#comment-17298156
 ] 

Peter Bacsko commented on YARN-10674:
-

[~zhuqi] this is very interesting. If we set RM Monitors to enabled, it means 
that system-wide preemption is always enabled, too:

AbstractCSQueue:
{noformat}
  private boolean isQueueHierarchyPreemptionDisabled(CSQueue q,
  CapacitySchedulerConfiguration configuration) {
boolean systemWidePreemption =
csContext.getConfiguration()
.getBoolean(YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS,
YarnConfiguration.DEFAULT_RM_SCHEDULER_ENABLE_MONITORS);
CSQueue parentQ = q.getParent();

// If the system-wide preemption switch is turned off, all of the queues in
// the qPath hierarchy have preemption disabled, so return true.
if (!systemWidePreemption) return true;
{noformat}
However, you already added a policy in YARN-10623, so looks like this property 
always has to be enabled in weight mode. But what if we convert an FS 
configuration which disabled preemption completely?

I think the best thing we can do right now is that we disable preemption for 
"root", which will propagate to all other parent queues.

So I suggest the following approach:
 1. In percentage conversion mode, do not enable RM monitors by default, 
because it's not needed.
 2. In weight mode (which is the default now), we have to enable it. But if 
"yarn.scheduler.fair.preemption" is false, then 
"yarn.scheduler.capacity.root.disable_preemption" must be set to true, but only 
for "root". This can be done in {{FSQueueConverter}}.

cc [~bteke] [~gandras] [~snemeth], not sure if this is a good approach, but I 
can't see anything better.

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-09 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298112#comment-17298112
 ] 

Peter Bacsko edited comment on YARN-10674 at 3/9/21, 3:23 PM:
--

[~zhuqi] I have the following comments:

1. This change seems to always enable "RM monitors":
{noformat}
// This should be always true to trigger dynamic queue auto deletion
// when expired.
yarnSiteConfig.setBoolean(
YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true);
{noformat}
But I don't think this is necessary. We need to enable it in two cases: 
preemption is enabled OR we're in weight mode. We don't have auto-queue delete 
in percentage mode (fs2cs can still convert to percentages with a command line 
switch).
 So I suggest that you pass an extra boolean "usePercentages".

Invocation from {{FSConfigToCSConfigConverter}}:
{noformat}
siteConverter.convertSiteProperties(inputYarnSiteConfig,
convertedYarnSiteConfig, drfUsed,
conversionOptions.isEnableAsyncScheduler(), usePercentages);  <-- last 
argument is new
{noformat}
Then in the site converter:
{noformat}
if (conf.getBoolean(FairSchedulerConfiguration.PREEMPTION,
FairSchedulerConfiguration.DEFAULT_PREEMPTION)) {
  yarnSiteConfig.setBoolean(
  YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true);
  preemptionEnabled = true;
  ...
}

if (!usePercentages) {
yarnSiteConfig.setBoolean(
YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true);   // 
setting it again is OK

String policies =
yarnSiteConfig.get(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES);
if (policies == null) {
  policies = AutoCreatedQueueDeletionPolicy.
  class.getCanonicalName();
} else {
  policies += "," + AutoCreatedQueueDeletionPolicy.
  class.getCanonicalName();
}

yarnSiteConfig.set(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES,
policies);

// Set the expired for deletion interval to 10s, consistent with fs.
yarnSiteConfig.setInt(CapacitySchedulerConfiguration.
AUTO_CREATE_CHILD_QUEUE_EXPIRED_TIME, 10);
}
{noformat}
If I think about it, {{yarnSiteConfig}} is the output config. So this cannot 
happen:
{noformat}
} else {
  policies += "," + AutoCreatedQueueDeletionPolicy.
  class.getCanonicalName();
}
{noformat}
This {{Configuration}} object is created with no entries. The {{else}} branch 
will never be taken.

So it can be simplified to:
{noformat}
if (!usePercentages) {
yarnSiteConfig.setBoolean(
YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true);

String policy = AutoCreatedQueueDeletionPolicy.
  class.getCanonicalName();

yarnSiteConfig.set(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES,
policy);

// Set the expired for deletion interval to 10s, consistent with fs.
yarnSiteConfig.setInt(CapacitySchedulerConfiguration.
AUTO_CREATE_CHILD_QUEUE_EXPIRED_TIME, 10);
}
{noformat}
2. This also means two separate test cases:
 * When usePercentages = false, then {{RM_SCHEDULER_ENABLE_MONITORS}} and 
{{RM_SCHEDULER_MONITOR_POLICIES}} should be set (with preemption = false)
 * When usePercentages = true, then {{RM_SCHEDULER_ENABLE_MONITORS}} and 
{{RM_SCHEDULER_MONITOR_POLICIES}} should NOT be set (with preemption = false)

I recommend the following naming:
 {{testRmMonitorsAndPoliciesSetWhenUsingWeights()}} - first scenario
 {{testRmMonitorsAndPoliciesSetWhenUsingPercentages()}} - second scenario


was (Author: pbacsko):
[~zhuqi] I have the following comments:

1. This change seems to always enable "RM monitors":
{noformat}
// This should be always true to trigger dynamic queue auto deletion
// when expired.
yarnSiteConfig.setBoolean(
YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true);
{noformat}
But I don't think this is necessary. We need to enable it in two cases: 
preemption is enabled OR we're in weight mode. We don't have auto-queue delete 
in percentage mode (fs2cs can still convert to percentages with a command line 
switch).
 So I suggest that you pass an extra boolean "usePercentages".

Invocation from {{FSConfigToCSConfigConverter}}:
{noformat}
siteConverter.convertSiteProperties(inputYarnSiteConfig,
convertedYarnSiteConfig, drfUsed,
conversionOptions.isEnableAsyncScheduler(), usePercentages);  <-- last 
argument is new
{noformat}
Then in the site converter:
{noformat}
if (conf.getBoolean(FairSchedulerConfiguration.PREEMPTION,
FairSchedulerConfiguration.DEFAULT_PREEMPTION)) {
  yarnSiteConfig.setBoolean(
  YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true);
  preemptionEnabled = true;
  ...
}

if (!usePercentages) {
yarnSiteConfig.setBoolean(
YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true);   

[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-09 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298112#comment-17298112
 ] 

Peter Bacsko commented on YARN-10674:
-

[~zhuqi] I have the following comments:

1. This change seems to always enable "RM monitors":
{noformat}
// This should be always true to trigger dynamic queue auto deletion
// when expired.
yarnSiteConfig.setBoolean(
YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true);
{noformat}
But I don't think this is necessary. We need to enable it in two cases: 
preemption is enabled OR we're in weight mode. We don't have auto-queue delete 
in percentage mode (fs2cs can still convert to percentages with a command line 
switch).
 So I suggest that you pass an extra boolean "usePercentages".

Invocation from {{FSConfigToCSConfigConverter}}:
{noformat}
siteConverter.convertSiteProperties(inputYarnSiteConfig,
convertedYarnSiteConfig, drfUsed,
conversionOptions.isEnableAsyncScheduler(), usePercentages);  <-- last 
argument is new
{noformat}
Then in the site converter:
{noformat}
if (conf.getBoolean(FairSchedulerConfiguration.PREEMPTION,
FairSchedulerConfiguration.DEFAULT_PREEMPTION)) {
  yarnSiteConfig.setBoolean(
  YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true);
  preemptionEnabled = true;
  ...
}

if (!usePercentages) {
yarnSiteConfig.setBoolean(
YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true);   // 
setting it again is OK

String policies =
yarnSiteConfig.get(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES);
if (policies == null) {
  policies = AutoCreatedQueueDeletionPolicy.
  class.getCanonicalName();
} else {
  policies += "," + AutoCreatedQueueDeletionPolicy.
  class.getCanonicalName();
}

yarnSiteConfig.set(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES,
policies);

// Set the expired for deletion interval to 10s, consistent with fs.
yarnSiteConfig.setInt(CapacitySchedulerConfiguration.
AUTO_CREATE_CHILD_QUEUE_EXPIRED_TIME, 10);
}
{noformat}
If I think about it, {{yarnSiteConfig}} is the output config. So this cannot 
happen:
{noformat}
} else {
  policies += "," + AutoCreatedQueueDeletionPolicy.
  class.getCanonicalName();
}
{noformat}
This {{Configuration}} object is created with no entries. The {{else}} branch 
will never be taken.

So it can be simplified to:
{noformat}
if (!usePercentages) {
yarnSiteConfig.setBoolean(
YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true);

String policy = AutoCreatedQueueDeletionPolicy.
  class.getCanonicalName();

yarnSiteConfig.set(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES,
policy);

// Set the expired for deletion interval to 10s, consistent with fs.
yarnSiteConfig.setInt(CapacitySchedulerConfiguration.
AUTO_CREATE_CHILD_QUEUE_EXPIRED_TIME, 10);
}
{noformat}
2. This also means two separate test cases:
 * When usePercentages = false, then {{RM_SCHEDULER_ENABLE_MONITORS}} and 
{{RM_SCHEDULER_MONITOR_POLICIES}} should be set (with preemption = false)
 * When usePercentages = true, then\{{RM_SCHEDULER_ENABLE_MONITORS}} and 
{{RM_SCHEDULER_MONITOR_POLICIES}} should NOT be set (with preemption = false)

I recommend the following naming:
 {{testRmMonitorsAndPoliciesSetWhenUsingWeights()}} - first scenario
 {{testRmMonitorsAndPoliciesSetWhenUsingPercentages()}} - second scenario

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-09 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298092#comment-17298092
 ] 

Peter Bacsko commented on YARN-10674:
-

Ok thanks, I'll review this one soon.

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-09 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298074#comment-17298074
 ] 

Peter Bacsko commented on YARN-10674:
-

[~zhuqi] am I right when I think that this patch depends on YARN-10682? Because 
this change generates a config entry with "," and it's not supported now.

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



<    1   2   3   4   5   6   7   8   9   10   >