[jira] [Updated] (YARN-10677) Logger of SLSFairScheduler is provided with the wrong class

2021-03-07 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10677:
--
Attachment: YARN-10677.003.patch

> Logger of SLSFairScheduler is provided with the wrong class
> ---
>
> Key: YARN-10677
> URL: https://issues.apache.org/jira/browse/YARN-10677
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10677.001.patch, YARN-10677.002.patch, 
> YARN-10677.003.patch
>
>
> In SLSFairScheduler, the Logger definition looks like: 
> https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L69
> We need to fix this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10681) Fix assertion failure message in BaseSLSRunnerTest

2021-03-07 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10681:
--
Attachment: YARN-10681.001.patch

> Fix assertion failure message in BaseSLSRunnerTest
> --
>
> Key: YARN-10681
> URL: https://issues.apache.org/jira/browse/YARN-10681
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Trivial
> Attachments: YARN-10681.001.patch
>
>
> There is this failure message: 
> https://github.com/apache/hadoop/blob/a89ca56a1b0eb949f56e7c6c5c25fdf87914a02f/hadoop-tools/hadoop-sls/src/test/java/org/apache/hadoop/yarn/sls/BaseSLSRunnerTest.java#L129-L130
> The word "catched" should be replaced with "caught".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10681) Fix assertion failure message in BaseSLSRunnerTest

2021-03-07 Thread Szilard Nemeth (Jira)
Szilard Nemeth created YARN-10681:
-

 Summary: Fix assertion failure message in BaseSLSRunnerTest
 Key: YARN-10681
 URL: https://issues.apache.org/jira/browse/YARN-10681
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Szilard Nemeth
Assignee: Szilard Nemeth


There is this failure message: 
https://github.com/apache/hadoop/blob/a89ca56a1b0eb949f56e7c6c5c25fdf87914a02f/hadoop-tools/hadoop-sls/src/test/java/org/apache/hadoop/yarn/sls/BaseSLSRunnerTest.java#L129-L130
"catched" should be replaced with "caught".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10681) Fix assertion failure message in BaseSLSRunnerTest

2021-03-07 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10681:
--
Description: 
There is this failure message: 
https://github.com/apache/hadoop/blob/a89ca56a1b0eb949f56e7c6c5c25fdf87914a02f/hadoop-tools/hadoop-sls/src/test/java/org/apache/hadoop/yarn/sls/BaseSLSRunnerTest.java#L129-L130
The word "catched" should be replaced with "caught".

  was:
There is this failure message: 
https://github.com/apache/hadoop/blob/a89ca56a1b0eb949f56e7c6c5c25fdf87914a02f/hadoop-tools/hadoop-sls/src/test/java/org/apache/hadoop/yarn/sls/BaseSLSRunnerTest.java#L129-L130
"catched" should be replaced with "caught".


> Fix assertion failure message in BaseSLSRunnerTest
> --
>
> Key: YARN-10681
> URL: https://issues.apache.org/jira/browse/YARN-10681
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Trivial
>
> There is this failure message: 
> https://github.com/apache/hadoop/blob/a89ca56a1b0eb949f56e7c6c5c25fdf87914a02f/hadoop-tools/hadoop-sls/src/test/java/org/apache/hadoop/yarn/sls/BaseSLSRunnerTest.java#L129-L130
> The word "catched" should be replaced with "caught".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10680) Revisit try blocks without catch blocks but having finally blocks

2021-03-07 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10680:
--
Description: This jira is to revisit all try blocks without catch blocks 
but having finally blocks in SLS.  (was: In our internal environment, there was 
a test failure while running SLS tests with Jenkins.
It's difficult to align the uncaught exceptions (in this case an NPE) and the 
log itself as the exception is logged with {{e.printStackTrace()}}.
This jira is to replace printStackTrace calls in SLS with {{LOG.error("msg", 
exception)}}.)

> Revisit try blocks without catch blocks but having finally blocks
> -
>
> Key: YARN-10680
> URL: https://issues.apache.org/jira/browse/YARN-10680
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
>
> This jira is to revisit all try blocks without catch blocks but having 
> finally blocks in SLS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10680) Revisit try blocks without catch blocks but having finally blocks

2021-03-07 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10680:
--
Labels: newbie  (was: )

> Revisit try blocks without catch blocks but having finally blocks
> -
>
> Key: YARN-10680
> URL: https://issues.apache.org/jira/browse/YARN-10680
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Priority: Major
>  Labels: newbie
>
> This jira is to revisit all try blocks without catch blocks but having 
> finally blocks in SLS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10680) Revisit try blocks without catch blocks but having finally blocks

2021-03-07 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-10680:
-

Assignee: Szilard Nemeth

> Revisit try blocks without catch blocks but having finally blocks
> -
>
> Key: YARN-10680
> URL: https://issues.apache.org/jira/browse/YARN-10680
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
>  Labels: newbie
>
> This jira is to revisit all try blocks without catch blocks but having 
> finally blocks in SLS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10680) Revisit try blocks without catch blocks but having finally blocks

2021-03-07 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-10680:
-

Assignee: (was: Szilard Nemeth)

> Revisit try blocks without catch blocks but having finally blocks
> -
>
> Key: YARN-10680
> URL: https://issues.apache.org/jira/browse/YARN-10680
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Priority: Major
>
> This jira is to revisit all try blocks without catch blocks but having 
> finally blocks in SLS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10680) Revisit try blocks without catch blocks but having finally blocks

2021-03-07 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10680:
--
Summary: Revisit try blocks without catch blocks but having finally blocks  
(was: CLONE - Better logging of uncaught exceptions throughout SLS)

> Revisit try blocks without catch blocks but having finally blocks
> -
>
> Key: YARN-10680
> URL: https://issues.apache.org/jira/browse/YARN-10680
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
>
> In our internal environment, there was a test failure while running SLS tests 
> with Jenkins.
> It's difficult to align the uncaught exceptions (in this case an NPE) and the 
> log itself as the exception is logged with {{e.printStackTrace()}}.
> This jira is to replace printStackTrace calls in SLS with {{LOG.error("msg", 
> exception)}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10680) CLONE - Better logging of uncaught exceptions throughout SLS

2021-03-07 Thread Szilard Nemeth (Jira)
Szilard Nemeth created YARN-10680:
-

 Summary: CLONE - Better logging of uncaught exceptions throughout 
SLS
 Key: YARN-10680
 URL: https://issues.apache.org/jira/browse/YARN-10680
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Szilard Nemeth
Assignee: Szilard Nemeth


In our internal environment, there was a test failure while running SLS tests 
with Jenkins.
It's difficult to align the uncaught exceptions (in this case an NPE) and the 
log itself as the exception is logged with {{e.printStackTrace()}}.
This jira is to replace printStackTrace calls in SLS with {{LOG.error("msg", 
exception)}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10679) Better logging of uncaught exceptions throughout SLS

2021-03-07 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10679:
--
Attachment: YARN-10679.001.patch

> Better logging of uncaught exceptions throughout SLS
> 
>
> Key: YARN-10679
> URL: https://issues.apache.org/jira/browse/YARN-10679
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10679.001.patch
>
>
> In our internal environment, there was a test failure while running SLS tests 
> with Jenkins.
> It's difficult to align the uncaught exceptions (in this case an NPE) and the 
> log itself as the exception is logged with {{e.printStackTrace()}}.
> This jira is to replace printStackTrace calls in SLS with {{LOG.error("msg", 
> exception)}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10679) Better logging of uncaught exceptions throughout SLS

2021-03-07 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10679:
--
Description: 
In our internal environment, there was a test failure while running SLS tests 
with Jenkins.
It's difficult to align the uncaught exceptions (in this case an NPE) and the 
log itself as the exception is logged with {{e.printStackTrace()}}.
This jira is to replace printStackTrace calls in SLS with {{LOG.error("msg", 
exception)}}.

> Better logging of uncaught exceptions throughout SLS
> 
>
> Key: YARN-10679
> URL: https://issues.apache.org/jira/browse/YARN-10679
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
>
> In our internal environment, there was a test failure while running SLS tests 
> with Jenkins.
> It's difficult to align the uncaught exceptions (in this case an NPE) and the 
> log itself as the exception is logged with {{e.printStackTrace()}}.
> This jira is to replace printStackTrace calls in SLS with {{LOG.error("msg", 
> exception)}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10679) Better logging of uncaught exceptions throughout SLS

2021-03-07 Thread Szilard Nemeth (Jira)
Szilard Nemeth created YARN-10679:
-

 Summary: Better logging of uncaught exceptions throughout SLS
 Key: YARN-10679
 URL: https://issues.apache.org/jira/browse/YARN-10679
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Szilard Nemeth
Assignee: Szilard Nemeth






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10678) Try blocks without catch blocks in SLS scheduler classes can swallow other exceptions

2021-03-07 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10678:
--
Attachment: YARN-10678.001.patch

> Try blocks without catch blocks in SLS scheduler classes can swallow other 
> exceptions
> -
>
> Key: YARN-10678
> URL: https://issues.apache.org/jira/browse/YARN-10678
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10678-unchecked-exception-from-FS-allocate.diff, 
> YARN-10678-unchecked-exception-from-FS-allocate_fixed.diff, 
> YARN-10678.001.patch, 
> org.apache.hadoop.yarn.sls.TestReservationSystemInvariants__testSimulatorRunning_modified.log,
>  
> org.apache.hadoop.yarn.sls.TestReservationSystemInvariants__testSimulatorRunning_original.log
>
>
> In SLSFairScheduler, we have this try-finally block (without catch block) in 
> the allocate method: 
> https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L109-L123
> Similarly, in SLSCapacityScheduler: 
> https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSCapacityScheduler.java#L116-L131
> In the finally block, the updateQueueWithAllocateRequest is invoked: 
> https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L118
> In our internal environment, there was a situation when an NPE was logged 
> from this method: 
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.sls.scheduler.SLSFairScheduler.updateQueueWithAllocateRequest(SLSFairScheduler.java:262)
>   at 
> org.apache.hadoop.yarn.sls.scheduler.SLSFairScheduler.allocate(SLSFairScheduler.java:118)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:288)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:436)
>   at 
> org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator$1.run(MRAMSimulator.java:352)
>   at 
> org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator$1.run(MRAMSimulator.java:349)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
>   at 
> org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator.sendContainerRequest(MRAMSimulator.java:348)
>   at 
> org.apache.hadoop.yarn.sls.appmaster.AMSimulator.middleStep(AMSimulator.java:212)
>   at 
> org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:94)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> This can happen if the following events occur:
> 1. A runtime exception is thrown in FairScheduler or CapacityScheduler's 
> allocate method 
> 2. In this case, the local variable called 'allocation' remains null: 
> https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L110
> 3. In updateQueueWithAllocateRequest, this null object will be dereferenced 
> here: 
> https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L262
> 4. Then, we have an NPE here: 
> https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L117-L122
> In this case, we lost the original exception thrown from 
> FairScheduler#allocate.
> In order to fix this, a catch-block should be introduced and the exception 
> needs to be logged.
> The whole thing applies to SLSCapacityScheduler as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: 

[jira] [Commented] (YARN-10678) Try blocks without catch blocks in SLS scheduler classes can swallow other exceptions

2021-03-07 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296991#comment-17296991
 ] 

Szilard Nemeth commented on YARN-10678:
---

Added demonstration patch of the issue. It's very simple, I added a 
{code}
if (true) throw new RuntimeException("test unchecked exception");
{code}
statement to 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler#allocate.
This way I was able to demonstrate that this exception is not logged anywhere 
and the NPE "overrides" it, see  
[^org.apache.hadoop.yarn.sls.TestReservationSystemInvariants__testSimulatorRunning_original.log].

> Try blocks without catch blocks in SLS scheduler classes can swallow other 
> exceptions
> -
>
> Key: YARN-10678
> URL: https://issues.apache.org/jira/browse/YARN-10678
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10678-unchecked-exception-from-FS-allocate.diff, 
> YARN-10678-unchecked-exception-from-FS-allocate_fixed.diff, 
> org.apache.hadoop.yarn.sls.TestReservationSystemInvariants__testSimulatorRunning_modified.log,
>  
> org.apache.hadoop.yarn.sls.TestReservationSystemInvariants__testSimulatorRunning_original.log
>
>
> In SLSFairScheduler, we have this try-finally block (without catch block) in 
> the allocate method: 
> https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L109-L123
> Similarly, in SLSCapacityScheduler: 
> https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSCapacityScheduler.java#L116-L131
> In the finally block, the updateQueueWithAllocateRequest is invoked: 
> https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L118
> In our internal environment, there was a situation when an NPE was logged 
> from this method: 
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.sls.scheduler.SLSFairScheduler.updateQueueWithAllocateRequest(SLSFairScheduler.java:262)
>   at 
> org.apache.hadoop.yarn.sls.scheduler.SLSFairScheduler.allocate(SLSFairScheduler.java:118)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:288)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:436)
>   at 
> org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator$1.run(MRAMSimulator.java:352)
>   at 
> org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator$1.run(MRAMSimulator.java:349)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
>   at 
> org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator.sendContainerRequest(MRAMSimulator.java:348)
>   at 
> org.apache.hadoop.yarn.sls.appmaster.AMSimulator.middleStep(AMSimulator.java:212)
>   at 
> org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:94)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> This can happen if the following events occur:
> 1. A runtime exception is thrown in FairScheduler or CapacityScheduler's 
> allocate method 
> 2. In this case, the local variable called 'allocation' remains null: 
> https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L110
> 3. In updateQueueWithAllocateRequest, this null object will be dereferenced 
> here: 
> https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L262
> 4. Then, we have an NPE here: 
> 

[jira] [Updated] (YARN-10678) Try blocks without catch blocks in SLS scheduler classes can swallow other exceptions

2021-03-07 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10678:
--
Attachment: YARN-10678-unchecked-exception-from-FS-allocate.diff
YARN-10678-unchecked-exception-from-FS-allocate_fixed.diff

org.apache.hadoop.yarn.sls.TestReservationSystemInvariants__testSimulatorRunning_original.log

org.apache.hadoop.yarn.sls.TestReservationSystemInvariants__testSimulatorRunning_modified.log

> Try blocks without catch blocks in SLS scheduler classes can swallow other 
> exceptions
> -
>
> Key: YARN-10678
> URL: https://issues.apache.org/jira/browse/YARN-10678
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10678-unchecked-exception-from-FS-allocate.diff, 
> YARN-10678-unchecked-exception-from-FS-allocate_fixed.diff, 
> org.apache.hadoop.yarn.sls.TestReservationSystemInvariants__testSimulatorRunning_modified.log,
>  
> org.apache.hadoop.yarn.sls.TestReservationSystemInvariants__testSimulatorRunning_original.log
>
>
> In SLSFairScheduler, we have this try-finally block (without catch block) in 
> the allocate method: 
> https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L109-L123
> Similarly, in SLSCapacityScheduler: 
> https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSCapacityScheduler.java#L116-L131
> In the finally block, the updateQueueWithAllocateRequest is invoked: 
> https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L118
> In our internal environment, there was a situation when an NPE was logged 
> from this method: 
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.sls.scheduler.SLSFairScheduler.updateQueueWithAllocateRequest(SLSFairScheduler.java:262)
>   at 
> org.apache.hadoop.yarn.sls.scheduler.SLSFairScheduler.allocate(SLSFairScheduler.java:118)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:288)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:436)
>   at 
> org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator$1.run(MRAMSimulator.java:352)
>   at 
> org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator$1.run(MRAMSimulator.java:349)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
>   at 
> org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator.sendContainerRequest(MRAMSimulator.java:348)
>   at 
> org.apache.hadoop.yarn.sls.appmaster.AMSimulator.middleStep(AMSimulator.java:212)
>   at 
> org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:94)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> This can happen if the following events occur:
> 1. A runtime exception is thrown in FairScheduler or CapacityScheduler's 
> allocate method 
> 2. In this case, the local variable called 'allocation' remains null: 
> https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L110
> 3. In updateQueueWithAllocateRequest, this null object will be dereferenced 
> here: 
> https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L262
> 4. Then, we have an NPE here: 
> https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L117-L122
> In this case, we lost the original exception thrown from 
> FairScheduler#allocate.
> In order to fix 

[jira] [Updated] (YARN-10678) Try blocks without catch blocks in SLS scheduler classes can swallow other exceptions

2021-03-07 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10678:
--
Description: 
In SLSFairScheduler, we have this try-finally block (without catch block) in 
the allocate method: 
https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L109-L123
Similarly, in SLSCapacityScheduler: 
https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSCapacityScheduler.java#L116-L131

In the finally block, the updateQueueWithAllocateRequest is invoked: 
https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L118

In our internal environment, there was a situation when an NPE was logged from 
this method: 
{code}
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.sls.scheduler.SLSFairScheduler.updateQueueWithAllocateRequest(SLSFairScheduler.java:262)
at 
org.apache.hadoop.yarn.sls.scheduler.SLSFairScheduler.allocate(SLSFairScheduler.java:118)
at 
org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:288)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
at 
org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
at 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:436)
at 
org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator$1.run(MRAMSimulator.java:352)
at 
org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator$1.run(MRAMSimulator.java:349)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
at 
org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator.sendContainerRequest(MRAMSimulator.java:348)
at 
org.apache.hadoop.yarn.sls.appmaster.AMSimulator.middleStep(AMSimulator.java:212)
at 
org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:94)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

This can happen if the following events occur:
1. A runtime exception is thrown in FairScheduler or CapacityScheduler's 
allocate method 
2. In this case, the local variable called 'allocation' remains null: 
https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L110
3. In updateQueueWithAllocateRequest, this null object will be dereferenced 
here: 
https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L262
4. Then, we have an NPE here: 
https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L117-L122
In this case, we lost the original exception thrown from FairScheduler#allocate.

In order to fix this, a catch-block should be introduced and the exception 
needs to be logged.
The whole thing applies to SLSCapacityScheduler as well.

> Try blocks without catch blocks in SLS scheduler classes can swallow other 
> exceptions
> -
>
> Key: YARN-10678
> URL: https://issues.apache.org/jira/browse/YARN-10678
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
>
> In SLSFairScheduler, we have this try-finally block (without catch block) in 
> the allocate method: 
> https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L109-L123
> Similarly, in SLSCapacityScheduler: 
> https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSCapacityScheduler.java#L116-L131
> In the finally block, the updateQueueWithAllocateRequest is invoked: 
> 

[jira] [Created] (YARN-10678) Try blocks without catch blocks in SLS scheduler classes can swallow other exceptions

2021-03-07 Thread Szilard Nemeth (Jira)
Szilard Nemeth created YARN-10678:
-

 Summary: Try blocks without catch blocks in SLS scheduler classes 
can swallow other exceptions
 Key: YARN-10678
 URL: https://issues.apache.org/jira/browse/YARN-10678
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Szilard Nemeth
Assignee: Szilard Nemeth






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10677) Logger of SLSFairScheduler is provided with the wrong class

2021-03-07 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10677:
--
Attachment: YARN-10677.002.patch

> Logger of SLSFairScheduler is provided with the wrong class
> ---
>
> Key: YARN-10677
> URL: https://issues.apache.org/jira/browse/YARN-10677
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10677.001.patch, YARN-10677.002.patch
>
>
> In SLSFairScheduler, the Logger definition looks like: 
> https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L69
> We need to fix this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10677) Logger of SLSFairScheduler is provided with the wrong class

2021-03-07 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10677:
--
Attachment: YARN-10677.001.patch

> Logger of SLSFairScheduler is provided with the wrong class
> ---
>
> Key: YARN-10677
> URL: https://issues.apache.org/jira/browse/YARN-10677
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10677.001.patch
>
>
> In SLSFairScheduler, the Logger definition looks like: 
> https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L69
> We need to fix this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10677) Logger of SLSFairScheduler is provided with the wrong class

2021-03-07 Thread Szilard Nemeth (Jira)
Szilard Nemeth created YARN-10677:
-

 Summary: Logger of SLSFairScheduler is provided with the wrong 
class
 Key: YARN-10677
 URL: https://issues.apache.org/jira/browse/YARN-10677
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Szilard Nemeth
Assignee: Szilard Nemeth


In SLSFairScheduler, the Logger definition looks like: 
https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L69
We need to fix this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10675) Consolidate YARN-10672 and YARN-10447

2021-03-07 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10675:
--
Attachment: YARN-10675.001.patch

> Consolidate YARN-10672 and YARN-10447
> -
>
> Key: YARN-10675
> URL: https://issues.apache.org/jira/browse/YARN-10675
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10675.001.patch
>
>
> Let's consolidate the solution applied for YARN-10672 and apply it to the 
> code changes introduced with YARN-10447.
> Quoting [~pbacsko]: 
> {quote}
> The solution is much straightforward than mine in YARN-10447. Actually we 
> might consider applying this to TestLeafQueue with undoing my changes, 
> because that's more complicated (I had no patience to go deeper with Mockito 
> internal behavior, I just thought well, disable that thread and that's 
> enough).
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10672) All testcases in TestReservations are flaky

2021-03-07 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10672:
--
Attachment: YARN-10672.branch-3.3.001.patch

> All testcases in TestReservations are flaky
> ---
>
> Key: YARN-10672
> URL: https://issues.apache.org/jira/browse/YARN-10672
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: Screenshot 2021-03-04 at 21.34.18.png, Screenshot 
> 2021-03-04 at 22.06.20.png, Screenshot-mockitostubbing1-2021-03-04 at 
> 22.34.01.png, Screenshot-mockitostubbing2-2021-03-04 at 22.34.12.png, 
> YARN-10672-debuglogs.patch, YARN-10672.001.patch, 
> YARN-10672.branch-3.3.001.patch
>
>
> All testcases in TestReservations are flaky
> Running a particular test in TestReservations 100 times never passes all the 
> time.
>  For example, let's run testReservationNoContinueLook 100 times. For me, it 
> produced 39 failed and 61 passed results.
>  Sometimes just 1 out of 100 runs is failed.
>  Screenshot is attached.
> Stacktrace:
> {code:java}
> java.lang.AssertionError: 
> Expected :2048
> Actual   :0
> 
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.failNotEquals(Assert.java:835)
> at org.junit.Assert.assertEquals(Assert.java:647)
> at org.junit.Assert.assertEquals(Assert.java:633)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642)
> {code}
> The test fails here:
> {code:java}
>  // Start testing...
> // Only AM
> TestUtils.applyResourceCommitRequest(clusterResource,
> a.assignContainers(clusterResource, node_0,
> new ResourceLimits(clusterResource),
> SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps);
> assertEquals(2 * GB, a.getUsedResources().getMemorySize());
> {code}
> With some debugging (patch attached), I realized that sometimes there are no 
> registered nodes so the AM can't be allocated and test will fail:
> {code:java}
> 2021-03-04 21:58:25,434 DEBUG [main] allocator.RegularContainerAllocator 
> (RegularContainerAllocator.java:canAssign(312)) - **Can't assign 
> container, no nodes... rmContext: 2a8dd942, scheduler: 2322e56f
> {code}
> In these cases, this is also printed from 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#getNumClusterNodes:
> {code:java}
> 2021-03-04 21:58:25,379 DEBUG [main] capacity.CapacityScheduler 
> (CapacityScheduler.java:getNumClusterNodes(290)) - ***Called real 
> getNumClusterNodes
> {code}
> h2. Let's break this down:
>  1. The mocking happens in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations#setup(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration,
>  boolean):
> {code:java}
> cs.setRMContext(spyRMContext);
> cs.init(csConf);
> cs.start();
> when(cs.getNumClusterNodes()).thenReturn(3);
> {code}
> Under no circumstances this could be allowed to return any other value than 3.
>  However, as mentioned above, sometimes the real method of 
> 'getNumClusterNodes' is called on CapacityScheduler.
> 2. Sometimes, this gets printed to the console:
> {code:java}
> org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
> Integer cannot be returned by isMultiNodePlacementEnabled()
> isMultiNodePlacementEnabled() should return boolean
> ***
> If you're unsure why you're getting above error read on.
> Due to the nature of the syntax above problem might occur because:
> 1. This exception *might* occur in wrongly written multi-threaded tests.
>Please refer to Mockito FAQ on limitations of concurrency testing.
> 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
> spies - 
>- with doReturn|Throw() family of methods. More in javadocs for 
> Mockito.spy() method.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:166)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:114)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:566)
>   at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> 

[jira] [Updated] (YARN-10676) Improve code quality in TestTimelineAuthenticationFilterForV1

2021-03-05 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10676:
--
Attachment: YARN-10676.001.patch

> Improve code quality in TestTimelineAuthenticationFilterForV1
> -
>
> Key: YARN-10676
> URL: https://issues.apache.org/jira/browse/YARN-10676
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Minor
> Attachments: YARN-10676.001.patch
>
>
> - In testcase "testDelegationTokenOperations", the exception message is 
> checked but in case it does not match the assertion, the exception is not 
> printed. This happens 3 times.
> - Assertion messages can be added
> - Fields called "httpSpnegoKeytabFile" and "httpSpnegoPrincipal" can be 
> static final.
> - There's a typo in comment "avaiable" (happens 2 times)
> - There are some Assert.fail() calls, without messages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10676) Improve code quality in TestTimelineAuthenticationFilterForV1

2021-03-05 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10676:
--
Description: 
- In testcase "testDelegationTokenOperations", the exception message is checked 
but in case it does not match the assertion, the exception is not printed. This 
happens 3 times.
- Assertion messages can be added
- Fields called "httpSpnegoKeytabFile" and "httpSpnegoPrincipal" can be static 
final.
- There's a typo in comment "avaiable" (happens 2 times)
- There are some Assert.fail() calls, without messages.



> Improve code quality in TestTimelineAuthenticationFilterForV1
> -
>
> Key: YARN-10676
> URL: https://issues.apache.org/jira/browse/YARN-10676
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Minor
>
> - In testcase "testDelegationTokenOperations", the exception message is 
> checked but in case it does not match the assertion, the exception is not 
> printed. This happens 3 times.
> - Assertion messages can be added
> - Fields called "httpSpnegoKeytabFile" and "httpSpnegoPrincipal" can be 
> static final.
> - There's a typo in comment "avaiable" (happens 2 times)
> - There are some Assert.fail() calls, without messages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10676) Improve code quality in TestTimelineAuthenticationFilterForV1

2021-03-05 Thread Szilard Nemeth (Jira)
Szilard Nemeth created YARN-10676:
-

 Summary: Improve code quality in 
TestTimelineAuthenticationFilterForV1
 Key: YARN-10676
 URL: https://issues.apache.org/jira/browse/YARN-10676
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Szilard Nemeth
Assignee: Szilard Nemeth






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10675) Consolidate YARN-10672 and YARN-10447

2021-03-05 Thread Szilard Nemeth (Jira)
Szilard Nemeth created YARN-10675:
-

 Summary: Consolidate YARN-10672 and YARN-10447
 Key: YARN-10675
 URL: https://issues.apache.org/jira/browse/YARN-10675
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Szilard Nemeth
Assignee: Szilard Nemeth


Let's consolidate the solution applied for YARN-10672 and apply it to the code 
changes introduced with YARN-10447.
Quoting [~pbacsko]: 
{quote}
The solution is much straightforward than mine in YARN-10447. Actually we might 
consider applying this to TestLeafQueue with undoing my changes, because that's 
more complicated (I had no patience to go deeper with Mockito internal 
behavior, I just thought well, disable that thread and that's enough).
{quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10672) All testcases in TestReservations are flaky

2021-03-05 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296017#comment-17296017
 ] 

Szilard Nemeth commented on YARN-10672:
---

As per our offline discussion with [~pbacsko], I'm creating a follow-up to 
consolidate this and YARN-10447.

> All testcases in TestReservations are flaky
> ---
>
> Key: YARN-10672
> URL: https://issues.apache.org/jira/browse/YARN-10672
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: Screenshot 2021-03-04 at 21.34.18.png, Screenshot 
> 2021-03-04 at 22.06.20.png, Screenshot-mockitostubbing1-2021-03-04 at 
> 22.34.01.png, Screenshot-mockitostubbing2-2021-03-04 at 22.34.12.png, 
> YARN-10672-debuglogs.patch, YARN-10672.001.patch
>
>
> All testcases in TestReservations are flaky
> Running a particular test in TestReservations 100 times never passes all the 
> time.
>  For example, let's run testReservationNoContinueLook 100 times. For me, it 
> produced 39 failed and 61 passed results.
>  Sometimes just 1 out of 100 runs is failed.
>  Screenshot is attached.
> Stacktrace:
> {code:java}
> java.lang.AssertionError: 
> Expected :2048
> Actual   :0
> 
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.failNotEquals(Assert.java:835)
> at org.junit.Assert.assertEquals(Assert.java:647)
> at org.junit.Assert.assertEquals(Assert.java:633)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642)
> {code}
> The test fails here:
> {code:java}
>  // Start testing...
> // Only AM
> TestUtils.applyResourceCommitRequest(clusterResource,
> a.assignContainers(clusterResource, node_0,
> new ResourceLimits(clusterResource),
> SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps);
> assertEquals(2 * GB, a.getUsedResources().getMemorySize());
> {code}
> With some debugging (patch attached), I realized that sometimes there are no 
> registered nodes so the AM can't be allocated and test will fail:
> {code:java}
> 2021-03-04 21:58:25,434 DEBUG [main] allocator.RegularContainerAllocator 
> (RegularContainerAllocator.java:canAssign(312)) - **Can't assign 
> container, no nodes... rmContext: 2a8dd942, scheduler: 2322e56f
> {code}
> In these cases, this is also printed from 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#getNumClusterNodes:
> {code:java}
> 2021-03-04 21:58:25,379 DEBUG [main] capacity.CapacityScheduler 
> (CapacityScheduler.java:getNumClusterNodes(290)) - ***Called real 
> getNumClusterNodes
> {code}
> h2. Let's break this down:
>  1. The mocking happens in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations#setup(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration,
>  boolean):
> {code:java}
> cs.setRMContext(spyRMContext);
> cs.init(csConf);
> cs.start();
> when(cs.getNumClusterNodes()).thenReturn(3);
> {code}
> Under no circumstances this could be allowed to return any other value than 3.
>  However, as mentioned above, sometimes the real method of 
> 'getNumClusterNodes' is called on CapacityScheduler.
> 2. Sometimes, this gets printed to the console:
> {code:java}
> org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
> Integer cannot be returned by isMultiNodePlacementEnabled()
> isMultiNodePlacementEnabled() should return boolean
> ***
> If you're unsure why you're getting above error read on.
> Due to the nature of the syntax above problem might occur because:
> 1. This exception *might* occur in wrongly written multi-threaded tests.
>Please refer to Mockito FAQ on limitations of concurrency testing.
> 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
> spies - 
>- with doReturn|Throw() family of methods. More in javadocs for 
> Mockito.spy() method.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:166)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:114)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:566)
>   at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   

[jira] [Updated] (YARN-10672) All testcases in TestReservations are flaky

2021-03-04 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10672:
--
Description: 
All testcases in TestReservations are flaky

Running a particular test in TestReservations 100 times never passes all the 
time.
 For example, let's run testReservationNoContinueLook 100 times. For me, it 
produced 39 failed and 61 passed results.
 Sometimes just 1 out of 100 runs is failed.
 Screenshot is attached.

Stacktrace:
{code:java}
java.lang.AssertionError: 
Expected :2048
Actual   :0


at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:647)
at org.junit.Assert.assertEquals(Assert.java:633)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642)
{code}
The test fails here:
{code:java}
 // Start testing...
// Only AM
TestUtils.applyResourceCommitRequest(clusterResource,
a.assignContainers(clusterResource, node_0,
new ResourceLimits(clusterResource),
SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps);
assertEquals(2 * GB, a.getUsedResources().getMemorySize());
{code}
With some debugging (patch attached), I realized that sometimes there are no 
registered nodes so the AM can't be allocated and test will fail:
{code:java}
2021-03-04 21:58:25,434 DEBUG [main] allocator.RegularContainerAllocator 
(RegularContainerAllocator.java:canAssign(312)) - **Can't assign container, 
no nodes... rmContext: 2a8dd942, scheduler: 2322e56f
{code}
In these cases, this is also printed from 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#getNumClusterNodes:
{code:java}
2021-03-04 21:58:25,379 DEBUG [main] capacity.CapacityScheduler 
(CapacityScheduler.java:getNumClusterNodes(290)) - ***Called real 
getNumClusterNodes
{code}

h2. Let's break this down:
 1. The mocking happens in 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations#setup(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration,
 boolean):
{code:java}
cs.setRMContext(spyRMContext);
cs.init(csConf);
cs.start();

when(cs.getNumClusterNodes()).thenReturn(3);
{code}
Under no circumstances this could be allowed to return any other value than 3.
 However, as mentioned above, sometimes the real method of 'getNumClusterNodes' 
is called on CapacityScheduler.

2. Sometimes, this gets printed to the console:
{code:java}
org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
Integer cannot be returned by isMultiNodePlacementEnabled()
isMultiNodePlacementEnabled() should return boolean
***
If you're unsure why you're getting above error read on.
Due to the nature of the syntax above problem might occur because:
1. This exception *might* occur in wrongly written multi-threaded tests.
   Please refer to Mockito FAQ on limitations of concurrency testing.
2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
spies - 
   - with doReturn|Throw() family of methods. More in javadocs for 
Mockito.spy() method.


at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:166)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:114)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:566)
at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at 

[jira] [Updated] (YARN-10672) All testcases in TestReservations are flaky

2021-03-04 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10672:
--
Description: 
All testcases in TestReservations are flaky

Running a particular test in TestReservations 100 times never passes all the 
time.
 For example, let's run testReservationNoContinueLook 100 times. For me, it 
produced 39 failed and 61 passed results.
 Sometimes just 1 out of 100 runs is failed.
 Screenshot is attached.

Stacktrace:
{code:java}
java.lang.AssertionError: 
Expected :2048
Actual   :0


at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:647)
at org.junit.Assert.assertEquals(Assert.java:633)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642)
{code}
The test fails here:
{code:java}
 // Start testing...
// Only AM
TestUtils.applyResourceCommitRequest(clusterResource,
a.assignContainers(clusterResource, node_0,
new ResourceLimits(clusterResource),
SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps);
assertEquals(2 * GB, a.getUsedResources().getMemorySize());
{code}
With some debugging (patch attached), I realized that sometimes there are no 
registered nodes so the AM can't be allocated and test will fail:
{code:java}
2021-03-04 21:58:25,434 DEBUG [main] allocator.RegularContainerAllocator 
(RegularContainerAllocator.java:canAssign(312)) - **Can't assign container, 
no nodes... rmContext: 2a8dd942, scheduler: 2322e56f
{code}
In these cases, this is also printed from 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#getNumClusterNodes:
{code:java}
2021-03-04 21:58:25,379 DEBUG [main] capacity.CapacityScheduler 
(CapacityScheduler.java:getNumClusterNodes(290)) - ***Called real 
getNumClusterNodes
{code}

h2. Let's break this down:
 1. The mocking happens in 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations#setup(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration,
 boolean):
{code:java}
cs.setRMContext(spyRMContext);
cs.init(csConf);
cs.start();

when(cs.getNumClusterNodes()).thenReturn(3);
{code}
Under no circumstances this could be allowed to return any other value than 3.
 However, as mentioned above, sometimes the real method of 'getNumClusterNodes' 
is called on CapacityScheduler.

2. Sometimes, this gets printed to the console:
{code:java}
org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
Integer cannot be returned by isMultiNodePlacementEnabled()
isMultiNodePlacementEnabled() should return boolean
***
If you're unsure why you're getting above error read on.
Due to the nature of the syntax above problem might occur because:
1. This exception *might* occur in wrongly written multi-threaded tests.
   Please refer to Mockito FAQ on limitations of concurrency testing.
2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
spies - 
   - with doReturn|Throw() family of methods. More in javadocs for 
Mockito.spy() method.


at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:166)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:114)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:566)
at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at 

[jira] [Updated] (YARN-10672) All testcases in TestReservations are flaky

2021-03-04 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10672:
--
Attachment: YARN-10672.001.patch

> All testcases in TestReservations are flaky
> ---
>
> Key: YARN-10672
> URL: https://issues.apache.org/jira/browse/YARN-10672
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: Screenshot 2021-03-04 at 21.34.18.png, Screenshot 
> 2021-03-04 at 22.06.20.png, Screenshot-mockitostubbing1-2021-03-04 at 
> 22.34.01.png, Screenshot-mockitostubbing2-2021-03-04 at 22.34.12.png, 
> YARN-10672-debuglogs.patch, YARN-10672.001.patch
>
>
> All testcases in TestReservations are flaky
> Running a particular test in TestReservations 100 times never passes all the 
> time.
>  For example, let's run testReservationNoContinueLook 100 times. For me, it 
> produced 39 failed and 61 passed results.
>  Sometimes just 1 out of 100 runs is failed.
>  Screenshot is attached.
> Stacktrace:
> {code:java}
> java.lang.AssertionError: 
> Expected :2048
> Actual   :0
> 
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.failNotEquals(Assert.java:835)
> at org.junit.Assert.assertEquals(Assert.java:647)
> at org.junit.Assert.assertEquals(Assert.java:633)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642)
> {code}
> The test fails here:
> {code:java}
>  // Start testing...
> // Only AM
> TestUtils.applyResourceCommitRequest(clusterResource,
> a.assignContainers(clusterResource, node_0,
> new ResourceLimits(clusterResource),
> SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps);
> assertEquals(2 * GB, a.getUsedResources().getMemorySize());
> {code}
> With some debugging (patch attached), I realized that sometimes there are no 
> registered nodes so the AM can't be allocated and test will fail:
> {code:java}
> 2021-03-04 21:58:25,434 DEBUG [main] allocator.RegularContainerAllocator 
> (RegularContainerAllocator.java:canAssign(312)) - **Can't assign 
> container, no nodes... rmContext: 2a8dd942, scheduler: 2322e56f
> {code}
> In these cases, this is also printed from 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#getNumClusterNodes:
> {code:java}
> 2021-03-04 21:58:25,379 DEBUG [main] capacity.CapacityScheduler 
> (CapacityScheduler.java:getNumClusterNodes(290)) - ***Called real 
> getNumClusterNodes
> {code}
> Let's break this down:
>  1. The mocking happens in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations#setup(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration,
>  boolean):
> {code:java}
> cs.setRMContext(spyRMContext);
> cs.init(csConf);
> cs.start();
> when(cs.getNumClusterNodes()).thenReturn(3);
> {code}
> Under no circumstances this could be allowed to return any other value than 3.
>  However, as mentioned above, sometimes the real method of 
> 'getNumClusterNodes' is called on CapacityScheduler.
> 2. Sometimes, this gets printed to the console:
> {code:java}
> org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
> Integer cannot be returned by isMultiNodePlacementEnabled()
> isMultiNodePlacementEnabled() should return boolean
> ***
> If you're unsure why you're getting above error read on.
> Due to the nature of the syntax above problem might occur because:
> 1. This exception *might* occur in wrongly written multi-threaded tests.
>Please refer to Mockito FAQ on limitations of concurrency testing.
> 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
> spies - 
>- with doReturn|Throw() family of methods. More in javadocs for 
> Mockito.spy() method.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:166)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:114)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:566)
>   at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> 

[jira] [Updated] (YARN-10672) All testcases in TestReservations are flaky

2021-03-04 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10672:
--
Description: 
All testcases in TestReservations are flaky

Running a particular test in TestReservations 100 times never passes all the 
time.
 For example, let's run testReservationNoContinueLook 100 times. For me, it 
produced 39 failed and 61 passed results.
 Sometimes just 1 out of 100 runs is failed.
 Screenshot is attached.

Stacktrace:
{code:java}
java.lang.AssertionError: 
Expected :2048
Actual   :0


at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:647)
at org.junit.Assert.assertEquals(Assert.java:633)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642)
{code}
The test fails here:
{code:java}
 // Start testing...
// Only AM
TestUtils.applyResourceCommitRequest(clusterResource,
a.assignContainers(clusterResource, node_0,
new ResourceLimits(clusterResource),
SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps);
assertEquals(2 * GB, a.getUsedResources().getMemorySize());
{code}
With some debugging (patch attached), I realized that sometimes there are no 
registered nodes so the AM can't be allocated and test will fail:
{code:java}
2021-03-04 21:58:25,434 DEBUG [main] allocator.RegularContainerAllocator 
(RegularContainerAllocator.java:canAssign(312)) - **Can't assign container, 
no nodes... rmContext: 2a8dd942, scheduler: 2322e56f
{code}
In these cases, this is also printed from 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#getNumClusterNodes:
{code:java}
2021-03-04 21:58:25,379 DEBUG [main] capacity.CapacityScheduler 
(CapacityScheduler.java:getNumClusterNodes(290)) - ***Called real 
getNumClusterNodes
{code}
Let's break this down:
 1. The mocking happens in 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations#setup(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration,
 boolean):
{code:java}
cs.setRMContext(spyRMContext);
cs.init(csConf);
cs.start();

when(cs.getNumClusterNodes()).thenReturn(3);
{code}
Under no circumstances this could be allowed to return any other value than 3.
 However, as mentioned above, sometimes the real method of 'getNumClusterNodes' 
is called on CapacityScheduler.

2. Sometimes, this gets printed to the console:
{code:java}
org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
Integer cannot be returned by isMultiNodePlacementEnabled()
isMultiNodePlacementEnabled() should return boolean
***
If you're unsure why you're getting above error read on.
Due to the nature of the syntax above problem might occur because:
1. This exception *might* occur in wrongly written multi-threaded tests.
   Please refer to Mockito FAQ on limitations of concurrency testing.
2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
spies - 
   - with doReturn|Throw() family of methods. More in javadocs for 
Mockito.spy() method.


at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:166)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:114)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:566)
at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at 

[jira] [Updated] (YARN-10672) All testcases in TestReservations are flaky

2021-03-04 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10672:
--
Description: 
All testcases in TestReservations are flaky

Running a particular test in TestReservations 100 times never passes all the 
time.
 For example, let's run testReservationNoContinueLook 100 times. For me, it 
produced 39 failed and 61 passed results.
 Sometimes just 1 out of 100 runs is failed.
 Screenshot is attached.

Stacktrace:
{code:java}
java.lang.AssertionError: 
Expected :2048
Actual   :0


at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:647)
at org.junit.Assert.assertEquals(Assert.java:633)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642)
{code}
The test fails here:
{code:java}
 // Start testing...
// Only AM
TestUtils.applyResourceCommitRequest(clusterResource,
a.assignContainers(clusterResource, node_0,
new ResourceLimits(clusterResource),
SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps);
assertEquals(2 * GB, a.getUsedResources().getMemorySize());
{code}
With some debugging (patch attached), I realized that sometimes there are no 
registered nodes so the AM can't be allocated and test will fail:
{code:java}
2021-03-04 21:58:25,434 DEBUG [main] allocator.RegularContainerAllocator 
(RegularContainerAllocator.java:canAssign(312)) - **Can't assign container, 
no nodes... rmContext: 2a8dd942, scheduler: 2322e56f
{code}
In these cases, this is also printed from 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#getNumClusterNodes:
{code:java}
2021-03-04 21:58:25,379 DEBUG [main] capacity.CapacityScheduler 
(CapacityScheduler.java:getNumClusterNodes(290)) - ***Called real 
getNumClusterNodes
{code}
Let's break this down:
 1. The mocking happens in 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations#setup(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration,
 boolean):
{code:java}
cs.setRMContext(spyRMContext);
cs.init(csConf);
cs.start();

when(cs.getNumClusterNodes()).thenReturn(3);
{code}
Under no circumstances this could be allowed to return any other value than 3.
 However, as mentioned above, sometimes the real method of 'getNumClusterNodes' 
is called on CapacityScheduler.

2. Sometimes, this gets printed to the console:
{code:java}
org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
Integer cannot be returned by isMultiNodePlacementEnabled()
isMultiNodePlacementEnabled() should return boolean
***
If you're unsure why you're getting above error read on.
Due to the nature of the syntax above problem might occur because:
1. This exception *might* occur in wrongly written multi-threaded tests.
   Please refer to Mockito FAQ on limitations of concurrency testing.
2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
spies - 
   - with doReturn|Throw() family of methods. More in javadocs for 
Mockito.spy() method.


at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:166)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:114)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:566)
at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at 

[jira] [Updated] (YARN-10672) All testcases in TestReservations are flaky

2021-03-04 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10672:
--
Description: 
All testcases in TestReservations are flaky

Running a particular test in TestReservations 100 times never passes all the 
time.
For example, let's run testReservationNoContinueLook 100 times. For me, it 
produced 39 failed and 61 passed results.
Sometimes just 1 out of 100 runs is failed.
Screenshot is attached.

Stacktrace: 
{code}
java.lang.AssertionError: 
Expected :2048
Actual   :0


at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:647)
at org.junit.Assert.assertEquals(Assert.java:633)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642)
{code}

The test fails here: 
{code}
 // Start testing...
// Only AM
TestUtils.applyResourceCommitRequest(clusterResource,
a.assignContainers(clusterResource, node_0,
new ResourceLimits(clusterResource),
SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps);
assertEquals(2 * GB, a.getUsedResources().getMemorySize());
{code}

With some debugging (patch attached), I realized that sometimes there are no 
registered nodes so the AM can't be allocated and test will fail: 
{code}
2021-03-04 21:58:25,434 DEBUG [main] allocator.RegularContainerAllocator 
(RegularContainerAllocator.java:canAssign(312)) - **Can't assign container, 
no nodes... rmContext: 2a8dd942, scheduler: 2322e56f
{code}

In these cases, this is also printed from 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#getNumClusterNodes:
{code}
2021-03-04 21:58:25,379 DEBUG [main] capacity.CapacityScheduler 
(CapacityScheduler.java:getNumClusterNodes(290)) - ***Called real 
getNumClusterNodes
{code}

Let's break this down:
1. The mocking happens in 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations#setup(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration,
 boolean):
{code}
cs.setRMContext(spyRMContext);
cs.init(csConf);
cs.start();

when(cs.getNumClusterNodes()).thenReturn(3);
{code}
Under no circumstances this could be allowed to return any other value than 3.
However, as mentioned above, sometimes the real method of 'getNumClusterNodes' 
is called on CapacityScheduler.

2. Sometimes, this gets printed to the console: 
{code}
org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
Integer cannot be returned by isMultiNodePlacementEnabled()
isMultiNodePlacementEnabled() should return boolean
***
If you're unsure why you're getting above error read on.
Due to the nature of the syntax above problem might occur because:
1. This exception *might* occur in wrongly written multi-threaded tests.
   Please refer to Mockito FAQ on limitations of concurrency testing.
2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
spies - 
   - with doReturn|Throw() family of methods. More in javadocs for 
Mockito.spy() method.


at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:166)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:114)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:566)
at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at 

[jira] [Updated] (YARN-10672) All testcases in TestReservations are flaky

2021-03-04 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10672:
--
Attachment: Screenshot-mockitostubbing2-2021-03-04 at 22.34.12.png
Screenshot-mockitostubbing1-2021-03-04 at 22.34.01.png

> All testcases in TestReservations are flaky
> ---
>
> Key: YARN-10672
> URL: https://issues.apache.org/jira/browse/YARN-10672
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: Screenshot 2021-03-04 at 21.34.18.png, Screenshot 
> 2021-03-04 at 22.06.20.png, Screenshot-mockitostubbing1-2021-03-04 at 
> 22.34.01.png, Screenshot-mockitostubbing2-2021-03-04 at 22.34.12.png, 
> YARN-10672-debuglogs.patch
>
>
> Running a particular test in TestReservations 100 times never passes all the 
> time.
> For example, let's run testReservationNoContinueLook 100 times. For me, it 
> produced 39 failed and 61 passed results.
> Screenshot is attached.
> Stacktrace: 
> {code}
> java.lang.AssertionError: 
> Expected :2048
> Actual   :0
> 
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.failNotEquals(Assert.java:835)
> at org.junit.Assert.assertEquals(Assert.java:647)
> at org.junit.Assert.assertEquals(Assert.java:633)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642)
> {code}
> The test fails here: 
> {code}
>  // Start testing...
> // Only AM
> TestUtils.applyResourceCommitRequest(clusterResource,
> a.assignContainers(clusterResource, node_0,
> new ResourceLimits(clusterResource),
> SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps);
> assertEquals(2 * GB, a.getUsedResources().getMemorySize());
> {code}
> With some debugging (patch attached), I realized that sometimes there are no 
> registered nodes so the AM can't be allocated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10672) All testcases in TestReservations are flaky

2021-03-04 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10672:
--
Attachment: Screenshot 2021-03-04 at 22.06.20.png
Screenshot 2021-03-04 at 21.34.18.png

> All testcases in TestReservations are flaky
> ---
>
> Key: YARN-10672
> URL: https://issues.apache.org/jira/browse/YARN-10672
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: Screenshot 2021-03-04 at 21.34.18.png, Screenshot 
> 2021-03-04 at 22.06.20.png, YARN-10672-debuglogs.patch
>
>
> Running a particular test in TestReservations 100 times never passes all the 
> time.
> For example, let's run testReservationNoContinueLook 100 times. For me, it 
> produced 39 failed and 61 passed results.
> Screenshot is attached.
> Stacktrace: 
> {code}
> java.lang.AssertionError: 
> Expected :2048
> Actual   :0
> 
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.failNotEquals(Assert.java:835)
> at org.junit.Assert.assertEquals(Assert.java:647)
> at org.junit.Assert.assertEquals(Assert.java:633)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642)
> {code}
> The test fails here: 
> {code}
>  // Start testing...
> // Only AM
> TestUtils.applyResourceCommitRequest(clusterResource,
> a.assignContainers(clusterResource, node_0,
> new ResourceLimits(clusterResource),
> SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps);
> assertEquals(2 * GB, a.getUsedResources().getMemorySize());
> {code}
> With some debugging (patch attached), I realized that sometimes there are no 
> registered nodes so the AM can't be allocated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10672) All testcases in TestReservations are flaky

2021-03-04 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10672:
--
Attachment: (was: Screenshot 2021-03-04 at 21.28.18.png)

> All testcases in TestReservations are flaky
> ---
>
> Key: YARN-10672
> URL: https://issues.apache.org/jira/browse/YARN-10672
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10672-debuglogs.patch
>
>
> Running a particular test in TestReservations 100 times never passes all the 
> time.
> For example, let's run testReservationNoContinueLook 100 times. For me, it 
> produced 39 failed and 61 passed results.
> Screenshot is attached.
> Stacktrace: 
> {code}
> java.lang.AssertionError: 
> Expected :2048
> Actual   :0
> 
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.failNotEquals(Assert.java:835)
> at org.junit.Assert.assertEquals(Assert.java:647)
> at org.junit.Assert.assertEquals(Assert.java:633)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642)
> {code}
> The test fails here: 
> {code}
>  // Start testing...
> // Only AM
> TestUtils.applyResourceCommitRequest(clusterResource,
> a.assignContainers(clusterResource, node_0,
> new ResourceLimits(clusterResource),
> SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps);
> assertEquals(2 * GB, a.getUsedResources().getMemorySize());
> {code}
> With some debugging (patch attached), I realized that sometimes there are no 
> registered nodes so the AM can't be allocated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10672) All testcases in TestReservations are flaky

2021-03-04 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10672:
--
Attachment: Screenshot 2021-03-04 at 21.28.18.png

> All testcases in TestReservations are flaky
> ---
>
> Key: YARN-10672
> URL: https://issues.apache.org/jira/browse/YARN-10672
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10672-debuglogs.patch
>
>
> Running a particular test in TestReservations 100 times never passes all the 
> time.
> For example, let's run testReservationNoContinueLook 100 times. For me, it 
> produced 39 failed and 61 passed results.
> Screenshot is attached.
> Stacktrace: 
> {code}
> java.lang.AssertionError: 
> Expected :2048
> Actual   :0
> 
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.failNotEquals(Assert.java:835)
> at org.junit.Assert.assertEquals(Assert.java:647)
> at org.junit.Assert.assertEquals(Assert.java:633)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642)
> {code}
> The test fails here: 
> {code}
>  // Start testing...
> // Only AM
> TestUtils.applyResourceCommitRequest(clusterResource,
> a.assignContainers(clusterResource, node_0,
> new ResourceLimits(clusterResource),
> SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps);
> assertEquals(2 * GB, a.getUsedResources().getMemorySize());
> {code}
> With some debugging (patch attached), I realized that sometimes there are no 
> registered nodes so the AM can't be allocated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10672) All testcases in TestReservations are flaky

2021-03-04 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10672:
--
Attachment: (was: Screenshot 2021-03-04 at 21.34.18.png)

> All testcases in TestReservations are flaky
> ---
>
> Key: YARN-10672
> URL: https://issues.apache.org/jira/browse/YARN-10672
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10672-debuglogs.patch
>
>
> Running a particular test in TestReservations 100 times never passes all the 
> time.
> For example, let's run testReservationNoContinueLook 100 times. For me, it 
> produced 39 failed and 61 passed results.
> Screenshot is attached.
> Stacktrace: 
> {code}
> java.lang.AssertionError: 
> Expected :2048
> Actual   :0
> 
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.failNotEquals(Assert.java:835)
> at org.junit.Assert.assertEquals(Assert.java:647)
> at org.junit.Assert.assertEquals(Assert.java:633)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642)
> {code}
> The test fails here: 
> {code}
>  // Start testing...
> // Only AM
> TestUtils.applyResourceCommitRequest(clusterResource,
> a.assignContainers(clusterResource, node_0,
> new ResourceLimits(clusterResource),
> SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps);
> assertEquals(2 * GB, a.getUsedResources().getMemorySize());
> {code}
> With some debugging (patch attached), I realized that sometimes there are no 
> registered nodes so the AM can't be allocated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10672) All testcases in TestReservations are flaky

2021-03-04 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10672:
--
Attachment: YARN-10672-debuglogs.patch

> All testcases in TestReservations are flaky
> ---
>
> Key: YARN-10672
> URL: https://issues.apache.org/jira/browse/YARN-10672
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: Screenshot 2021-03-04 at 21.34.18.png, 
> YARN-10672-debuglogs.patch
>
>
> Running a particular test in TestReservations 100 times never passes all the 
> time.
> For example, let's run testReservationNoContinueLook 100 times. For me, it 
> produced 39 failed and 61 passed results.
> Screenshot is attached.
> Stacktrace: 
> {code}
> java.lang.AssertionError: 
> Expected :2048
> Actual   :0
> 
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.failNotEquals(Assert.java:835)
> at org.junit.Assert.assertEquals(Assert.java:647)
> at org.junit.Assert.assertEquals(Assert.java:633)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642)
> {code}
> The test fails here: 
> {code}
>  // Start testing...
> // Only AM
> TestUtils.applyResourceCommitRequest(clusterResource,
> a.assignContainers(clusterResource, node_0,
> new ResourceLimits(clusterResource),
> SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps);
> assertEquals(2 * GB, a.getUsedResources().getMemorySize());
> {code}
> With some debugging (patch attached), I realized that sometimes there are no 
> registered nodes so the AM can't be allocated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10672) All testcases in TestReservations are flaky

2021-03-04 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10672:
--
Attachment: (was: YARN-10672-debuglogs.patch)

> All testcases in TestReservations are flaky
> ---
>
> Key: YARN-10672
> URL: https://issues.apache.org/jira/browse/YARN-10672
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: Screenshot 2021-03-04 at 21.34.18.png, 
> YARN-10672-debuglogs.patch
>
>
> Running a particular test in TestReservations 100 times never passes all the 
> time.
> For example, let's run testReservationNoContinueLook 100 times. For me, it 
> produced 39 failed and 61 passed results.
> Screenshot is attached.
> Stacktrace: 
> {code}
> java.lang.AssertionError: 
> Expected :2048
> Actual   :0
> 
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.failNotEquals(Assert.java:835)
> at org.junit.Assert.assertEquals(Assert.java:647)
> at org.junit.Assert.assertEquals(Assert.java:633)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642)
> {code}
> The test fails here: 
> {code}
>  // Start testing...
> // Only AM
> TestUtils.applyResourceCommitRequest(clusterResource,
> a.assignContainers(clusterResource, node_0,
> new ResourceLimits(clusterResource),
> SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps);
> assertEquals(2 * GB, a.getUsedResources().getMemorySize());
> {code}
> With some debugging (patch attached), I realized that sometimes there are no 
> registered nodes so the AM can't be allocated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10672) All testcases in TestReservations are flaky

2021-03-04 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10672:
--
Attachment: YARN-10672-debuglogs.patch
Screenshot 2021-03-04 at 21.34.18.png

> All testcases in TestReservations are flaky
> ---
>
> Key: YARN-10672
> URL: https://issues.apache.org/jira/browse/YARN-10672
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: Screenshot 2021-03-04 at 21.34.18.png, 
> YARN-10672-debuglogs.patch
>
>
> Running a particular test in TestReservations 100 times never passes all the 
> time.
> For example, let's run testReservationNoContinueLook 100 times. For me, it 
> produced 39 failed and 61 passed results.
> Screenshot is attached.
> Stacktrace: 
> {code}
> java.lang.AssertionError: 
> Expected :2048
> Actual   :0
> 
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.failNotEquals(Assert.java:835)
> at org.junit.Assert.assertEquals(Assert.java:647)
> at org.junit.Assert.assertEquals(Assert.java:633)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642)
> {code}
> The test fails here: 
> {code}
>  // Start testing...
> // Only AM
> TestUtils.applyResourceCommitRequest(clusterResource,
> a.assignContainers(clusterResource, node_0,
> new ResourceLimits(clusterResource),
> SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps);
> assertEquals(2 * GB, a.getUsedResources().getMemorySize());
> {code}
> With some debugging (patch attached), I realized that sometimes there are no 
> registered nodes so the AM can't be allocated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10672) All testcases in TestReservations are flaky

2021-03-04 Thread Szilard Nemeth (Jira)
Szilard Nemeth created YARN-10672:
-

 Summary: All testcases in TestReservations are flaky
 Key: YARN-10672
 URL: https://issues.apache.org/jira/browse/YARN-10672
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Szilard Nemeth
Assignee: Szilard Nemeth


Running a particular test in TestReservations 100 times never passes all the 
time.
For example, let's run testReservationNoContinueLook 100 times. For me, it 
produced 39 failed and 61 passed results.
Screenshot is attached.

Stacktrace: 
{code}
java.lang.AssertionError: 
Expected :2048
Actual   :0


at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:647)
at org.junit.Assert.assertEquals(Assert.java:633)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642)
{code}

The test fails here: 
{code}
 // Start testing...
// Only AM
TestUtils.applyResourceCommitRequest(clusterResource,
a.assignContainers(clusterResource, node_0,
new ResourceLimits(clusterResource),
SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps);
assertEquals(2 * GB, a.getUsedResources().getMemorySize());
{code}

With some debugging (patch attached), I realized that sometimes there are no 
registered nodes so the AM can't be allocated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10652) Capacity Scheduler fails to handle user weights for a user that has a "." (dot) in it

2021-03-04 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17294833#comment-17294833
 ] 

Szilard Nemeth edited comment on YARN-10652 at 3/4/21, 9:32 AM:


Hi [~sahuja]

First of all, thanks for working on this.
I can't really add too many things, [~pbacsko] and [~shuzirra] summarized my 
concerns pretty much.
I would like to state my own opinion at least.
Again, there will be repercussions of previous comments, bear with me.

1. As Gergo said, we need to keep consistency. It's one thing that usernames 
with dots are kind of supported, but is it really supported in all parts of the 
system? Definitely not for placement rules as the rule Gergo mentioned 
("root.user.%user") could cause an issue easily. It's okay that some customers 
don't want to use placement rules and your change is not strictly related to 
placement rules. But if we are encouraging using usernames with dots across the 
codebase, we need to have handle these usernames in all aspects of the system. 
What if some customers are using usernames with dots and placement rules? There 
we have a problem, we need a more complete solution.

2. Support for usernames with dots: Was this documented anywhere or is this 
fact only can be dig up from the codebase?

3. We also understand that this is a setting of a queue and usernames are 
stored in the config objects and you are just retrieving this with that regex. 
The problem here is that "supporting" this is more like an overstatement as ACL 
handling / placement rules could be problematic areas.

4.  
{quote}
But we are supporting usernames with dots today. Users with dots in their 
usernames can submit jobs to the cluster having CS with no issues today (again, 
I am not talking about queue placement with queues with dots here). There are 
no errors reported when users with dots are supplied against 
"yarn.scheduler.capacity..user-settings..weight setting" 
and in fact, there should NOT be any errors when it is done so. These are 
real-world usernames and we will have to accept them from any interface, 
whether it be UI or CLI or anything. 
{quote}

My answer for this is added at 3.

5. [~wilfreds] I don't agree with this:
{quote}
If you want to solve the generic dot issue for user based placement then that 
is outside of this change. 
{quote}
Why would we allow usernames with dots more and more places in the code and 
forget about the generic solution? Doesn't make sense for me, just leads to 
developer and user confusion, IMHO.
TBH, it's too confusing as it is now. As [~sahuja] said, users can submit jobs 
without a problem. Then someone defines a simple username-based placement rule 
and things will stop working? That's just not consistent and not acceptable 
from the user's point of view. 




was (Author: snemeth):
Hi [~sahuja]

First of all, thanks for working on this.
I can't really add too many things, [~pbacsko] and [~shuzirra] summarized my 
concerns pretty much.
I would like to state my own opinion at least.
Again, there will be repercussions of previous comments, bear with me.

1. As Gergo said, we need to keep consistency. It's one thing that usernames 
with dots are kind of supported, but is it really supported in all parts of the 
system? Definitely not for placement rules as the rule Gergo mentioned 
("root.user.%user") could cause an issue easily. It's okay that some customers 
don't want to use placement rules and your change is not strictly related to 
placement rules. But if we are encouraging using usernames with dots across the 
codebase, we need to have handle these usernames in all aspects of the system. 
What if some customers are using usernames with dots and placement rules? There 
we have a problem, we need a more complete solution.

2. Support for usernames with dots: Was this documented anywhere or is this 
fact only can be dig up from the codebase?

3. We also understand that this is a setting of a queue and usernames are 
stored in the config objects and you are just retrieving this with that regex. 
The problem here is that "supporting" this is more like an overstatement as ACL 
handling / placement rules could be problematic areas.

4.  
{quote}
But we are supporting usernames with dots today. Users with dots in their 
usernames can submit jobs to the cluster having CS with no issues today (again, 
I am not talking about queue placement with queues with dots here). There are 
no errors reported when users with dots are supplied against 
"yarn.scheduler.capacity..user-settings..weight setting" 
and in fact, there should NOT be any errors when it is done so. These are 
real-world usernames and we will have to accept them from any interface, 
whether it be UI or CLI or anything. 
{quote}

My answer for this is added at 3.

5. [~wilfreds] I don't agree with this:
{quote}
If you want to solve the generic dot issue for user based placement 

[jira] [Comment Edited] (YARN-10652) Capacity Scheduler fails to handle user weights for a user that has a "." (dot) in it

2021-03-03 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17294833#comment-17294833
 ] 

Szilard Nemeth edited comment on YARN-10652 at 3/3/21, 9:11 PM:


Hi [~sahuja]

First of all, thanks for working on this.
I can't really add too many things, [~pbacsko] and [~shuzirra] summarized my 
concerns pretty much.
I would like to state my own opinion at least.
Again, there will be repercussions of previous comments, bear with me.

1. As Gergo said, we need to keep consistency. It's one thing that usernames 
with dots are kind of supported, but is it really supported in all parts of the 
system? Definitely not for placement rules as the rule Gergo mentioned 
("root.user.%user") could cause an issue easily. It's okay that some customers 
don't want to use placement rules and your change is not strictly related to 
placement rules. But if we are encouraging using usernames with dots across the 
codebase, we need to have handle these usernames in all aspects of the system. 
What if some customers are using usernames with dots and placement rules? There 
we have a problem, we need a more complete solution.

2. Support for usernames with dots: Was this documented anywhere or is this 
fact only can be dig up from the codebase?

3. We also understand that this is a setting of a queue and usernames are 
stored in the config objects and you are just retrieving this with that regex. 
The problem here is that "supporting" this is more like an overstatement as ACL 
handling / placement rules could be problematic areas.

4.  
{quote}
But we are supporting usernames with dots today. Users with dots in their 
usernames can submit jobs to the cluster having CS with no issues today (again, 
I am not talking about queue placement with queues with dots here). There are 
no errors reported when users with dots are supplied against 
"yarn.scheduler.capacity..user-settings..weight setting" 
and in fact, there should NOT be any errors when it is done so. These are 
real-world usernames and we will have to accept them from any interface, 
whether it be UI or CLI or anything. 
{quote}

My answer for this is added at 3.

5. [~wilfreds] I don't agree with this:
{quote}
If you want to solve the generic dot issue for user based placement then that 
is outside of this change. 
{quote}
Why would we allow usernames with dots more and more places in the code and 
forget about the generic solution of ? Doesn't make sense for me, just leads to 
developer and user confusion, IMHO.
TBH, it's too confusing as it is now. As [~sahuja] said, users can submit jobs 
without a problem. Then someone defines a simple username-based placement rule 
and things will stop working? That's just not consistent and not acceptable 
from the user's point of view. 




was (Author: snemeth):
Hi [~sahuja]

First of all, thanks for working on this.
I can't really add too many things, [~pbacsko] and [~shuzirra] summarized my 
concerns pretty much.
I would like to state my own opinion at least.
Again, there will be repercussions of previous comments, bear with me.

1. As Gergo said, we need to keep consistency. It's one thing that usernames 
with dots are kind of supported, but is it really supported in all parts of the 
system? Definitely not for placement rules as the rule Gergo mentioned 
("root.user.%user") could cause an issue easily. It's okay that some customers 
don't want to use placement rules and your change is not strictly related to 
placement rules. But if we are encouraging using usernames with dots across the 
codebase, we need to have handle these usernames in all aspects of the system. 
What if some customers are using usernames with dots and placement rules? There 
we have a problem, we need a more complete solution.

2. Support for usernames with dots: Was this documented anywhere or is this 
fact only can be dig up from the codebase?

3. We also understand that this is a setting of a queue and usernames are 
stored in the config objects and you are just retrieving this with that regex. 
The problem here is that "supporting" this is more like an overstatement as ACL 
handling / placement rules could be problematic areas.

4.  
{quote}
But we are supporting usernames with dots today. Users with dots in their 
usernames can submit jobs to the cluster having CS with no issues today (again, 
I am not talking about queue placement with queues with dots here). There are 
no errors reported when users with dots are supplied against 
"yarn.scheduler.capacity..user-settings..weight setting" 
and in fact, there should NOT be any errors when it is done so. These are 
real-world usernames and we will have to accept them from any interface, 
whether it be UI or CLI or anything. 
{quote}

My answer for this is added at 3.

5. [~wilfreds] I don't agree with this:
{quote}
If you want to solve the generic dot issue for user based 

[jira] [Comment Edited] (YARN-10652) Capacity Scheduler fails to handle user weights for a user that has a "." (dot) in it

2021-03-03 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17294833#comment-17294833
 ] 

Szilard Nemeth edited comment on YARN-10652 at 3/3/21, 9:11 PM:


Hi [~sahuja]

First of all, thanks for working on this.
I can't really add too many things, [~pbacsko] and [~shuzirra] summarized my 
concerns pretty much.
I would like to state my own opinion at least.
Again, there will be repercussions of previous comments, bear with me.

1. As Gergo said, we need to keep consistency. It's one thing that usernames 
with dots are kind of supported, but is it really supported in all parts of the 
system? Definitely not for placement rules as the rule Gergo mentioned 
("root.user.%user") could cause an issue easily. It's okay that some customers 
don't want to use placement rules and your change is not strictly related to 
placement rules. But if we are encouraging using usernames with dots across the 
codebase, we need to have handle these usernames in all aspects of the system. 
What if some customers are using usernames with dots and placement rules? There 
we have a problem, we need a more complete solution.

2. Support for usernames with dots: Was this documented anywhere or is this 
fact only can be dig up from the codebase?

3. We also understand that this is a setting of a queue and usernames are 
stored in the config objects and you are just retrieving this with that regex. 
The problem here is that "supporting" this is more like an overstatement as ACL 
handling / placement rules could be problematic areas.

4.  
{quote}
But we are supporting usernames with dots today. Users with dots in their 
usernames can submit jobs to the cluster having CS with no issues today (again, 
I am not talking about queue placement with queues with dots here). There are 
no errors reported when users with dots are supplied against 
"yarn.scheduler.capacity..user-settings..weight setting" 
and in fact, there should NOT be any errors when it is done so. These are 
real-world usernames and we will have to accept them from any interface, 
whether it be UI or CLI or anything. 
{quote}

My answer for this is added at 3.

5. [~wilfreds] I don't agree with this:
{quote}
If you want to solve the generic dot issue for user based placement then that 
is outside of this change. 
{quote}
Why would we allow usernames with dots more and more places in the code and 
forget about the generic solution of ? Doesn't make sense for me, just leads to 
developer and user confusion, IMHO.
TBH, it's too confusing as it is now. As [~sahuja] said, users can submit jobs 
without a problem. Then someone defines a simple username-based placement rule 
and things will stop working? That's just not consistent and not acceptable 
from the users point of view. 




was (Author: snemeth):
Hi [~sahuja]

First of all, thanks for working on this.
I can't really add too many things, [~pbacsko] and [~shuzirra] summarized my 
concerns pretty much.
I would like to state my own opinion at least.
Again, there will be repercussions of previous comments, bear with me.

1. As Gergo said, we need to keep consistency. It's one thing that usernames 
with dots are kind of supported, but is it really supported in all parts of the 
system? Definitely not for placement rules as the rule Gergo mentioned 
("root.user.%user") could cause an issue easily. It's okay that some customers 
don't want to use placement rules and your change is not strictly related to 
placement rules. But if we are encouraging using usernames with dots across the 
codebase, we need to have handle these usernames in all aspects of the system. 
What if some customers are using usernames with dots and placement rules? There 
we have a problem, we need a more complete solution.

2. Support for usernames with dots: Was this documented anywhere or is this 
fact only can be dig up from the codebase?

3. We also understand that this is a setting of a queue and usernames are 
stored in the config objects and you are just retrieving this with that regex. 
The problem here is that "supporting" this is more like an overstatement as ACL 
handling / placement rules could be problematic areas.

4.  
{quote}
But we are supporting usernames with dots today. Users with dots in their 
usernames can submit jobs to the cluster having CS with no issues today (again, 
I am not talking about queue placement with queues with dots here). There are 
no errors reported when users with dots are supplied against 
"yarn.scheduler.capacity..user-settings..weight setting" 
and in fact, there should NOT be any errors when it is done so. These are 
real-world usernames and we will have to accept them from any interface, 
whether it be UI or CLI or anything. 
{quote}

My answer for this is added at 3.

5. [~wilfreds] I don't agree with this:
{quote}
If you want to solve the generic dot issue for user based 

[jira] [Commented] (YARN-10652) Capacity Scheduler fails to handle user weights for a user that has a "." (dot) in it

2021-03-03 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17294833#comment-17294833
 ] 

Szilard Nemeth commented on YARN-10652:
---

Hi [~sahuja]

First of all, thanks for working on this.
I can't really add too many things, [~pbacsko] and [~shuzirra] summarized my 
concerns pretty much.
I would like to state my own opinion at least.
Again, there will be repercussions of previous comments, bear with me.

1. As Gergo said, we need to keep consistency. It's one thing that usernames 
with dots are kind of supported, but is it really supported in all parts of the 
system? Definitely not for placement rules as the rule Gergo mentioned 
("root.user.%user") could cause an issue easily. It's okay that some customers 
don't want to use placement rules and your change is not strictly related to 
placement rules. But if we are encouraging using usernames with dots across the 
codebase, we need to have handle these usernames in all aspects of the system. 
What if some customers are using usernames with dots and placement rules? There 
we have a problem, we need a more complete solution.

2. Support for usernames with dots: Was this documented anywhere or is this 
fact only can be dig up from the codebase?

3. We also understand that this is a setting of a queue and usernames are 
stored in the config objects and you are just retrieving this with that regex. 
The problem here is that "supporting" this is more like an overstatement as ACL 
handling / placement rules could be problematic areas.

4.  
{quote}
But we are supporting usernames with dots today. Users with dots in their 
usernames can submit jobs to the cluster having CS with no issues today (again, 
I am not talking about queue placement with queues with dots here). There are 
no errors reported when users with dots are supplied against 
"yarn.scheduler.capacity..user-settings..weight setting" 
and in fact, there should NOT be any errors when it is done so. These are 
real-world usernames and we will have to accept them from any interface, 
whether it be UI or CLI or anything. 
{quote}

My answer for this is added at 3.

5. [~wilfreds] I don't agree with this:
{quote}
If you want to solve the generic dot issue for user based placement then that 
is outside of this change. 
{quote}
Why would we allow usernames with dots more and more places in the code and 
forget about the generic solution of ? Doesn't make sense for me, just leads to 
developer and user confusion, IMHO.
TBH, it's too confusing as it is now. As [~sahuja] said, users can submit jobs 
without a problem. Then someone defines a simple username-based placement rule 
and things will stop working? That's just ridiculous.



> Capacity Scheduler fails to handle user weights for a user that has a "." 
> (dot) in it
> -
>
> Key: YARN-10652
> URL: https://issues.apache.org/jira/browse/YARN-10652
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.3.0
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: Correct user weight of 0.76 picked up for the user with 
> a dot after the patch.png, Incorrect default user weight of 1.0 being picked 
> for the user with a dot before the patch.png, YARN-10652.001.patch
>
>
> AD usernames can have a "." (dot) in them i.e. they can be of the format -> 
> {{firstname.lastname}}. However, if you specify a username with this format 
> against the Capacity Scheduler setting -> 
> {{yarn.scheduler.capacity.root.default.user-settings.firstname.lastname.weight}},
>  it fails to be applied and is instead assigned the default of 1.0f weight. 
> This renders the user weight feature (being used as a means of setting user 
> priorities for a queue) unusable for such users.
> This limitation comes from [1]. From [1], only word characters (A word 
> character: [a-zA-Z_0-9]) (see [2]) are permissible at the moment which is no 
> good for AD names that contain a "." (dot).
> Similar discussion has been had in a few HADOOP jiras e.g. HADOOP-7050 and 
> HADOOP-15395 and the outcome was to use non-whitespace characters i.e. 
> instead of {{\w+}}, use {{\S+}}.
> We could go down similar path and unblock this feature for the AD usernames 
> with a "." (dot) in them.
> [1] 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java#L1953
> [2] 
> https://docs.oracle.com/javase/tutorial/essential/regex/pre_char_classes.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (YARN-10653) Fixed the findbugs issues introduced by YARN-10647.

2021-02-25 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10653:
--
Summary: Fixed the findbugs issues introduced by YARN-10647.  (was: Fixed 
the findingbugs introduced by YARN-10647.)

> Fixed the findbugs issues introduced by YARN-10647.
> ---
>
> Key: YARN-10653
> URL: https://issues.apache.org/jira/browse/YARN-10653
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10653.001.patch
>
>
> In YARN-10647
> I fixed TestRMNodeLabelsManager failed after YARN-10501.
> But the finding bugs should be fixed also.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10639) Queueinfo related capacity, should adjusted to weight mode.

2021-02-25 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10639:
--
Summary: Queueinfo related capacity, should adjusted to weight mode.  (was: 
Queueinfo related capacity, should ajusted to weight mode.)

> Queueinfo related capacity, should adjusted to weight mode.
> ---
>
> Key: YARN-10639
> URL: https://issues.apache.org/jira/browse/YARN-10639
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10639.001.patch, YARN-10639.002.patch
>
>
> {color:#172b4d}The class QueueInfo capacity field should consider the weight 
> mode.{color}
> {color:#172b4d}Now when client use getQueueInfo to get queue capacity in 
> weight mode, i always return 0, it is wrong.{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10628) Add node usage metrics in SLS

2021-02-17 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17285953#comment-17285953
 ] 

Szilard Nemeth commented on YARN-10628:
---

Hi [~ananyo_rao],
If you need reviews on SLS, feel free to ping me.

> Add node usage metrics in SLS
> -
>
> Key: YARN-10628
> URL: https://issues.apache.org/jira/browse/YARN-10628
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler-load-simulator
>Affects Versions: 3.3.1
>Reporter: VADAGA ANANYO RAO
>Assignee: VADAGA ANANYO RAO
>Priority: Major
> Attachments: Nodes_memory_usage.png, Nodes_vcores_usage.png, 
> YARN-10628.0001.patch
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Given the work around container packing going on in YARN schedulers, it would 
> be beneficial to have charts showing the usage per node in SLS. This will 
> help to improve container packing algorithms for more efficient packings.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10625) FairScheduler: add global flag to disable AM-preemption

2021-02-16 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17285348#comment-17285348
 ] 

Szilard Nemeth commented on YARN-10625:
---

Thanks [~pbacsko] for working on this,
Patch LGTM, committed to trunk.
Thanks [~bteke] for the review.

> FairScheduler: add global flag to disable AM-preemption
> ---
>
> Key: YARN-10625
> URL: https://issues.apache.org/jira/browse/YARN-10625
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.3.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10625-001.patch
>
>
> YARN-9537 added a feature to disable AM preemption on a per queue basis.
> This is a nice enhancement, but it's very inconvenient if the cluster has a 
> lot of queues or queues dynamically created/deleted regularly (static queue 
> configuration changes).
> It's a legitimate use-case to have AM preemption turned off completely. To 
> make it easier, add property which acts as a global flag for this feature.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10625) FairScheduler: add global flag to disable AM-preemption

2021-02-16 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10625:
--
Fix Version/s: 3.4.0

> FairScheduler: add global flag to disable AM-preemption
> ---
>
> Key: YARN-10625
> URL: https://issues.apache.org/jira/browse/YARN-10625
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.3.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10625-001.patch
>
>
> YARN-9537 added a feature to disable AM preemption on a per queue basis.
> This is a nice enhancement, but it's very inconvenient if the cluster has a 
> lot of queues or queues dynamically created/deleted regularly (static queue 
> configuration changes).
> It's a legitimate use-case to have AM preemption turned off completely. To 
> make it easier, add property which acts as a global flag for this feature.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10622) Fix preemption policy to exclude childless ParentQueues

2021-02-15 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10622:
--
Fix Version/s: 3.4.0

> Fix preemption policy to exclude childless ParentQueues
> ---
>
> Key: YARN-10622
> URL: https://issues.apache.org/jira/browse/YARN-10622
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10622.001.patch
>
>
> ProportionalCapacityPreemptionPolicy selects the potential LeafQueues to be 
> preempted by this logic:
> {code:java}
> private Set getLeafQueueNames(TempQueuePerPartition q) {
> // If its a ManagedParentQueue, it might not have any children
> if ((q.children == null || q.children.isEmpty())
> && !(q.parentQueue instanceof ManagedParentQueue)) {
>   return ImmutableSet.of(q.queueName);
> }
> Set leafQueueNames = new HashSet<>();
> for (TempQueuePerPartition child : q.children) {
>   leafQueueNames.addAll(getLeafQueueNames(child));
> }
> return leafQueueNames;
>   }
> {code}
> This, however does not take childless ParentQueues (which was introduced in 
> YARN-10596) into account. 
> A childless ParentQueue will throw a NPE in 
> FifoCandidatesSelector#selectCandidates:
> {code:java}
> LeafQueue leafQueue = preemptionContext.getQueueByPartition(queueName,
>   RMNodeLabelsManager.NO_LABEL).leafQueue;
> {code}
> TempQueuePerPartition has a leafQueue member variable, which is null, if the 
> queue is not a LeafQueue. In case of childless ParentQueue, it is null, but 
> its name is present in the leafQueueNames as stated before.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10622) Fix preemption policy to exclude childless ParentQueues

2021-02-15 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284739#comment-17284739
 ] 

Szilard Nemeth commented on YARN-10622:
---

Thanks [~gandras] for working on this,
Patch LGTM, committed to trunk.


> Fix preemption policy to exclude childless ParentQueues
> ---
>
> Key: YARN-10622
> URL: https://issues.apache.org/jira/browse/YARN-10622
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10622.001.patch
>
>
> ProportionalCapacityPreemptionPolicy selects the potential LeafQueues to be 
> preempted by this logic:
> {code:java}
> private Set getLeafQueueNames(TempQueuePerPartition q) {
> // If its a ManagedParentQueue, it might not have any children
> if ((q.children == null || q.children.isEmpty())
> && !(q.parentQueue instanceof ManagedParentQueue)) {
>   return ImmutableSet.of(q.queueName);
> }
> Set leafQueueNames = new HashSet<>();
> for (TempQueuePerPartition child : q.children) {
>   leafQueueNames.addAll(getLeafQueueNames(child));
> }
> return leafQueueNames;
>   }
> {code}
> This, however does not take childless ParentQueues (which was introduced in 
> YARN-10596) into account. 
> A childless ParentQueue will throw a NPE in 
> FifoCandidatesSelector#selectCandidates:
> {code:java}
> LeafQueue leafQueue = preemptionContext.getQueueByPartition(queueName,
>   RMNodeLabelsManager.NO_LABEL).leafQueue;
> {code}
> TempQueuePerPartition has a leafQueue member variable, which is null, if the 
> queue is not a LeafQueue. In case of childless ParentQueue, it is null, but 
> its name is present in the leafQueueNames as stated before.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10624) Support max queues limit configuration in new auto created queue, consistent with old auto created.

2021-02-15 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10624:
--
Fix Version/s: 3.4.0

> Support max queues limit configuration in new auto created queue, consistent 
> with old auto created.
> ---
>
> Key: YARN-10624
> URL: https://issues.apache.org/jira/browse/YARN-10624
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10624.001.patch, YARN-10624.002.patch
>
>
> Since old created leaf queue has the max leaf queues limit, i think we also 
> should support this in new auto created queue, both the auto created leaf and 
> the auto created parent need limits.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10624) Support max queues limit configuration in new auto created queue, consistent with old auto created.

2021-02-15 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284709#comment-17284709
 ] 

Szilard Nemeth commented on YARN-10624:
---

Thanks [~zhuqi] for working on this.
Patch LGTM, committed to trunk.
Thanks [~gandras] and [~bteke] for the reviews.

> Support max queues limit configuration in new auto created queue, consistent 
> with old auto created.
> ---
>
> Key: YARN-10624
> URL: https://issues.apache.org/jira/browse/YARN-10624
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10624.001.patch, YARN-10624.002.patch
>
>
> Since old created leaf queue has the max leaf queues limit, i think we also 
> should support this in new auto created queue, both the auto created leaf and 
> the auto created parent need limits.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10618) RM UI2 Application page shows the AM preempted containers instead of the nonAM ones

2021-02-11 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10618:
--
Fix Version/s: 3.4.0

> RM UI2 Application page shows the AM preempted containers instead of the 
> nonAM ones
> ---
>
> Key: YARN-10618
> URL: https://issues.apache.org/jira/browse/YARN-10618
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: YARN-10618.001.patch
>
>
> YARN RM UIv2 application page shows the AM preempted containers under both 
> the _Num Non-AM container preempted_ and _Num AM container preempted_.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10618) RM UI2 Application page shows the AM preempted containers instead of the nonAM ones

2021-02-11 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17282919#comment-17282919
 ] 

Szilard Nemeth commented on YARN-10618:
---

Hi [~bteke],
Thanks for working on this.
Patch LGTM, committed to trunk.


> RM UI2 Application page shows the AM preempted containers instead of the 
> nonAM ones
> ---
>
> Key: YARN-10618
> URL: https://issues.apache.org/jira/browse/YARN-10618
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Minor
> Attachments: YARN-10618.001.patch
>
>
> YARN RM UIv2 application page shows the AM preempted containers under both 
> the _Num Non-AM container preempted_ and _Num AM container preempted_.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10593) Fix incorrect string comparison in GpuDiscoverer

2021-02-10 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17282424#comment-17282424
 ] 

Szilard Nemeth commented on YARN-10593:
---

Thanks [~pbacsko] for working on this.
Patch LGTM, committed to trunk.
Thanks [~zhuqi] for the review.

> Fix incorrect string comparison in GpuDiscoverer
> 
>
> Key: YARN-10593
> URL: https://issues.apache.org/jira/browse/YARN-10593
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10593-001.patch
>
>
> The following comparison in {{GpuDiscoverer}} is invalid:
> {noformat}
>    binaryPath = configuredBinaryFile;
>   // If path exists but file name is incorrect don't execute the file
>   String fileName = binaryPath.getName();
>   if (DEFAULT_BINARY_NAME.equals(fileName)) {  <--- inverse condition 
> needed
> String msg = String.format("Please check the configuration value of"
>  +" %s. It should point to an %s binary.",
>  YarnConfiguration.NM_GPU_PATH_TO_EXEC,
>  DEFAULT_BINARY_NAME);
> throwIfNecessary(new YarnException(msg), config);
> LOG.warn(msg);
>   }{noformat}
> Obviously it should be other way around - we should log a warning or throw an 
> exception if the file names *differ*, not when they're equal.
> Consider adding a unit test for this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10593) Fix incorrect string comparison in GpuDiscoverer

2021-02-10 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10593:
--
Fix Version/s: 3.4.0

> Fix incorrect string comparison in GpuDiscoverer
> 
>
> Key: YARN-10593
> URL: https://issues.apache.org/jira/browse/YARN-10593
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10593-001.patch
>
>
> The following comparison in {{GpuDiscoverer}} is invalid:
> {noformat}
>    binaryPath = configuredBinaryFile;
>   // If path exists but file name is incorrect don't execute the file
>   String fileName = binaryPath.getName();
>   if (DEFAULT_BINARY_NAME.equals(fileName)) {  <--- inverse condition 
> needed
> String msg = String.format("Please check the configuration value of"
>  +" %s. It should point to an %s binary.",
>  YarnConfiguration.NM_GPU_PATH_TO_EXEC,
>  DEFAULT_BINARY_NAME);
> throwIfNecessary(new YarnException(msg), config);
> LOG.warn(msg);
>   }{noformat}
> Obviously it should be other way around - we should log a warning or throw an 
> exception if the file names *differ*, not when they're equal.
> Consider adding a unit test for this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10620) fs2cs: parentQueue for certain placement rules are not set during conversion

2021-02-10 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17282357#comment-17282357
 ] 

Szilard Nemeth commented on YARN-10620:
---

Hi [~pbacsko],
Thanks for working on this. 
Latest patch LGTM, just committed to trunk.
Thanks [~gandras] for the review.

> fs2cs: parentQueue for certain placement rules are not set during conversion
> 
>
> Key: YARN-10620
> URL: https://issues.apache.org/jira/browse/YARN-10620
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: fs2cs
> Fix For: 3.4.0
>
> Attachments: YARN-10620-001.patch, YARN-10620-002.patch
>
>
> There are some placement rules in FS which are currently not handled properly 
> by fs2cs:
> {noformat}
> 
> 
> 
> 
> {noformat}
> The first rule means that if the user queue doesn't exist, it should be 
> created as {{root.}}.
> The second means the same thing, except refers to the primary group instead 
> of the submitting user: {{root.}}.
> The problem is that in order for the create="true" setting to take effect, we 
> must set the parent queue in the generated JSON:
> Current:
> {noformat}
> {
>   "rules" : [ {
> "type" : "user",
> "matches" : "*",
> "policy" : "user",
> "fallbackResult" : "skip",
> "create" : true
>   }, {
> "type" : "user",
> "matches" : "*",
> "policy" : "primaryGroup",
> "fallbackResult" : "skip",
> "create" : true
>   } ]
> }
> {noformat}
> Expected:
> {noformat}
> {
>   "rules" : [ {
> "type" : "user",
> "matches" : "*",
> "policy" : "user",
> "fallbackResult" : "skip",
> "parentQueue": "root",
> "create" : true
>   }, {
> "type" : "user",
> "matches" : "*",
> "policy" : "primaryGroup",
> "fallbackResult" : "skip",
> "parentQueue": "root",
> "create" : true
>   } ]
> {noformat}
> This is missing right now and it need to be fixed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10620) fs2cs: parentQueue for certain placement rules are not set during conversion

2021-02-10 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10620:
--
Fix Version/s: 3.4.0

> fs2cs: parentQueue for certain placement rules are not set during conversion
> 
>
> Key: YARN-10620
> URL: https://issues.apache.org/jira/browse/YARN-10620
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: fs2cs
> Fix For: 3.4.0
>
> Attachments: YARN-10620-001.patch, YARN-10620-002.patch
>
>
> There are some placement rules in FS which are currently not handled properly 
> by fs2cs:
> {noformat}
> 
> 
> 
> 
> {noformat}
> The first rule means that if the user queue doesn't exist, it should be 
> created as {{root.}}.
> The second means the same thing, except refers to the primary group instead 
> of the submitting user: {{root.}}.
> The problem is that in order for the create="true" setting to take effect, we 
> must set the parent queue in the generated JSON:
> Current:
> {noformat}
> {
>   "rules" : [ {
> "type" : "user",
> "matches" : "*",
> "policy" : "user",
> "fallbackResult" : "skip",
> "create" : true
>   }, {
> "type" : "user",
> "matches" : "*",
> "policy" : "primaryGroup",
> "fallbackResult" : "skip",
> "create" : true
>   } ]
> }
> {noformat}
> Expected:
> {noformat}
> {
>   "rules" : [ {
> "type" : "user",
> "matches" : "*",
> "policy" : "user",
> "fallbackResult" : "skip",
> "parentQueue": "root",
> "create" : true
>   }, {
> "type" : "user",
> "matches" : "*",
> "policy" : "primaryGroup",
> "fallbackResult" : "skip",
> "parentQueue": "root",
> "create" : true
>   } ]
> {noformat}
> This is missing right now and it need to be fixed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10619) CS Mapping Rule %specified rule catches default submissions

2021-02-09 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281900#comment-17281900
 ] 

Szilard Nemeth commented on YARN-10619:
---

Thanks [~shuzirra] for working on this.
Patch LGTM, committed to trunk.


> CS Mapping Rule %specified rule catches default submissions
> ---
>
> Key: YARN-10619
> URL: https://issues.apache.org/jira/browse/YARN-10619
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10619.001.patch
>
>
> If we have a mapping rule which places the application to the %specified 
> queue, then application submissions without specified queues (default) will 
> get placed to default. 
> The expected behaviour would be to fail the specified placement when no queue 
> was specified, and move on or reject based on the fallback action of the 
> rule. 
> Also it is impossible to differentiate between explicitly specified 'default' 
> and when the user does not specify any actual queue, so these will be handled 
> the same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10619) CS Mapping Rule %specified rule catches default submissions

2021-02-09 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10619:
--
Fix Version/s: 3.4.0

> CS Mapping Rule %specified rule catches default submissions
> ---
>
> Key: YARN-10619
> URL: https://issues.apache.org/jira/browse/YARN-10619
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10619.001.patch
>
>
> If we have a mapping rule which places the application to the %specified 
> queue, then application submissions without specified queues (default) will 
> get placed to default. 
> The expected behaviour would be to fail the specified placement when no queue 
> was specified, and move on or reject based on the fallback action of the 
> rule. 
> Also it is impossible to differentiate between explicitly specified 'default' 
> and when the user does not specify any actual queue, so these will be handled 
> the same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10615) Fix Auto Queue Creation hierarchy construction to use queue path instead of short queue name

2021-02-05 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10615:
--
Fix Version/s: 3.4.0

> Fix Auto Queue Creation hierarchy construction to use queue path instead of 
> short queue name
> 
>
> Key: YARN-10615
> URL: https://issues.apache.org/jira/browse/YARN-10615
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Critical
> Fix For: 3.4.0
>
> Attachments: YARN-10615.001.patch
>
>
> The CSAutoQueueHandler validates the parent hierarchy of a queue path on 
> creation. The queues are queried by their short name, which might cause 
> ambiguity (short name of root.a is the same as root.b.a). We need to query 
> the queues from QueueManager by full queue path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10615) Fix Auto Queue Creation hierarchy construction to use queue path instead of short queue name

2021-02-05 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279796#comment-17279796
 ] 

Szilard Nemeth commented on YARN-10615:
---

Thanks [~gandras] for working on this,
Patch LGTM, committed to trunk.
Thanks [~zhuqi] for your review.

> Fix Auto Queue Creation hierarchy construction to use queue path instead of 
> short queue name
> 
>
> Key: YARN-10615
> URL: https://issues.apache.org/jira/browse/YARN-10615
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Critical
> Attachments: YARN-10615.001.patch
>
>
> The CSAutoQueueHandler validates the parent hierarchy of a queue path on 
> creation. The queues are queried by their short name, which might cause 
> ambiguity (short name of root.a is the same as root.b.a). We need to query 
> the queues from QueueManager by full queue path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10610) Add queuePath to RESTful API for CapacityScheduler consistent with FairScheduler queuePath

2021-02-05 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279789#comment-17279789
 ] 

Szilard Nemeth commented on YARN-10610:
---

Thanks [~zhuqi] for working on this.
Latest patch LGTM, committed to trunk.
Thanks [~shuzirra] and [~gandras] for the reviews.

> Add queuePath to RESTful API for CapacityScheduler consistent with 
> FairScheduler queuePath
> --
>
> Key: YARN-10610
> URL: https://issues.apache.org/jira/browse/YARN-10610
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10610.001.patch, YARN-10610.002.patch, 
> YARN-10610.003.patch, image-2021-02-03-13-47-13-516.png
>
>
> The cs only have queueName, but not full queuePath.
> !image-2021-02-03-13-47-13-516.png|width=631,height=356!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10610) Add queuePath to RESTful API for CapacityScheduler consistent with FairScheduler queuePath

2021-02-05 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10610:
--
Fix Version/s: 3.4.0

> Add queuePath to RESTful API for CapacityScheduler consistent with 
> FairScheduler queuePath
> --
>
> Key: YARN-10610
> URL: https://issues.apache.org/jira/browse/YARN-10610
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10610.001.patch, YARN-10610.002.patch, 
> YARN-10610.003.patch, image-2021-02-03-13-47-13-516.png
>
>
> The cs only have queueName, but not full queuePath.
> !image-2021-02-03-13-47-13-516.png|width=631,height=356!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10610) Add queuePath to RESTful API for CapacityScheduler consistent with FairScheduler queuePath.

2021-02-05 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10610:
--
Summary: Add queuePath to RESTful API for CapacityScheduler consistent with 
FairScheduler queuePath.  (was: Add queuePath to RESTful api for 
CapacityScheduler consistent with FairScheduler queuePath.)

> Add queuePath to RESTful API for CapacityScheduler consistent with 
> FairScheduler queuePath.
> ---
>
> Key: YARN-10610
> URL: https://issues.apache.org/jira/browse/YARN-10610
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10610.001.patch, YARN-10610.002.patch, 
> YARN-10610.003.patch, image-2021-02-03-13-47-13-516.png
>
>
> The cs only have queueName, but not full queuePath.
> !image-2021-02-03-13-47-13-516.png|width=631,height=356!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10610) Add queuePath to RESTful API for CapacityScheduler consistent with FairScheduler queuePath

2021-02-05 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10610:
--
Summary: Add queuePath to RESTful API for CapacityScheduler consistent with 
FairScheduler queuePath  (was: Add queuePath to RESTful API for 
CapacityScheduler consistent with FairScheduler queuePath.)

> Add queuePath to RESTful API for CapacityScheduler consistent with 
> FairScheduler queuePath
> --
>
> Key: YARN-10610
> URL: https://issues.apache.org/jira/browse/YARN-10610
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10610.001.patch, YARN-10610.002.patch, 
> YARN-10610.003.patch, image-2021-02-03-13-47-13-516.png
>
>
> The cs only have queueName, but not full queuePath.
> !image-2021-02-03-13-47-13-516.png|width=631,height=356!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10610) Add queuePath to RESTful api for CapacityScheduler consistent with FairScheduler queuePath.

2021-02-05 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10610:
--
Summary: Add queuePath to RESTful api for CapacityScheduler consistent with 
FairScheduler queuePath.  (was: Add queuePath to restful api for 
CapacityScheduler consistent with FairScheduler queuePath.)

> Add queuePath to RESTful api for CapacityScheduler consistent with 
> FairScheduler queuePath.
> ---
>
> Key: YARN-10610
> URL: https://issues.apache.org/jira/browse/YARN-10610
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10610.001.patch, YARN-10610.002.patch, 
> YARN-10610.003.patch, image-2021-02-03-13-47-13-516.png
>
>
> The cs only have queueName, but not full queuePath.
> !image-2021-02-03-13-47-13-516.png|width=631,height=356!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10428) Zombie applications in the YARN queue using FAIR + sizebasedweight

2021-02-05 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279777#comment-17279777
 ] 

Szilard Nemeth commented on YARN-10428:
---

Thanks [~yguang11], [~gandras] for working on this,
Latest patch LGTM, committed to trunk.


> Zombie applications in the YARN queue using FAIR + sizebasedweight
> --
>
> Key: YARN-10428
> URL: https://issues.apache.org/jira/browse/YARN-10428
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.8.5
>Reporter: Guang Yang
>Assignee: Andras Gyori
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10428.001.patch, YARN-10428.002.patch, 
> YARN-10428.003.patch
>
>
> Seeing zombie jobs in the YARN queue that uses FAIR and size based weight 
> ordering policy .
> *Detection:*
> The YARN UI shows incorrect number of "Num Schedulable Applications".
> *Impact:*
> The queue has an upper limit of number of running applications, with zombie 
> job, it hits the limit even though the number of running applications is far 
> less than the limit. 
> *Workaround:*
> **Fail-over and restart Resource Manager process.
> *Analysis:*
> **In the heap dump, we can find the zombie jobs in the `FairOderingPolicy#
> schedulableEntities` (see attachment). Take application 
> "application_1599157165858_29429" for example, it is still in the  
> `FairOderingPolicy#schedulableEntities` set, however, if we check the log of 
> resource manager, we can see RM already tried to remove the application:
>  
> ./yarn-yarn-resourcemanager-ip-172-21-153-252.log.2020-09-04-04:2020-09-04 
> 04:32:19,730 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue 
> (ResourceManager Event Processor): Application removed - appId: 
> application_1599157165858_29429 user: svc_di_data_eng queue: core-data 
> #user-pending-applications: -3 #user-active-applications: 7 
> #queue-pending-applications: 0 #queue-active-applications: 21
>  
> So it appears RM failed to removed the application from the set.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10428) Zombie applications in the YARN queue using FAIR + sizebasedweight

2021-02-05 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10428:
--
Fix Version/s: 3.4.0

> Zombie applications in the YARN queue using FAIR + sizebasedweight
> --
>
> Key: YARN-10428
> URL: https://issues.apache.org/jira/browse/YARN-10428
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 2.8.5
>Reporter: Guang Yang
>Assignee: Andras Gyori
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10428.001.patch, YARN-10428.002.patch, 
> YARN-10428.003.patch
>
>
> Seeing zombie jobs in the YARN queue that uses FAIR and size based weight 
> ordering policy .
> *Detection:*
> The YARN UI shows incorrect number of "Num Schedulable Applications".
> *Impact:*
> The queue has an upper limit of number of running applications, with zombie 
> job, it hits the limit even though the number of running applications is far 
> less than the limit. 
> *Workaround:*
> **Fail-over and restart Resource Manager process.
> *Analysis:*
> **In the heap dump, we can find the zombie jobs in the `FairOderingPolicy#
> schedulableEntities` (see attachment). Take application 
> "application_1599157165858_29429" for example, it is still in the  
> `FairOderingPolicy#schedulableEntities` set, however, if we check the log of 
> resource manager, we can see RM already tried to remove the application:
>  
> ./yarn-yarn-resourcemanager-ip-172-21-153-252.log.2020-09-04-04:2020-09-04 
> 04:32:19,730 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue 
> (ResourceManager Event Processor): Application removed - appId: 
> application_1599157165858_29429 user: svc_di_data_eng queue: core-data 
> #user-pending-applications: -3 #user-active-applications: 7 
> #queue-pending-applications: 0 #queue-active-applications: 21
>  
> So it appears RM failed to removed the application from the set.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10612) Fix findbugs issue introduced in YARN-10585

2021-02-03 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17278606#comment-17278606
 ] 

Szilard Nemeth commented on YARN-10612:
---

Hi [~shuzirra],
Thanks for the explanation.
Fix looks good to me, committed to trunk.
Resolving this jira.

> Fix findbugs issue introduced in YARN-10585
> ---
>
> Key: YARN-10612
> URL: https://issues.apache.org/jira/browse/YARN-10612
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Attachments: YARN-10612.001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10612) Fix findbugs issue introduced in YARN-10585

2021-02-03 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10612:
--
Fix Version/s: 3.4.0

> Fix findbugs issue introduced in YARN-10585
> ---
>
> Key: YARN-10612
> URL: https://issues.apache.org/jira/browse/YARN-10612
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10612.001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10612) Fix findbugs issue introduced in YARN-10585

2021-02-03 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10612:
--
Summary: Fix findbugs issue introduced in YARN-10585  (was: Fix find bugs 
issue introduced in YARN-10585)

> Fix findbugs issue introduced in YARN-10585
> ---
>
> Key: YARN-10612
> URL: https://issues.apache.org/jira/browse/YARN-10612
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Attachments: YARN-10612.001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10585) Create a class which can convert from legacy mapping rule format to the new JSON format

2021-02-03 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17278337#comment-17278337
 ] 

Szilard Nemeth edited comment on YARN-10585 at 2/3/21, 8:10 PM:


Hi [~ahussein],

My thoughts:

1. Apologies for merging this one with the Findbugs issue.
I have been a committer since middle of 2019 and have been paying attention and 
have been striving for the best code quality and Yetus results, making sure the 
code meets the code quality standards we're expecting at Hadoop.
This one is an exceptional case that simply fell through the cracks.

2. About the UT failures: They are completely unrelated
- 
org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartOnMissingAttempts[FAIR]:
 This is Fair scheduler related and the patch is not
- 
org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer.testRMRestartWithExpiredToken:
 This is a well known flakey.


3. I can see that [~shuzirra] already reported YARN-10612 and you also left a 
comment there.
I still don't understand how reopening this jira is a better approach than 
fixing it in a follow-up.
We will have one more commit on top of trunk nevertheless, as I would not 
revert this commit for the sake of a single findbugs warning.
You mentioned amending on the other jira. How did you mean that? I never 
amended any commit as it modifies git's commit history and this is to be 
avoided on a repository that is used by many many people.

4. About scalability: I generally agree with your comment but as said in bullet 
point 1, this is an exceptional situation. I have 200+ added commits and I 
can't recall a case where I committed findbugs issues. So it's a bit of an 
overstatement that this will cause a flood of commits.

5. Credibility: I can agree that we need to strive for findbugs error free 
commits. However, I have carefully reviewed the unit tests [~shuzirra] 
introduced and the coverage was more than enough. Such an NPE would have 
surfaced during the UT execution as well.

[~sunil.gov...@gmail.com] Please chime in for the topic of how to fix this: in 
a follow-up or reopening this one, please share your thoughts about pros/cons.
Thanks



was (Author: snemeth):
Hi [~ahussein],

My thoughts:

1. Apologies for merging this one with the Findbugs issue.
I have been a committer since middle of 2019 and have been paying attention and 
have been striving for the best code quality and Yetus results, making sure the 
code meets the code quality standards we're expecting at Hadoop.
This one is an exceptional case that simply fell through the cracks.

2. About the UT failures: They are completely unrelated
- 
org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartOnMissingAttempts[FAIR]:
 This is Fair scheduler related and the patch is not
- 
org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer.testRMRestartWithExpiredToken:
 This is a well known flakey.


3. I can see that [~shuzirra] already reported YARN-10612 and you also left a 
comment there.
I still don't understand how reopening this jira is a better approach than 
fixing it in a follow-up.
We will have one more commit on top of trunk nevertheless, as I would not 
revert this commit for the sake of a single findbugs warning.
You mentioned amending on the other jira. How did you mean that? I never 
amended any commit as it modifies git's commit history and this is to be 
avoided on a repository that is used by many many people.

4. About scalability: I generally agree with your comment but as said in bullet 
point 1, this is an exceptional situation. I have 200+ added commits and I 
can't recall a case where I committed findbugs issues. So it's a bit of an 
overstatement that this will cause a flood of commits.

5. Credibility: I can agree that we need to strive for findbugs error free 
commits. However, I have carefully reviewed the unit tests [~shuzirra] 
introduced and the coverage was more than enough. Such an NPE would have 
surfaced during the UT execution as well.


> Create a class which can convert from legacy mapping rule format to the new 
> JSON format
> ---
>
> Key: YARN-10585
> URL: https://issues.apache.org/jira/browse/YARN-10585
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10585.001.patch, YARN-10585.002.patch, 
> YARN-10585.003.patch
>
>
> To make transition easier we need to create tooling to support the migration 
> effort. The first step is to create a class which can migrate from legacy to 
> the new JSON format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (YARN-10585) Create a class which can convert from legacy mapping rule format to the new JSON format

2021-02-03 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17278337#comment-17278337
 ] 

Szilard Nemeth edited comment on YARN-10585 at 2/3/21, 8:10 PM:


Hi [~ahussein],

My thoughts:

1. Apologies for merging this one with the Findbugs issue.
I have been a committer since middle of 2019 and have been paying attention and 
have been striving for the best code quality and Yetus results, making sure the 
code meets the code quality standards we're expecting at Hadoop.
This one is an exceptional case that simply fell through the cracks.

2. About the UT failures: They are completely unrelated
- 
org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartOnMissingAttempts[FAIR]:
 This is Fair scheduler related and the patch is not
- 
org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer.testRMRestartWithExpiredToken:
 This is a well known flakey.


3. I can see that [~shuzirra] already reported YARN-10612 and you also left a 
comment there.
I still don't understand how reopening this jira is a better approach than 
fixing it in a follow-up.
We will have one more commit on top of trunk nevertheless, as I would not 
revert this commit for the sake of a single findbugs warning.
You mentioned amending on the other jira. How did you mean that? I never 
amended any commit as it modifies git's commit history and this is to be 
avoided on a repository that is used by many many people.

4. About scalability: I generally agree with your comment but as said in bullet 
point 1, this is an exceptional situation. I have 200+ added commits and I 
can't recall a case where I committed findbugs issues. So it's a bit of an 
overstatement that this will cause a flood of commits.

5. Credibility: I can agree that we need to strive for findbugs error free 
commits. However, I have carefully reviewed the unit tests [~shuzirra] 
introduced and the coverage was more than enough. Such an NPE would have 
surfaced during the UT execution as well.

[~sunilg] Please chime in for the topic of how to fix this: in a follow-up or 
reopening this one, please share your thoughts about pros/cons.
Thanks



was (Author: snemeth):
Hi [~ahussein],

My thoughts:

1. Apologies for merging this one with the Findbugs issue.
I have been a committer since middle of 2019 and have been paying attention and 
have been striving for the best code quality and Yetus results, making sure the 
code meets the code quality standards we're expecting at Hadoop.
This one is an exceptional case that simply fell through the cracks.

2. About the UT failures: They are completely unrelated
- 
org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartOnMissingAttempts[FAIR]:
 This is Fair scheduler related and the patch is not
- 
org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer.testRMRestartWithExpiredToken:
 This is a well known flakey.


3. I can see that [~shuzirra] already reported YARN-10612 and you also left a 
comment there.
I still don't understand how reopening this jira is a better approach than 
fixing it in a follow-up.
We will have one more commit on top of trunk nevertheless, as I would not 
revert this commit for the sake of a single findbugs warning.
You mentioned amending on the other jira. How did you mean that? I never 
amended any commit as it modifies git's commit history and this is to be 
avoided on a repository that is used by many many people.

4. About scalability: I generally agree with your comment but as said in bullet 
point 1, this is an exceptional situation. I have 200+ added commits and I 
can't recall a case where I committed findbugs issues. So it's a bit of an 
overstatement that this will cause a flood of commits.

5. Credibility: I can agree that we need to strive for findbugs error free 
commits. However, I have carefully reviewed the unit tests [~shuzirra] 
introduced and the coverage was more than enough. Such an NPE would have 
surfaced during the UT execution as well.

[~sunil.gov...@gmail.com] Please chime in for the topic of how to fix this: in 
a follow-up or reopening this one, please share your thoughts about pros/cons.
Thanks


> Create a class which can convert from legacy mapping rule format to the new 
> JSON format
> ---
>
> Key: YARN-10585
> URL: https://issues.apache.org/jira/browse/YARN-10585
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10585.001.patch, YARN-10585.002.patch, 
> YARN-10585.003.patch
>
>
> To make transition easier we need to create tooling to support the migration 
> effort. The 

[jira] [Comment Edited] (YARN-10585) Create a class which can convert from legacy mapping rule format to the new JSON format

2021-02-03 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17278337#comment-17278337
 ] 

Szilard Nemeth edited comment on YARN-10585 at 2/3/21, 8:09 PM:


Hi [~ahussein],

My thoughts:

1. Apologies for merging this one with the Findbugs issue.
I have been a committer since middle of 2019 and have been paying attention and 
have been striving for the best code quality and Yetus results, making sure the 
code meets the code quality standards we're expecting at Hadoop.
This one is an exceptional case that simply fell through the cracks.

2. About the UT failures: They are completely unrelated
- 
org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartOnMissingAttempts[FAIR]:
 This is Fair scheduler related and the patch is not
- 
org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer.testRMRestartWithExpiredToken:
 This is a well known flakey.


3. I can see that [~shuzirra] already reported YARN-10612 and you also left a 
comment there.
I still don't understand how reopening this jira is a better approach than 
fixing it in a follow-up.
We will have one more commit on top of trunk nevertheless, as I would not 
revert this commit for the sake of a single findbugs warning.
You mentioned amending on the other jira. How did you mean that? I never 
amended any commit as it modifies git's commit history and this is to be 
avoided on a repository that is used by many many people.

4. About scalability: I generally agree with your comment but as said in bullet 
point 1, this is an exceptional situation. I have 200+ added commits and I 
can't recall a case where I committed findbugs issues. So it's a bit of an 
overstatement that this will cause a flood of commits.

5. Credibility: I can agree that we need to strive for findbugs error free 
commits. However, I have carefully reviewed the unit tests [~shuzirra] 
introduced and the coverage was more than enough. Such an NPE would have 
surfaced during the UT execution as well.



was (Author: snemeth):
Hi [~ahussein],

My thoughts:

1. Apologies for merging this one with the Findbugs issue.
I have been a committer since middle of 2019 and have been paying attention and 
have been striving for the best code quality and Yetus results, making sure the 
code meets the code quality standards we're expecting at Hadoop.
This one is an exceptional case that simply fell through the cracks.

2. About the UT failures: They are completely unrelated
- 
org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartOnMissingAttempts[FAIR]:
 This is Fair scheduler related and the patch is not
- 
org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer.testRMRestartWithExpiredToken:
 This is a well known flakey.


3. I can see that [~shuzirra] already reported YARN-10612 and you also left a 
comment there.
I still don't understand how reopening this jira is a better approach than 
fixing it in a follow-up.
We will have one more commit on top of trunk nevertheless, as I would not 
revert this commit for the sake of a single findbugs warning.
You mentioned amending on the other jira. How did you mean that? I never 
amended any commit as it modifies git's commit history and this is to be 
avoided on a repository that is used by many many people.

4. About scalability: I generally agree with your comment but as said in bullet 
point 1, this is an exceptional situation. I have 200+ added commits and I 
can't recall a case where I committed findbugs issues. So it's a bit of an 
overstatement that this will cause a flood of commits.

5. Credibility: I can agree that we need to strive for findbugs error free 
commits. However, I have carefully reviewed the unit tests Gergo introduced and 
the coverage was more than enough. Such an NPE would have surfaced during the 
UT execution as well.


> Create a class which can convert from legacy mapping rule format to the new 
> JSON format
> ---
>
> Key: YARN-10585
> URL: https://issues.apache.org/jira/browse/YARN-10585
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10585.001.patch, YARN-10585.002.patch, 
> YARN-10585.003.patch
>
>
> To make transition easier we need to create tooling to support the migration 
> effort. The first step is to create a class which can migrate from legacy to 
> the new JSON format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: 

[jira] [Comment Edited] (YARN-10585) Create a class which can convert from legacy mapping rule format to the new JSON format

2021-02-03 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17278337#comment-17278337
 ] 

Szilard Nemeth edited comment on YARN-10585 at 2/3/21, 8:08 PM:


Hi [~ahussein],

My thoughts:

1. Apologies for merging this one with the Findbugs issue.
I have been a committer since middle of 2019 and have been paying attention and 
have been striving for the best code quality and Yetus results, making sure the 
code meets the code quality standards we're expecting at Hadoop.
This one is an exceptional case that simply fell through the cracks.

2. About the UT failures: They are completely unrelated
- 
org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartOnMissingAttempts[FAIR]:
 This is Fair scheduler related and the patch is not
- 
org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer.testRMRestartWithExpiredToken:
 This is a well known flakey.


3. I can see that [~shuzirra] already reported YARN-10612 and you also left a 
comment there.
I still don't understand how reopening this jira is a better approach than 
fixing it in a follow-up.
We will have one more commit on top of trunk nevertheless, as I would not 
revert this commit for the sake of a single findbugs warning.
You mentioned amending on the other jira. How did you mean that? I never 
amended any commit as it modifies git's commit history and this is to be 
avoided on a repository that is used by many many people.

4. About scalability: I generally agree with your comment but as said in bullet 
point 1, this is an exceptional situation. I have 200+ added commits and I 
can't recall a case where I committed findbugs issues. So it's a bit of an 
overstatement that this will cause a flood of commits.

5. Credibility: I can agree that we need to strive for findbugs error free 
commits. However, I have carefully reviewed the unit tests Gergo introduced and 
the coverage was more than enough. Such an NPE would have surfaced during the 
UT execution as well.



was (Author: snemeth):
Hi [~ahussein],

My thoughts:

1. Apologies for merging this one with the Findbugs issue.
I have been a committer since middle of 2019 and have been paying attention and 
have been striving for the best code quality and Yetus results, making sure the 
code meets the code quality standards we're expecting at Hadoop.
This one is an exceptional case that simply fell through the cracks.

2. About the UT failures: They are completely unrelated
- 
org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartOnMissingAttempts[FAIR]:
 This is Fair scheduler related and the patch is not
- 
org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer.testRMRestartWithExpiredToken:
 This is a well known flakey.


3. I can see that [~shuzirra] already reported YARN-10612 and you also left a 
comment there.
I still don't understand how reopening this jira is a better approach than 
fixing it in a follow-up.
We will have one more commit on top of trunk nevertheless, as I would not 
revert this commit for the sake of a single findbugs warning.
You mentioned amending on the other jira. How did you mean that? I never 
amended any commit as it modifies git's commit history and this is to be 
avoided on a repository that is used by many many people.

4. About scalability: I generally agree with your comment but as said in bullet 
point 1, this is an excecptional situation. I have 200+ commits and I can't 
recall a case where I committed findbugs issues. So it's a bit of an 
overstatement that this will cause a flood of commits.

5. Credibility: I can agree that we need to strive for findbugs error free 
commits. However, I have carefully reviewed the unit tests Gergo introduced and 
the coverage was more than enough. Such an NPE would have surfaced during the 
UT execution as well.


> Create a class which can convert from legacy mapping rule format to the new 
> JSON format
> ---
>
> Key: YARN-10585
> URL: https://issues.apache.org/jira/browse/YARN-10585
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10585.001.patch, YARN-10585.002.patch, 
> YARN-10585.003.patch
>
>
> To make transition easier we need to create tooling to support the migration 
> effort. The first step is to create a class which can migrate from legacy to 
> the new JSON format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: 

[jira] [Comment Edited] (YARN-10585) Create a class which can convert from legacy mapping rule format to the new JSON format

2021-02-03 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17278337#comment-17278337
 ] 

Szilard Nemeth edited comment on YARN-10585 at 2/3/21, 8:08 PM:


Hi [~ahussein],

My thoughts:

1. Apologies for merging this one with the Findbugs issue.
I have been a committer since middle of 2019 and have been paying attention and 
have been striving for the best code quality and Yetus results, making sure the 
code meets the code quality standards we're expecting at Hadoop.
This one is an exceptional case that simply fell through the cracks.

2. About the UT failures: They are completely unrelated
- 
org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartOnMissingAttempts[FAIR]:
 This is Fair scheduler related and the patch is not
- 
org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer.testRMRestartWithExpiredToken:
 This is a well known flakey.


3. I can see that [~shuzirra] already reported YARN-10612 and you also left a 
comment there.
I still don't understand how reopening this jira is a better approach than 
fixing it in a follow-up.
We will have one more commit on top of trunk nevertheless, as I would not 
revert this commit for the sake of a single findbugs warning.
You mentioned amending on the other jira. How did you mean that? I never 
amended any commit as it modifies git's commit history and this is to be 
avoided on a repository that is used by many many people.

4. About scalability: I generally agree with your comment but as said in bullet 
point 1, this is an excecptional situation. I have 200+ commits and I can't 
recall a case where I committed findbugs issues. So it's a bit of an 
overstatement that this will cause a flood of commits.

5. Credibility: I can agree that we need to strive for findbugs error free 
commits. However, I have carefully reviewed the unit tests Gergo introduced and 
the coverage was more than enough. Such an NPE would have surfaced during the 
UT execution as well.



was (Author: snemeth):
Hi [~ahussein],

My thoughts:

1. Apologies for merging this one with the Findbugs issue.
I have been a committer since middle of 2019 and have been paying attention and 
have been striving for the best code quality and Yetus results, making sure the 
code meets the code quality standards we're expecting at Hadoop.
This one is an exceptional case that simply fell through the cracks.

2. About the UT failures: They are completely unrelated
- 
org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartOnMissingAttempts[FAIR]:
 This is Fair scheduler related and the patch is not
- 
org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer.testRMRestartWithExpiredToken:
 This is a well known flakey.


3. I can see that [~shuzirra] already reported YARN-10612 and you also left a 
comment there.
I still don't understand how reopening this jira is a better approach than 
fixing it in a follow-up.
We will have one more commit on top of trunk nevertheless, as I would not 
revert this commit for the sake of a single findbugs warning.
You mentioned amending on the other jira. How did you mean that? I never 
amended any commit as it modified git history and this is to be avoided on a 
repository that is used by many many people.

4. About scalability: I generally agree with your comment but as said in bullet 
point 1, this is an excecptional situation. I have 200+ commits and I can't 
recall a case where I committed findbugs issues. So it's a bit of an 
overstatement that this will cause a flood of commits.

5. Credibility: I can agree that we need to strive for findbugs error free 
commits. However, I have carefully reviewed the unit tests Gergo introduced and 
the coverage was more than enough. Such an NPE would have surfaced during the 
UT execution as well.


> Create a class which can convert from legacy mapping rule format to the new 
> JSON format
> ---
>
> Key: YARN-10585
> URL: https://issues.apache.org/jira/browse/YARN-10585
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10585.001.patch, YARN-10585.002.patch, 
> YARN-10585.003.patch
>
>
> To make transition easier we need to create tooling to support the migration 
> effort. The first step is to create a class which can migrate from legacy to 
> the new JSON format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: 

[jira] [Commented] (YARN-10585) Create a class which can convert from legacy mapping rule format to the new JSON format

2021-02-03 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17278337#comment-17278337
 ] 

Szilard Nemeth commented on YARN-10585:
---

Hi [~ahussein],

My thoughts:

1. Apologies for merging this one with the Findbugs issue.
I have been a committer since middle of 2019 and have been paying attention and 
have been striving for the best code quality and Yetus results, making sure the 
code meets the code quality standards we're expecting at Hadoop.
This one is an exceptional case that simply fell through the cracks.

2. About the UT failures: They are completely unrelated
- 
org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartOnMissingAttempts[FAIR]:
 This is Fair scheduler related and the patch is not
- 
org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer.testRMRestartWithExpiredToken:
 This is a well known flakey.


3. I can see that [~shuzirra] already reported YARN-10612 and you also left a 
comment there.
I still don't understand how reopening this jira is a better approach than 
fixing it in a follow-up.
We will have one more commit on top of trunk nevertheless, as I would not 
revert this commit for the sake of a single findbugs warning.
You mentioned amending on the other jira. How did you mean that? I never 
amended any commit as it modified git history and this is to be avoided on a 
repository that is used by many many people.

4. About scalability: I generally agree with your comment but as said in bullet 
point 1, this is an excecptional situation. I have 200+ commits and I can't 
recall a case where I committed findbugs issues. So it's a bit of an 
overstatement that this will cause a flood of commits.

5. Credibility: I can agree that we need to strive for findbugs error free 
commits. However, I have carefully reviewed the unit tests Gergo introduced and 
the coverage was more than enough. Such an NPE would have surfaced during the 
UT execution as well.


> Create a class which can convert from legacy mapping rule format to the new 
> JSON format
> ---
>
> Key: YARN-10585
> URL: https://issues.apache.org/jira/browse/YARN-10585
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10585.001.patch, YARN-10585.002.patch, 
> YARN-10585.003.patch
>
>
> To make transition easier we need to create tooling to support the migration 
> effort. The first step is to create a class which can migrate from legacy to 
> the new JSON format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10612) Fix find bugs issue introduced in YARN-10585

2021-02-03 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17278335#comment-17278335
 ] 

Szilard Nemeth commented on YARN-10612:
---

HI [~ahussein],
See my comment on the other jira (YARN-10585).

> Fix find bugs issue introduced in YARN-10585
> 
>
> Key: YARN-10612
> URL: https://issues.apache.org/jira/browse/YARN-10612
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Priority: Major
> Attachments: YARN-10612.001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10605) Add queue-mappings-override.enable property in FS2CS conversions

2021-02-02 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276932#comment-17276932
 ] 

Szilard Nemeth commented on YARN-10605:
---

Thanks [~gandras] for working on this,

Patch is straightforward, LGTM, committed to trunk.

Thanks [~bteke] and [~shuzirra] for the reviews.

> Add queue-mappings-override.enable property in FS2CS conversions
> 
>
> Key: YARN-10605
> URL: https://issues.apache.org/jira/browse/YARN-10605
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10605.001.patch, YARN-10605.002.patch
>
>
> In Capacity Scheduler the
> {noformat}
> queue-mappings-override.enable
> {noformat}
> property is false by default. As this is not set during an FS2CS conversion, 
> the converted placement rules (aka. mapping rules in CS) are ignored during 
> application submission. We should enable this property in the conversion 
> logic if there are placement rules to be converted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10604) Support auto queue creation without mapping rules

2021-02-02 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276923#comment-17276923
 ] 

Szilard Nemeth commented on YARN-10604:
---

Hi [~gandras],

Thanks for working on this.

Patch looks good to me, committed to trunk.

Thanks [~bteke] and [~shuzirra] for the reviews.

> Support auto queue creation without mapping rules
> -
>
> Key: YARN-10604
> URL: https://issues.apache.org/jira/browse/YARN-10604
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10604.001.patch
>
>
> Currently, the Capacity Scheduler skips auto queue creation entirely, if the 
> ApplicationPlacementContext is null, which happens, when the mapping rules 
> are turned off by:
> {noformat}
> 
> yarn.scheduler.capacity.queue-mappings-override.enable
> false
> {noformat}
> We should allow the auto queue creation to be taken into consideration 
> without disrupting the application submission flow.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10604) Support auto queue creation without mapping rules

2021-02-02 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10604:
--
Fix Version/s: 3.4.0

> Support auto queue creation without mapping rules
> -
>
> Key: YARN-10604
> URL: https://issues.apache.org/jira/browse/YARN-10604
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10604.001.patch
>
>
> Currently, the Capacity Scheduler skips auto queue creation entirely, if the 
> ApplicationPlacementContext is null, which happens, when the mapping rules 
> are turned off by:
> {noformat}
> 
> yarn.scheduler.capacity.queue-mappings-override.enable
> false
> {noformat}
> We should allow the auto queue creation to be taken into consideration 
> without disrupting the application submission flow.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Issue Comment Deleted] (YARN-10598) CS Flexible Auto Queue Creation: Modify RM /scheduler endpoint to extend the creation type with additional information

2021-01-27 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10598:
--
Comment: was deleted

(was: Okay, since the patch is already committed and findbugs issue is not 
caused by this one, I'm resolving this jira.)

> CS Flexible Auto Queue Creation: Modify RM /scheduler endpoint to extend the 
> creation type with additional information
> --
>
> Key: YARN-10598
> URL: https://issues.apache.org/jira/browse/YARN-10598
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10598.001.patch, YARN-10598.002.patch, 
> YARN-10598.003.patch
>
>
> Under this umbrella (YARN-10496), weight-mode has been implemented for CS 
> with YARN-10504.
> Auto-queue creation has been also implemented with YARN-10506.
> Connected to this effort, we would like to expose the type of the queue with 
> the RM's /scheduler REST endpoint.
> To extend/modify the values added in YARN-10581 these 3 fields will describe 
> a queue:
>  * queueType : *parent/leaf*
>  * creationMethod : *static/dynamicLegacy/dynamicFlexible*
>  * autoCreationEligibility : *off/legacy/flexible*
> After this change here are some example cases:
>  * Static parent queue which has the auto-creation-enabled-v2 false:
>  ** queueType : *parent*
>  ** creationMethod : *static*
>  ** autoCreationEligibility : *off*
>  * Static managed parent (can have dynamic children):
>  ** queueType : *parent*
>  ** creationMethod : *static*
>  ** autoCreationEligibility : *legacy*
>  * Legacy auto-created leaf queue (cannot have children):
>  ** queueType : *leaf*
>  ** creationMethod : *dynamicLegacy*
>  ** autoCreationEligibility : *off*
>  * Auto-created (v2) parent queue, (implicitly) auto-creation-enabled-v2 
> true: 
>  ** queueType : *parent*
>  ** creationMethod : *dynamicFlexible*
>  ** autoCreationEligibility : *flexible*
>  * Auto-created (v2) leaf queue (cannot have children):
>  ** queueType : *leaf*
>  ** creationMethod : *dynamicFlexible*
>  ** autoCreationEligibility : *off*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10598) CS Flexible Auto Queue Creation: Modify RM /scheduler endpoint to extend the creation type with additional information

2021-01-27 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17273119#comment-17273119
 ] 

Szilard Nemeth commented on YARN-10598:
---

Okay, since the patch is already committed and findbugs issue is not caused by 
this one, I'm resolving this jira.

> CS Flexible Auto Queue Creation: Modify RM /scheduler endpoint to extend the 
> creation type with additional information
> --
>
> Key: YARN-10598
> URL: https://issues.apache.org/jira/browse/YARN-10598
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10598.001.patch, YARN-10598.002.patch, 
> YARN-10598.003.patch
>
>
> Under this umbrella (YARN-10496), weight-mode has been implemented for CS 
> with YARN-10504.
> Auto-queue creation has been also implemented with YARN-10506.
> Connected to this effort, we would like to expose the type of the queue with 
> the RM's /scheduler REST endpoint.
> To extend/modify the values added in YARN-10581 these 3 fields will describe 
> a queue:
>  * queueType : *parent/leaf*
>  * creationMethod : *static/dynamicLegacy/dynamicFlexible*
>  * autoCreationEligibility : *off/legacy/flexible*
> After this change here are some example cases:
>  * Static parent queue which has the auto-creation-enabled-v2 false:
>  ** queueType : *parent*
>  ** creationMethod : *static*
>  ** autoCreationEligibility : *off*
>  * Static managed parent (can have dynamic children):
>  ** queueType : *parent*
>  ** creationMethod : *static*
>  ** autoCreationEligibility : *legacy*
>  * Legacy auto-created leaf queue (cannot have children):
>  ** queueType : *leaf*
>  ** creationMethod : *dynamicLegacy*
>  ** autoCreationEligibility : *off*
>  * Auto-created (v2) parent queue, (implicitly) auto-creation-enabled-v2 
> true: 
>  ** queueType : *parent*
>  ** creationMethod : *dynamicFlexible*
>  ** autoCreationEligibility : *flexible*
>  * Auto-created (v2) leaf queue (cannot have children):
>  ** queueType : *leaf*
>  ** creationMethod : *dynamicFlexible*
>  ** autoCreationEligibility : *off*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10598) CS Flexible Auto Queue Creation: Modify RM /scheduler endpoint to extend the creation type with additional information

2021-01-27 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17273017#comment-17273017
 ] 

Szilard Nemeth edited comment on YARN-10598 at 1/27/21, 5:15 PM:
-

Thanks [~bteke],

Latest patch LGTM, committed to trunk.

Finbugs was not introduced by this patch and the checkstyle issues can be 
ignored.

Thanks [~gandras] for the review.


was (Author: snemeth):
Thanks [~bteke],

Latest patch LGTM, committed to trunk.

Thanks [~gandras] for the review.

> CS Flexible Auto Queue Creation: Modify RM /scheduler endpoint to extend the 
> creation type with additional information
> --
>
> Key: YARN-10598
> URL: https://issues.apache.org/jira/browse/YARN-10598
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10598.001.patch, YARN-10598.002.patch, 
> YARN-10598.003.patch
>
>
> Under this umbrella (YARN-10496), weight-mode has been implemented for CS 
> with YARN-10504.
> Auto-queue creation has been also implemented with YARN-10506.
> Connected to this effort, we would like to expose the type of the queue with 
> the RM's /scheduler REST endpoint.
> To extend/modify the values added in YARN-10581 these 3 fields will describe 
> a queue:
>  * queueType : *parent/leaf*
>  * creationMethod : *static/dynamicLegacy/dynamicFlexible*
>  * autoCreationEligibility : *off/legacy/flexible*
> After this change here are some example cases:
>  * Static parent queue which has the auto-creation-enabled-v2 false:
>  ** queueType : *parent*
>  ** creationMethod : *static*
>  ** autoCreationEligibility : *off*
>  * Static managed parent (can have dynamic children):
>  ** queueType : *parent*
>  ** creationMethod : *static*
>  ** autoCreationEligibility : *legacy*
>  * Legacy auto-created leaf queue (cannot have children):
>  ** queueType : *leaf*
>  ** creationMethod : *dynamicLegacy*
>  ** autoCreationEligibility : *off*
>  * Auto-created (v2) parent queue, (implicitly) auto-creation-enabled-v2 
> true: 
>  ** queueType : *parent*
>  ** creationMethod : *dynamicFlexible*
>  ** autoCreationEligibility : *flexible*
>  * Auto-created (v2) leaf queue (cannot have children):
>  ** queueType : *leaf*
>  ** creationMethod : *dynamicFlexible*
>  ** autoCreationEligibility : *off*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10598) CS Flexible Auto Queue Creation: Modify RM /scheduler endpoint to extend the creation type with additional information

2021-01-27 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17273017#comment-17273017
 ] 

Szilard Nemeth commented on YARN-10598:
---

Thanks [~bteke],

Latest patch LGTM, committed to trunk.

Thanks [~gandras] for the review.

> CS Flexible Auto Queue Creation: Modify RM /scheduler endpoint to extend the 
> creation type with additional information
> --
>
> Key: YARN-10598
> URL: https://issues.apache.org/jira/browse/YARN-10598
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10598.001.patch, YARN-10598.002.patch, 
> YARN-10598.003.patch
>
>
> Under this umbrella (YARN-10496), weight-mode has been implemented for CS 
> with YARN-10504.
> Auto-queue creation has been also implemented with YARN-10506.
> Connected to this effort, we would like to expose the type of the queue with 
> the RM's /scheduler REST endpoint.
> To extend/modify the values added in YARN-10581 these 3 fields will describe 
> a queue:
>  * queueType : *parent/leaf*
>  * creationMethod : *static/dynamicLegacy/dynamicFlexible*
>  * autoCreationEligibility : *off/legacy/flexible*
> After this change here are some example cases:
>  * Static parent queue which has the auto-creation-enabled-v2 false:
>  ** queueType : *parent*
>  ** creationMethod : *static*
>  ** autoCreationEligibility : *off*
>  * Static managed parent (can have dynamic children):
>  ** queueType : *parent*
>  ** creationMethod : *static*
>  ** autoCreationEligibility : *legacy*
>  * Legacy auto-created leaf queue (cannot have children):
>  ** queueType : *leaf*
>  ** creationMethod : *dynamicLegacy*
>  ** autoCreationEligibility : *off*
>  * Auto-created (v2) parent queue, (implicitly) auto-creation-enabled-v2 
> true: 
>  ** queueType : *parent*
>  ** creationMethod : *dynamicFlexible*
>  ** autoCreationEligibility : *flexible*
>  * Auto-created (v2) leaf queue (cannot have children):
>  ** queueType : *leaf*
>  ** creationMethod : *dynamicFlexible*
>  ** autoCreationEligibility : *off*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10598) CS Flexible Auto Queue Creation: Modify RM /scheduler endpoint to extend the creation type with additional information

2021-01-27 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10598:
--
Fix Version/s: 3.4.0

> CS Flexible Auto Queue Creation: Modify RM /scheduler endpoint to extend the 
> creation type with additional information
> --
>
> Key: YARN-10598
> URL: https://issues.apache.org/jira/browse/YARN-10598
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10598.001.patch, YARN-10598.002.patch, 
> YARN-10598.003.patch
>
>
> Under this umbrella (YARN-10496), weight-mode has been implemented for CS 
> with YARN-10504.
> Auto-queue creation has been also implemented with YARN-10506.
> Connected to this effort, we would like to expose the type of the queue with 
> the RM's /scheduler REST endpoint.
> To extend/modify the values added in YARN-10581 these 3 fields will describe 
> a queue:
>  * queueType : *parent/leaf*
>  * creationMethod : *static/dynamicLegacy/dynamicFlexible*
>  * autoCreationEligibility : *off/legacy/flexible*
> After this change here are some example cases:
>  * Static parent queue which has the auto-creation-enabled-v2 false:
>  ** queueType : *parent*
>  ** creationMethod : *static*
>  ** autoCreationEligibility : *off*
>  * Static managed parent (can have dynamic children):
>  ** queueType : *parent*
>  ** creationMethod : *static*
>  ** autoCreationEligibility : *legacy*
>  * Legacy auto-created leaf queue (cannot have children):
>  ** queueType : *leaf*
>  ** creationMethod : *dynamicLegacy*
>  ** autoCreationEligibility : *off*
>  * Auto-created (v2) parent queue, (implicitly) auto-creation-enabled-v2 
> true: 
>  ** queueType : *parent*
>  ** creationMethod : *dynamicFlexible*
>  ** autoCreationEligibility : *flexible*
>  * Auto-created (v2) leaf queue (cannot have children):
>  ** queueType : *leaf*
>  ** creationMethod : *dynamicFlexible*
>  ** autoCreationEligibility : *off*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10598) CS Flexible Auto Queue Creation: Modify RM /scheduler endpoint to extend the creation type with additional information

2021-01-27 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17272939#comment-17272939
 ] 

Szilard Nemeth commented on YARN-10598:
---

Hi [~bteke],

Latest patch looks good to me.

Two minor things only:
 # This method is not used: 
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySchedDynamicConfig#submitApp
 # It would be useful to add javadoc for all static final variables that are 
among valid values of autoCreationType, queueType and auto creation 
eligibility: 
org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.helper.CapacitySchedulerInfoHelper
I can also accept this in a follow-up fix as it's not critical.

 

> CS Flexible Auto Queue Creation: Modify RM /scheduler endpoint to extend the 
> creation type with additional information
> --
>
> Key: YARN-10598
> URL: https://issues.apache.org/jira/browse/YARN-10598
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10598.001.patch, YARN-10598.002.patch
>
>
> Under this umbrella (YARN-10496), weight-mode has been implemented for CS 
> with YARN-10504.
> Auto-queue creation has been also implemented with YARN-10506.
> Connected to this effort, we would like to expose the type of the queue with 
> the RM's /scheduler REST endpoint.
> To extend/modify the values added in YARN-10581 these 3 fields will describe 
> a queue:
>  * queueType : *parent/leaf*
>  * creationMethod : *static/dynamicLegacy/dynamicFlexible*
>  * autoCreationEligibility : *off/legacy/flexible*
> After this change here are some example cases:
>  * Static parent queue which has the auto-creation-enabled-v2 false:
>  ** queueType : *parent*
>  ** creationMethod : *static*
>  ** autoCreationEligibility : *off*
>  * Static managed parent (can have dynamic children):
>  ** queueType : *parent*
>  ** creationMethod : *static*
>  ** autoCreationEligibility : *legacy*
>  * Legacy auto-created leaf queue (cannot have children):
>  ** queueType : *leaf*
>  ** creationMethod : *dynamicLegacy*
>  ** autoCreationEligibility : *off*
>  * Auto-created (v2) parent queue, (implicitly) auto-creation-enabled-v2 
> true: 
>  ** queueType : *parent*
>  ** creationMethod : *dynamicFlexible*
>  ** autoCreationEligibility : *flexible*
>  * Auto-created (v2) leaf queue (cannot have children):
>  ** queueType : *leaf*
>  ** creationMethod : *dynamicFlexible*
>  ** autoCreationEligibility : *off*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10599) fs2cs should generate new "auto-queue-creation-v2.enabled" properties for all parents

2021-01-27 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10599:
--
Fix Version/s: 3.4.0

> fs2cs should generate new "auto-queue-creation-v2.enabled" properties for all 
> parents
> -
>
> Key: YARN-10599
> URL: https://issues.apache.org/jira/browse/YARN-10599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: fs2cs
> Fix For: 3.4.0
>
> Attachments: YARN-10599-001.patch, YARN-10599-002.patch
>
>
> The property 
> {{yarn.scheduler.capacity..auto-queue-creation-v2.enabled}} is 
> not enabled by default for parent queues. However, users who migrate from FS 
> need this property enabled for all parents queues, because FS allows them to 
> have dynamic children.
> Note that this is only relevant if we convert directly to weights, it's not 
> needed in percentage mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10599) fs2cs should generate new "auto-queue-creation-v2.enabled" properties for all parents

2021-01-27 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17272904#comment-17272904
 ] 

Szilard Nemeth commented on YARN-10599:
---

Thanks [~pbacsko],

Latest patch LGTM, committed to trunk.

> fs2cs should generate new "auto-queue-creation-v2.enabled" properties for all 
> parents
> -
>
> Key: YARN-10599
> URL: https://issues.apache.org/jira/browse/YARN-10599
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: fs2cs
> Fix For: 3.4.0
>
> Attachments: YARN-10599-001.patch, YARN-10599-002.patch
>
>
> The property 
> {{yarn.scheduler.capacity..auto-queue-creation-v2.enabled}} is 
> not enabled by default for parent queues. However, users who migrate from FS 
> need this property enabled for all parents queues, because FS allows them to 
> have dynamic children.
> Note that this is only relevant if we convert directly to weights, it's not 
> needed in percentage mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10585) Create a class which can convert from legacy mapping rule format to the new JSON format

2021-01-26 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10585:
--
Fix Version/s: 3.4.0

> Create a class which can convert from legacy mapping rule format to the new 
> JSON format
> ---
>
> Key: YARN-10585
> URL: https://issues.apache.org/jira/browse/YARN-10585
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10585.001.patch, YARN-10585.002.patch, 
> YARN-10585.003.patch
>
>
> To make transition easier we need to create tooling to support the migration 
> effort. The first step is to create a class which can migrate from legacy to 
> the new JSON format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10585) Create a class which can convert from legacy mapping rule format to the new JSON format

2021-01-26 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17272274#comment-17272274
 ] 

Szilard Nemeth commented on YARN-10585:
---

Thanks [~shuzirra] for this, good job!

Latest patch LGTM, committed to trunk.

> Create a class which can convert from legacy mapping rule format to the new 
> JSON format
> ---
>
> Key: YARN-10585
> URL: https://issues.apache.org/jira/browse/YARN-10585
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10585.001.patch, YARN-10585.002.patch, 
> YARN-10585.003.patch
>
>
> To make transition easier we need to create tooling to support the migration 
> effort. The first step is to create a class which can migrate from legacy to 
> the new JSON format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10581) CS Flexible Auto Queue Creation: Modify RM /scheduler endpoint to include queue creation type for queues

2021-01-26 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17272169#comment-17272169
 ] 

Szilard Nemeth commented on YARN-10581:
---

Resolving this Jira as [~pbacsko] committed it last week.

Thanks for the commit.

> CS Flexible Auto Queue Creation: Modify RM /scheduler endpoint to include 
> queue creation type for queues
> 
>
> Key: YARN-10581
> URL: https://issues.apache.org/jira/browse/YARN-10581
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10581.001.patch, YARN-10581.002.patch, 
> YARN-10581.003.patch
>
>
> Under this umbrella (YARN-10496), weight-mode has been implemented for CS 
> with YARN-10504.
> Auto-queue creation has been also imlemented with YARN-10506.
> Connected to this effort, we would like to expose the type of the queue with 
> the RM's /scheduler REST endpoint.
> The queue type should hold these values: 
>  * Auto-created parent queue: *autoCreatedParent*
>  * Auto-created leaf queue: *autoCreatedLeaf*
>  * Static parent: *staticParent*
>  * Static leaf: *staticLeaf* 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10596) Allow static definition of childless ParentQueues with auto-queue-creation-v2 enabled

2021-01-26 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17272161#comment-17272161
 ] 

Szilard Nemeth commented on YARN-10596:
---

Hi [~gandras],

Thanks for working on this. 

Latest patch looks good to me, committed to trunk.

Thanks [~pbacsko] and [~bteke] for the reviews.

> Allow static definition of childless ParentQueues with auto-queue-creation-v2 
> enabled
> -
>
> Key: YARN-10596
> URL: https://issues.apache.org/jira/browse/YARN-10596
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10596.001.patch, YARN-10596.002.patch
>
>
> The old auto queue creation/managed queue logic allowed the definition of 
> childless parents to be created statically, if the auto-create-child-queue 
> flag was turned on the parent (thus making it a ManagedParentQueue).
> Since it is not an edge case, we also need to support the creation of a 
> ParentQueue instead of a LeafQueue, if auto-queue-creation-v2 is enabled, 
> even when no child queue is defined under the parent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10596) Allow static definition of childless ParentQueues with auto-queue-creation-v2 enabled

2021-01-26 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10596:
--
Fix Version/s: 3.4.0

> Allow static definition of childless ParentQueues with auto-queue-creation-v2 
> enabled
> -
>
> Key: YARN-10596
> URL: https://issues.apache.org/jira/browse/YARN-10596
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10596.001.patch, YARN-10596.002.patch
>
>
> The old auto queue creation/managed queue logic allowed the definition of 
> childless parents to be created statically, if the auto-create-child-queue 
> flag was turned on the parent (thus making it a ManagedParentQueue).
> Since it is not an edge case, we also need to support the creation of a 
> ParentQueue instead of a LeafQueue, if auto-queue-creation-v2 is enabled, 
> even when no child queue is defined under the parent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10515) Fix flaky test TestCapacitySchedulerAutoQueueCreation.testDynamicAutoQueueCreationWithTags

2021-01-21 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269592#comment-17269592
 ] 

Szilard Nemeth commented on YARN-10515:
---

Thanks [~pbacsko],

Straightforward patch, committed to trunk.

Do you plan to backport this to any of the older branches?

If so, please reopen this Jira and upload patches to those branches.

Thanks.

> Fix flaky test 
> TestCapacitySchedulerAutoQueueCreation.testDynamicAutoQueueCreationWithTags
> --
>
> Key: YARN-10515
> URL: https://issues.apache.org/jira/browse/YARN-10515
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10515-001.patch
>
>
> The testcase 
> TestCapacitySchedulerAutoQueueCreation.testDynamicAutoQueueCreationWithTags 
> sometimes fails with the following error:
> {noformat}
> org.apache.hadoop.service.ServiceStateException: 
> org.apache.hadoop.yarn.exceptions.YarnException: Failed to initialize queues
>   at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:174)
>   at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:110)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:884)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:165)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:1296)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:339)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.MockRM.serviceInit(MockRM.java:1018)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:165)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:158)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:134)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.MockRM.(MockRM.java:130)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerAutoQueueCreation$5.(TestCapacitySchedulerAutoQueueCreation.java:873)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerAutoQueueCreation.testDynamicAutoQueueCreationWithTags(TestCapacitySchedulerAutoQueueCreation.java:873)
> ...
> Caused by: org.apache.hadoop.metrics2.MetricsException: Metrics source 
> PartitionQueueMetrics,partition=,q0=root,q1=a already exists!
>   at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
>   at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
>   at 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.getPartitionQueueMetrics(QueueMetrics.java:317)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.setAvailableResourcesToQueue(QueueMetrics.java:513)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueUtils.updateQueueStatistics(CSQueueUtils.java:308)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue.setupQueueConfigs(AbstractCSQueue.java:412)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue.setupQueueConfigs(AbstractCSQueue.java:350)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.setupQueueConfigs(ParentQueue.java:137)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.(ParentQueue.java:119)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractManagedParentQueue.(AbstractManagedParentQueue.java:52)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ManagedParentQueue.(ManagedParentQueue.java:57)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:261)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:289)
>   at 
> 

  1   2   3   4   5   6   7   8   9   10   >