from:"Peter Bacsko \(JIRA\)"

[jira] [Commented] (YARN-9615) Add dispatcher metrics to RM

2021-03-09 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298069#comment-17298069
 ] 

Peter Bacsko commented on YARN-9615:


+1

I had to commit twice because there are actually two authors.

Thanks for the patch [~jhung] / [~zhuqi] and [~bibinchundatt] for the review.

Committed to trunk.

> Add dispatcher metrics to RM
> 
>
> Key: YARN-9615
> URL: https://issues.apache.org/jira/browse/YARN-9615
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Jonathan Hung
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-9615.001.patch, YARN-9615.002.patch, 
> YARN-9615.003.patch, YARN-9615.004.patch, YARN-9615.005.patch, 
> YARN-9615.006.patch, YARN-9615.007.patch, YARN-9615.008.patch, 
> YARN-9615.009.patch, YARN-9615.010.patch, YARN-9615.011.patch, 
> YARN-9615.011.patch, YARN-9615.poc.patch, image-2021-03-04-10-35-10-626.png, 
> image-2021-03-04-10-36-12-441.png, screenshot-1.png
>
>
> It'd be good to have counts/processing times for each event type in RM async 
> dispatcher and scheduler async dispatcher.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9615) Add dispatcher metrics to RM

2021-03-09 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298068#comment-17298068
 ] 

Peter Bacsko commented on YARN-9615:


Thanks [~zhuqi] patch v11 looks good, committing it soon.

> Add dispatcher metrics to RM
> 
>
> Key: YARN-9615
> URL: https://issues.apache.org/jira/browse/YARN-9615
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Jonathan Hung
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-9615.001.patch, YARN-9615.002.patch, 
> YARN-9615.003.patch, YARN-9615.004.patch, YARN-9615.005.patch, 
> YARN-9615.006.patch, YARN-9615.007.patch, YARN-9615.008.patch, 
> YARN-9615.009.patch, YARN-9615.010.patch, YARN-9615.011.patch, 
> YARN-9615.011.patch, YARN-9615.poc.patch, image-2021-03-04-10-35-10-626.png, 
> image-2021-03-04-10-36-12-441.png, screenshot-1.png
>
>
> It'd be good to have counts/processing times for each event type in RM async 
> dispatcher and scheduler async dispatcher.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10679) Better logging of uncaught exceptions throughout SLS

2021-03-09 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298064#comment-17298064
 ] 

Peter Bacsko commented on YARN-10679:
-

+1 thanks [~snemeth] for the patch and [~shuzirra] for the review.

Committed to trunk.

> Better logging of uncaught exceptions throughout SLS
> 
>
> Key: YARN-10679
> URL: https://issues.apache.org/jira/browse/YARN-10679
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10679.001.patch
>
>
> In our internal environment, there was a test failure while running SLS tests 
> with Jenkins.
> It's difficult to align the uncaught exceptions (in this case an NPE) and the 
> log itself as the exception is logged with {{e.printStackTrace()}}.
> This jira is to replace printStackTrace calls in SLS with {{LOG.error("msg", 
> exception)}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10679) Better logging of uncaught exceptions throughout SLS

2021-03-09 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298062#comment-17298062
 ] 

Peter Bacsko commented on YARN-10679:
-

Ok, this time the failed test is different, most likely a flaky one. Let's 
investigate it later.

> Better logging of uncaught exceptions throughout SLS
> 
>
> Key: YARN-10679
> URL: https://issues.apache.org/jira/browse/YARN-10679
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10679.001.patch
>
>
> In our internal environment, there was a test failure while running SLS tests 
> with Jenkins.
> It's difficult to align the uncaught exceptions (in this case an NPE) and the 
> log itself as the exception is logged with {{e.printStackTrace()}}.
> This jira is to replace printStackTrace calls in SLS with {{LOG.error("msg", 
> exception)}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10679) Better logging of uncaught exceptions throughout SLS

2021-03-09 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298057#comment-17298057
 ] 

Peter Bacsko commented on YARN-10679:
-

Re-triggered build to see what's going on with TestSLSRunner.

> Better logging of uncaught exceptions throughout SLS
> 
>
> Key: YARN-10679
> URL: https://issues.apache.org/jira/browse/YARN-10679
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10679.001.patch
>
>
> In our internal environment, there was a test failure while running SLS tests 
> with Jenkins.
> It's difficult to align the uncaught exceptions (in this case an NPE) and the 
> log itself as the exception is logged with {{e.printStackTrace()}}.
> This jira is to replace printStackTrace calls in SLS with {{LOG.error("msg", 
> exception)}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10681) Fix assertion failure message in BaseSLSRunnerTest

2021-03-09 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298041#comment-17298041
 ] 

Peter Bacsko commented on YARN-10681:
-

+1 thanks [~snemeth] and [~shuzirra] for the patch and review, committed to 
trunk.

> Fix assertion failure message in BaseSLSRunnerTest
> --
>
> Key: YARN-10681
> URL: https://issues.apache.org/jira/browse/YARN-10681
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Trivial
> Attachments: YARN-10681.001.patch
>
>
> There is this failure message: 
> https://github.com/apache/hadoop/blob/a89ca56a1b0eb949f56e7c6c5c25fdf87914a02f/hadoop-tools/hadoop-sls/src/test/java/org/apache/hadoop/yarn/sls/BaseSLSRunnerTest.java#L129-L130
> The word "catched" should be replaced with "caught".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10677) Logger of SLSFairScheduler is provided with the wrong class

2021-03-09 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298031#comment-17298031
 ] 

Peter Bacsko commented on YARN-10677:
-

+1 LGTM.

Thanks [~snemeth] for the patch and [~zhuqi] for the review. Committed to 
trunk. (Jenkins is running but I don't expect any issues).

> Logger of SLSFairScheduler is provided with the wrong class
> ---
>
> Key: YARN-10677
> URL: https://issues.apache.org/jira/browse/YARN-10677
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10677.001.patch, YARN-10677.002.patch, 
> YARN-10677.003.patch, YARN-10677.004.patch
>
>
> In SLSFairScheduler, the Logger definition looks like: 
> https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L69
> We need to fix this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10678) Try blocks without catch blocks in SLS scheduler classes can swallow other exceptions

2021-03-09 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298010#comment-17298010
 ] 

Peter Bacsko commented on YARN-10678:
-

+1 thanks [~snemeth] for the patch and [~shuzirra] for the review.

Committed to trunk.

> Try blocks without catch blocks in SLS scheduler classes can swallow other 
> exceptions
> -
>
> Key: YARN-10678
> URL: https://issues.apache.org/jira/browse/YARN-10678
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10678-unchecked-exception-from-FS-allocate.diff, 
> YARN-10678-unchecked-exception-from-FS-allocate_fixed.diff, 
> YARN-10678.001.patch, 
> org.apache.hadoop.yarn.sls.TestReservationSystemInvariants__testSimulatorRunning_modified.log,
>  
> org.apache.hadoop.yarn.sls.TestReservationSystemInvariants__testSimulatorRunning_original.log
>
>
> In SLSFairScheduler, we have this try-finally block (without catch block) in 
> the allocate method: 
> https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L109-L123
> Similarly, in SLSCapacityScheduler: 
> https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSCapacityScheduler.java#L116-L131
> In the finally block, the updateQueueWithAllocateRequest is invoked: 
> https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L118
> In our internal environment, there was a situation when an NPE was logged 
> from this method: 
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.sls.scheduler.SLSFairScheduler.updateQueueWithAllocateRequest(SLSFairScheduler.java:262)
>   at 
> org.apache.hadoop.yarn.sls.scheduler.SLSFairScheduler.allocate(SLSFairScheduler.java:118)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:288)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:436)
>   at 
> org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator$1.run(MRAMSimulator.java:352)
>   at 
> org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator$1.run(MRAMSimulator.java:349)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
>   at 
> org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator.sendContainerRequest(MRAMSimulator.java:348)
>   at 
> org.apache.hadoop.yarn.sls.appmaster.AMSimulator.middleStep(AMSimulator.java:212)
>   at 
> org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:94)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> This can happen if the following events occur:
> 1. A runtime exception is thrown in FairScheduler or CapacityScheduler's 
> allocate method 
> 2. In this case, the local variable called 'allocation' remains null: 
> https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L110
> 3. In updateQueueWithAllocateRequest, this null object will be dereferenced 
> here: 
> https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L262
> 4. Then, we have an NPE here: 
> https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L117-L122
> In this case, we lost the original exception thrown from 
> FairScheduler#allocate.
> In order to fix this, a catch-block should be introduced and the exception 
> needs to be logged.
> The whole thing applies to SLSCapacityScheduler as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (YARN-10677) Logger of SLSFairScheduler is provided with the wrong class

2021-03-09 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298007#comment-17298007
 ] 

Peter Bacsko commented on YARN-10677:
-

[~snemeth] please fix the whitespace and checkstyle, thanks. 

> Logger of SLSFairScheduler is provided with the wrong class
> ---
>
> Key: YARN-10677
> URL: https://issues.apache.org/jira/browse/YARN-10677
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10677.001.patch, YARN-10677.002.patch, 
> YARN-10677.003.patch
>
>
> In SLSFairScheduler, the Logger definition looks like: 
> https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L69
> We need to fix this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10675) Consolidate YARN-10672 and YARN-10447

2021-03-09 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297998#comment-17297998
 ] 

Peter Bacsko commented on YARN-10675:
-

+1 LGTM.

Thanks [~snemeth] for the patch, committed to trunk.

> Consolidate YARN-10672 and YARN-10447
> -
>
> Key: YARN-10675
> URL: https://issues.apache.org/jira/browse/YARN-10675
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10675.001.patch
>
>
> Let's consolidate the solution applied for YARN-10672 and apply it to the 
> code changes introduced with YARN-10447.
> Quoting [~pbacsko]: 
> {quote}
> The solution is much straightforward than mine in YARN-10447. Actually we 
> might consider applying this to TestLeafQueue with undoing my changes, 
> because that's more complicated (I had no patience to go deeper with Mockito 
> internal behavior, I just thought well, disable that thread and that's 
> enough).
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10676) Improve code quality in TestTimelineAuthenticationFilterForV1

2021-03-09 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297996#comment-17297996
 ] 

Peter Bacsko commented on YARN-10676:
-

+1 thanks [~snemeth] for the patch and [~bteke] / [~zhuqi] / [~shuzirra] for 
the review.

Committed to trunk.

> Improve code quality in TestTimelineAuthenticationFilterForV1
> -
>
> Key: YARN-10676
> URL: https://issues.apache.org/jira/browse/YARN-10676
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Minor
> Attachments: YARN-10676.001.patch
>
>
> - In testcase "testDelegationTokenOperations", the exception message is 
> checked but in case it does not match the assertion, the exception is not 
> printed. This happens 3 times.
> - Assertion messages can be added
> - Fields called "httpSpnegoKeytabFile" and "httpSpnegoPrincipal" can be 
> static final.
> - There's a typo in comment "avaiable" (happens 2 times)
> - There are some Assert.fail() calls, without messages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract'

2021-03-08 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297642#comment-17297642
 ] 

Peter Bacsko edited comment on YARN-10178 at 3/8/21, 6:54 PM:
--

[~zhuqi] this is a tricky patch, I have to understand what's going on. We might 
ask [~wangda] again to look at it, because I'm not that familiar with the code 
that has been modified.

Having said that, I have some recommendations:
1. {{private final static Random RANDOM = new 
Random(System.currentTimeMillis());}}
Is there a reason why this is static? {{RANDOM}} is only used in the test.
Another problem is that, let's assume that it fails. But the problem is that we 
don't see the random seed that was used for initialization, so this test is not 
reproducible.
I suggest rewriting the test like:
{noformat}
long seed = System.nanoTime();  // I think nanoTime is better

try {
  .. test code ..
} catch (AssertionFailedError e) {
   LOG.error("Test failed, seed = {}", seed);
   LOG.error(e);
   throw e;
}
{noformat}

So at least we can check the logs for the seed number. Or maybe rethrow the 
exception with a modified message, that's also a solution, or wrap it in a 
different exception with a new message which contains the seed. The point is, 
it should be visible.

2. This sanity test only works if JVM is started with "-ea":
{noformat}
// sanity check
assert queueNames != null && priorities != null && utilizations != 
null
&& queueNames.length > 0 && queueNames.length == 
priorities.length
&& priorities.length == utilizations.length;
{noformat}
I think this should be converted to normal JUnit assertion or just remove it.


was (Author: pbacsko):
[~zhuqi] this is a tricky patch, I have to understand what's going on. We might 
ask [~wangda] again to look at it, because I'm not that familiar with the code 
that has been modified.

Having said that, I have some recommendations:
1. {{private final static Random RANDOM = new 
Random(System.currentTimeMillis());}}
Is there a reason why this is static? {{RANDOM}} is only used in the test.
Another problem is that, let's assume that it fails. But the problem is that we 
don't see the random seed that was used for initialization, so this test is not 
reproducible.
I suggest rewriting the test like:
{noformat}
long seed = System.nanoTime();  // I think nanoTime is better

try {
  .. test code ..
} catch (AssertionFailedError e) {
   LOG.error("Test failed, seed = {}", seed, e);
   throw e;
}
{noformat}

So at least we can check the logs for the seed number. Or maybe rethrow the 
exception with a modified message, that's also a solution, or wrap it in a 
different exception with a new message which contains the seed. The point is, 
it should be visible.

2. This sanity test only works if JVM is started with "-ea":
{noformat}
// sanity check
assert queueNames != null && priorities != null && utilizations != 
null
&& queueNames.length > 0 && queueNames.length == 
priorities.length
&& priorities.length == utilizations.length;
{noformat}
I think this should be converted to normal JUnit assertion or just remove it.

> Global Scheduler async thread crash caused by 'Comparison method violates its 
> general contract'
> ---
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10178.001.patch, YARN-10178.002.patch, 
> YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch
>
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
>

[jira] [Commented] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract'

2021-03-08 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297642#comment-17297642
 ] 

Peter Bacsko commented on YARN-10178:
-

[~zhuqi] this is a tricky patch, I have to understand what's going on. We might 
ask [~wangda] again to look at it, because I'm not that familiar with the code 
that has been modified.

Having said that, I have some recommendations:
1. {{private final static Random RANDOM = new 
Random(System.currentTimeMillis());}}
Is there a reason why this is static? {{RANDOM}} is only used in the test.
Another problem is that, let's assume that it fails. But the problem is that we 
don't see the random seed that was used for initialization, so this test is not 
reproducible.
I suggest rewriting the test like:
{noformat}
long seed = System.nanoTime();  // I think nanoTime is better

try {
  .. test code ..
} catch (AssertionFailedError e) {
   LOG.error("Test failed, seed = {}", seed, e);
   throw e;
}
{noformat}

So at least we can check the logs for the seed number. Or maybe rethrow the 
exception with a modified message, that's also a solution, or wrap it in a 
different exception with a new message which contains the seed. The point is, 
it should be visible.

2. This sanity test only works if JVM is started with "-ea":
{noformat}
// sanity check
assert queueNames != null && priorities != null && utilizations != 
null
&& queueNames.length > 0 && queueNames.length == 
priorities.length
&& priorities.length == utilizations.length;
{noformat}
I think this should be converted to normal JUnit assertion or just remove it.

> Global Scheduler async thread crash caused by 'Comparison method violates its 
> general contract'
> ---
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10178.001.patch, YARN-10178.002.patch, 
> YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch
>
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
> {code}
> JAVA 8 Arrays.sort default use timsort algo, and timsort has  few require 
> {code:java}
> 1.x.compareTo(y) != y.compareTo(x)
> 2.x>y,y>z --> x > z
> 3.x=y, x.compareTo(z) == y.compareTo(z)
> {code}
> if not Arrays paramters not satify this

[jira] [Commented] (YARN-9615) Add dispatcher metrics to RM

2021-03-08 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297504#comment-17297504
 ] 

Peter Bacsko commented on YARN-9615:


Let's wait for the Jenkins results of patch v10.

> Add dispatcher metrics to RM
> 
>
> Key: YARN-9615
> URL: https://issues.apache.org/jira/browse/YARN-9615
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Jonathan Hung
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-9615.001.patch, YARN-9615.002.patch, 
> YARN-9615.003.patch, YARN-9615.004.patch, YARN-9615.005.patch, 
> YARN-9615.006.patch, YARN-9615.007.patch, YARN-9615.008.patch, 
> YARN-9615.009.patch, YARN-9615.010.patch, YARN-9615.poc.patch, 
> image-2021-03-04-10-35-10-626.png, image-2021-03-04-10-36-12-441.png, 
> screenshot-1.png
>
>
> It'd be good to have counts/processing times for each event type in RM async 
> dispatcher and scheduler async dispatcher.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10672) All testcases in TestReservations are flaky

2021-03-08 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10672:

Fix Version/s: 3.2.3
   3.3.1

> All testcases in TestReservations are flaky
> ---
>
> Key: YARN-10672
> URL: https://issues.apache.org/jira/browse/YARN-10672
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Fix For: 3.4.0, 3.3.1, 3.2.3
>
> Attachments: Screenshot 2021-03-04 at 21.34.18.png, Screenshot 
> 2021-03-04 at 22.06.20.png, Screenshot-mockitostubbing1-2021-03-04 at 
> 22.34.01.png, Screenshot-mockitostubbing2-2021-03-04 at 22.34.12.png, 
> YARN-10672-debuglogs.patch, YARN-10672.001.patch, 
> YARN-10672.branch-3.2.001.patch, YARN-10672.branch-3.3.001.patch
>
>
> All testcases in TestReservations are flaky
> Running a particular test in TestReservations 100 times never passes all the 
> time.
>  For example, let's run testReservationNoContinueLook 100 times. For me, it 
> produced 39 failed and 61 passed results.
>  Sometimes just 1 out of 100 runs is failed.
>  Screenshot is attached.
> Stacktrace:
> {code:java}
> java.lang.AssertionError: 
> Expected :2048
> Actual   :0
> 
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.failNotEquals(Assert.java:835)
> at org.junit.Assert.assertEquals(Assert.java:647)
> at org.junit.Assert.assertEquals(Assert.java:633)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642)
> {code}
> The test fails here:
> {code:java}
>  // Start testing...
> // Only AM
> TestUtils.applyResourceCommitRequest(clusterResource,
> a.assignContainers(clusterResource, node_0,
> new ResourceLimits(clusterResource),
> SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps);
> assertEquals(2 * GB, a.getUsedResources().getMemorySize());
> {code}
> With some debugging (patch attached), I realized that sometimes there are no 
> registered nodes so the AM can't be allocated and test will fail:
> {code:java}
> 2021-03-04 21:58:25,434 DEBUG [main] allocator.RegularContainerAllocator 
> (RegularContainerAllocator.java:canAssign(312)) - **Can't assign 
> container, no nodes... rmContext: 2a8dd942, scheduler: 2322e56f
> {code}
> In these cases, this is also printed from 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#getNumClusterNodes:
> {code:java}
> 2021-03-04 21:58:25,379 DEBUG [main] capacity.CapacityScheduler 
> (CapacityScheduler.java:getNumClusterNodes(290)) - ***Called real 
> getNumClusterNodes
> {code}
> h2. Let's break this down:
>  1. The mocking happens in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations#setup(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration,
>  boolean):
> {code:java}
> cs.setRMContext(spyRMContext);
> cs.init(csConf);
> cs.start();
> when(cs.getNumClusterNodes()).thenReturn(3);
> {code}
> Under no circumstances this could be allowed to return any other value than 3.
>  However, as mentioned above, sometimes the real method of 
> 'getNumClusterNodes' is called on CapacityScheduler.
> 2. Sometimes, this gets printed to the console:
> {code:java}
> org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
> Integer cannot be returned by isMultiNodePlacementEnabled()
> isMultiNodePlacementEnabled() should return boolean
> ***
> If you're unsure why you're getting above error read on.
> Due to the nature of the syntax above problem might occur because:
> 1. This exception *might* occur in wrongly written multi-threaded tests.
>Please refer to Mockito FAQ on limitations of concurrency testing.
> 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
> spies - 
>- with doReturn|Throw() family of methods. More in javadocs for 
> Mockito.spy() method.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:166)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:114)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:566)
>   at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
>

[jira] [Commented] (YARN-10672) All testcases in TestReservations are flaky

2021-03-08 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297358#comment-17297358
 ] 

Peter Bacsko commented on YARN-10672:
-

+1 overall. Committed changes to branch-3.2 too.

Thanks [~snemeth] for the contribution.

> All testcases in TestReservations are flaky
> ---
>
> Key: YARN-10672
> URL: https://issues.apache.org/jira/browse/YARN-10672
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: Screenshot 2021-03-04 at 21.34.18.png, Screenshot 
> 2021-03-04 at 22.06.20.png, Screenshot-mockitostubbing1-2021-03-04 at 
> 22.34.01.png, Screenshot-mockitostubbing2-2021-03-04 at 22.34.12.png, 
> YARN-10672-debuglogs.patch, YARN-10672.001.patch, 
> YARN-10672.branch-3.2.001.patch, YARN-10672.branch-3.3.001.patch
>
>
> All testcases in TestReservations are flaky
> Running a particular test in TestReservations 100 times never passes all the 
> time.
>  For example, let's run testReservationNoContinueLook 100 times. For me, it 
> produced 39 failed and 61 passed results.
>  Sometimes just 1 out of 100 runs is failed.
>  Screenshot is attached.
> Stacktrace:
> {code:java}
> java.lang.AssertionError: 
> Expected :2048
> Actual   :0
> 
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.failNotEquals(Assert.java:835)
> at org.junit.Assert.assertEquals(Assert.java:647)
> at org.junit.Assert.assertEquals(Assert.java:633)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642)
> {code}
> The test fails here:
> {code:java}
>  // Start testing...
> // Only AM
> TestUtils.applyResourceCommitRequest(clusterResource,
> a.assignContainers(clusterResource, node_0,
> new ResourceLimits(clusterResource),
> SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps);
> assertEquals(2 * GB, a.getUsedResources().getMemorySize());
> {code}
> With some debugging (patch attached), I realized that sometimes there are no 
> registered nodes so the AM can't be allocated and test will fail:
> {code:java}
> 2021-03-04 21:58:25,434 DEBUG [main] allocator.RegularContainerAllocator 
> (RegularContainerAllocator.java:canAssign(312)) - **Can't assign 
> container, no nodes... rmContext: 2a8dd942, scheduler: 2322e56f
> {code}
> In these cases, this is also printed from 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#getNumClusterNodes:
> {code:java}
> 2021-03-04 21:58:25,379 DEBUG [main] capacity.CapacityScheduler 
> (CapacityScheduler.java:getNumClusterNodes(290)) - ***Called real 
> getNumClusterNodes
> {code}
> h2. Let's break this down:
>  1. The mocking happens in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations#setup(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration,
>  boolean):
> {code:java}
> cs.setRMContext(spyRMContext);
> cs.init(csConf);
> cs.start();
> when(cs.getNumClusterNodes()).thenReturn(3);
> {code}
> Under no circumstances this could be allowed to return any other value than 3.
>  However, as mentioned above, sometimes the real method of 
> 'getNumClusterNodes' is called on CapacityScheduler.
> 2. Sometimes, this gets printed to the console:
> {code:java}
> org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
> Integer cannot be returned by isMultiNodePlacementEnabled()
> isMultiNodePlacementEnabled() should return boolean
> ***
> If you're unsure why you're getting above error read on.
> Due to the nature of the syntax above problem might occur because:
> 1. This exception *might* occur in wrongly written multi-threaded tests.
>Please refer to Mockito FAQ on limitations of concurrency testing.
> 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
> spies - 
>- with doReturn|Throw() family of methods. More in javadocs for 
> Mockito.spy() method.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:166)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:114)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:566)
>   at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
>

[jira] [Commented] (YARN-10672) All testcases in TestReservations are flaky

2021-03-08 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297345#comment-17297345
 ] 

Peter Bacsko commented on YARN-10672:
-

Ok, test failures seem to be totally unrelated. The change only concerns 
"TestReservations" and modifies the order of stubbing.

> All testcases in TestReservations are flaky
> ---
>
> Key: YARN-10672
> URL: https://issues.apache.org/jira/browse/YARN-10672
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: Screenshot 2021-03-04 at 21.34.18.png, Screenshot 
> 2021-03-04 at 22.06.20.png, Screenshot-mockitostubbing1-2021-03-04 at 
> 22.34.01.png, Screenshot-mockitostubbing2-2021-03-04 at 22.34.12.png, 
> YARN-10672-debuglogs.patch, YARN-10672.001.patch, 
> YARN-10672.branch-3.2.001.patch, YARN-10672.branch-3.3.001.patch
>
>
> All testcases in TestReservations are flaky
> Running a particular test in TestReservations 100 times never passes all the 
> time.
>  For example, let's run testReservationNoContinueLook 100 times. For me, it 
> produced 39 failed and 61 passed results.
>  Sometimes just 1 out of 100 runs is failed.
>  Screenshot is attached.
> Stacktrace:
> {code:java}
> java.lang.AssertionError: 
> Expected :2048
> Actual   :0
> 
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.failNotEquals(Assert.java:835)
> at org.junit.Assert.assertEquals(Assert.java:647)
> at org.junit.Assert.assertEquals(Assert.java:633)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642)
> {code}
> The test fails here:
> {code:java}
>  // Start testing...
> // Only AM
> TestUtils.applyResourceCommitRequest(clusterResource,
> a.assignContainers(clusterResource, node_0,
> new ResourceLimits(clusterResource),
> SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps);
> assertEquals(2 * GB, a.getUsedResources().getMemorySize());
> {code}
> With some debugging (patch attached), I realized that sometimes there are no 
> registered nodes so the AM can't be allocated and test will fail:
> {code:java}
> 2021-03-04 21:58:25,434 DEBUG [main] allocator.RegularContainerAllocator 
> (RegularContainerAllocator.java:canAssign(312)) - **Can't assign 
> container, no nodes... rmContext: 2a8dd942, scheduler: 2322e56f
> {code}
> In these cases, this is also printed from 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#getNumClusterNodes:
> {code:java}
> 2021-03-04 21:58:25,379 DEBUG [main] capacity.CapacityScheduler 
> (CapacityScheduler.java:getNumClusterNodes(290)) - ***Called real 
> getNumClusterNodes
> {code}
> h2. Let's break this down:
>  1. The mocking happens in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations#setup(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration,
>  boolean):
> {code:java}
> cs.setRMContext(spyRMContext);
> cs.init(csConf);
> cs.start();
> when(cs.getNumClusterNodes()).thenReturn(3);
> {code}
> Under no circumstances this could be allowed to return any other value than 3.
>  However, as mentioned above, sometimes the real method of 
> 'getNumClusterNodes' is called on CapacityScheduler.
> 2. Sometimes, this gets printed to the console:
> {code:java}
> org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
> Integer cannot be returned by isMultiNodePlacementEnabled()
> isMultiNodePlacementEnabled() should return boolean
> ***
> If you're unsure why you're getting above error read on.
> Due to the nature of the syntax above problem might occur because:
> 1. This exception *might* occur in wrongly written multi-threaded tests.
>Please refer to Mockito FAQ on limitations of concurrency testing.
> 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
> spies - 
>- with doReturn|Throw() family of methods. More in javadocs for 
> Mockito.spy() method.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:166)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:114)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:566)
>   at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
>

[jira] [Updated] (YARN-10642) Race condition: AsyncDispatcher can get stuck by the changes introduced in YARN-8995

2021-03-08 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10642:

Fix Version/s: 3.2.3

> Race condition: AsyncDispatcher can get stuck by the changes introduced in 
> YARN-8995
> 
>
> Key: YARN-10642
> URL: https://issues.apache.org/jira/browse/YARN-10642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Critical
> Fix For: 3.4.0, 3.3.1, 3.2.3
>
> Attachments: MockForDeadLoop.java, YARN-10642-branch-3.2.001.patch, 
> YARN-10642-branch-3.2.002.patch, YARN-10642-branch-3.3.001.patch, 
> YARN-10642.001.patch, YARN-10642.002.patch, YARN-10642.003.patch, 
> YARN-10642.004.patch, YARN-10642.005.patch, deadloop.png, debugfornode.png, 
> put.png, take.png
>
>
> In our cluster, ResouceManager stuck twice within twenty days. Yarn client 
> can't submit application. I got jstack info at second time, then found the 
> reason.
> I analyze all the jstack, I found many thread stuck because can't get 
> LinkedBlockingQueue.putLock. (Note: Sorry for limited space , omit the 
> analytical process)
> The reason is that one thread hold the putLock all the time, 
> printEventQueueDetails will called forEachRemaining, then hold putLock and 
> readLock. The AsyncDispatcher will stuck.
> {code}
> Thread 6526 (IPC Server handler 454 on default port 8030):
>   State: RUNNABLE
>   Blocked count: 29988
>   Waited count: 2035029
>   Stack:
> 
> java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926)
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
> 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.handleProgress(DefaultAMSProcessor.java:408)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:215)
> 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
> 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
> 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:432)
> 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1040)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:958)
> java.security.AccessController.doPrivileged(Native Method)
> {code}
> I analyze LinkedBlockingQueue's source code. I found forEachRemaining in 
> LinkedBlockingQueue.LBQSpliterator may stuck, when forEachRemaining and take 
> are called in different thread. 
> YARN-8995 introduce printEventQueueDetails method, 
> "eventQueue.stream().collect" will called forEachRemaining method.
> Let's see why？ "put.png" shows that how to put("a"), "take.png" shows that 
> how to take()。Specical Node: The removed Node will point itself for help gc!!!
> The key point code is in forEachRemaining, we see LBQSpliterator use 
> forEachRemaining to visit all Node. But when got item value from Node, will 
> release the lock. If at this time, take() will be called. 
> The variable 'p' in forEachRemaining may point a Node which point itself, 
> then forEachRemaining will be in dead loop. You can see it in "deadloop.png"
> Let's see a simple uni-test, Let's forEachRemaining called more slow than 
> take, the problem will reproduction。uni-test is MockForDeadLoop.java.
> I debug MockForDeadLoop.java, and see a Node point itself. You can see pic 
> "debugfornode.png"
> Environment:
>   OS: CentOS Linux release 7.5.1804 (Core) 
>   JDK: jdk1.8.0_281



--
This

[jira] [Commented] (YARN-10642) Race condition: AsyncDispatcher can get stuck by the changes introduced in YARN-8995

2021-03-08 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297343#comment-17297343
 ] 

Peter Bacsko commented on YARN-10642:
-

Ok, pushed to branch-3.2 as well.

Thanks for the patch [~zhengchenyu] and [~bteke] / [~zhuqi] for the review. 

> Race condition: AsyncDispatcher can get stuck by the changes introduced in 
> YARN-8995
> 
>
> Key: YARN-10642
> URL: https://issues.apache.org/jira/browse/YARN-10642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Critical
> Fix For: 3.4.0, 3.3.1
>
> Attachments: MockForDeadLoop.java, YARN-10642-branch-3.2.001.patch, 
> YARN-10642-branch-3.2.002.patch, YARN-10642-branch-3.3.001.patch, 
> YARN-10642.001.patch, YARN-10642.002.patch, YARN-10642.003.patch, 
> YARN-10642.004.patch, YARN-10642.005.patch, deadloop.png, debugfornode.png, 
> put.png, take.png
>
>
> In our cluster, ResouceManager stuck twice within twenty days. Yarn client 
> can't submit application. I got jstack info at second time, then found the 
> reason.
> I analyze all the jstack, I found many thread stuck because can't get 
> LinkedBlockingQueue.putLock. (Note: Sorry for limited space , omit the 
> analytical process)
> The reason is that one thread hold the putLock all the time, 
> printEventQueueDetails will called forEachRemaining, then hold putLock and 
> readLock. The AsyncDispatcher will stuck.
> {code}
> Thread 6526 (IPC Server handler 454 on default port 8030):
>   State: RUNNABLE
>   Blocked count: 29988
>   Waited count: 2035029
>   Stack:
> 
> java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926)
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
> 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.handleProgress(DefaultAMSProcessor.java:408)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:215)
> 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
> 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
> 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:432)
> 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1040)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:958)
> java.security.AccessController.doPrivileged(Native Method)
> {code}
> I analyze LinkedBlockingQueue's source code. I found forEachRemaining in 
> LinkedBlockingQueue.LBQSpliterator may stuck, when forEachRemaining and take 
> are called in different thread. 
> YARN-8995 introduce printEventQueueDetails method, 
> "eventQueue.stream().collect" will called forEachRemaining method.
> Let's see why？ "put.png" shows that how to put("a"), "take.png" shows that 
> how to take()。Specical Node: The removed Node will point itself for help gc!!!
> The key point code is in forEachRemaining, we see LBQSpliterator use 
> forEachRemaining to visit all Node. But when got item value from Node, will 
> release the lock. If at this time, take() will be called. 
> The variable 'p' in forEachRemaining may point a Node which point itself, 
> then forEachRemaining will be in dead loop. You can see it in "deadloop.png"
> Let's see a simple uni-test, Let's forEachRemaining called more slow than 
> take, the problem will reproduction。uni-test is MockForDeadLoop.java.
> I debug MockForDeadLoop.java, and see a Node point itself. You can see

[jira] [Comment Edited] (YARN-10642) Race condition: AsyncDispatcher can get stuck by the changes introduced in YARN-8995

2021-03-08 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297343#comment-17297343
 ] 

Peter Bacsko edited comment on YARN-10642 at 3/8/21, 1:24 PM:
--

+1
Ok, pushed to branch-3.2 as well.

Thanks for the patch [~zhengchenyu] and [~bteke] / [~zhuqi] for the review. 


was (Author: pbacsko):
Ok, pushed to branch-3.2 as well.

Thanks for the patch [~zhengchenyu] and [~bteke] / [~zhuqi] for the review. 

> Race condition: AsyncDispatcher can get stuck by the changes introduced in 
> YARN-8995
> 
>
> Key: YARN-10642
> URL: https://issues.apache.org/jira/browse/YARN-10642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Critical
> Fix For: 3.4.0, 3.3.1
>
> Attachments: MockForDeadLoop.java, YARN-10642-branch-3.2.001.patch, 
> YARN-10642-branch-3.2.002.patch, YARN-10642-branch-3.3.001.patch, 
> YARN-10642.001.patch, YARN-10642.002.patch, YARN-10642.003.patch, 
> YARN-10642.004.patch, YARN-10642.005.patch, deadloop.png, debugfornode.png, 
> put.png, take.png
>
>
> In our cluster, ResouceManager stuck twice within twenty days. Yarn client 
> can't submit application. I got jstack info at second time, then found the 
> reason.
> I analyze all the jstack, I found many thread stuck because can't get 
> LinkedBlockingQueue.putLock. (Note: Sorry for limited space , omit the 
> analytical process)
> The reason is that one thread hold the putLock all the time, 
> printEventQueueDetails will called forEachRemaining, then hold putLock and 
> readLock. The AsyncDispatcher will stuck.
> {code}
> Thread 6526 (IPC Server handler 454 on default port 8030):
>   State: RUNNABLE
>   Blocked count: 29988
>   Waited count: 2035029
>   Stack:
> 
> java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926)
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
> 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.handleProgress(DefaultAMSProcessor.java:408)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:215)
> 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
> 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
> 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:432)
> 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1040)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:958)
> java.security.AccessController.doPrivileged(Native Method)
> {code}
> I analyze LinkedBlockingQueue's source code. I found forEachRemaining in 
> LinkedBlockingQueue.LBQSpliterator may stuck, when forEachRemaining and take 
> are called in different thread. 
> YARN-8995 introduce printEventQueueDetails method, 
> "eventQueue.stream().collect" will called forEachRemaining method.
> Let's see why？ "put.png" shows that how to put("a"), "take.png" shows that 
> how to take()。Specical Node: The removed Node will point itself for help gc!!!
> The key point code is in forEachRemaining, we see LBQSpliterator use 
> forEachRemaining to visit all Node. But when got item value from Node, will 
> release the lock. If at this time, take() will be called. 
> The variable 'p' in forEachRemaining may point a Node which point itself, 
> then forEachRemaining will be in dead loop. You can see it in "deadloop.png"
> Let's see a simple uni-test,

[jira] [Commented] (YARN-9615) Add dispatcher metrics to RM

2021-03-08 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297311#comment-17297311
 ] 

Peter Bacsko commented on YARN-9615:


[~zhuqi] could you investigate the failing unit tests? Something doesn't seem 
right, especially with "TestAsyncDispatcher".

> Add dispatcher metrics to RM
> 
>
> Key: YARN-9615
> URL: https://issues.apache.org/jira/browse/YARN-9615
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Jonathan Hung
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-9615.001.patch, YARN-9615.002.patch, 
> YARN-9615.003.patch, YARN-9615.004.patch, YARN-9615.005.patch, 
> YARN-9615.006.patch, YARN-9615.007.patch, YARN-9615.008.patch, 
> YARN-9615.009.patch, YARN-9615.poc.patch, image-2021-03-04-10-35-10-626.png, 
> image-2021-03-04-10-36-12-441.png, screenshot-1.png
>
>
> It'd be good to have counts/processing times for each event type in RM async 
> dispatcher and scheduler async dispatcher.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-08 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10674:

Labels: fs2cs  (was: )

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10642) Race condition: AsyncDispatcher can get stuck by the changes introduced in YARN-8995

2021-03-08 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297302#comment-17297302
 ] 

Peter Bacsko commented on YARN-10642:
-

Committed patch to branch-3.3, waiting for Jenkins results from branch-3.2.

> Race condition: AsyncDispatcher can get stuck by the changes introduced in 
> YARN-8995
> 
>
> Key: YARN-10642
> URL: https://issues.apache.org/jira/browse/YARN-10642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Critical
> Fix For: 3.4.0, 3.3.1
>
> Attachments: MockForDeadLoop.java, YARN-10642-branch-3.2.001.patch, 
> YARN-10642-branch-3.2.002.patch, YARN-10642-branch-3.3.001.patch, 
> YARN-10642.001.patch, YARN-10642.002.patch, YARN-10642.003.patch, 
> YARN-10642.004.patch, YARN-10642.005.patch, deadloop.png, debugfornode.png, 
> put.png, take.png
>
>
> In our cluster, ResouceManager stuck twice within twenty days. Yarn client 
> can't submit application. I got jstack info at second time, then found the 
> reason.
> I analyze all the jstack, I found many thread stuck because can't get 
> LinkedBlockingQueue.putLock. (Note: Sorry for limited space , omit the 
> analytical process)
> The reason is that one thread hold the putLock all the time, 
> printEventQueueDetails will called forEachRemaining, then hold putLock and 
> readLock. The AsyncDispatcher will stuck.
> {code}
> Thread 6526 (IPC Server handler 454 on default port 8030):
>   State: RUNNABLE
>   Blocked count: 29988
>   Waited count: 2035029
>   Stack:
> 
> java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926)
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
> 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.handleProgress(DefaultAMSProcessor.java:408)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:215)
> 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
> 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
> 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:432)
> 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1040)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:958)
> java.security.AccessController.doPrivileged(Native Method)
> {code}
> I analyze LinkedBlockingQueue's source code. I found forEachRemaining in 
> LinkedBlockingQueue.LBQSpliterator may stuck, when forEachRemaining and take 
> are called in different thread. 
> YARN-8995 introduce printEventQueueDetails method, 
> "eventQueue.stream().collect" will called forEachRemaining method.
> Let's see why？ "put.png" shows that how to put("a"), "take.png" shows that 
> how to take()。Specical Node: The removed Node will point itself for help gc!!!
> The key point code is in forEachRemaining, we see LBQSpliterator use 
> forEachRemaining to visit all Node. But when got item value from Node, will 
> release the lock. If at this time, take() will be called. 
> The variable 'p' in forEachRemaining may point a Node which point itself, 
> then forEachRemaining will be in dead loop. You can see it in "deadloop.png"
> Let's see a simple uni-test, Let's forEachRemaining called more slow than 
> take, the problem will reproduction。uni-test is MockForDeadLoop.java.
> I debug MockForDeadLoop.java, and see a Node point itself. You can see pic 
> "debugfornode.png"
>

[jira] [Updated] (YARN-10642) Race condition: AsyncDispatcher can get stuck by the changes introduced in YARN-8995

2021-03-08 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10642:

Fix Version/s: 3.3.1

> Race condition: AsyncDispatcher can get stuck by the changes introduced in 
> YARN-8995
> 
>
> Key: YARN-10642
> URL: https://issues.apache.org/jira/browse/YARN-10642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Critical
> Fix For: 3.4.0, 3.3.1
>
> Attachments: MockForDeadLoop.java, YARN-10642-branch-3.2.001.patch, 
> YARN-10642-branch-3.2.002.patch, YARN-10642-branch-3.3.001.patch, 
> YARN-10642.001.patch, YARN-10642.002.patch, YARN-10642.003.patch, 
> YARN-10642.004.patch, YARN-10642.005.patch, deadloop.png, debugfornode.png, 
> put.png, take.png
>
>
> In our cluster, ResouceManager stuck twice within twenty days. Yarn client 
> can't submit application. I got jstack info at second time, then found the 
> reason.
> I analyze all the jstack, I found many thread stuck because can't get 
> LinkedBlockingQueue.putLock. (Note: Sorry for limited space , omit the 
> analytical process)
> The reason is that one thread hold the putLock all the time, 
> printEventQueueDetails will called forEachRemaining, then hold putLock and 
> readLock. The AsyncDispatcher will stuck.
> {code}
> Thread 6526 (IPC Server handler 454 on default port 8030):
>   State: RUNNABLE
>   Blocked count: 29988
>   Waited count: 2035029
>   Stack:
> 
> java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926)
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
> 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.handleProgress(DefaultAMSProcessor.java:408)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:215)
> 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
> 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
> 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:432)
> 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1040)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:958)
> java.security.AccessController.doPrivileged(Native Method)
> {code}
> I analyze LinkedBlockingQueue's source code. I found forEachRemaining in 
> LinkedBlockingQueue.LBQSpliterator may stuck, when forEachRemaining and take 
> are called in different thread. 
> YARN-8995 introduce printEventQueueDetails method, 
> "eventQueue.stream().collect" will called forEachRemaining method.
> Let's see why？ "put.png" shows that how to put("a"), "take.png" shows that 
> how to take()。Specical Node: The removed Node will point itself for help gc!!!
> The key point code is in forEachRemaining, we see LBQSpliterator use 
> forEachRemaining to visit all Node. But when got item value from Node, will 
> release the lock. If at this time, take() will be called. 
> The variable 'p' in forEachRemaining may point a Node which point itself, 
> then forEachRemaining will be in dead loop. You can see it in "deadloop.png"
> Let's see a simple uni-test, Let's forEachRemaining called more slow than 
> take, the problem will reproduction。uni-test is MockForDeadLoop.java.
> I debug MockForDeadLoop.java, and see a Node point itself. You can see pic 
> "debugfornode.png"
> Environment:
>   OS: CentOS Linux release 7.5.1804 (Core) 
>   JDK: jdk1.8.0_281



--
This message

[jira] [Commented] (YARN-10642) Race condition: AsyncDispatcher can get stuck by the changes introduced in YARN-8995

2021-03-08 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297298#comment-17297298
 ] 

Peter Bacsko commented on YARN-10642:
-

Uploaded v2 for branch-3.2, the original patch doesn't compile (slf4j vs 
commons logging).

> Race condition: AsyncDispatcher can get stuck by the changes introduced in 
> YARN-8995
> 
>
> Key: YARN-10642
> URL: https://issues.apache.org/jira/browse/YARN-10642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Critical
> Fix For: 3.4.0
>
> Attachments: MockForDeadLoop.java, YARN-10642-branch-3.2.001.patch, 
> YARN-10642-branch-3.2.002.patch, YARN-10642-branch-3.3.001.patch, 
> YARN-10642.001.patch, YARN-10642.002.patch, YARN-10642.003.patch, 
> YARN-10642.004.patch, YARN-10642.005.patch, deadloop.png, debugfornode.png, 
> put.png, take.png
>
>
> In our cluster, ResouceManager stuck twice within twenty days. Yarn client 
> can't submit application. I got jstack info at second time, then found the 
> reason.
> I analyze all the jstack, I found many thread stuck because can't get 
> LinkedBlockingQueue.putLock. (Note: Sorry for limited space , omit the 
> analytical process)
> The reason is that one thread hold the putLock all the time, 
> printEventQueueDetails will called forEachRemaining, then hold putLock and 
> readLock. The AsyncDispatcher will stuck.
> {code}
> Thread 6526 (IPC Server handler 454 on default port 8030):
>   State: RUNNABLE
>   Blocked count: 29988
>   Waited count: 2035029
>   Stack:
> 
> java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926)
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
> 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.handleProgress(DefaultAMSProcessor.java:408)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:215)
> 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
> 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
> 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:432)
> 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1040)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:958)
> java.security.AccessController.doPrivileged(Native Method)
> {code}
> I analyze LinkedBlockingQueue's source code. I found forEachRemaining in 
> LinkedBlockingQueue.LBQSpliterator may stuck, when forEachRemaining and take 
> are called in different thread. 
> YARN-8995 introduce printEventQueueDetails method, 
> "eventQueue.stream().collect" will called forEachRemaining method.
> Let's see why？ "put.png" shows that how to put("a"), "take.png" shows that 
> how to take()。Specical Node: The removed Node will point itself for help gc!!!
> The key point code is in forEachRemaining, we see LBQSpliterator use 
> forEachRemaining to visit all Node. But when got item value from Node, will 
> release the lock. If at this time, take() will be called. 
> The variable 'p' in forEachRemaining may point a Node which point itself, 
> then forEachRemaining will be in dead loop. You can see it in "deadloop.png"
> Let's see a simple uni-test, Let's forEachRemaining called more slow than 
> take, the problem will reproduction。uni-test is MockForDeadLoop.java.
> I debug MockForDeadLoop.java, and see a Node point itself. You can see pic 
> "debugfornode.png"
>

[jira] [Updated] (YARN-10642) Race condition: AsyncDispatcher can get stuck by the changes introduced in YARN-8995

2021-03-08 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10642:

Attachment: YARN-10642-branch-3.2.002.patch

> Race condition: AsyncDispatcher can get stuck by the changes introduced in 
> YARN-8995
> 
>
> Key: YARN-10642
> URL: https://issues.apache.org/jira/browse/YARN-10642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Critical
> Fix For: 3.4.0
>
> Attachments: MockForDeadLoop.java, YARN-10642-branch-3.2.001.patch, 
> YARN-10642-branch-3.2.002.patch, YARN-10642-branch-3.3.001.patch, 
> YARN-10642.001.patch, YARN-10642.002.patch, YARN-10642.003.patch, 
> YARN-10642.004.patch, YARN-10642.005.patch, deadloop.png, debugfornode.png, 
> put.png, take.png
>
>
> In our cluster, ResouceManager stuck twice within twenty days. Yarn client 
> can't submit application. I got jstack info at second time, then found the 
> reason.
> I analyze all the jstack, I found many thread stuck because can't get 
> LinkedBlockingQueue.putLock. (Note: Sorry for limited space , omit the 
> analytical process)
> The reason is that one thread hold the putLock all the time, 
> printEventQueueDetails will called forEachRemaining, then hold putLock and 
> readLock. The AsyncDispatcher will stuck.
> {code}
> Thread 6526 (IPC Server handler 454 on default port 8030):
>   State: RUNNABLE
>   Blocked count: 29988
>   Waited count: 2035029
>   Stack:
> 
> java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926)
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
> 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.handleProgress(DefaultAMSProcessor.java:408)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:215)
> 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
> 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
> 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:432)
> 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1040)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:958)
> java.security.AccessController.doPrivileged(Native Method)
> {code}
> I analyze LinkedBlockingQueue's source code. I found forEachRemaining in 
> LinkedBlockingQueue.LBQSpliterator may stuck, when forEachRemaining and take 
> are called in different thread. 
> YARN-8995 introduce printEventQueueDetails method, 
> "eventQueue.stream().collect" will called forEachRemaining method.
> Let's see why？ "put.png" shows that how to put("a"), "take.png" shows that 
> how to take()。Specical Node: The removed Node will point itself for help gc!!!
> The key point code is in forEachRemaining, we see LBQSpliterator use 
> forEachRemaining to visit all Node. But when got item value from Node, will 
> release the lock. If at this time, take() will be called. 
> The variable 'p' in forEachRemaining may point a Node which point itself, 
> then forEachRemaining will be in dead loop. You can see it in "deadloop.png"
> Let's see a simple uni-test, Let's forEachRemaining called more slow than 
> take, the problem will reproduction。uni-test is MockForDeadLoop.java.
> I debug MockForDeadLoop.java, and see a Node point itself. You can see pic 
> "debugfornode.png"
> Environment:
>   OS: CentOS Linux release 7.5.1804 (Core) 
>   JDK: jdk1.8.0_281



--

[jira] [Updated] (YARN-10642) Race condition: AsyncDispatcher can get stuck by the changes introduced in YARN-8995

2021-03-08 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10642:

Attachment: YARN-10642-branch-3.2.001.patch

> Race condition: AsyncDispatcher can get stuck by the changes introduced in 
> YARN-8995
> 
>
> Key: YARN-10642
> URL: https://issues.apache.org/jira/browse/YARN-10642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Critical
> Fix For: 3.4.0
>
> Attachments: MockForDeadLoop.java, YARN-10642-branch-3.2.001.patch, 
> YARN-10642-branch-3.3.001.patch, YARN-10642.001.patch, YARN-10642.002.patch, 
> YARN-10642.003.patch, YARN-10642.004.patch, YARN-10642.005.patch, 
> deadloop.png, debugfornode.png, put.png, take.png
>
>
> In our cluster, ResouceManager stuck twice within twenty days. Yarn client 
> can't submit application. I got jstack info at second time, then found the 
> reason.
> I analyze all the jstack, I found many thread stuck because can't get 
> LinkedBlockingQueue.putLock. (Note: Sorry for limited space , omit the 
> analytical process)
> The reason is that one thread hold the putLock all the time, 
> printEventQueueDetails will called forEachRemaining, then hold putLock and 
> readLock. The AsyncDispatcher will stuck.
> {code}
> Thread 6526 (IPC Server handler 454 on default port 8030):
>   State: RUNNABLE
>   Blocked count: 29988
>   Waited count: 2035029
>   Stack:
> 
> java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926)
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
> 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.handleProgress(DefaultAMSProcessor.java:408)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:215)
> 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
> 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
> 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:432)
> 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1040)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:958)
> java.security.AccessController.doPrivileged(Native Method)
> {code}
> I analyze LinkedBlockingQueue's source code. I found forEachRemaining in 
> LinkedBlockingQueue.LBQSpliterator may stuck, when forEachRemaining and take 
> are called in different thread. 
> YARN-8995 introduce printEventQueueDetails method, 
> "eventQueue.stream().collect" will called forEachRemaining method.
> Let's see why？ "put.png" shows that how to put("a"), "take.png" shows that 
> how to take()。Specical Node: The removed Node will point itself for help gc!!!
> The key point code is in forEachRemaining, we see LBQSpliterator use 
> forEachRemaining to visit all Node. But when got item value from Node, will 
> release the lock. If at this time, take() will be called. 
> The variable 'p' in forEachRemaining may point a Node which point itself, 
> then forEachRemaining will be in dead loop. You can see it in "deadloop.png"
> Let's see a simple uni-test, Let's forEachRemaining called more slow than 
> take, the problem will reproduction。uni-test is MockForDeadLoop.java.
> I debug MockForDeadLoop.java, and see a Node point itself. You can see pic 
> "debugfornode.png"
> Environment:
>   OS: CentOS Linux release 7.5.1804 (Core) 
>   JDK: jdk1.8.0_281



--
This message was sent by Atlassian

[jira] [Updated] (YARN-10642) Race condition: AsyncDispatcher can get stuck by the changes introduced in YARN-8995

2021-03-08 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10642:

Attachment: (was: YARN-10642-branch-3.2.001.patch)

> Race condition: AsyncDispatcher can get stuck by the changes introduced in 
> YARN-8995
> 
>
> Key: YARN-10642
> URL: https://issues.apache.org/jira/browse/YARN-10642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Critical
> Fix For: 3.4.0
>
> Attachments: MockForDeadLoop.java, YARN-10642-branch-3.2.001.patch, 
> YARN-10642-branch-3.3.001.patch, YARN-10642.001.patch, YARN-10642.002.patch, 
> YARN-10642.003.patch, YARN-10642.004.patch, YARN-10642.005.patch, 
> deadloop.png, debugfornode.png, put.png, take.png
>
>
> In our cluster, ResouceManager stuck twice within twenty days. Yarn client 
> can't submit application. I got jstack info at second time, then found the 
> reason.
> I analyze all the jstack, I found many thread stuck because can't get 
> LinkedBlockingQueue.putLock. (Note: Sorry for limited space , omit the 
> analytical process)
> The reason is that one thread hold the putLock all the time, 
> printEventQueueDetails will called forEachRemaining, then hold putLock and 
> readLock. The AsyncDispatcher will stuck.
> {code}
> Thread 6526 (IPC Server handler 454 on default port 8030):
>   State: RUNNABLE
>   Blocked count: 29988
>   Waited count: 2035029
>   Stack:
> 
> java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926)
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
> 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.handleProgress(DefaultAMSProcessor.java:408)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:215)
> 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
> 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
> 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:432)
> 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1040)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:958)
> java.security.AccessController.doPrivileged(Native Method)
> {code}
> I analyze LinkedBlockingQueue's source code. I found forEachRemaining in 
> LinkedBlockingQueue.LBQSpliterator may stuck, when forEachRemaining and take 
> are called in different thread. 
> YARN-8995 introduce printEventQueueDetails method, 
> "eventQueue.stream().collect" will called forEachRemaining method.
> Let's see why？ "put.png" shows that how to put("a"), "take.png" shows that 
> how to take()。Specical Node: The removed Node will point itself for help gc!!!
> The key point code is in forEachRemaining, we see LBQSpliterator use 
> forEachRemaining to visit all Node. But when got item value from Node, will 
> release the lock. If at this time, take() will be called. 
> The variable 'p' in forEachRemaining may point a Node which point itself, 
> then forEachRemaining will be in dead loop. You can see it in "deadloop.png"
> Let's see a simple uni-test, Let's forEachRemaining called more slow than 
> take, the problem will reproduction。uni-test is MockForDeadLoop.java.
> I debug MockForDeadLoop.java, and see a Node point itself. You can see pic 
> "debugfornode.png"
> Environment:
>   OS: CentOS Linux release 7.5.1804 (Core) 
>   JDK: jdk1.8.0_281



--
This message was sent by

[jira] [Commented] (YARN-10658) CapacityScheduler QueueInfo add queue path field to avoid ambiguous QueueName.

2021-03-08 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297284#comment-17297284
 ] 

Peter Bacsko commented on YARN-10658:
-

+1

Thanks [~zhuqi] for the patch. Committed to trunk.

> CapacityScheduler QueueInfo add queue path field to avoid ambiguous QueueName.
> --
>
> Key: YARN-10658
> URL: https://issues.apache.org/jira/browse/YARN-10658
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10658.001.patch, YARN-10658.002.patch, 
> YARN-10658.003.patch
>
>
> After the leaf queue can use same name, QueueInfo class getQueueName method 
> should have ambiguous QueueName. we should add queue path field to avoid 
> ambiguous QueueName, and make it consistent with fair scheduler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10672) All testcases in TestReservations are flaky

2021-03-08 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10672:

Attachment: YARN-10672.branch-3.2.001.patch

> All testcases in TestReservations are flaky
> ---
>
> Key: YARN-10672
> URL: https://issues.apache.org/jira/browse/YARN-10672
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: Screenshot 2021-03-04 at 21.34.18.png, Screenshot 
> 2021-03-04 at 22.06.20.png, Screenshot-mockitostubbing1-2021-03-04 at 
> 22.34.01.png, Screenshot-mockitostubbing2-2021-03-04 at 22.34.12.png, 
> YARN-10672-debuglogs.patch, YARN-10672.001.patch, 
> YARN-10672.branch-3.2.001.patch, YARN-10672.branch-3.3.001.patch
>
>
> All testcases in TestReservations are flaky
> Running a particular test in TestReservations 100 times never passes all the 
> time.
>  For example, let's run testReservationNoContinueLook 100 times. For me, it 
> produced 39 failed and 61 passed results.
>  Sometimes just 1 out of 100 runs is failed.
>  Screenshot is attached.
> Stacktrace:
> {code:java}
> java.lang.AssertionError: 
> Expected :2048
> Actual   :0
> 
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.failNotEquals(Assert.java:835)
> at org.junit.Assert.assertEquals(Assert.java:647)
> at org.junit.Assert.assertEquals(Assert.java:633)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642)
> {code}
> The test fails here:
> {code:java}
>  // Start testing...
> // Only AM
> TestUtils.applyResourceCommitRequest(clusterResource,
> a.assignContainers(clusterResource, node_0,
> new ResourceLimits(clusterResource),
> SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps);
> assertEquals(2 * GB, a.getUsedResources().getMemorySize());
> {code}
> With some debugging (patch attached), I realized that sometimes there are no 
> registered nodes so the AM can't be allocated and test will fail:
> {code:java}
> 2021-03-04 21:58:25,434 DEBUG [main] allocator.RegularContainerAllocator 
> (RegularContainerAllocator.java:canAssign(312)) - **Can't assign 
> container, no nodes... rmContext: 2a8dd942, scheduler: 2322e56f
> {code}
> In these cases, this is also printed from 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#getNumClusterNodes:
> {code:java}
> 2021-03-04 21:58:25,379 DEBUG [main] capacity.CapacityScheduler 
> (CapacityScheduler.java:getNumClusterNodes(290)) - ***Called real 
> getNumClusterNodes
> {code}
> h2. Let's break this down:
>  1. The mocking happens in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations#setup(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration,
>  boolean):
> {code:java}
> cs.setRMContext(spyRMContext);
> cs.init(csConf);
> cs.start();
> when(cs.getNumClusterNodes()).thenReturn(3);
> {code}
> Under no circumstances this could be allowed to return any other value than 3.
>  However, as mentioned above, sometimes the real method of 
> 'getNumClusterNodes' is called on CapacityScheduler.
> 2. Sometimes, this gets printed to the console:
> {code:java}
> org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
> Integer cannot be returned by isMultiNodePlacementEnabled()
> isMultiNodePlacementEnabled() should return boolean
> ***
> If you're unsure why you're getting above error read on.
> Due to the nature of the syntax above problem might occur because:
> 1. This exception *might* occur in wrongly written multi-threaded tests.
>Please refer to Mockito FAQ on limitations of concurrency testing.
> 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
> spies - 
>- with doReturn|Throw() family of methods. More in javadocs for 
> Mockito.spy() method.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:166)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:114)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:566)
>   at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
>

[jira] [Updated] (YARN-10672) All testcases in TestReservations are flaky

2021-03-05 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10672:

Fix Version/s: 3.4.0

> All testcases in TestReservations are flaky
> ---
>
> Key: YARN-10672
> URL: https://issues.apache.org/jira/browse/YARN-10672
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: Screenshot 2021-03-04 at 21.34.18.png, Screenshot 
> 2021-03-04 at 22.06.20.png, Screenshot-mockitostubbing1-2021-03-04 at 
> 22.34.01.png, Screenshot-mockitostubbing2-2021-03-04 at 22.34.12.png, 
> YARN-10672-debuglogs.patch, YARN-10672.001.patch
>
>
> All testcases in TestReservations are flaky
> Running a particular test in TestReservations 100 times never passes all the 
> time.
>  For example, let's run testReservationNoContinueLook 100 times. For me, it 
> produced 39 failed and 61 passed results.
>  Sometimes just 1 out of 100 runs is failed.
>  Screenshot is attached.
> Stacktrace:
> {code:java}
> java.lang.AssertionError: 
> Expected :2048
> Actual   :0
> 
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.failNotEquals(Assert.java:835)
> at org.junit.Assert.assertEquals(Assert.java:647)
> at org.junit.Assert.assertEquals(Assert.java:633)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642)
> {code}
> The test fails here:
> {code:java}
>  // Start testing...
> // Only AM
> TestUtils.applyResourceCommitRequest(clusterResource,
> a.assignContainers(clusterResource, node_0,
> new ResourceLimits(clusterResource),
> SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps);
> assertEquals(2 * GB, a.getUsedResources().getMemorySize());
> {code}
> With some debugging (patch attached), I realized that sometimes there are no 
> registered nodes so the AM can't be allocated and test will fail:
> {code:java}
> 2021-03-04 21:58:25,434 DEBUG [main] allocator.RegularContainerAllocator 
> (RegularContainerAllocator.java:canAssign(312)) - **Can't assign 
> container, no nodes... rmContext: 2a8dd942, scheduler: 2322e56f
> {code}
> In these cases, this is also printed from 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#getNumClusterNodes:
> {code:java}
> 2021-03-04 21:58:25,379 DEBUG [main] capacity.CapacityScheduler 
> (CapacityScheduler.java:getNumClusterNodes(290)) - ***Called real 
> getNumClusterNodes
> {code}
> h2. Let's break this down:
>  1. The mocking happens in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations#setup(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration,
>  boolean):
> {code:java}
> cs.setRMContext(spyRMContext);
> cs.init(csConf);
> cs.start();
> when(cs.getNumClusterNodes()).thenReturn(3);
> {code}
> Under no circumstances this could be allowed to return any other value than 3.
>  However, as mentioned above, sometimes the real method of 
> 'getNumClusterNodes' is called on CapacityScheduler.
> 2. Sometimes, this gets printed to the console:
> {code:java}
> org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
> Integer cannot be returned by isMultiNodePlacementEnabled()
> isMultiNodePlacementEnabled() should return boolean
> ***
> If you're unsure why you're getting above error read on.
> Due to the nature of the syntax above problem might occur because:
> 1. This exception *might* occur in wrongly written multi-threaded tests.
>Please refer to Mockito FAQ on limitations of concurrency testing.
> 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
> spies - 
>- with doReturn|Throw() family of methods. More in javadocs for 
> Mockito.spy() method.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:166)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:114)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:566)
>   at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>

[jira] [Commented] (YARN-10672) All testcases in TestReservations are flaky

2021-03-05 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296362#comment-17296362
 ] 

Peter Bacsko commented on YARN-10672:
-

+1 LGTM.

Thanks [~snemeth], committed to trunk. You might want to consider backporting 
this to branch-3.3 and branch-3.2.

> All testcases in TestReservations are flaky
> ---
>
> Key: YARN-10672
> URL: https://issues.apache.org/jira/browse/YARN-10672
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: Screenshot 2021-03-04 at 21.34.18.png, Screenshot 
> 2021-03-04 at 22.06.20.png, Screenshot-mockitostubbing1-2021-03-04 at 
> 22.34.01.png, Screenshot-mockitostubbing2-2021-03-04 at 22.34.12.png, 
> YARN-10672-debuglogs.patch, YARN-10672.001.patch
>
>
> All testcases in TestReservations are flaky
> Running a particular test in TestReservations 100 times never passes all the 
> time.
>  For example, let's run testReservationNoContinueLook 100 times. For me, it 
> produced 39 failed and 61 passed results.
>  Sometimes just 1 out of 100 runs is failed.
>  Screenshot is attached.
> Stacktrace:
> {code:java}
> java.lang.AssertionError: 
> Expected :2048
> Actual   :0
> 
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.failNotEquals(Assert.java:835)
> at org.junit.Assert.assertEquals(Assert.java:647)
> at org.junit.Assert.assertEquals(Assert.java:633)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642)
> {code}
> The test fails here:
> {code:java}
>  // Start testing...
> // Only AM
> TestUtils.applyResourceCommitRequest(clusterResource,
> a.assignContainers(clusterResource, node_0,
> new ResourceLimits(clusterResource),
> SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps);
> assertEquals(2 * GB, a.getUsedResources().getMemorySize());
> {code}
> With some debugging (patch attached), I realized that sometimes there are no 
> registered nodes so the AM can't be allocated and test will fail:
> {code:java}
> 2021-03-04 21:58:25,434 DEBUG [main] allocator.RegularContainerAllocator 
> (RegularContainerAllocator.java:canAssign(312)) - **Can't assign 
> container, no nodes... rmContext: 2a8dd942, scheduler: 2322e56f
> {code}
> In these cases, this is also printed from 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#getNumClusterNodes:
> {code:java}
> 2021-03-04 21:58:25,379 DEBUG [main] capacity.CapacityScheduler 
> (CapacityScheduler.java:getNumClusterNodes(290)) - ***Called real 
> getNumClusterNodes
> {code}
> h2. Let's break this down:
>  1. The mocking happens in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations#setup(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration,
>  boolean):
> {code:java}
> cs.setRMContext(spyRMContext);
> cs.init(csConf);
> cs.start();
> when(cs.getNumClusterNodes()).thenReturn(3);
> {code}
> Under no circumstances this could be allowed to return any other value than 3.
>  However, as mentioned above, sometimes the real method of 
> 'getNumClusterNodes' is called on CapacityScheduler.
> 2. Sometimes, this gets printed to the console:
> {code:java}
> org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
> Integer cannot be returned by isMultiNodePlacementEnabled()
> isMultiNodePlacementEnabled() should return boolean
> ***
> If you're unsure why you're getting above error read on.
> Due to the nature of the syntax above problem might occur because:
> 1. This exception *might* occur in wrongly written multi-threaded tests.
>Please refer to Mockito FAQ on limitations of concurrency testing.
> 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
> spies - 
>- with doReturn|Throw() family of methods. More in javadocs for 
> Mockito.spy() method.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:166)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:114)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:566)
>   at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
>

[jira] [Commented] (YARN-10658) CapacityScheduler QueueInfo add queue path field to avoid ambiguous QueueName.

2021-03-05 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296198#comment-17296198
 ] 

Peter Bacsko commented on YARN-10658:
-

[~zhuqi] please fix the remaining checkstyle issues, except the 
"ParameterNumber", because we can ignore that.

> CapacityScheduler QueueInfo add queue path field to avoid ambiguous QueueName.
> --
>
> Key: YARN-10658
> URL: https://issues.apache.org/jira/browse/YARN-10658
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10658.001.patch, YARN-10658.002.patch
>
>
> After the leaf queue can use same name, QueueInfo class getQueueName method 
> should have ambiguous QueueName. we should add queue path field to avoid 
> ambiguous QueueName, and make it consistent with fair scheduler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-8786) LinuxContainerExecutor fails sporadically in create_local_dirs

2021-03-05 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-8786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YARN-8786.

Resolution: Fixed

> LinuxContainerExecutor fails sporadically in create_local_dirs
> --
>
> Key: YARN-8786
> URL: https://issues.apache.org/jira/browse/YARN-8786
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Jon Bender
>Priority: Major
>
> We started using CGroups with LinuxContainerExecutor recently, running Apache 
> Hadoop 3.0.0. Occasionally (once out of many millions of tasks) a yarn 
> container will fail with a message like the following:
> {code:java}
> [2018-09-02 23:48:02.458691] 18/09/02 23:48:02 INFO container.ContainerImpl: 
> Container container_1530684675517_516620_01_020846 transitioned from 
> SCHEDULED to RUNNING
> [2018-09-02 23:48:02.458874] 18/09/02 23:48:02 INFO 
> monitor.ContainersMonitorImpl: Starting resource-monitoring for 
> container_1530684675517_516620_01_020846
> [2018-09-02 23:48:02.506114] 18/09/02 23:48:02 WARN 
> privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 
> 35. Privileged Execution Operation Stderr:
> [2018-09-02 23:48:02.506159] Could not create container dirsCould not create 
> local files and directories
> [2018-09-02 23:48:02.506220]
> [2018-09-02 23:48:02.506238] Stdout: main : command provided 1
> [2018-09-02 23:48:02.506258] main : run as user is nobody
> [2018-09-02 23:48:02.506282] main : requested yarn user is root
> [2018-09-02 23:48:02.506294] Getting exit code file...
> [2018-09-02 23:48:02.506307] Creating script paths...
> [2018-09-02 23:48:02.506330] Writing pid file...
> [2018-09-02 23:48:02.506366] Writing to tmp file 
> /path/to/hadoop/yarn/local/nmPrivate/application_1530684675517_516620/container_1530684675517_516620_01_020846/container_1530684675517_516620_01_020846.pid.tmp
> [2018-09-02 23:48:02.506389] Writing to cgroup task files...
> [2018-09-02 23:48:02.506402] Creating local dirs...
> [2018-09-02 23:48:02.506414] Getting exit code file...
> [2018-09-02 23:48:02.506435] Creating script paths...
> {code}
> Looking at the container executor source it's traceable to errors here: 
> [https://github.com/apache/hadoop/blob/release-3.0.0-RC1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c#L1604]
>  And ultimately to 
> [https://github.com/apache/hadoop/blob/release-3.0.0-RC1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c#L672]
> The root failure seems to be in the underlying mkdir call, but that exit code 
> / errno is swallowed so we don't have more details. We tend to see this when 
> many containers start at the same time for the same application on a host, 
> and suspect it may be related to some race conditions around those shared 
> directories between containers for the same application.
> For example, this is a typical pattern in the audit logs:
> {code:java}
> [2018-09-07 17:16:38.447654] 18/09/07 17:16:38 INFO 
> nodemanager.NMAuditLogger: USER=root  IP=<> Container Request 
> TARGET=ContainerManageImpl  RESULT=SUCCESS  
> APPID=application_1530684675517_559126  
> CONTAINERID=container_1530684675517_559126_01_012871
> [2018-09-07 17:16:38.492298] 18/09/07 17:16:38 INFO 
> nodemanager.NMAuditLogger: USER=root  IP=<> Container Request 
> TARGET=ContainerManageImpl  RESULT=SUCCESS  
> APPID=application_1530684675517_559126  
> CONTAINERID=container_1530684675517_559126_01_012870
> [2018-09-07 17:16:38.614044] 18/09/07 17:16:38 WARN 
> nodemanager.NMAuditLogger: USER=root  OPERATION=Container Finished - 
> Failed   TARGET=ContainerImplRESULT=FAILURE  DESCRIPTION=Container failed 
> with state: EXITED_WITH_FAILUREAPPID=application_1530684675517_559126  
> CONTAINERID=container_1530684675517_559126_01_012871
> {code}
> Two containers for the same application starting in quick succession followed 
> by the EXITED_WITH_FAILURE step (exit code 35).
> We plan to upgrade to 3.1.x soon but I don't expect this to be fixed by this, 
> the only major JIRAs that affected the executor since 3.0.0 seem unrelated 
> ([https://github.com/apache/hadoop/commit/bc285da107bb84a3c60c5224369d7398a41db2d8]
>  and 
> [https://github.com/apache/hadoop/commit/a82be7754d74f4d16b206427b91e700bb5f44d56])



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10640) Adjust the queue Configured capacity to Configured weight number for weight mode in UI.

2021-03-05 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296067#comment-17296067
 ] 

Peter Bacsko commented on YARN-10640:
-

Thanks [~zhuqi] for the patch and [~gandras] + [~bteke] for the review.

Committed to trunk.

> Adjust the queue Configured capacity to  Configured weight number for weight 
> mode in UI.
> 
>
> Key: YARN-10640
> URL: https://issues.apache.org/jira/browse/YARN-10640
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10640.001.patch, YARN-10640.002.patch, 
> YARN-10640.003.patch, YARN-10640.004.patch, 
> image-2021-02-20-11-21-50-306.png, image-2021-02-20-14-18-56-261.png, 
> image-2021-02-20-14-19-30-767.png, image-2021-03-02-11-34-26-062.png
>
>
> In weight mode:
> Both the weight mode static queue and the dynamic queue will show the 
> Configured Capacity to 0. I think this should change to Configured Weight if 
> we use weight mode, this will be helpful.
> Such as in dynamic weight mode queue:
> !image-2021-02-20-11-21-50-306.png|width=528,height=374!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10221) Nodemanager lockups on printEventQueueDetails

2021-03-05 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296035#comment-17296035
 ] 

Peter Bacsko commented on YARN-10221:
-

[~jonbender-stripe] although this ticket was filed first, the fix went in under 
YARN-10642. So I'll close this as duplicate if no objections.

> Nodemanager lockups on printEventQueueDetails
> -
>
> Key: YARN-10221
> URL: https://issues.apache.org/jira/browse/YARN-10221
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.1
> Environment: We're running stock hadoop3.2.1 with cgroups / 
> LinuxContainerExecutor.
> Java version:
> {noformat}
> openjdk version "1.8.0_242"
> OpenJDK Runtime Environment (build 1.8.0_242-8u242-b08-0ubuntu3~16.04-b08)
> OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode) {noformat}
>  
>Reporter: Jon Bender
>Assignee: Qi Zhu
>Priority: Major
>
> We are seeing a rare, but critical bug on our production clusters running 
> hadoop 3.2.1. The central issue is that the NodeManager is locked up trying 
> to print details about the event queues. This feature was added in YARN-8995
> The main symptoms are:
> - Containers stuck in an Initing phase (ContainersIniting in jmx)
> - NM stops accepting RPC calls
> Failed job submissions manifest as socket timeouts to the RPC port:
> {code}
> INFO - diagnostics: Application application_1585693823779_0028 failed 1 times 
> (global limit =2; local limit is =1) due to Error launching 
> appattempt_1585693823779_0028_01. Got exception: 
> java.net.SocketTimeoutException: Call From 
> hadoopresourcesec--0c94ac2238c29f40e.production/10.68.12.37 to 
> hadoopdatanodei--06bad095f795f0725.production:8039 failed on socket timeout 
> exception: java.net.SocketTimeoutException: 6 millis timeout while 
> waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/10.68.12.37:59892 
> remote=hadoopdatanodei--06bad095f795f0725.production/10.68.58.224:8039]; For 
> more details see:  http://wiki.apache.org/hadoop/SocketTimeout
> {code}
> Relevant outputs from {{jstack -l:}} on an affected NodeManager. All IPC 
> threads are blocked waiting on the lock on the eventQueue
> Thread printing event queue details - this runs indefinitely
> {code:java}
> "Public Localizer" #62 prio=5 os_prio=0 tid=0x7f488d948000 nid=0x1cee9 
> runnable [0x7f4890571000]"Public Localizer" #62 prio=5 os_prio=0 
> tid=0x7f488d948000 nid=0x1cee9 runnable [0x7f4890571000]   
> java.lang.Thread.State: RUNNABLE at 
> java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926)
>  at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) 
> at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) 
> at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at 
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566) at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270)
>  at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource$FetchSuccessTransition.transition(LocalizedResource.java:252)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource$FetchSuccessTransition.transition(LocalizedResource.java:243)
>  at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>  at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>  at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
>  at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
>  - locked <0x7f4906f49230> (a 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine) at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource.handle(LocalizedResource.java:200)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl.handle(LocalResourcesTrackerImpl.java:188)
>  - locked <0x7f48f47a9658> (a 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl.handle(LocalResourcesTrackerImpl.java:59)
>  at 
>

[jira] [Comment Edited] (YARN-8786) LinuxContainerExecutor fails sporadically in create_local_dirs

2021-03-05 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-8786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296034#comment-17296034
 ] 

Peter Bacsko edited comment on YARN-8786 at 3/5/21, 2:14 PM:
-

[~jlowe] [~ebadger] [~jonbender-stripe] do we still need this JIRA open? Is the 
issue still happening after YARN-9833? (as it turned out that fix is still not 
100% perfect, but very close enough to 100% which makes it acceptable).


was (Author: pbacsko):
[~jlowe] [~ebadger] [~jonbender-stripe] do we still need this JIRA open? Is the 
issue still happening after YARN-9833 (as it turned out that fix is still not 
100% perfect, but very close enough to 100% which makes it acceptable).

> LinuxContainerExecutor fails sporadically in create_local_dirs
> --
>
> Key: YARN-8786
> URL: https://issues.apache.org/jira/browse/YARN-8786
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Jon Bender
>Priority: Major
>
> We started using CGroups with LinuxContainerExecutor recently, running Apache 
> Hadoop 3.0.0. Occasionally (once out of many millions of tasks) a yarn 
> container will fail with a message like the following:
> {code:java}
> [2018-09-02 23:48:02.458691] 18/09/02 23:48:02 INFO container.ContainerImpl: 
> Container container_1530684675517_516620_01_020846 transitioned from 
> SCHEDULED to RUNNING
> [2018-09-02 23:48:02.458874] 18/09/02 23:48:02 INFO 
> monitor.ContainersMonitorImpl: Starting resource-monitoring for 
> container_1530684675517_516620_01_020846
> [2018-09-02 23:48:02.506114] 18/09/02 23:48:02 WARN 
> privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 
> 35. Privileged Execution Operation Stderr:
> [2018-09-02 23:48:02.506159] Could not create container dirsCould not create 
> local files and directories
> [2018-09-02 23:48:02.506220]
> [2018-09-02 23:48:02.506238] Stdout: main : command provided 1
> [2018-09-02 23:48:02.506258] main : run as user is nobody
> [2018-09-02 23:48:02.506282] main : requested yarn user is root
> [2018-09-02 23:48:02.506294] Getting exit code file...
> [2018-09-02 23:48:02.506307] Creating script paths...
> [2018-09-02 23:48:02.506330] Writing pid file...
> [2018-09-02 23:48:02.506366] Writing to tmp file 
> /path/to/hadoop/yarn/local/nmPrivate/application_1530684675517_516620/container_1530684675517_516620_01_020846/container_1530684675517_516620_01_020846.pid.tmp
> [2018-09-02 23:48:02.506389] Writing to cgroup task files...
> [2018-09-02 23:48:02.506402] Creating local dirs...
> [2018-09-02 23:48:02.506414] Getting exit code file...
> [2018-09-02 23:48:02.506435] Creating script paths...
> {code}
> Looking at the container executor source it's traceable to errors here: 
> [https://github.com/apache/hadoop/blob/release-3.0.0-RC1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c#L1604]
>  And ultimately to 
> [https://github.com/apache/hadoop/blob/release-3.0.0-RC1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c#L672]
> The root failure seems to be in the underlying mkdir call, but that exit code 
> / errno is swallowed so we don't have more details. We tend to see this when 
> many containers start at the same time for the same application on a host, 
> and suspect it may be related to some race conditions around those shared 
> directories between containers for the same application.
> For example, this is a typical pattern in the audit logs:
> {code:java}
> [2018-09-07 17:16:38.447654] 18/09/07 17:16:38 INFO 
> nodemanager.NMAuditLogger: USER=root  IP=<> Container Request 
> TARGET=ContainerManageImpl  RESULT=SUCCESS  
> APPID=application_1530684675517_559126  
> CONTAINERID=container_1530684675517_559126_01_012871
> [2018-09-07 17:16:38.492298] 18/09/07 17:16:38 INFO 
> nodemanager.NMAuditLogger: USER=root  IP=<> Container Request 
> TARGET=ContainerManageImpl  RESULT=SUCCESS  
> APPID=application_1530684675517_559126  
> CONTAINERID=container_1530684675517_559126_01_012870
> [2018-09-07 17:16:38.614044] 18/09/07 17:16:38 WARN 
> nodemanager.NMAuditLogger: USER=root  OPERATION=Container Finished - 
> Failed   TARGET=ContainerImplRESULT=FAILURE  DESCRIPTION=Container failed 
> with state: EXITED_WITH_FAILUREAPPID=application_1530684675517_559126  
> CONTAINERID=container_1530684675517_559126_01_012871
> {code}
> Two containers for the same application starting in quick succession followed 
> by the EXITED_WITH_FAILURE step (exit code 35).
> We plan to upgrade to 3.1.x soon but I don't expect this to be fixed by this, 
> the only major JIRAs that affected the executor since 3.0.0

[jira] [Commented] (YARN-8786) LinuxContainerExecutor fails sporadically in create_local_dirs

2021-03-05 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-8786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296034#comment-17296034
 ] 

Peter Bacsko commented on YARN-8786:


[~jlowe] [~ebadger] [~jonbender-stripe] do we still need this JIRA open? Is the 
issue still happening after YARN-9833 (as it turned out that fix is still not 
100% perfect, but very close enough to 100% which makes it acceptable).

> LinuxContainerExecutor fails sporadically in create_local_dirs
> --
>
> Key: YARN-8786
> URL: https://issues.apache.org/jira/browse/YARN-8786
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Jon Bender
>Priority: Major
>
> We started using CGroups with LinuxContainerExecutor recently, running Apache 
> Hadoop 3.0.0. Occasionally (once out of many millions of tasks) a yarn 
> container will fail with a message like the following:
> {code:java}
> [2018-09-02 23:48:02.458691] 18/09/02 23:48:02 INFO container.ContainerImpl: 
> Container container_1530684675517_516620_01_020846 transitioned from 
> SCHEDULED to RUNNING
> [2018-09-02 23:48:02.458874] 18/09/02 23:48:02 INFO 
> monitor.ContainersMonitorImpl: Starting resource-monitoring for 
> container_1530684675517_516620_01_020846
> [2018-09-02 23:48:02.506114] 18/09/02 23:48:02 WARN 
> privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 
> 35. Privileged Execution Operation Stderr:
> [2018-09-02 23:48:02.506159] Could not create container dirsCould not create 
> local files and directories
> [2018-09-02 23:48:02.506220]
> [2018-09-02 23:48:02.506238] Stdout: main : command provided 1
> [2018-09-02 23:48:02.506258] main : run as user is nobody
> [2018-09-02 23:48:02.506282] main : requested yarn user is root
> [2018-09-02 23:48:02.506294] Getting exit code file...
> [2018-09-02 23:48:02.506307] Creating script paths...
> [2018-09-02 23:48:02.506330] Writing pid file...
> [2018-09-02 23:48:02.506366] Writing to tmp file 
> /path/to/hadoop/yarn/local/nmPrivate/application_1530684675517_516620/container_1530684675517_516620_01_020846/container_1530684675517_516620_01_020846.pid.tmp
> [2018-09-02 23:48:02.506389] Writing to cgroup task files...
> [2018-09-02 23:48:02.506402] Creating local dirs...
> [2018-09-02 23:48:02.506414] Getting exit code file...
> [2018-09-02 23:48:02.506435] Creating script paths...
> {code}
> Looking at the container executor source it's traceable to errors here: 
> [https://github.com/apache/hadoop/blob/release-3.0.0-RC1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c#L1604]
>  And ultimately to 
> [https://github.com/apache/hadoop/blob/release-3.0.0-RC1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c#L672]
> The root failure seems to be in the underlying mkdir call, but that exit code 
> / errno is swallowed so we don't have more details. We tend to see this when 
> many containers start at the same time for the same application on a host, 
> and suspect it may be related to some race conditions around those shared 
> directories between containers for the same application.
> For example, this is a typical pattern in the audit logs:
> {code:java}
> [2018-09-07 17:16:38.447654] 18/09/07 17:16:38 INFO 
> nodemanager.NMAuditLogger: USER=root  IP=<> Container Request 
> TARGET=ContainerManageImpl  RESULT=SUCCESS  
> APPID=application_1530684675517_559126  
> CONTAINERID=container_1530684675517_559126_01_012871
> [2018-09-07 17:16:38.492298] 18/09/07 17:16:38 INFO 
> nodemanager.NMAuditLogger: USER=root  IP=<> Container Request 
> TARGET=ContainerManageImpl  RESULT=SUCCESS  
> APPID=application_1530684675517_559126  
> CONTAINERID=container_1530684675517_559126_01_012870
> [2018-09-07 17:16:38.614044] 18/09/07 17:16:38 WARN 
> nodemanager.NMAuditLogger: USER=root  OPERATION=Container Finished - 
> Failed   TARGET=ContainerImplRESULT=FAILURE  DESCRIPTION=Container failed 
> with state: EXITED_WITH_FAILUREAPPID=application_1530684675517_559126  
> CONTAINERID=container_1530684675517_559126_01_012871
> {code}
> Two containers for the same application starting in quick succession followed 
> by the EXITED_WITH_FAILURE step (exit code 35).
> We plan to upgrade to 3.1.x soon but I don't expect this to be fixed by this, 
> the only major JIRAs that affected the executor since 3.0.0 seem unrelated 
> ([https://github.com/apache/hadoop/commit/bc285da107bb84a3c60c5224369d7398a41db2d8]
>  and 
> [https://github.com/apache/hadoop/commit/a82be7754d74f4d16b206427b91e700bb5f44d56])



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (YARN-10643) Fix the race condition introduced by YARN-8995.

2021-03-05 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YARN-10643.
-
Resolution: Duplicate

> Fix the race condition introduced by YARN-8995.
> ---
>
> Key: YARN-10643
> URL: https://issues.apache.org/jira/browse/YARN-10643
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.2.1
>Reporter: Qi Zhu
>Assignee: zhengchenyu
>Priority: Critical
> Attachments: YARN-10643.001.patch
>
>
> The race condition introduced by -YARN-8995.-
> The problem has been raised in YARN-10221
> also in YARN-10642.
> I think we should fix it in a hurry.
> I will help fix it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10643) Fix the race condition introduced by YARN-8995.

2021-03-05 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296016#comment-17296016
 ] 

Peter Bacsko commented on YARN-10643:
-

Hi [~zhengchenyu] / [~zhuqi] does this JIRA add anything new to YARN-10642? It 
looks like a duplicate.

Can I close it?

> Fix the race condition introduced by YARN-8995.
> ---
>
> Key: YARN-10643
> URL: https://issues.apache.org/jira/browse/YARN-10643
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.2.1
>Reporter: Qi Zhu
>Assignee: zhengchenyu
>Priority: Critical
> Attachments: YARN-10643.001.patch
>
>
> The race condition introduced by -YARN-8995.-
> The problem has been raised in YARN-10221
> also in YARN-10642.
> I think we should fix it in a hurry.
> I will help fix it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10642) Race condition: AsyncDispatcher can get stuck by the changes introduced in YARN-8995

2021-03-05 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10642:

Fix Version/s: 3.4.0

> Race condition: AsyncDispatcher can get stuck by the changes introduced in 
> YARN-8995
> 
>
> Key: YARN-10642
> URL: https://issues.apache.org/jira/browse/YARN-10642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Critical
> Fix For: 3.4.0
>
> Attachments: MockForDeadLoop.java, YARN-10642.001.patch, 
> YARN-10642.002.patch, YARN-10642.003.patch, YARN-10642.004.patch, 
> YARN-10642.005.patch, deadloop.png, debugfornode.png, put.png, take.png
>
>
> In our cluster, ResouceManager stuck twice within twenty days. Yarn client 
> can't submit application. I got jstack info at second time, then found the 
> reason.
> I analyze all the jstack, I found many thread stuck because can't get 
> LinkedBlockingQueue.putLock. (Note: Sorry for limited space , omit the 
> analytical process)
> The reason is that one thread hold the putLock all the time, 
> printEventQueueDetails will called forEachRemaining, then hold putLock and 
> readLock. The AsyncDispatcher will stuck.
> {code}
> Thread 6526 (IPC Server handler 454 on default port 8030):
>   State: RUNNABLE
>   Blocked count: 29988
>   Waited count: 2035029
>   Stack:
> 
> java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926)
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
> 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.handleProgress(DefaultAMSProcessor.java:408)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:215)
> 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
> 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
> 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:432)
> 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1040)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:958)
> java.security.AccessController.doPrivileged(Native Method)
> {code}
> I analyze LinkedBlockingQueue's source code. I found forEachRemaining in 
> LinkedBlockingQueue.LBQSpliterator may stuck, when forEachRemaining and take 
> are called in different thread. 
> YARN-8995 introduce printEventQueueDetails method, 
> "eventQueue.stream().collect" will called forEachRemaining method.
> Let's see why？ "put.png" shows that how to put("a"), "take.png" shows that 
> how to take()。Specical Node: The removed Node will point itself for help gc!!!
> The key point code is in forEachRemaining, we see LBQSpliterator use 
> forEachRemaining to visit all Node. But when got item value from Node, will 
> release the lock. If at this time, take() will be called. 
> The variable 'p' in forEachRemaining may point a Node which point itself, 
> then forEachRemaining will be in dead loop. You can see it in "deadloop.png"
> Let's see a simple uni-test, Let's forEachRemaining called more slow than 
> take, the problem will reproduction。uni-test is MockForDeadLoop.java.
> I debug MockForDeadLoop.java, and see a Node point itself. You can see pic 
> "debugfornode.png"
> Environment:
>   OS: CentOS Linux release 7.5.1804 (Core) 
>   JDK: jdk1.8.0_281



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-

[jira] [Commented] (YARN-10642) Race condition: AsyncDispatcher can get stuck by the changes introduced in YARN-8995

2021-03-05 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296006#comment-17296006
 ] 

Peter Bacsko commented on YARN-10642:
-

+1

Thanks [~zhengchenyu] for the analysis + patch and [~zhuqi] for the review. 
Committed to trunk.

Affected version is set to 3.2.1. This problem looks serious, the only solution 
is restarting the RM. Backporting this to branch-3.2 and branch-3.3 seems very 
reasonable.

[~zhengchenyu] can you create the branch-3.2 and branch-3.3 versions of the 
changes? They should be called like "YARN-10643-branch-3.2.001.patch". Just 
wait until Jenkins starts, then upload the branch-3.3 patch.

> Race condition: AsyncDispatcher can get stuck by the changes introduced in 
> YARN-8995
> 
>
> Key: YARN-10642
> URL: https://issues.apache.org/jira/browse/YARN-10642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Critical
> Attachments: MockForDeadLoop.java, YARN-10642.001.patch, 
> YARN-10642.002.patch, YARN-10642.003.patch, YARN-10642.004.patch, 
> YARN-10642.005.patch, deadloop.png, debugfornode.png, put.png, take.png
>
>
> In our cluster, ResouceManager stuck twice within twenty days. Yarn client 
> can't submit application. I got jstack info at second time, then found the 
> reason.
> I analyze all the jstack, I found many thread stuck because can't get 
> LinkedBlockingQueue.putLock. (Note: Sorry for limited space , omit the 
> analytical process)
> The reason is that one thread hold the putLock all the time, 
> printEventQueueDetails will called forEachRemaining, then hold putLock and 
> readLock. The AsyncDispatcher will stuck.
> {code}
> Thread 6526 (IPC Server handler 454 on default port 8030):
>   State: RUNNABLE
>   Blocked count: 29988
>   Waited count: 2035029
>   Stack:
> 
> java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926)
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
> 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.handleProgress(DefaultAMSProcessor.java:408)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:215)
> 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
> 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
> 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:432)
> 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1040)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:958)
> java.security.AccessController.doPrivileged(Native Method)
> {code}
> I analyze LinkedBlockingQueue's source code. I found forEachRemaining in 
> LinkedBlockingQueue.LBQSpliterator may stuck, when forEachRemaining and take 
> are called in different thread. 
> YARN-8995 introduce printEventQueueDetails method, 
> "eventQueue.stream().collect" will called forEachRemaining method.
> Let's see why？ "put.png" shows that how to put("a"), "take.png" shows that 
> how to take()。Specical Node: The removed Node will point itself for help gc!!!
> The key point code is in forEachRemaining, we see LBQSpliterator use 
> forEachRemaining to visit all Node. But when got item value from Node, will 
> release the lock. If at this time, take() will be called. 
> The variable 'p' in forEachRemaining may point a Node which point itself, 
> then forEachRemaining will be in dead loop. You can see it in

[jira] [Updated] (YARN-10642) Race condition: AsyncDispatcher can get stuck by the changes introduced in YARN-8995

2021-03-05 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10642:

Summary: Race condition: AsyncDispatcher can get stuck by the changes 
introduced in YARN-8995  (was: Race condition: AsyncDispatcher can get stuck by 
YARN-8995)

> Race condition: AsyncDispatcher can get stuck by the changes introduced in 
> YARN-8995
> 
>
> Key: YARN-10642
> URL: https://issues.apache.org/jira/browse/YARN-10642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Critical
> Attachments: MockForDeadLoop.java, YARN-10642.001.patch, 
> YARN-10642.002.patch, YARN-10642.003.patch, YARN-10642.004.patch, 
> YARN-10642.005.patch, deadloop.png, debugfornode.png, put.png, take.png
>
>
> In our cluster, ResouceManager stuck twice within twenty days. Yarn client 
> can't submit application. I got jstack info at second time, then found the 
> reason.
> I analyze all the jstack, I found many thread stuck because can't get 
> LinkedBlockingQueue.putLock. (Note: Sorry for limited space , omit the 
> analytical process)
> The reason is that one thread hold the putLock all the time, 
> printEventQueueDetails will called forEachRemaining, then hold putLock and 
> readLock. The AsyncDispatcher will stuck.
> {code}
> Thread 6526 (IPC Server handler 454 on default port 8030):
>   State: RUNNABLE
>   Blocked count: 29988
>   Waited count: 2035029
>   Stack:
> 
> java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926)
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
> 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.handleProgress(DefaultAMSProcessor.java:408)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:215)
> 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
> 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
> 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:432)
> 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1040)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:958)
> java.security.AccessController.doPrivileged(Native Method)
> {code}
> I analyze LinkedBlockingQueue's source code. I found forEachRemaining in 
> LinkedBlockingQueue.LBQSpliterator may stuck, when forEachRemaining and take 
> are called in different thread. 
> YARN-8995 introduce printEventQueueDetails method, 
> "eventQueue.stream().collect" will called forEachRemaining method.
> Let's see why？ "put.png" shows that how to put("a"), "take.png" shows that 
> how to take()。Specical Node: The removed Node will point itself for help gc!!!
> The key point code is in forEachRemaining, we see LBQSpliterator use 
> forEachRemaining to visit all Node. But when got item value from Node, will 
> release the lock. If at this time, take() will be called. 
> The variable 'p' in forEachRemaining may point a Node which point itself, 
> then forEachRemaining will be in dead loop. You can see it in "deadloop.png"
> Let's see a simple uni-test, Let's forEachRemaining called more slow than 
> take, the problem will reproduction。uni-test is MockForDeadLoop.java.
> I debug MockForDeadLoop.java, and see a Node point itself. You can see pic 
> "debugfornode.png"
> Environment:
>   OS: CentOS Linux release 7.5.1804 (Core) 
>   JDK: jdk1.8.0_281



--
This message was

[jira] [Commented] (YARN-10642) Race condition: AsyncDispatcher can get stuck by YARN-8995

2021-03-05 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296002#comment-17296002
 ] 

Peter Bacsko commented on YARN-10642:
-

I'm going to commit this soon. What's the difference between this JIRA and 
YARN-10643?

> Race condition: AsyncDispatcher can get stuck by YARN-8995
> --
>
> Key: YARN-10642
> URL: https://issues.apache.org/jira/browse/YARN-10642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Critical
> Attachments: MockForDeadLoop.java, YARN-10642.001.patch, 
> YARN-10642.002.patch, YARN-10642.003.patch, YARN-10642.004.patch, 
> YARN-10642.005.patch, deadloop.png, debugfornode.png, put.png, take.png
>
>
> In our cluster, ResouceManager stuck twice within twenty days. Yarn client 
> can't submit application. I got jstack info at second time, then found the 
> reason.
> I analyze all the jstack, I found many thread stuck because can't get 
> LinkedBlockingQueue.putLock. (Note: Sorry for limited space , omit the 
> analytical process)
> The reason is that one thread hold the putLock all the time, 
> printEventQueueDetails will called forEachRemaining, then hold putLock and 
> readLock. The AsyncDispatcher will stuck.
> {code}
> Thread 6526 (IPC Server handler 454 on default port 8030):
>   State: RUNNABLE
>   Blocked count: 29988
>   Waited count: 2035029
>   Stack:
> 
> java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926)
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
> 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.handleProgress(DefaultAMSProcessor.java:408)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:215)
> 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
> 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
> 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:432)
> 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1040)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:958)
> java.security.AccessController.doPrivileged(Native Method)
> {code}
> I analyze LinkedBlockingQueue's source code. I found forEachRemaining in 
> LinkedBlockingQueue.LBQSpliterator may stuck, when forEachRemaining and take 
> are called in different thread. 
> YARN-8995 introduce printEventQueueDetails method, 
> "eventQueue.stream().collect" will called forEachRemaining method.
> Let's see why？ "put.png" shows that how to put("a"), "take.png" shows that 
> how to take()。Specical Node: The removed Node will point itself for help gc!!!
> The key point code is in forEachRemaining, we see LBQSpliterator use 
> forEachRemaining to visit all Node. But when got item value from Node, will 
> release the lock. If at this time, take() will be called. 
> The variable 'p' in forEachRemaining may point a Node which point itself, 
> then forEachRemaining will be in dead loop. You can see it in "deadloop.png"
> Let's see a simple uni-test, Let's forEachRemaining called more slow than 
> take, the problem will reproduction。uni-test is MockForDeadLoop.java.
> I debug MockForDeadLoop.java, and see a Node point itself. You can see pic 
> "debugfornode.png"
> Environment:
>   OS: CentOS Linux release 7.5.1804 (Core) 
>   JDK: jdk1.8.0_281



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (YARN-10642) Race condition: AsyncDispatcher can get stuck by YARN-8995

2021-03-05 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296001#comment-17296001
 ] 

Peter Bacsko commented on YARN-10642:
-

I renamed the title a little bit.

[~zhengchenyu] [~zhuqi] this is a very interesting problem and looks like it 
has to do with Java itself.

Could it be that this should be reported to JDK developers? I don't think that 
this should be allowed, especially since {{LinkedBlockingQueue}} was 
specifically designed for multi threaded applications.



> Race condition: AsyncDispatcher can get stuck by YARN-8995
> --
>
> Key: YARN-10642
> URL: https://issues.apache.org/jira/browse/YARN-10642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Critical
> Attachments: MockForDeadLoop.java, YARN-10642.001.patch, 
> YARN-10642.002.patch, YARN-10642.003.patch, YARN-10642.004.patch, 
> YARN-10642.005.patch, deadloop.png, debugfornode.png, put.png, take.png
>
>
> In our cluster, ResouceManager stuck twice within twenty days. Yarn client 
> can't submit application. I got jstack info at second time, then found the 
> reason.
> I analyze all the jstack, I found many thread stuck because can't get 
> LinkedBlockingQueue.putLock. (Note: Sorry for limited space , omit the 
> analytical process)
> The reason is that one thread hold the putLock all the time, 
> printEventQueueDetails will called forEachRemaining, then hold putLock and 
> readLock. The AsyncDispatcher will stuck.
> {code}
> Thread 6526 (IPC Server handler 454 on default port 8030):
>   State: RUNNABLE
>   Blocked count: 29988
>   Waited count: 2035029
>   Stack:
> 
> java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926)
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
> 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.handleProgress(DefaultAMSProcessor.java:408)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:215)
> 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
> 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
> 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:432)
> 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1040)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:958)
> java.security.AccessController.doPrivileged(Native Method)
> {code}
> I analyze LinkedBlockingQueue's source code. I found forEachRemaining in 
> LinkedBlockingQueue.LBQSpliterator may stuck, when forEachRemaining and take 
> are called in different thread. 
> YARN-8995 introduce printEventQueueDetails method, 
> "eventQueue.stream().collect" will called forEachRemaining method.
> Let's see why？ "put.png" shows that how to put("a"), "take.png" shows that 
> how to take()。Specical Node: The removed Node will point itself for help gc!!!
> The key point code is in forEachRemaining, we see LBQSpliterator use 
> forEachRemaining to visit all Node. But when got item value from Node, will 
> release the lock. If at this time, take() will be called. 
> The variable 'p' in forEachRemaining may point a Node which point itself, 
> then forEachRemaining will be in dead loop. You can see it in "deadloop.png"
> Let's see a simple uni-test, Let's forEachRemaining called more slow than 
> take, the problem will reproduction。uni-test is MockForDeadLoop.java.
> I debug MockForDeadLoop.java, and

[jira] [Updated] (YARN-10642) Race condition: AsyncDispatcher can get stuck by YARN-8995

2021-03-05 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10642:

Summary: Race condition: AsyncDispatcher can get stuck by YARN-8995  (was: 
AsyncDispatcher will stuck introduced by YARN-8995.)

> Race condition: AsyncDispatcher can get stuck by YARN-8995
> --
>
> Key: YARN-10642
> URL: https://issues.apache.org/jira/browse/YARN-10642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Critical
> Attachments: MockForDeadLoop.java, YARN-10642.001.patch, 
> YARN-10642.002.patch, YARN-10642.003.patch, YARN-10642.004.patch, 
> YARN-10642.005.patch, deadloop.png, debugfornode.png, put.png, take.png
>
>
> In our cluster, ResouceManager stuck twice within twenty days. Yarn client 
> can't submit application. I got jstack info at second time, then found the 
> reason.
> I analyze all the jstack, I found many thread stuck because can't get 
> LinkedBlockingQueue.putLock. (Note: Sorry for limited space , omit the 
> analytical process)
> The reason is that one thread hold the putLock all the time, 
> printEventQueueDetails will called forEachRemaining, then hold putLock and 
> readLock. The AsyncDispatcher will stuck.
> {code}
> Thread 6526 (IPC Server handler 454 on default port 8030):
>   State: RUNNABLE
>   Blocked count: 29988
>   Waited count: 2035029
>   Stack:
> 
> java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926)
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
> 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.handleProgress(DefaultAMSProcessor.java:408)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:215)
> 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
> 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
> 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:432)
> 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1040)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:958)
> java.security.AccessController.doPrivileged(Native Method)
> {code}
> I analyze LinkedBlockingQueue's source code. I found forEachRemaining in 
> LinkedBlockingQueue.LBQSpliterator may stuck, when forEachRemaining and take 
> are called in different thread. 
> YARN-8995 introduce printEventQueueDetails method, 
> "eventQueue.stream().collect" will called forEachRemaining method.
> Let's see why？ "put.png" shows that how to put("a"), "take.png" shows that 
> how to take()。Specical Node: The removed Node will point itself for help gc!!!
> The key point code is in forEachRemaining, we see LBQSpliterator use 
> forEachRemaining to visit all Node. But when got item value from Node, will 
> release the lock. If at this time, take() will be called. 
> The variable 'p' in forEachRemaining may point a Node which point itself, 
> then forEachRemaining will be in dead loop. You can see it in "deadloop.png"
> Let's see a simple uni-test, Let's forEachRemaining called more slow than 
> take, the problem will reproduction。uni-test is MockForDeadLoop.java.
> I debug MockForDeadLoop.java, and see a Node point itself. You can see pic 
> "debugfornode.png"
> Environment:
>   OS: CentOS Linux release 7.5.1804 (Core) 
>   JDK: jdk1.8.0_281



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (YARN-10639) Queueinfo related capacity, should adjusted to weight mode.

2021-03-05 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295999#comment-17295999
 ] 

Peter Bacsko commented on YARN-10639:
-

+1

Thanks [~zhuqi] for the patch and [~gandras] for the review.

Committed to trunk.

> Queueinfo related capacity, should adjusted to weight mode.
> ---
>
> Key: YARN-10639
> URL: https://issues.apache.org/jira/browse/YARN-10639
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10639.001.patch, YARN-10639.002.patch, 
> YARN-10639.003.patch, YARN-10639.004.patch, YARN-10639.005.patch
>
>
> {color:#172b4d}The class QueueInfo capacity field should consider the weight 
> mode.{color}
> {color:#172b4d}Now when client use getQueueInfo to get queue capacity in 
> weight mode, i always return 0, it is wrong.{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-10652) Capacity Scheduler fails to handle user weights for a user that has a "." (dot) in it

2021-03-05 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295964#comment-17295964
 ] 

Peter Bacsko edited comment on YARN-10652 at 3/5/21, 10:37 AM:
---

Hi guys,

I think we can reach compromise: let's think about scenarios where dotted 
usernames can be problematic and address them in a follow-up JIRA. For example, 
we already know that placement rules involving username (%user placeholder) 
will definitely exhibit unexpected behavior (interestingly enough this has 
always been a problem, but just hasn't been reported). So in this case, we can 
go FS-way and just replace "." with {{_dot_}}. Also, FS does this to primary 
groups as well, that's another thing that we need to fix. Maybe the 
{{cleanName()}} approach is just fine?

When it comes to configuration, {{getValByRegex()}} is only used for this 
property, so it's likely that we're already good and in other cases, property 
names are concatenated and dot isn't an issue at all. In YARN-9930, I added 
"yarn.scheduler.capacity.user..max-parallel-apps", making it a 
potential suspect, but I don't use regex, just concat strings.

IMO we can handle these on a case-by-case basis.


was (Author: pbacsko):
Hi guys,

I think we can reach compromise: let's think about scenarios where dotted 
usernames can be problematic and address them in a follow-up JIRA. For example, 
we already know that placement rules involving username (%user placeholder) 
will definitely exhibit unexpected behavior (interestingly enough this has 
always been a problem, but just hasn't been reported). So in this case, we can 
go FS-way and just replace "." with "_dot_". Also, FS does this to primary 
groups as well, that's another thing that we need to fix. Maybe the 
{{cleanName()}} approach is just fine?

When it comes to configuration, {{getValByRegex()}} is only used for this 
property, so it's likely that we're already good and in other cases, property 
names are concatenated and dot isn't an issue at all. In YARN-9930, I added 
"yarn.scheduler.capacity.user..max-parallel-apps", making it a 
potential suspect, but I don't use regex, just concat strings.

IMO we can handle these on a case-by-case basis.

> Capacity Scheduler fails to handle user weights for a user that has a "." 
> (dot) in it
> -
>
> Key: YARN-10652
> URL: https://issues.apache.org/jira/browse/YARN-10652
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.3.0
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: Correct user weight of 0.76 picked up for the user with 
> a dot after the patch.png, Incorrect default user weight of 1.0 being picked 
> for the user with a dot before the patch.png, YARN-10652.001.patch
>
>
> AD usernames can have a "." (dot) in them i.e. they can be of the format -> 
> {{firstname.lastname}}. However, if you specify a username with this format 
> against the Capacity Scheduler setting -> 
> {{yarn.scheduler.capacity.root.default.user-settings.firstname.lastname.weight}},
>  it fails to be applied and is instead assigned the default of 1.0f weight. 
> This renders the user weight feature (being used as a means of setting user 
> priorities for a queue) unusable for such users.
> This limitation comes from [1]. From [1], only word characters (A word 
> character: [a-zA-Z_0-9]) (see [2]) are permissible at the moment which is no 
> good for AD names that contain a "." (dot).
> Similar discussion has been had in a few HADOOP jiras e.g. HADOOP-7050 and 
> HADOOP-15395 and the outcome was to use non-whitespace characters i.e. 
> instead of {{\w+}}, use {{\S+}}.
> We could go down similar path and unblock this feature for the AD usernames 
> with a "." (dot) in them.
> [1] 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java#L1953
> [2] 
> https://docs.oracle.com/javase/tutorial/essential/regex/pre_char_classes.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-10652) Capacity Scheduler fails to handle user weights for a user that has a "." (dot) in it

2021-03-05 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295964#comment-17295964
 ] 

Peter Bacsko edited comment on YARN-10652 at 3/5/21, 10:34 AM:
---

Hi guys,

I think we can reach compromise: let's think about scenarios where dotted 
usernames can be problematic and address them in a follow-up JIRA. For example, 
we already know that placement rules involving username (%user placeholder) 
will definitely exhibit unexpected behavior (interestingly enough this has 
always been a problem, but just hasn't been reported). So in this case, we can 
go FS-way and just replace "." with "_dot_". Also, FS does this to primary 
groups as well, that's another thing that we need to fix. Maybe the 
{{cleanName()}} approach is just fine?

When it comes to configuration, {{getValByRegex()}} is only used for this 
property, so it's likely that we're already good and in other cases, property 
names are concatenated and dot isn't an issue at all. In YARN-9930, I added 
"yarn.scheduler.capacity.user..max-parallel-apps", making it a 
potential suspect, but I don't use regex, just concat strings.

IMO we can handle these on a case-by-case basis.


was (Author: pbacsko):
Hi guys,

I think we can reach compromise: let's think about scenarios where dotted 
usernames can be problematic and address them in a follow-up JIRA. For example, 
we already know that placement rules involving username (%user placeholder) 
will definitely exhibit unexpected behavior (interestingly enough this has 
always been a problem, but just hasn't been reported). So in this case, we can 
go FS-way and just replace "." with "_dot_". Also, FS does this to primary 
groups as well, that's another thing that we need to fix. Maybe the cleanName() 
approach is just fine?

When it comes to configuration, {{getValByRegex()}} is only used for this 
property, so it's likely that we're already good and in other cases, property 
names are concatenated and dot isn't an issue at all. In YARN-9930, I added 
"yarn.scheduler.capacity.user..max-parallel-apps", making it a 
potential suspect, but I don't use regex, just concat strings.

IMO we can handle these on a case-by-case basis.

> Capacity Scheduler fails to handle user weights for a user that has a "." 
> (dot) in it
> -
>
> Key: YARN-10652
> URL: https://issues.apache.org/jira/browse/YARN-10652
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.3.0
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: Correct user weight of 0.76 picked up for the user with 
> a dot after the patch.png, Incorrect default user weight of 1.0 being picked 
> for the user with a dot before the patch.png, YARN-10652.001.patch
>
>
> AD usernames can have a "." (dot) in them i.e. they can be of the format -> 
> {{firstname.lastname}}. However, if you specify a username with this format 
> against the Capacity Scheduler setting -> 
> {{yarn.scheduler.capacity.root.default.user-settings.firstname.lastname.weight}},
>  it fails to be applied and is instead assigned the default of 1.0f weight. 
> This renders the user weight feature (being used as a means of setting user 
> priorities for a queue) unusable for such users.
> This limitation comes from [1]. From [1], only word characters (A word 
> character: [a-zA-Z_0-9]) (see [2]) are permissible at the moment which is no 
> good for AD names that contain a "." (dot).
> Similar discussion has been had in a few HADOOP jiras e.g. HADOOP-7050 and 
> HADOOP-15395 and the outcome was to use non-whitespace characters i.e. 
> instead of {{\w+}}, use {{\S+}}.
> We could go down similar path and unblock this feature for the AD usernames 
> with a "." (dot) in them.
> [1] 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java#L1953
> [2] 
> https://docs.oracle.com/javase/tutorial/essential/regex/pre_char_classes.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10652) Capacity Scheduler fails to handle user weights for a user that has a "." (dot) in it

2021-03-05 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295964#comment-17295964
 ] 

Peter Bacsko commented on YARN-10652:
-

Hi guys,

I think we can reach compromise: let's think about scenarios where dotted 
usernames can be problematic and address them in a follow-up JIRA. For example, 
we already know that placement rules involving username (%user placeholder) 
will definitely exhibit unexpected behavior (interestingly enough this has 
always been a problem, but just hasn't been reported). So in this case, we can 
go FS-way and just replace "." with "_dot_". Also, FS does this to primary 
groups as well, that's another thing that we need to fix. Maybe the cleanName() 
approach is just fine?

When it comes to configuration, {{getValByRegex()}} is only used for this 
property, so it's likely that we're already good and in other cases, property 
names are concatenated and dot isn't an issue at all. In YARN-9930, I added 
"yarn.scheduler.capacity.user..max-parallel-apps", making it a 
potential suspect, but I don't use regex, just concat strings.

IMO we can handle these on a case-by-case basis.

> Capacity Scheduler fails to handle user weights for a user that has a "." 
> (dot) in it
> -
>
> Key: YARN-10652
> URL: https://issues.apache.org/jira/browse/YARN-10652
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.3.0
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: Correct user weight of 0.76 picked up for the user with 
> a dot after the patch.png, Incorrect default user weight of 1.0 being picked 
> for the user with a dot before the patch.png, YARN-10652.001.patch
>
>
> AD usernames can have a "." (dot) in them i.e. they can be of the format -> 
> {{firstname.lastname}}. However, if you specify a username with this format 
> against the Capacity Scheduler setting -> 
> {{yarn.scheduler.capacity.root.default.user-settings.firstname.lastname.weight}},
>  it fails to be applied and is instead assigned the default of 1.0f weight. 
> This renders the user weight feature (being used as a means of setting user 
> priorities for a queue) unusable for such users.
> This limitation comes from [1]. From [1], only word characters (A word 
> character: [a-zA-Z_0-9]) (see [2]) are permissible at the moment which is no 
> good for AD names that contain a "." (dot).
> Similar discussion has been had in a few HADOOP jiras e.g. HADOOP-7050 and 
> HADOOP-15395 and the outcome was to use non-whitespace characters i.e. 
> instead of {{\w+}}, use {{\S+}}.
> We could go down similar path and unblock this feature for the AD usernames 
> with a "." (dot) in them.
> [1] 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java#L1953
> [2] 
> https://docs.oracle.com/javase/tutorial/essential/regex/pre_char_classes.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-10672) All testcases in TestReservations are flaky

2021-03-05 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295881#comment-17295881
 ] 

Peter Bacsko edited comment on YARN-10672 at 3/5/21, 8:56 AM:
--

The solution is much straightforward than mine in YARN-10447. Actually we might 
consider applying this to TestLeafQueue with undoing my changes, because that's 
more complicated (I had no patience to go deeper with Mockito internal 
behavior, I just thought well, disable that thread and that's enough).


was (Author: pbacsko):
The solution is much straightforward than mine in YARN-10447. Actually we might 
consider applying this to TestLeafQueue as well, while undoing my changes, 
because that's more complicated (I had no patience to go deeper with Mockito 
internal behavior, I just thought well, disable that thread and that's enough).

> All testcases in TestReservations are flaky
> ---
>
> Key: YARN-10672
> URL: https://issues.apache.org/jira/browse/YARN-10672
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: Screenshot 2021-03-04 at 21.34.18.png, Screenshot 
> 2021-03-04 at 22.06.20.png, Screenshot-mockitostubbing1-2021-03-04 at 
> 22.34.01.png, Screenshot-mockitostubbing2-2021-03-04 at 22.34.12.png, 
> YARN-10672-debuglogs.patch, YARN-10672.001.patch
>
>
> All testcases in TestReservations are flaky
> Running a particular test in TestReservations 100 times never passes all the 
> time.
>  For example, let's run testReservationNoContinueLook 100 times. For me, it 
> produced 39 failed and 61 passed results.
>  Sometimes just 1 out of 100 runs is failed.
>  Screenshot is attached.
> Stacktrace:
> {code:java}
> java.lang.AssertionError: 
> Expected :2048
> Actual   :0
> 
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.failNotEquals(Assert.java:835)
> at org.junit.Assert.assertEquals(Assert.java:647)
> at org.junit.Assert.assertEquals(Assert.java:633)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642)
> {code}
> The test fails here:
> {code:java}
>  // Start testing...
> // Only AM
> TestUtils.applyResourceCommitRequest(clusterResource,
> a.assignContainers(clusterResource, node_0,
> new ResourceLimits(clusterResource),
> SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps);
> assertEquals(2 * GB, a.getUsedResources().getMemorySize());
> {code}
> With some debugging (patch attached), I realized that sometimes there are no 
> registered nodes so the AM can't be allocated and test will fail:
> {code:java}
> 2021-03-04 21:58:25,434 DEBUG [main] allocator.RegularContainerAllocator 
> (RegularContainerAllocator.java:canAssign(312)) - **Can't assign 
> container, no nodes... rmContext: 2a8dd942, scheduler: 2322e56f
> {code}
> In these cases, this is also printed from 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#getNumClusterNodes:
> {code:java}
> 2021-03-04 21:58:25,379 DEBUG [main] capacity.CapacityScheduler 
> (CapacityScheduler.java:getNumClusterNodes(290)) - ***Called real 
> getNumClusterNodes
> {code}
> h2. Let's break this down:
>  1. The mocking happens in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations#setup(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration,
>  boolean):
> {code:java}
> cs.setRMContext(spyRMContext);
> cs.init(csConf);
> cs.start();
> when(cs.getNumClusterNodes()).thenReturn(3);
> {code}
> Under no circumstances this could be allowed to return any other value than 3.
>  However, as mentioned above, sometimes the real method of 
> 'getNumClusterNodes' is called on CapacityScheduler.
> 2. Sometimes, this gets printed to the console:
> {code:java}
> org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
> Integer cannot be returned by isMultiNodePlacementEnabled()
> isMultiNodePlacementEnabled() should return boolean
> ***
> If you're unsure why you're getting above error read on.
> Due to the nature of the syntax above problem might occur because:
> 1. This exception *might* occur in wrongly written multi-threaded tests.
>Please refer to Mockito FAQ on limitations of concurrency testing.
> 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
> spies - 
>- with doReturn|Throw() family of methods. More in javadocs for 
> Mockito.spy() method.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:166)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:114)
>

[jira] [Commented] (YARN-10672) All testcases in TestReservations are flaky

2021-03-05 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295881#comment-17295881
 ] 

Peter Bacsko commented on YARN-10672:
-

The solution is much straightforward than mine in YARN-10447. Actually we might 
consider applying this to TestLeafQueue as well, while undoing my changes, 
because that's more complicated (I had no patience to go deeper with Mockito 
internal behavior, I just thought well, disable that thread and that's enough).

> All testcases in TestReservations are flaky
> ---
>
> Key: YARN-10672
> URL: https://issues.apache.org/jira/browse/YARN-10672
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: Screenshot 2021-03-04 at 21.34.18.png, Screenshot 
> 2021-03-04 at 22.06.20.png, Screenshot-mockitostubbing1-2021-03-04 at 
> 22.34.01.png, Screenshot-mockitostubbing2-2021-03-04 at 22.34.12.png, 
> YARN-10672-debuglogs.patch, YARN-10672.001.patch
>
>
> All testcases in TestReservations are flaky
> Running a particular test in TestReservations 100 times never passes all the 
> time.
>  For example, let's run testReservationNoContinueLook 100 times. For me, it 
> produced 39 failed and 61 passed results.
>  Sometimes just 1 out of 100 runs is failed.
>  Screenshot is attached.
> Stacktrace:
> {code:java}
> java.lang.AssertionError: 
> Expected :2048
> Actual   :0
> 
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.failNotEquals(Assert.java:835)
> at org.junit.Assert.assertEquals(Assert.java:647)
> at org.junit.Assert.assertEquals(Assert.java:633)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642)
> {code}
> The test fails here:
> {code:java}
>  // Start testing...
> // Only AM
> TestUtils.applyResourceCommitRequest(clusterResource,
> a.assignContainers(clusterResource, node_0,
> new ResourceLimits(clusterResource),
> SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps);
> assertEquals(2 * GB, a.getUsedResources().getMemorySize());
> {code}
> With some debugging (patch attached), I realized that sometimes there are no 
> registered nodes so the AM can't be allocated and test will fail:
> {code:java}
> 2021-03-04 21:58:25,434 DEBUG [main] allocator.RegularContainerAllocator 
> (RegularContainerAllocator.java:canAssign(312)) - **Can't assign 
> container, no nodes... rmContext: 2a8dd942, scheduler: 2322e56f
> {code}
> In these cases, this is also printed from 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#getNumClusterNodes:
> {code:java}
> 2021-03-04 21:58:25,379 DEBUG [main] capacity.CapacityScheduler 
> (CapacityScheduler.java:getNumClusterNodes(290)) - ***Called real 
> getNumClusterNodes
> {code}
> h2. Let's break this down:
>  1. The mocking happens in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations#setup(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration,
>  boolean):
> {code:java}
> cs.setRMContext(spyRMContext);
> cs.init(csConf);
> cs.start();
> when(cs.getNumClusterNodes()).thenReturn(3);
> {code}
> Under no circumstances this could be allowed to return any other value than 3.
>  However, as mentioned above, sometimes the real method of 
> 'getNumClusterNodes' is called on CapacityScheduler.
> 2. Sometimes, this gets printed to the console:
> {code:java}
> org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
> Integer cannot be returned by isMultiNodePlacementEnabled()
> isMultiNodePlacementEnabled() should return boolean
> ***
> If you're unsure why you're getting above error read on.
> Due to the nature of the syntax above problem might occur because:
> 1. This exception *might* occur in wrongly written multi-threaded tests.
>Please refer to Mockito FAQ on limitations of concurrency testing.
> 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
> spies - 
>- with doReturn|Throw() family of methods. More in javadocs for 
> Mockito.spy() method.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:166)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:114)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:566)
>   at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>

[jira] [Commented] (YARN-10672) All testcases in TestReservations are flaky

2021-03-05 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295875#comment-17295875
 ] 

Peter Bacsko commented on YARN-10672:
-

It's basically the same as YARN-10447. Must have been a good debugging 
session...

> All testcases in TestReservations are flaky
> ---
>
> Key: YARN-10672
> URL: https://issues.apache.org/jira/browse/YARN-10672
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: Screenshot 2021-03-04 at 21.34.18.png, Screenshot 
> 2021-03-04 at 22.06.20.png, Screenshot-mockitostubbing1-2021-03-04 at 
> 22.34.01.png, Screenshot-mockitostubbing2-2021-03-04 at 22.34.12.png, 
> YARN-10672-debuglogs.patch, YARN-10672.001.patch
>
>
> All testcases in TestReservations are flaky
> Running a particular test in TestReservations 100 times never passes all the 
> time.
>  For example, let's run testReservationNoContinueLook 100 times. For me, it 
> produced 39 failed and 61 passed results.
>  Sometimes just 1 out of 100 runs is failed.
>  Screenshot is attached.
> Stacktrace:
> {code:java}
> java.lang.AssertionError: 
> Expected :2048
> Actual   :0
> 
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.failNotEquals(Assert.java:835)
> at org.junit.Assert.assertEquals(Assert.java:647)
> at org.junit.Assert.assertEquals(Assert.java:633)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642)
> {code}
> The test fails here:
> {code:java}
>  // Start testing...
> // Only AM
> TestUtils.applyResourceCommitRequest(clusterResource,
> a.assignContainers(clusterResource, node_0,
> new ResourceLimits(clusterResource),
> SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps);
> assertEquals(2 * GB, a.getUsedResources().getMemorySize());
> {code}
> With some debugging (patch attached), I realized that sometimes there are no 
> registered nodes so the AM can't be allocated and test will fail:
> {code:java}
> 2021-03-04 21:58:25,434 DEBUG [main] allocator.RegularContainerAllocator 
> (RegularContainerAllocator.java:canAssign(312)) - **Can't assign 
> container, no nodes... rmContext: 2a8dd942, scheduler: 2322e56f
> {code}
> In these cases, this is also printed from 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#getNumClusterNodes:
> {code:java}
> 2021-03-04 21:58:25,379 DEBUG [main] capacity.CapacityScheduler 
> (CapacityScheduler.java:getNumClusterNodes(290)) - ***Called real 
> getNumClusterNodes
> {code}
> h2. Let's break this down:
>  1. The mocking happens in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations#setup(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration,
>  boolean):
> {code:java}
> cs.setRMContext(spyRMContext);
> cs.init(csConf);
> cs.start();
> when(cs.getNumClusterNodes()).thenReturn(3);
> {code}
> Under no circumstances this could be allowed to return any other value than 3.
>  However, as mentioned above, sometimes the real method of 
> 'getNumClusterNodes' is called on CapacityScheduler.
> 2. Sometimes, this gets printed to the console:
> {code:java}
> org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
> Integer cannot be returned by isMultiNodePlacementEnabled()
> isMultiNodePlacementEnabled() should return boolean
> ***
> If you're unsure why you're getting above error read on.
> Due to the nature of the syntax above problem might occur because:
> 1. This exception *might* occur in wrongly written multi-threaded tests.
>Please refer to Mockito FAQ on limitations of concurrency testing.
> 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
> spies - 
>- with doReturn|Throw() family of methods. More in javadocs for 
> Mockito.spy() method.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:166)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:114)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:566)
>   at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
>

[jira] [Commented] (YARN-10640) Adjust the queue Configured capacity to Configured weight number for weight mode in UI.

2021-03-04 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295519#comment-17295519
 ] 

Peter Bacsko commented on YARN-10640:
-

I think displaying the "Configured weight" in percentage mode is not an issue 
if you print something like "n/a" or "unset". But some people might think that 
-1 is a valid value. I know it's a minor thing, but this is just my opinion.

[~gandras] [~bteke] - what do you think?

> Adjust the queue Configured capacity to  Configured weight number for weight 
> mode in UI.
> 
>
> Key: YARN-10640
> URL: https://issues.apache.org/jira/browse/YARN-10640
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10640.001.patch, YARN-10640.002.patch, 
> YARN-10640.003.patch, image-2021-02-20-11-21-50-306.png, 
> image-2021-02-20-14-18-56-261.png, image-2021-02-20-14-19-30-767.png, 
> image-2021-03-02-11-34-26-062.png
>
>
> In weight mode:
> Both the weight mode static queue and the dynamic queue will show the 
> Configured Capacity to 0. I think this should change to Configured Weight if 
> we use weight mode, this will be helpful.
> Such as in dynamic weight mode queue:
> !image-2021-02-20-11-21-50-306.png|width=528,height=374!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10532) Capacity Scheduler Auto Queue Creation: Allow auto delete queue when queue is not being used

2021-03-04 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295422#comment-17295422
 ] 

Peter Bacsko commented on YARN-10532:
-

+1

Thanks [~zhuqi] for the patch and [~gandras] and [~bteke] for the review.

Committed to trunk.

> Capacity Scheduler Auto Queue Creation: Allow auto delete queue when queue is 
> not being used
> 
>
> Key: YARN-10532
> URL: https://issues.apache.org/jira/browse/YARN-10532
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10532.001.patch, YARN-10532.002.patch, 
> YARN-10532.003.patch, YARN-10532.004.patch, YARN-10532.005.patch, 
> YARN-10532.006.patch, YARN-10532.007.patch, YARN-10532.008.patch, 
> YARN-10532.009.patch, YARN-10532.010.patch, YARN-10532.011.patch, 
> YARN-10532.012.patch, YARN-10532.013.patch, YARN-10532.014.patch, 
> YARN-10532.015.patch, YARN-10532.016.patch, YARN-10532.017.patch, 
> YARN-10532.018.patch, YARN-10532.019.patch, YARN-10532.020.patch, 
> YARN-10532.021.patch, YARN-10532.022.patch, YARN-10532.023.patch, 
> YARN-10532.024.patch, YARN-10532.025.patch, YARN-10532.026.patch, 
> YARN-10532.027.patch, image-2021-02-12-21-32-02-267.png
>
>
> It's better if we can delete auto-created queues when they are not in use for 
> a period of time (like 5 mins). It will be helpful when we have a large 
> number of auto-created queues (e.g. from 500 users), but only a small subset 
> of queues are actively used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10642) AsyncDispatcher will stuck introduced by YARN-8995.

2021-03-04 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295281#comment-17295281
 ] 

Peter Bacsko commented on YARN-10642:
-

Ah sorry. I didn't look at the assignee properly. Yeah, I meant to ping 
[~zhengchenyu].

> AsyncDispatcher will stuck introduced by YARN-8995.
> ---
>
> Key: YARN-10642
> URL: https://issues.apache.org/jira/browse/YARN-10642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Critical
> Attachments: MockForDeadLoop.java, YARN-10642.001.patch, 
> YARN-10642.002.patch, YARN-10642.003.patch, YARN-10642.004.patch, 
> deadloop.png, debugfornode.png, put.png, take.png
>
>
> In our cluster, ResouceManager stuck twice within twenty days. Yarn client 
> can't submit application. I got jstack info at second time, then found the 
> reason.
> I analyze all the jstack, I found many thread stuck because can't get 
> LinkedBlockingQueue.putLock. (Note: Sorry for limited space , omit the 
> analytical process)
> The reason is that one thread hold the putLock all the time, 
> printEventQueueDetails will called forEachRemaining, then hold putLock and 
> readLock. The AsyncDispatcher will stuck.
> {code}
> Thread 6526 (IPC Server handler 454 on default port 8030):
>   State: RUNNABLE
>   Blocked count: 29988
>   Waited count: 2035029
>   Stack:
> 
> java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926)
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
> 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.handleProgress(DefaultAMSProcessor.java:408)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:215)
> 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
> 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
> 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:432)
> 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1040)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:958)
> java.security.AccessController.doPrivileged(Native Method)
> {code}
> I analyze LinkedBlockingQueue's source code. I found forEachRemaining in 
> LinkedBlockingQueue.LBQSpliterator may stuck, when forEachRemaining and take 
> are called in different thread. 
> YARN-8995 introduce printEventQueueDetails method, 
> "eventQueue.stream().collect" will called forEachRemaining method.
> Let's see why？ "put.png" shows that how to put("a"), "take.png" shows that 
> how to take()。Specical Node: The removed Node will point itself for help gc!!!
> The key point code is in forEachRemaining, we see LBQSpliterator use 
> forEachRemaining to visit all Node. But when got item value from Node, will 
> release the lock. If at this time, take() will be called. 
> The variable 'p' in forEachRemaining may point a Node which point itself, 
> then forEachRemaining will be in dead loop. You can see it in "deadloop.png"
> Let's see a simple uni-test, Let's forEachRemaining called more slow than 
> take, the problem will reproduction。uni-test is MockForDeadLoop.java.
> I debug MockForDeadLoop.java, and see a Node point itself. You can see pic 
> "debugfornode.png"
> Environment:
>   OS: CentOS Linux release 7.5.1804 (Core) 
>   JDK: jdk1.8.0_281



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe,

[jira] [Commented] (YARN-10642) AsyncDispatcher will stuck introduced by YARN-8995.

2021-03-04 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295260#comment-17295260
 ] 

Peter Bacsko commented on YARN-10642:
-

Ok, I found one minor issue:

{{dispatcher.stop();}} --> [~zhuqi] please move this to the finally block.

Otherwise looks good.

> AsyncDispatcher will stuck introduced by YARN-8995.
> ---
>
> Key: YARN-10642
> URL: https://issues.apache.org/jira/browse/YARN-10642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Critical
> Attachments: MockForDeadLoop.java, YARN-10642.001.patch, 
> YARN-10642.002.patch, YARN-10642.003.patch, YARN-10642.004.patch, 
> deadloop.png, debugfornode.png, put.png, take.png
>
>
> In our cluster, ResouceManager stuck twice within twenty days. Yarn client 
> can't submit application. I got jstack info at second time, then found the 
> reason.
> I analyze all the jstack, I found many thread stuck because can't get 
> LinkedBlockingQueue.putLock. (Note: Sorry for limited space , omit the 
> analytical process)
> The reason is that one thread hold the putLock all the time, 
> printEventQueueDetails will called forEachRemaining, then hold putLock and 
> readLock. The AsyncDispatcher will stuck.
> {code}
> Thread 6526 (IPC Server handler 454 on default port 8030):
>   State: RUNNABLE
>   Blocked count: 29988
>   Waited count: 2035029
>   Stack:
> 
> java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926)
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
> 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.handleProgress(DefaultAMSProcessor.java:408)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:215)
> 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
> 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
> 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:432)
> 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1040)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:958)
> java.security.AccessController.doPrivileged(Native Method)
> {code}
> I analyze LinkedBlockingQueue's source code. I found forEachRemaining in 
> LinkedBlockingQueue.LBQSpliterator may stuck, when forEachRemaining and take 
> are called in different thread. 
> YARN-8995 introduce printEventQueueDetails method, 
> "eventQueue.stream().collect" will called forEachRemaining method.
> Let's see why？ "put.png" shows that how to put("a"), "take.png" shows that 
> how to take()。Specical Node: The removed Node will point itself for help gc!!!
> The key point code is in forEachRemaining, we see LBQSpliterator use 
> forEachRemaining to visit all Node. But when got item value from Node, will 
> release the lock. If at this time, take() will be called. 
> The variable 'p' in forEachRemaining may point a Node which point itself, 
> then forEachRemaining will be in dead loop. You can see it in "deadloop.png"
> Let's see a simple uni-test, Let's forEachRemaining called more slow than 
> take, the problem will reproduction。uni-test is MockForDeadLoop.java.
> I debug MockForDeadLoop.java, and see a Node point itself. You can see pic 
> "debugfornode.png"
> Environment:
>   OS: CentOS Linux release 7.5.1804 (Core) 
>   JDK: jdk1.8.0_281



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (YARN-10642) AsyncDispatcher will stuck introduced by YARN-8995.

2021-03-04 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295246#comment-17295246
 ] 

Peter Bacsko commented on YARN-10642:
-

I'm reviewing this patch soon.

> AsyncDispatcher will stuck introduced by YARN-8995.
> ---
>
> Key: YARN-10642
> URL: https://issues.apache.org/jira/browse/YARN-10642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Critical
> Attachments: MockForDeadLoop.java, YARN-10642.001.patch, 
> YARN-10642.002.patch, YARN-10642.003.patch, YARN-10642.004.patch, 
> deadloop.png, debugfornode.png, put.png, take.png
>
>
> In our cluster, ResouceManager stuck twice within twenty days. Yarn client 
> can't submit application. I got jstack info at second time, then found the 
> reason.
> I analyze all the jstack, I found many thread stuck because can't get 
> LinkedBlockingQueue.putLock. (Note: Sorry for limited space , omit the 
> analytical process)
> The reason is that one thread hold the putLock all the time, 
> printEventQueueDetails will called forEachRemaining, then hold putLock and 
> readLock. The AsyncDispatcher will stuck.
> {code}
> Thread 6526 (IPC Server handler 454 on default port 8030):
>   State: RUNNABLE
>   Blocked count: 29988
>   Waited count: 2035029
>   Stack:
> 
> java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926)
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
> 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.handleProgress(DefaultAMSProcessor.java:408)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:215)
> 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
> 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
> 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:432)
> 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1040)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:958)
> java.security.AccessController.doPrivileged(Native Method)
> {code}
> I analyze LinkedBlockingQueue's source code. I found forEachRemaining in 
> LinkedBlockingQueue.LBQSpliterator may stuck, when forEachRemaining and take 
> are called in different thread. 
> YARN-8995 introduce printEventQueueDetails method, 
> "eventQueue.stream().collect" will called forEachRemaining method.
> Let's see why？ "put.png" shows that how to put("a"), "take.png" shows that 
> how to take()。Specical Node: The removed Node will point itself for help gc!!!
> The key point code is in forEachRemaining, we see LBQSpliterator use 
> forEachRemaining to visit all Node. But when got item value from Node, will 
> release the lock. If at this time, take() will be called. 
> The variable 'p' in forEachRemaining may point a Node which point itself, 
> then forEachRemaining will be in dead loop. You can see it in "deadloop.png"
> Let's see a simple uni-test, Let's forEachRemaining called more slow than 
> take, the problem will reproduction。uni-test is MockForDeadLoop.java.
> I debug MockForDeadLoop.java, and see a Node point itself. You can see pic 
> "debugfornode.png"
> Environment:
>   OS: CentOS Linux release 7.5.1804 (Core) 
>   JDK: jdk1.8.0_281



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For

[jira] [Commented] (YARN-10623) Capacity scheduler should support refresh queue automatically by a thread policy.

2021-03-04 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295221#comment-17295221
 ] 

Peter Bacsko commented on YARN-10623:
-

+1

[~zhuqi] thanks for the patch.
Also thanks [~shuzirra] for reviewing.

Committed to trunk.


> Capacity scheduler should support refresh queue automatically by a thread 
> policy.
> -
>
> Key: YARN-10623
> URL: https://issues.apache.org/jira/browse/YARN-10623
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10623.001.patch, YARN-10623.002.patch, 
> YARN-10623.003.patch, YARN-10623.004.patch, YARN-10623.005.patch, 
> YARN-10623.006.patch, YARN-10623.007.patch, YARN-10623.008.patch, 
> YARN-10623.009.patch, YARN-10623.010.patch
>
>
> In fair scheduler, it is supported that refresh queue related conf 
> automatically by a thread to reload, but in capacity scheduler we only 
> support to refresh queue related changes by refreshQueues, it is needed for 
> our cluster to realize queue manage.
> cc [~wangda] [~ztang] [~pbacsko] [~snemeth] [~gandras]  [~bteke] [~shuzirra]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9615) Add dispatcher metrics to RM

2021-03-03 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17294705#comment-17294705
 ] 

Peter Bacsko commented on YARN-9615:


Thanks [~zhuqi] for the patch.

Minor things left:

1. {{GenericTestUtils.waitFor()}} --> remove the anonymous inner classes and 
use lambdas.

2. {{AsyncDispatcher}} is not closed in the tests. It's important to close it 
because it starts a thread.

Do this:
{noformat}
AsyncDispatcher dispatcher = null;
try {

... test code

} finally {
  if (dispatcher != null) {
dispatcher.close();
  }
}
{noformat}

3.
{noformat}
  /**
   * Run a test to submit values for Dispatcher metrics
   * histogram comes out correctly.
   * @throws Exception
   */
{noformat}
This comment can be removed, the test name says what it does.

> Add dispatcher metrics to RM
> 
>
> Key: YARN-9615
> URL: https://issues.apache.org/jira/browse/YARN-9615
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Jonathan Hung
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-9615.001.patch, YARN-9615.002.patch, 
> YARN-9615.003.patch, YARN-9615.004.patch, YARN-9615.005.patch, 
> YARN-9615.006.patch, YARN-9615.007.patch, YARN-9615.poc.patch, 
> screenshot-1.png
>
>
> It'd be good to have counts/processing times for each event type in RM async 
> dispatcher and scheduler async dispatcher.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10655) Limit queue creation depth relative to its first static parent

2021-03-03 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17294653#comment-17294653
 ] 

Peter Bacsko commented on YARN-10655:
-

Thanks [~gandras] for the patch and [~zhuqi] for the review.

Committed to trunk.

> Limit queue creation depth relative to its first static parent
> --
>
> Key: YARN-10655
> URL: https://issues.apache.org/jira/browse/YARN-10655
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10655.001.patch
>
>
> YARN-10506 introduced a limit on the maximum depth of auto queue creation. 
> This, however, only limits the levels of queue path relative to its first 
> existing parent queue. It poses an unnecessary limitation on users, while 
> providing no real safety net over rogue users (especially when YARN-10632 
> makes this limit configurable), because it could be incrementally 
> circumvented by creating a new queue under an existing dynamic parent queue. 
> By bounding this limit to the first static parent queue in the hierarchy, we 
> could have a safer alternative.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10640) Adjust the queue Configured capacity to Configured weight number for weight mode in UI.

2021-03-03 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17294643#comment-17294643
 ] 

Peter Bacsko commented on YARN-10640:
-

[~zhuqi] in the latest patch "Configured weight" is always displayed, 
regardless of the mode. Is this what we want? I think we should only display 
this in weight mode (users could be confused).

> Adjust the queue Configured capacity to  Configured weight number for weight 
> mode in UI.
> 
>
> Key: YARN-10640
> URL: https://issues.apache.org/jira/browse/YARN-10640
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10640.001.patch, YARN-10640.002.patch, 
> YARN-10640.003.patch, image-2021-02-20-11-21-50-306.png, 
> image-2021-02-20-14-18-56-261.png, image-2021-02-20-14-19-30-767.png, 
> image-2021-03-02-11-34-26-062.png
>
>
> In weight mode:
> Both the weight mode static queue and the dynamic queue will show the 
> Configured Capacity to 0. I think this should change to Configured Weight if 
> we use weight mode, this will be helpful.
> Such as in dynamic weight mode queue:
> !image-2021-02-20-11-21-50-306.png|width=528,height=374!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10655) Limit queue creation depth relative to its first static parent

2021-03-01 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17293128#comment-17293128
 ] 

Peter Bacsko commented on YARN-10655:
-

I can't really comment on the changes because I'm very unfamiliar with the 
newly introduced class.

The test method can be simplified:
{noformat}
try {
  createQueue("root.a.a-auto.a2-auto.a3-auto");
  Assert.fail("Queue creation should not succeed because the 
distance " +
  "from the first static parent is above limit");
} catch (SchedulerDynamicEditException ignored) {

}
{noformat}

Instead of the try-catch block, just use {{@Test(expected = 
SchedulerDynamicEditException.class)}} and remove the assertion.

> Limit queue creation depth relative to its first static parent
> --
>
> Key: YARN-10655
> URL: https://issues.apache.org/jira/browse/YARN-10655
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10655.001.patch
>
>
> YARN-10506 introduced a limit on the maximum depth of auto queue creation. 
> This, however, only limits the levels of queue path relative to its first 
> existing parent queue. It poses an unnecessary limitation on users, while 
> providing no real safety net over rogue users (especially when YARN-10632 
> makes this limit configurable), because it could be incrementally 
> circumvented by creating a new queue under an existing dynamic parent queue. 
> By bounding this limit to the first static parent queue in the hierarchy, we 
> could have a safer alternative.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10623) Capacity scheduler should support refresh queue automatically by a thread policy.

2021-03-01 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17293122#comment-17293122
 ] 

Peter Bacsko commented on YARN-10623:
-

[~zhuqi] thanks, I can see one thing:  {{GenericTestUtils.waitFor()}} can be 
simplified with lambdas, you don't need to use anonymous inner classes. So that 
can be simplified, and remove the imported Supplier interface.

> Capacity scheduler should support refresh queue automatically by a thread 
> policy.
> -
>
> Key: YARN-10623
> URL: https://issues.apache.org/jira/browse/YARN-10623
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10623.001.patch, YARN-10623.002.patch, 
> YARN-10623.003.patch, YARN-10623.004.patch, YARN-10623.005.patch, 
> YARN-10623.006.patch, YARN-10623.007.patch, YARN-10623.008.patch
>
>
> In fair scheduler, it is supported that refresh queue related conf 
> automatically by a thread to reload, but in capacity scheduler we only 
> support to refresh queue related changes by refreshQueues, it is needed for 
> our cluster to realize queue manage.
> cc [~wangda] [~ztang] [~pbacsko] [~snemeth] [~gandras]  [~bteke] [~shuzirra]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-10532) Capacity Scheduler Auto Queue Creation: Allow auto delete queue when queue is not being used

2021-03-01 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17293113#comment-17293113
 ] 

Peter Bacsko edited comment on YARN-10532 at 3/1/21, 7:16 PM:
--

[~zhuqi] please fix the things that [~bteke] mentioned and I think I'll commit 
this patch to trunk.

Also pay attention to the remaining checkstyle issues.


was (Author: pbacsko):
[~zhuqi] please fix the things that [~bteke] mentioned and I think I'll commit 
this patch to trunk.

> Capacity Scheduler Auto Queue Creation: Allow auto delete queue when queue is 
> not being used
> 
>
> Key: YARN-10532
> URL: https://issues.apache.org/jira/browse/YARN-10532
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10532.001.patch, YARN-10532.002.patch, 
> YARN-10532.003.patch, YARN-10532.004.patch, YARN-10532.005.patch, 
> YARN-10532.006.patch, YARN-10532.007.patch, YARN-10532.008.patch, 
> YARN-10532.009.patch, YARN-10532.010.patch, YARN-10532.011.patch, 
> YARN-10532.012.patch, YARN-10532.013.patch, YARN-10532.014.patch, 
> YARN-10532.015.patch, YARN-10532.016.patch, YARN-10532.017.patch, 
> YARN-10532.018.patch, YARN-10532.019.patch, YARN-10532.020.patch, 
> YARN-10532.021.patch, YARN-10532.022.patch, YARN-10532.023.patch, 
> YARN-10532.024.patch, YARN-10532.025.patch, image-2021-02-12-21-32-02-267.png
>
>
> It's better if we can delete auto-created queues when they are not in use for 
> a period of time (like 5 mins). It will be helpful when we have a large 
> number of auto-created queues (e.g. from 500 users), but only a small subset 
> of queues are actively used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10532) Capacity Scheduler Auto Queue Creation: Allow auto delete queue when queue is not being used

2021-03-01 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17293113#comment-17293113
 ] 

Peter Bacsko commented on YARN-10532:
-

[~zhuqi] please fix the things that [~bteke] mentioned and I think I'll commit 
this patch to trunk.

> Capacity Scheduler Auto Queue Creation: Allow auto delete queue when queue is 
> not being used
> 
>
> Key: YARN-10532
> URL: https://issues.apache.org/jira/browse/YARN-10532
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10532.001.patch, YARN-10532.002.patch, 
> YARN-10532.003.patch, YARN-10532.004.patch, YARN-10532.005.patch, 
> YARN-10532.006.patch, YARN-10532.007.patch, YARN-10532.008.patch, 
> YARN-10532.009.patch, YARN-10532.010.patch, YARN-10532.011.patch, 
> YARN-10532.012.patch, YARN-10532.013.patch, YARN-10532.014.patch, 
> YARN-10532.015.patch, YARN-10532.016.patch, YARN-10532.017.patch, 
> YARN-10532.018.patch, YARN-10532.019.patch, YARN-10532.020.patch, 
> YARN-10532.021.patch, YARN-10532.022.patch, YARN-10532.023.patch, 
> YARN-10532.024.patch, YARN-10532.025.patch, image-2021-02-12-21-32-02-267.png
>
>
> It's better if we can delete auto-created queues when they are not in use for 
> a period of time (like 5 mins). It will be helpful when we have a large 
> number of auto-created queues (e.g. from 500 users), but only a small subset 
> of queues are actively used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10658) CapacityScheduler QueueInfo getQueueName should change to queue path to avoid ambiguous QueueName.

2021-03-01 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17293110#comment-17293110
 ] 

Peter Bacsko commented on YARN-10658:
-

Thanks [~zhuqi] for the patch. How is this QueueInfo class used? If we set the 
full path instead of the short name, what changes? Are the UI or public 
interfaces affected in any way?

> CapacityScheduler QueueInfo getQueueName should change to queue path to avoid 
> ambiguous QueueName.
> --
>
> Key: YARN-10658
> URL: https://issues.apache.org/jira/browse/YARN-10658
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10658.001.patch
>
>
> After the leaf queue can use same name, QueueInfo class getQueueName method 
> should change to queue path to avoid ambiguous QueueName, and make it 
> consistent with fairscheduler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10627) Extend logging to give more information about weight mode

2021-02-26 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17291919#comment-17291919
 ] 

Peter Bacsko commented on YARN-10627:
-

Thanks for the patch [~bteke] and [~zhuqi] / [~gandras] for the review.

Committed to trunk.

> Extend logging to give more information about weight mode
> -
>
> Key: YARN-10627
> URL: https://issues.apache.org/jira/browse/YARN-10627
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10627.001.patch, YARN-10627.002.patch, 
> YARN-10627.003.patch, YARN-10627.004.patch, YARN-10627.005.patch, 
> YARN-10627.006.patch, image-2021-02-20-00-07-09-875.png
>
>
> In YARN-10504 weight mode was added, however the logged information about the 
> created queues or the toString methods weren't updated accordingly. Some 
> examples:
> ParentQueue#setupQueueConfigs:
> {code:java}
>  LOG.info(queueName + ", capacity=" + this.queueCapacities.getCapacity()
>   + ", absoluteCapacity=" + this.queueCapacities.getAbsoluteCapacity()
>   + ", maxCapacity=" + this.queueCapacities.getMaximumCapacity()
>   + ", absoluteMaxCapacity=" + this.queueCapacities
>   .getAbsoluteMaximumCapacity() + ", state=" + getState() + ", acls="
>   + aclsString + ", labels=" + labelStrBuilder.toString() + "\n"
>   + ", reservationsContinueLooking=" + reservationsContinueLooking
>   + ", orderingPolicy=" + getQueueOrderingPolicyConfigName()
>   + ", priority=" + priority
>   + ", allowZeroCapacitySum=" + allowZeroCapacitySum);
> {code}
> ParentQueue#toString:
> {code:java}
> public String toString() {
> return queueName + ": " +
> "numChildQueue= " + childQueues.size() + ", " + 
> "capacity=" + queueCapacities.getCapacity() + ", " +  
> "absoluteCapacity=" + queueCapacities.getAbsoluteCapacity() + ", " +
> "usedResources=" + queueUsage.getUsed() + 
> "usedCapacity=" + getUsedCapacity() + ", " + 
> "numApps=" + getNumApplications() + ", " + 
> "numContainers=" + getNumContainers();
>  }
> {code}
> LeafQueue#setupQueueConfigs:
> {code:java}
>   LOG.info(
>   "Initializing " + getQueuePath() + "\n" + "capacity = "
>   + queueCapacities.getCapacity()
>   + " [= (float) configuredCapacity / 100 ]" + "\n"
>   + "absoluteCapacity = " + queueCapacities.getAbsoluteCapacity()
>   + " [= parentAbsoluteCapacity * capacity ]" + "\n"
>   + "maxCapacity = " + queueCapacities.getMaximumCapacity()
>   + " [= configuredMaxCapacity ]" + "\n" + "absoluteMaxCapacity = 
> "
>   + queueCapacities.getAbsoluteMaximumCapacity()
>   + " [= 1.0 maximumCapacity undefined, "
>   + "(parentAbsoluteMaxCapacity * maximumCapacity) / 100 
> otherwise ]"
>   + "\n" + "effectiveMinResource=" +
>   getEffectiveCapacity(CommonNodeLabelsManager.NO_LABEL) + "\n"
>   + " , effectiveMaxResource=" +
>   getEffectiveMaxCapacity(CommonNodeLabelsManager.NO_LABEL)
>   + "\n" + "userLimit = " + usersManager.getUserLimit()
>   + " [= configuredUserLimit ]" + "\n" + "userLimitFactor = "
>   + usersManager.getUserLimitFactor()
>   + " [= configuredUserLimitFactor ]" + "\n" + "maxApplications = 
> "
>   + maxApplications
>   + " [= configuredMaximumSystemApplicationsPerQueue or"
>   + " (int)(configuredMaximumSystemApplications * 
> absoluteCapacity)]"
>   + "\n" + "maxApplicationsPerUser = " + maxApplicationsPerUser
>   + " [= (int)(maxApplications * (userLimit / 100.0f) * "
>   + "userLimitFactor) ]" + "\n"
>   + "maxParallelApps = " + getMaxParallelApps() + "\n"
>   + "usedCapacity = " +
>   + queueCapacities.getUsedCapacity() + " [= usedResourcesMemory 
> / "
>   + "(clusterResourceMemory * absoluteCapacity)]" + "\n"
>   + "absoluteUsedCapacity = " + absoluteUsedCapacity
>   + " [= usedResourcesMemory / clusterResourceMemory]" + "\n"
>   + "maxAMResourcePerQueuePercent = " + 
> maxAMResourcePerQueuePercent
>   + " [= configuredMaximumAMResourcePercent ]" + "\n"
>   + "minimumAllocationFactor = " + minimumAllocationFactor
>   + " [= (float)(maximumAllocationMemory - 
> minimumAllocationMemory) / "
>   + "maximumAllocationMemory ]" + "\n" + "maximumAllocation = "
>   + maximumAllocation + " [= configuredMaxAllocation ]" + "\n"
>   + "numContainers =

[jira] [Commented] (YARN-10640) Adjust the queue Configured capacity to Configured weight number for weight mode in UI.

2021-02-26 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17291894#comment-17291894
 ] 

Peter Bacsko commented on YARN-10640:
-

[~zhuqi] thanks for the patch, it's useful.

There is one thing I was thinking: in weight mode, we configure weights but 
inside CS we still calculate percentages. So how is it possible that 
"Configured capacity" is 0/0? For example, percentage values are properly 
shown, those are not 0.

[~gandras] [~bteke] do you know why {{}} is printed even 
for static queues?


> Adjust the queue Configured capacity to  Configured weight number for weight 
> mode in UI.
> 
>
> Key: YARN-10640
> URL: https://issues.apache.org/jira/browse/YARN-10640
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10640.001.patch, YARN-10640.002.patch, 
> image-2021-02-20-11-21-50-306.png, image-2021-02-20-14-18-56-261.png, 
> image-2021-02-20-14-19-30-767.png
>
>
> In weight mode:
> Both the weight mode static queue and the dynamic queue will show the 
> Configured Capacity to 0. I think this should change to Configured Weight if 
> we use weight mode, this will be helpful.
> Such as in dynamic weight mode queue:
> !image-2021-02-20-11-21-50-306.png|width=528,height=374!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10532) Capacity Scheduler Auto Queue Creation: Allow auto delete queue when queue is not being used

2021-02-26 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17291889#comment-17291889
 ] 

Peter Bacsko commented on YARN-10532:
-

[~zhuqi] I think it's very good now. Maybe I'll check another round on Monday 
but this one looks good enough to get committed.

[~gandras] [~bteke] you guys pls do a final review.

> Capacity Scheduler Auto Queue Creation: Allow auto delete queue when queue is 
> not being used
> 
>
> Key: YARN-10532
> URL: https://issues.apache.org/jira/browse/YARN-10532
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10532.001.patch, YARN-10532.002.patch, 
> YARN-10532.003.patch, YARN-10532.004.patch, YARN-10532.005.patch, 
> YARN-10532.006.patch, YARN-10532.007.patch, YARN-10532.008.patch, 
> YARN-10532.009.patch, YARN-10532.010.patch, YARN-10532.011.patch, 
> YARN-10532.012.patch, YARN-10532.013.patch, YARN-10532.014.patch, 
> YARN-10532.015.patch, YARN-10532.016.patch, YARN-10532.017.patch, 
> YARN-10532.018.patch, YARN-10532.019.patch, YARN-10532.020.patch, 
> YARN-10532.021.patch, YARN-10532.022.patch, YARN-10532.023.patch, 
> YARN-10532.024.patch, image-2021-02-12-21-32-02-267.png
>
>
> It's better if we can delete auto-created queues when they are not in use for 
> a period of time (like 5 mins). It will be helpful when we have a large 
> number of auto-created queues (e.g. from 500 users), but only a small subset 
> of queues are actively used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10656) Parsing error in CapacityScheduler.md

2021-02-26 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17291886#comment-17291886
 ] 

Peter Bacsko commented on YARN-10656:
-

Thanks [~aajisaka] for noticing this. I should have tried it before committing.

+1.

> Parsing error in CapacityScheduler.md
> -
>
> Key: YARN-10656
> URL: https://issues.apache.org/jira/browse/YARN-10656
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Reporter: Akira Ajisaka
>Assignee: Akira Ajisaka
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> mvn site failed: 
> https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/429/artifact/out/patch-mvnsite-root.txt
> {noformat}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-site-plugin:3.6:site (default-site) on project 
> hadoop-yarn-site: Error parsing 
> '/home/jenkins/jenkins-home/workspace/hadoop-qbt-trunk-java8-linux-x86_64/sourcedir/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/CapacityScheduler.md':
>  line [-1] Error parsing the model: Unable to execute macro in the document: 
> toc -> [Help 1]
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-10623) Capacity scheduler should support refresh queue automatically by a thread policy.

2021-02-26 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17291728#comment-17291728
 ] 

Peter Bacsko edited comment on YARN-10623 at 2/26/21, 4:10 PM:
---

Thanks [~zhuqi] for the update. I missed a couple of things in my previous 
review.

1.
{noformat}
clock = SystemClock.getInstance();
{noformat}

Use {{MonotonicClock}}. Looking at the comments of {{SystemClock}}, that class 
is a better choice.

2. {{waitFor}} usage can be simplified:

{noformat}
GenericTestUtils.waitFor(() ->
  cs.getConfiguration().getMaximumSystemApplications() != 1,
500L, 10_000L);
{noformat}

I think 100ms interval time is a bit small, 500ms seems to be a better 
compromise. Watch out for the line length or checkstyle will complain.

3. 
{noformat}
configuration.set(YarnConfiguration.RM_CONFIGURATION_PROVIDER_CLASS,
"org.apache.hadoop.yarn.FileSystemBasedConfigurationProvider");

configuration.set(YarnConfiguration.RM_CONFIGURATION_PROVIDER_CLASS,
"org.apache.hadoop.yarn.FileSystemBasedConfigurationProvider");
{noformat}

You can also use 
{{FileSystemBasedConfigurationProvider.class.getCanonicalName()}} here.

4. 
{noformat}

csConf.set(CapacitySchedulerConfiguration.MAXIMUM_SYSTEM_APPLICATIONS,
"5000");
{noformat}

For consistency reasons, I suggest using {{setInt()}} here.

5.
{noformat}
  @VisibleForTesting
  public long getLastReloadAttempt() {
return lastReloadAttempt;
  }

  @VisibleForTesting
  public long getLastModified() {
return lastModified;
  }

  @VisibleForTesting
  public Clock getClock() {
return clock;
  }

  @VisibleForTesting
  public boolean getLastReloadAttemptFailed() {
return  lastReloadAttemptFailed;
  }
{noformat}

As I can see, these methods are only called from test cases. You can reduce the 
visibility to package private, so just remove "public".

6. Note that stack traces are still not visible in the latest patch:
{noformat}
LOG.error("Can't refresh queue: " + e.getMessage());
...
LOG.error("Can't get file status for refresh : " + e.getMessage());
{noformat}

This is preferred:
{noformat}
LOG.error("Can't refresh queue", e);
...
LOG.error("Can't get file status for refresh", e);
{noformat}


was (Author: pbacsko):
Thanks [~zhuqi] for the update. I missed a couple of things in my previous 
review.

1.
{noformat}
clock = SystemClock.getInstance();
{noformat}

I suggest using {{MonotonicClock}}. Looking at the comments of {{SystemClock}}, 
that class is a better choice.

2. {{waitFor}} usage can be simplified:

{noformat}
GenericTestUtils.waitFor(() ->
  cs.getConfiguration().getMaximumSystemApplications() != 1,
500L, 10_000L);
{noformat}

I think 100ms interval time is a bit small, 500ms seems to be a better 
compromise. Watch out for the line length or checkstyle will complain.

3. 
{noformat}
configuration.set(YarnConfiguration.RM_CONFIGURATION_PROVIDER_CLASS,
"org.apache.hadoop.yarn.FileSystemBasedConfigurationProvider");

configuration.set(YarnConfiguration.RM_CONFIGURATION_PROVIDER_CLASS,
"org.apache.hadoop.yarn.FileSystemBasedConfigurationProvider");
{noformat}

You can also use 
{{FileSystemBasedConfigurationProvider.class.getCanonicalName()}} here.

4. 
{noformat}

csConf.set(CapacitySchedulerConfiguration.MAXIMUM_SYSTEM_APPLICATIONS,
"5000");
{noformat}

For consistency reasons, I suggest using {{setInt()}} here.

5.
{noformat}
  @VisibleForTesting
  public long getLastReloadAttempt() {
return lastReloadAttempt;
  }

  @VisibleForTesting
  public long getLastModified() {
return lastModified;
  }

  @VisibleForTesting
  public Clock getClock() {
return clock;
  }

  @VisibleForTesting
  public boolean getLastReloadAttemptFailed() {
return  lastReloadAttemptFailed;
  }
{noformat}

As I can see, these methods are only called from test cases. You can reduce the 
visibility to package private, so just remove "public".

6. Note that stack traces are still not visible in the latest patch:
{noformat}
LOG.error("Can't refresh queue: " + e.getMessage());
...
LOG.error("Can't get file status for refresh : " + e.getMessage());
{noformat}

This is preferred:
{noformat}
LOG.error("Can't refresh queue", e);
...
LOG.error("Can't get file status for refresh", e);
{noformat}

> Capacity scheduler should support refresh queue automatically by a thread 
> policy.
>

[jira] [Commented] (YARN-10623) Capacity scheduler should support refresh queue automatically by a thread policy.

2021-02-26 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17291728#comment-17291728
 ] 

Peter Bacsko commented on YARN-10623:
-

Thanks [~zhuqi] for the update. I missed a couple of things in my previous 
review.

1.
{noformat}
clock = SystemClock.getInstance();
{noformat}

I suggest using {{MonotonicClock}}. Looking at the comments of {{SystemClock}}, 
that class is a better choice.

2. {{waitFor}} usage can be simplified:

{noformat}
GenericTestUtils.waitFor(() ->
  cs.getConfiguration().getMaximumSystemApplications() != 1,
500L, 10_000L);
{noformat}

I think 100ms interval time is a bit small, 500ms seems to be a better 
compromise. Watch out for the line length or checkstyle will complain.

3. 
{noformat}
configuration.set(YarnConfiguration.RM_CONFIGURATION_PROVIDER_CLASS,
"org.apache.hadoop.yarn.FileSystemBasedConfigurationProvider");

configuration.set(YarnConfiguration.RM_CONFIGURATION_PROVIDER_CLASS,
"org.apache.hadoop.yarn.FileSystemBasedConfigurationProvider");
{noformat}

You can also use 
{{FileSystemBasedConfigurationProvider.class.getCanonicalName()}} here.

4. 
{noformat}

csConf.set(CapacitySchedulerConfiguration.MAXIMUM_SYSTEM_APPLICATIONS,
"5000");
{noformat}

For consistency reasons, I suggest using {{setInt()}} here.

5.
{noformat}
  @VisibleForTesting
  public long getLastReloadAttempt() {
return lastReloadAttempt;
  }

  @VisibleForTesting
  public long getLastModified() {
return lastModified;
  }

  @VisibleForTesting
  public Clock getClock() {
return clock;
  }

  @VisibleForTesting
  public boolean getLastReloadAttemptFailed() {
return  lastReloadAttemptFailed;
  }
{noformat}

As I can see, these methods are only called from test cases. You can reduce the 
visibility to package private, so just remove "public".

6. Note that stack traces are still not visible in the latest patch:
{noformat}
LOG.error("Can't refresh queue: " + e.getMessage());
...
LOG.error("Can't get file status for refresh : " + e.getMessage());
{noformat}

This is preferred:
{noformat}
LOG.error("Can't refresh queue", e);
...
LOG.error("Can't get file status for refresh", e);
{noformat}

> Capacity scheduler should support refresh queue automatically by a thread 
> policy.
> -
>
> Key: YARN-10623
> URL: https://issues.apache.org/jira/browse/YARN-10623
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10623.001.patch, YARN-10623.002.patch, 
> YARN-10623.003.patch, YARN-10623.004.patch, YARN-10623.005.patch
>
>
> In fair scheduler, it is supported that refresh queue related conf 
> automatically by a thread to reload, but in capacity scheduler we only 
> support to refresh queue related changes by refreshQueues, it is needed for 
> our cluster to realize queue manage.
> cc [~wangda] [~ztang] [~pbacsko] [~snemeth] [~gandras]  [~bteke] [~shuzirra]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10627) Extend logging to give more information about weight mode

2021-02-26 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17291718#comment-17291718
 ] 

Peter Bacsko commented on YARN-10627:
-

Thanks for the patch [~bteke], could you take care of those checkstyle stuff? 
Thanks.

> Extend logging to give more information about weight mode
> -
>
> Key: YARN-10627
> URL: https://issues.apache.org/jira/browse/YARN-10627
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10627.001.patch, YARN-10627.002.patch, 
> YARN-10627.003.patch, YARN-10627.004.patch, YARN-10627.005.patch, 
> image-2021-02-20-00-07-09-875.png
>
>
> In YARN-10504 weight mode was added, however the logged information about the 
> created queues or the toString methods weren't updated accordingly. Some 
> examples:
> ParentQueue#setupQueueConfigs:
> {code:java}
>  LOG.info(queueName + ", capacity=" + this.queueCapacities.getCapacity()
>   + ", absoluteCapacity=" + this.queueCapacities.getAbsoluteCapacity()
>   + ", maxCapacity=" + this.queueCapacities.getMaximumCapacity()
>   + ", absoluteMaxCapacity=" + this.queueCapacities
>   .getAbsoluteMaximumCapacity() + ", state=" + getState() + ", acls="
>   + aclsString + ", labels=" + labelStrBuilder.toString() + "\n"
>   + ", reservationsContinueLooking=" + reservationsContinueLooking
>   + ", orderingPolicy=" + getQueueOrderingPolicyConfigName()
>   + ", priority=" + priority
>   + ", allowZeroCapacitySum=" + allowZeroCapacitySum);
> {code}
> ParentQueue#toString:
> {code:java}
> public String toString() {
> return queueName + ": " +
> "numChildQueue= " + childQueues.size() + ", " + 
> "capacity=" + queueCapacities.getCapacity() + ", " +  
> "absoluteCapacity=" + queueCapacities.getAbsoluteCapacity() + ", " +
> "usedResources=" + queueUsage.getUsed() + 
> "usedCapacity=" + getUsedCapacity() + ", " + 
> "numApps=" + getNumApplications() + ", " + 
> "numContainers=" + getNumContainers();
>  }
> {code}
> LeafQueue#setupQueueConfigs:
> {code:java}
>   LOG.info(
>   "Initializing " + getQueuePath() + "\n" + "capacity = "
>   + queueCapacities.getCapacity()
>   + " [= (float) configuredCapacity / 100 ]" + "\n"
>   + "absoluteCapacity = " + queueCapacities.getAbsoluteCapacity()
>   + " [= parentAbsoluteCapacity * capacity ]" + "\n"
>   + "maxCapacity = " + queueCapacities.getMaximumCapacity()
>   + " [= configuredMaxCapacity ]" + "\n" + "absoluteMaxCapacity = 
> "
>   + queueCapacities.getAbsoluteMaximumCapacity()
>   + " [= 1.0 maximumCapacity undefined, "
>   + "(parentAbsoluteMaxCapacity * maximumCapacity) / 100 
> otherwise ]"
>   + "\n" + "effectiveMinResource=" +
>   getEffectiveCapacity(CommonNodeLabelsManager.NO_LABEL) + "\n"
>   + " , effectiveMaxResource=" +
>   getEffectiveMaxCapacity(CommonNodeLabelsManager.NO_LABEL)
>   + "\n" + "userLimit = " + usersManager.getUserLimit()
>   + " [= configuredUserLimit ]" + "\n" + "userLimitFactor = "
>   + usersManager.getUserLimitFactor()
>   + " [= configuredUserLimitFactor ]" + "\n" + "maxApplications = 
> "
>   + maxApplications
>   + " [= configuredMaximumSystemApplicationsPerQueue or"
>   + " (int)(configuredMaximumSystemApplications * 
> absoluteCapacity)]"
>   + "\n" + "maxApplicationsPerUser = " + maxApplicationsPerUser
>   + " [= (int)(maxApplications * (userLimit / 100.0f) * "
>   + "userLimitFactor) ]" + "\n"
>   + "maxParallelApps = " + getMaxParallelApps() + "\n"
>   + "usedCapacity = " +
>   + queueCapacities.getUsedCapacity() + " [= usedResourcesMemory 
> / "
>   + "(clusterResourceMemory * absoluteCapacity)]" + "\n"
>   + "absoluteUsedCapacity = " + absoluteUsedCapacity
>   + " [= usedResourcesMemory / clusterResourceMemory]" + "\n"
>   + "maxAMResourcePerQueuePercent = " + 
> maxAMResourcePerQueuePercent
>   + " [= configuredMaximumAMResourcePercent ]" + "\n"
>   + "minimumAllocationFactor = " + minimumAllocationFactor
>   + " [= (float)(maximumAllocationMemory - 
> minimumAllocationMemory) / "
>   + "maximumAllocationMemory ]" + "\n" + "maximumAllocation = "
>   + maximumAllocation + " [= configuredMaxAllocation ]" + "\n"
>   + "numContainers = " + numContainers
>

[jira] [Comment Edited] (YARN-10652) Capacity Scheduler fails to handle user weights for a user that has a "." (dot) in it

2021-02-26 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17291612#comment-17291612
 ] 

Peter Bacsko edited comment on YARN-10652 at 2/26/21, 12:26 PM:


[~sahuja] you're right in saying that it has no direct relation to the 
placement.

In the first part of my comment, I was just thinking out loud that MAYBE using 
"_" instead of "." in the property is also a solution, but it comes with its 
own problems.

The placement stuff is different, it's something that we haven't considered so 
far. Currently, the new placement engine simply replaces placeholders like 
"root.users.%user" to "root.users.firstname.lastname", which is likely not what 
we want. It will not work in percentage mode, because "firstname" is a parent 
and you can't create parents under a ManagedParentQueue. In the new weight 
mode, it can work, but again, the intention is to have something like 
"root.users.firstname_lastname", just a single leaf.


was (Author: pbacsko):
[~sahuja] you're right in saying that it has no direct relation in the 
placement.

In the first part of my comment, I was just thinking out loud that MAYBE using 
"_" instead of "." in the property is also a solution, but it comes with its 
own problems.

The placement stuff is different, it's something that we haven't considered so 
far. Currently, the new placement engine simply replaces placeholders like 
"root.users.%user" to "root.users.firstname.lastname", which is likely not what 
we want. It will not work in percentage mode, because "firstname" is a parent 
and you can't create parents under a ManagedParentQueue. In the new weight 
mode, it can work, but again, the intention is to have something like 
"root.users.firstname_lastname", just a single leaf.

> Capacity Scheduler fails to handle user weights for a user that has a "." 
> (dot) in it
> -
>
> Key: YARN-10652
> URL: https://issues.apache.org/jira/browse/YARN-10652
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.3.0
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: Correct user weight of 0.76 picked up for the user with 
> a dot after the patch.png, Incorrect default user weight of 1.0 being picked 
> for the user with a dot before the patch.png, YARN-10652.001.patch
>
>
> AD usernames can have a "." (dot) in them i.e. they can be of the format -> 
> {{firstname.lastname}}. However, if you specify a username with this format 
> against the Capacity Scheduler setting -> 
> {{yarn.scheduler.capacity.root.default.user-settings.firstname.lastname.weight}},
>  it fails to be applied and is instead assigned the default of 1.0f weight. 
> This renders the user weight feature (being used as a means of setting user 
> priorities for a queue) unusable for such users.
> This limitation comes from [1]. From [1], only word characters (A word 
> character: [a-zA-Z_0-9]) (see [2]) are permissible at the moment which is no 
> good for AD names that contain a "." (dot).
> Similar discussion has been had in a few HADOOP jiras e.g. HADOOP-7050 and 
> HADOOP-15395 and the outcome was to use non-whitespace characters i.e. 
> instead of {{\w+}}, use {{\S+}}.
> We could go down similar path and unblock this feature for the AD usernames 
> with a "." (dot) in them.
> [1] 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java#L1953
> [2] 
> https://docs.oracle.com/javase/tutorial/essential/regex/pre_char_classes.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10652) Capacity Scheduler fails to handle user weights for a user that has a "." (dot) in it

2021-02-26 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17291612#comment-17291612
 ] 

Peter Bacsko commented on YARN-10652:
-

[~sahuja] you're right in saying that it has no direct relation in the 
placement.

In the first part of my comment, I was just thinking out loud that MAYBE using 
"_" instead of "." in the property is also a solution, but it comes with its 
own problems.

The placement stuff is different, it's something that we haven't considered so 
far. Currently, the new placement engine simply replaces placeholders like 
"root.users.%user" to "root.users.firstname.lastname", which is likely not what 
we want. It will not work in percentage mode, because "firstname" is a parent 
and you can't create parents under a ManagedParentQueue. In the new weight 
mode, it can work, but again, the intention is to have something like 
"root.users.firstname_lastname", just a single leaf.

> Capacity Scheduler fails to handle user weights for a user that has a "." 
> (dot) in it
> -
>
> Key: YARN-10652
> URL: https://issues.apache.org/jira/browse/YARN-10652
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.3.0
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: Correct user weight of 0.76 picked up for the user with 
> a dot after the patch.png, Incorrect default user weight of 1.0 being picked 
> for the user with a dot before the patch.png, YARN-10652.001.patch
>
>
> AD usernames can have a "." (dot) in them i.e. they can be of the format -> 
> {{firstname.lastname}}. However, if you specify a username with this format 
> against the Capacity Scheduler setting -> 
> {{yarn.scheduler.capacity.root.default.user-settings.firstname.lastname.weight}},
>  it fails to be applied and is instead assigned the default of 1.0f weight. 
> This renders the user weight feature (being used as a means of setting user 
> priorities for a queue) unusable for such users.
> This limitation comes from [1]. From [1], only word characters (A word 
> character: [a-zA-Z_0-9]) (see [2]) are permissible at the moment which is no 
> good for AD names that contain a "." (dot).
> Similar discussion has been had in a few HADOOP jiras e.g. HADOOP-7050 and 
> HADOOP-15395 and the outcome was to use non-whitespace characters i.e. 
> instead of {{\w+}}, use {{\S+}}.
> We could go down similar path and unblock this feature for the AD usernames 
> with a "." (dot) in them.
> [1] 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java#L1953
> [2] 
> https://docs.oracle.com/javase/tutorial/essential/regex/pre_char_classes.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10652) Capacity Scheduler fails to handle user weights for a user that has a "." (dot) in it

2021-02-26 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17291572#comment-17291572
 ] 

Peter Bacsko commented on YARN-10652:
-

"Dot" in the username is clearly a problem. In FS, there is approach in certain 
situations when dots are replaced to underscores ("_").

Quoting the upstream docs:

{noformat}
user: the app is placed into a queue with the name of the user who submitted 
it. Periods in the username will be replace with “_dot_”, i.e. the queue name 
for user “first.last” is “first_dot_last”.

primaryGroup: the app is placed into a queue with the name of the primary group 
of the user who submitted it. Periods in the group name will be replaced with 
“_dot_”, i.e. the queue name for group “one.two” is “one_dot_two”.
{noformat}

Obviously this is slightly different here, because in this case, you'd refer to 
the username as "firstname_lastname" in a static configuration, which could be 
confusing. Also, "firstname.lastname" and "firstname_lastname" would clash 
(unrealistic, but can happen in theory).

But in the placement engine, we should definitely consider what FS does and 
replace "." with "_".

> Capacity Scheduler fails to handle user weights for a user that has a "." 
> (dot) in it
> -
>
> Key: YARN-10652
> URL: https://issues.apache.org/jira/browse/YARN-10652
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.3.0
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: Correct user weight of 0.76 picked up for the user with 
> a dot after the patch.png, Incorrect default user weight of 1.0 being picked 
> for the user with a dot before the patch.png, YARN-10652.001.patch
>
>
> AD usernames can have a "." (dot) in them i.e. they can be of the format -> 
> {{firstname.lastname}}. However, if you specify a username with this format 
> against the Capacity Scheduler setting -> 
> {{yarn.scheduler.capacity.root.default.user-settings.firstname.lastname.weight}},
>  it fails to be applied and is instead assigned the default of 1.0f weight. 
> This renders the user weight feature (being used as a means of setting user 
> priorities for a queue) unusable for such users.
> This limitation comes from [1]. From [1], only word characters (A word 
> character: [a-zA-Z_0-9]) (see [2]) are permissible at the moment which is no 
> good for AD names that contain a "." (dot).
> Similar discussion has been had in a few HADOOP jiras e.g. HADOOP-7050 and 
> HADOOP-15395 and the outcome was to use non-whitespace characters i.e. 
> instead of {{\w+}}, use {{\S+}}.
> We could go down similar path and unblock this feature for the AD usernames 
> with a "." (dot) in them.
> [1] 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java#L1953
> [2] 
> https://docs.oracle.com/javase/tutorial/essential/regex/pre_char_classes.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9615) Add dispatcher metrics to RM

2021-02-25 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290963#comment-17290963
 ] 

Peter Bacsko commented on YARN-9615:


Thanks for the patch [~zhuqi].

Some comments:

1. 
{noformat}
  eventTypeMetricsMap.get(event.getType().getClass())
  .incr(event.getType(),
   (System.nanoTime() - startTime) / 1000);
{noformat}

I'm not 100% confident in this, but most of the time, we rely on {{Clock}} 
implementations, like {{MonotonicClock}}. I suggest using 
{{MonotonicClock.getTime()}}.

It might be a good idea to introduce a new method to {{AsyncDispatcher}} like 
{{setClock()}} (mark it with VisibleForTesting). This way, you can replace the 
Clock instance with a mock or something else, so testability is much easier.

2. Same thing applies to {{EventDispatcher}}.

3. Nit: {{public class DisableEventTypeMetrics implements EventTypeMetrics{}} 
-- add space after "EventTypeMetrics"

4. 
{noformat}
  @Override
  public void get(Enum type) {

  }
{noformat}

If this method does nothing, pls. add a comment like "//nop" to the method body 
(make it clear that no-op is normal).

5. 
{noformat}
  @Override
  public void get(T type) {
  }

  @Override
  public void getMetrics(MetricsCollector collector, boolean all) {
  }
{noformat}

Same here, add a short "// nop" comment in the method bodies.

6. ResourceManager.java: {{import org.apache.hadoop.yarn.event.*;}} --> avoid 
star imports

7. EventTypeMetrics.java:
{noformat}
void incr(T type, long processingTimeUs);
{noformat}

Nit: it's a minor thing, but if we can do it, let's write complete words, so 
I'd opt for {{increment()}} instead of just {{incr()}}.

8. Very important: there are NO tests for either {{EventDispatcher}} or 
{{AsyncDispatcher}}. Please add 1-2 unit tests that validate the correct 
behavior (and think of #1 and you can use a mock {{Clock}} instance for 
verification).

Also fix checkstyle and FindBugs issues.

> Add dispatcher metrics to RM
> 
>
> Key: YARN-9615
> URL: https://issues.apache.org/jira/browse/YARN-9615
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Jonathan Hung
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-9615.001.patch, YARN-9615.002.patch, 
> YARN-9615.003.patch, YARN-9615.poc.patch, screenshot-1.png
>
>
> It'd be good to have counts/processing times for each event type in RM async 
> dispatcher and scheduler async dispatcher.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9615) Add dispatcher metrics to RM

2021-02-24 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-9615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290237#comment-17290237
 ] 

Peter Bacsko commented on YARN-9615:


Thanks [~zhuqi] for the patch. I'll try to review this tomorrow.

> Add dispatcher metrics to RM
> 
>
> Key: YARN-9615
> URL: https://issues.apache.org/jira/browse/YARN-9615
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Jonathan Hung
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-9615.001.patch, YARN-9615.002.patch, 
> YARN-9615.003.patch, YARN-9615.poc.patch, screenshot-1.png
>
>
> It'd be good to have counts/processing times for each event type in RM async 
> dispatcher and scheduler async dispatcher.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-10532) Capacity Scheduler Auto Queue Creation: Allow auto delete queue when queue is not being used

2021-02-24 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17289968#comment-17289968
 ] 

Peter Bacsko edited comment on YARN-10532 at 2/24/21, 8:12 PM:
---

FIRST round review.

I might post more but these are that stand out to me right now.

1.
 AbstractYarnScheduler:
{noformat}
  public void removeQueue(CSQueue queueName) throws YarnException {
throw new YarnException(getClass().getSimpleName()
+ " does not support removing queues");
  }
{noformat}
If this is an abstract class, just make this method abstract without 
implementation:
 {{public abstract void removeQueue(CSQueue queueName) throws YarnException;}}

2.
{noformat}
  // When this queue has application submit to?
  // This property only applies to dynamic queue,
  // and will be used to check when the queue need to be removed.
{noformat}
Rephrase this comment a little bit:
{noformat}
  // The timestamp of the last submitted application to this queue.
  // Only applies to dynamic queues.
{noformat}
3.
{noformat}
  // "Tab" the queue, so this queue won't be removed because of idle 
timeout.
  public void signalToSubmitToQueue() {
{noformat}
I'd comment that "Update the timestamp of the last submitted application".

Also, the method name sounds weird to me. What it does is really simple. Call 
it {{updateLastSubmittedTimeStamp()}}.
 If you use the right naming, then the comment is probably unnecessary. We 
don't need comments if the method is simple and easy to understand its purpose.

4. Instead of this:
{noformat}
  // just for test
  public void setLastSubmittedTimestamp(long lastSubmittedTimestamp) {
{noformat}
use this:
{noformat}
  @VisibleForTesting
  public void setLastSubmittedTimestamp(long lastSubmittedTimestamp) {
{noformat}
5. This comment is completely unnecessary I think:
{noformat}
// Expired queue, when there are no applications in queue,
// and the last submit time has been expired.
// Delete queue when expired deletion enabled.
{noformat}
It's obvious what the method is doing. Or if you insist on having a comment 
there, just add "Timeout expired, delete the dynamic queue"

6. I suggest a better exception message:
{noformat}
throw new SchedulerDynamicEditException(
"The queue " + queue.getQueuePath()
+ " can't removed normally.");
{noformat}
It should say "The queue ABC cannot be removed because it's parent is null".

7. {{LOG.info("Removed queue: " + queue.getQueuePath());}} – not necessary to 
log a successful removal. If there is no message, it means that the removal was 
successful.

8. Typo in comment: {{// 300s for expired defualt}} --> "default"

9. These methods are used by the code itself, not just test:
{noformat}
  @VisibleForTesting
  public void prepareForAutoDeletion() {
  ...
  @VisibleForTesting
  public void triggerAutoDeletionForExpiredQueues() {
{noformat}
So "VisibleForTesting" should be removed.

10.
{noformat}
  private void queueAutoDeletion(CSQueue checkQueue) {
//Scheduler update is asynchronous
if (checkQueue != null) {
{noformat}
Three things:
 * {{queueAutoDeletion()}} - this method is a noun. Ideally, methods begin with 
a verb. For example "deleteDynamicQueue()" or "deleteAutoCreatedQueue()".
 * Also, why is it called "checkQueue"? Just call it "queue".
 * The comment is confusing: "Scheduler update is asynchronous". Why is it 
there? This statement does not tell me anything in this context. Does it refer 
to the null-check?

11.
{noformat}
  @Before
  public void setUp() throws Exception {
// The expired time for deletion will be 1s
super.setUp();
  }
{noformat}
This method is unnecessary, the setUp() method in the super class will be 
called anyway.

12. Test methods: {{testEditSchedule}}, 
{{testCapacitySchedulerAutoQueueDeletion}}, 
{{testCapacitySchedulerAutoQueueDeletionDisabled}}
 These test methods are long, but it's not my main problem. There are 
{{Thread.sleep()}} calls inside. This is really problematic, especially short 
sleeps like {{Thread.sleep(100)}}.
 I have fixed many flaky tests where the test code were full of 
{{Thread.sleep()}}. This must be avoided whever possible.

We should come up with a better solution, eg. polling a certain state 
regularly, for example:
{noformat}
GenericTestUtils.waitFor(someObject.isConditionTrue(), 500, 10_000);
{noformat}
This method calls {{someObject.isConditionTrue()}} in every 500ms and it times 
out after 10 seconds. In case of a timeout, a {{TimeoutException}} will be 
thrown.


was (Author: pbacsko):
FIRST round review.

I might post more but these are that stand out to me

[jira] [Commented] (YARN-10623) Capacity scheduler should support refresh queue automatically by a thread policy.

2021-02-24 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290235#comment-17290235
 ] 

Peter Bacsko commented on YARN-10623:
-

I have some minor comments.

1.  {{LOG.info("Auto refreshed queue successfully!");}} The sentence "Queue 
auto refresh completed successfully" sounds better.

2.  
{noformat}
LOG.error("Can't refresh queue: " + e.getMessage());
...
LOG.error("Can't get file status for refresh : " + e.getMessage());
{noformat}

We don't have the stack trace. Having the stack trace is very important for 
debugging, so either use {{LOG.error("Can't refresh queue", e);}} or log it 
separately.

3. 
{noformat}
  public FileSystem getFs() {
return fs;
  }

  public Path getAllocCsFile() {
return allocCsFile;
  }

  public ResourceCalculator getResourceCalculator() {
return rc;
  }

  public RMContext getRmContext() {
return rmContext;
  }

  public CapacityScheduler getScheduler() {
return scheduler;
  }
{noformat}

Are these methods used? To me it looks like that not even the test code calls 
these methods. So remove those which are unused.

4. 
{noformat}
try {
  Thread.sleep(3000);
} catch (Exception e) {
  // do nothing
}
{noformat}

Just as I mentioned in a different review, we should refrain from 
{{Thread.sleep()}}. It unnecessarily slows down the test.
Use {{GenerticTestUtils.waitFor()}}.

5. 
{noformat}
try {
  rm = new MockRM(configuration);
  rm.init(configuration);
  rm.start();
} catch(Exception ex) {
  fail("Should not get any exceptions");
}
{noformat}

You don't have to catch the exceptions from MockRM. If it fails, the test fails 
anyway. In this case, it will be counted as a failed test. But if it cannot 
start, that's really a test error, which is a separate counter in JUnit. Just 
remove the try-catch block.

> Capacity scheduler should support refresh queue automatically by a thread 
> policy.
> -
>
> Key: YARN-10623
> URL: https://issues.apache.org/jira/browse/YARN-10623
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10623.001.patch, YARN-10623.002.patch, 
> YARN-10623.003.patch
>
>
> In fair scheduler, it is supported that refresh queue related conf 
> automatically by a thread to reload, but in capacity scheduler we only 
> support to refresh queue related changes by refreshQueues, it is needed for 
> our cluster to realize queue manage.
> cc [~wangda] [~ztang] [~pbacsko] [~snemeth] [~gandras]  [~bteke] [~shuzirra]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10609) Update the document for YARN-10531(Be able to disable user limit factor for CapacityScheduler Leaf Queue)

2021-02-24 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10609:

Hadoop Flags: Reviewed

> Update the document for YARN-10531(Be able to disable user limit factor for 
> CapacityScheduler Leaf Queue)
> -
>
> Key: YARN-10609
> URL: https://issues.apache.org/jira/browse/YARN-10609
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10609.001.patch, YARN-10609.002.patch, 
> YARN-10609.003.patch, YARN-10609.004.patch, YARN-10609.005.patch
>
>
> Since we have finished YARN-10531.
> We should update the corresponding document.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10609) Update the document for YARN-10531(Be able to disable user limit factor for CapacityScheduler Leaf Queue)

2021-02-24 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290217#comment-17290217
 ] 

Peter Bacsko commented on YARN-10609:
-

+1

Thanks [~zhuqi] for the patch and [~bteke] for the review.

Committed to master.

> Update the document for YARN-10531(Be able to disable user limit factor for 
> CapacityScheduler Leaf Queue)
> -
>
> Key: YARN-10609
> URL: https://issues.apache.org/jira/browse/YARN-10609
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10609.001.patch, YARN-10609.002.patch, 
> YARN-10609.003.patch, YARN-10609.004.patch, YARN-10609.005.patch
>
>
> Since we have finished YARN-10531.
> We should update the corresponding document.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10532) Capacity Scheduler Auto Queue Creation: Allow auto delete queue when queue is not being used

2021-02-24 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17289968#comment-17289968
 ] 

Peter Bacsko commented on YARN-10532:
-

FIRST round review.

I might post more but these are that stand out to me right now.

1.
 AbstractYarnScheduler:
{noformat}
  public void removeQueue(CSQueue queueName) throws YarnException {
throw new YarnException(getClass().getSimpleName()
+ " does not support removing queues");
  }
{noformat}
If this is an abstract class, just make this method abstract without 
implementation:
 {{public abstract void removeQueue(CSQueue queueName) throws YarnException;}}

2.
{noformat}
  // When this queue has application submit to?
  // This property only applies to dynamic queue,
  // and will be used to check when the queue need to be removed.
{noformat}
Rephrase this comment a little bit:
{noformat}
  // The timestamp of the last submitted application to this queue.
  // Only applies to dynamic queues.
{noformat}
3.
{noformat}
  // "Tab" the queue, so this queue won't be removed because of idle 
timeout.
  public void signalToSubmitToQueue() {
{noformat}
I'd comment that "Update the timestamp of the last submitted application".

Also, the method name is sounds weird to me. What it does is really simple. 
Call it {{updateLastSubmittedTimeStamp()}}.
 If you use the right naming, then the comment is probably unnecessary. We 
don't need comments if the method is simple and easy to understand its purpose.

4. Instead of this:
{noformat}
  // just for test
  public void setLastSubmittedTimestamp(long lastSubmittedTimestamp) {
{noformat}
use this:
{noformat}
  @VisibleForTesting
  public void setLastSubmittedTimestamp(long lastSubmittedTimestamp) {
{noformat}
5. This comment is completely unnecessary I think:
{noformat}
// Expired queue, when there are no applications in queue,
// and the last submit time has been expired.
// Delete queue when expired deletion enabled.
{noformat}
It's obvious what the method is doing. Or if you insist on having a comment 
there, just add "Timeout expired, delete the dynamic queue"

6. I suggest a better exception message:
{noformat}
throw new SchedulerDynamicEditException(
"The queue " + queue.getQueuePath()
+ " can't removed normally.");
{noformat}
It should say "The queue ABC cannot be removed because it's parent is null".

7. {{LOG.info("Removed queue: " + queue.getQueuePath());}} – not necessary to 
log a successful removal. If there is no message, it means that the removal was 
successful.

8. Typo in comment: {{// 300s for expired defualt}} --> "default"

9. These methods are used by the code itself, not just test:
{noformat}
  @VisibleForTesting
  public void prepareForAutoDeletion() {
  ...
  @VisibleForTesting
  public void triggerAutoDeletionForExpiredQueues() {
{noformat}
So "VisibleForTesting" should be removed.

10.
{noformat}
  private void queueAutoDeletion(CSQueue checkQueue) {
//Scheduler update is asynchronous
if (checkQueue != null) {
{noformat}
Three things:
 * {{queueAutoDeletion()}} - this method is a noun. Ideally, methods begin with 
a verb. For example "deleteDynamicQueue()" or "deleteAutoCreatedQueue()".
 * Also, why is it called "checkQueue"? Just call it "queue".
 * The comment is confusing: "Scheduler update is asynchronous". Why is it 
there? This statement does not tell me anything in this context. Does it refer 
to the null-check?

11.
{noformat}
  @Before
  public void setUp() throws Exception {
// The expired time for deletion will be 1s
super.setUp();
  }
{noformat}
This method is unnecessary, the setUp() method in the super class will be 
called anyway.

12. Test methods: {{testEditSchedule}}, 
{{testCapacitySchedulerAutoQueueDeletion}}, 
{{testCapacitySchedulerAutoQueueDeletionDisabled}}
 These test methods are long, but it's not my main problem. There are 
{{Thread.sleep()}} calls inside. This is really problematic, especially short 
sleeps like {{Thread.sleep(100)}}.
 I have fixed many flaky tests where the test code were full of 
{{Thread.sleep()}}. This must be avoided whever possible.

We should come up with a better solution, eg. polling a certain state 
regularly, for example:
{noformat}
GenericTestUtils.waitFor(someObject.isConditionTrue(), 500, 10_000);
{noformat}
This method calls {{someObject.isConditionTrue()}} in every 500ms and it times 
out after 10 seconds. In case of a timeout, a {{TimeoutException}} will be 
thrown.

> Capacity Scheduler Auto Queue Creation: Allow auto delete queue when queue is 
> not being used
>

[jira] [Commented] (YARN-10640) Ajust the queue Configured capacity to Configured weight number for weight mode in UI.

2021-02-24 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17289929#comment-17289929
 ] 

Peter Bacsko commented on YARN-10640:
-

[~zhuqi] Thanks for your work. I've just started to review your patches, but 
there are many, so I'll try to do my best and give feedbacks sooner or later.

> Ajust the queue Configured capacity to  Configured weight number for weight 
> mode in UI.
> ---
>
> Key: YARN-10640
> URL: https://issues.apache.org/jira/browse/YARN-10640
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10640.001.patch, YARN-10640.002.patch, 
> image-2021-02-20-11-21-50-306.png, image-2021-02-20-14-18-56-261.png, 
> image-2021-02-20-14-19-30-767.png
>
>
> In weight mode:
> Both the weight mode static queue and the dynamic queue will show the 
> Configured Capacity to 0. I think this should change to Configured Weight if 
> we use weight mode, this will be helpful.
> Such as in dynamic weight mode queue:
> !image-2021-02-20-11-21-50-306.png|width=528,height=374!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10627) Extend logging to give more information about weight mode

2021-02-24 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17289911#comment-17289911
 ] 

Peter Bacsko commented on YARN-10627:
-

# Still have 4 checkstyle issues, they're not serious, but if we're not in a 
rush, we should fix those.
 # testGetCapacityOrWeightStringUsingWeights / 
testGetCapacityOrWeightStringParentPctLeafWeights -> make sure MockRM is closed 
in a finally block

> Extend logging to give more information about weight mode
> -
>
> Key: YARN-10627
> URL: https://issues.apache.org/jira/browse/YARN-10627
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10627.001.patch, YARN-10627.002.patch, 
> YARN-10627.003.patch, image-2021-02-20-00-07-09-875.png
>
>
> In YARN-10504 weight mode was added, however the logged information about the 
> created queues or the toString methods weren't updated accordingly. Some 
> examples:
> ParentQueue#setupQueueConfigs:
> {code:java}
>  LOG.info(queueName + ", capacity=" + this.queueCapacities.getCapacity()
>   + ", absoluteCapacity=" + this.queueCapacities.getAbsoluteCapacity()
>   + ", maxCapacity=" + this.queueCapacities.getMaximumCapacity()
>   + ", absoluteMaxCapacity=" + this.queueCapacities
>   .getAbsoluteMaximumCapacity() + ", state=" + getState() + ", acls="
>   + aclsString + ", labels=" + labelStrBuilder.toString() + "\n"
>   + ", reservationsContinueLooking=" + reservationsContinueLooking
>   + ", orderingPolicy=" + getQueueOrderingPolicyConfigName()
>   + ", priority=" + priority
>   + ", allowZeroCapacitySum=" + allowZeroCapacitySum);
> {code}
> ParentQueue#toString:
> {code:java}
> public String toString() {
> return queueName + ": " +
> "numChildQueue= " + childQueues.size() + ", " + 
> "capacity=" + queueCapacities.getCapacity() + ", " +  
> "absoluteCapacity=" + queueCapacities.getAbsoluteCapacity() + ", " +
> "usedResources=" + queueUsage.getUsed() + 
> "usedCapacity=" + getUsedCapacity() + ", " + 
> "numApps=" + getNumApplications() + ", " + 
> "numContainers=" + getNumContainers();
>  }
> {code}
> LeafQueue#setupQueueConfigs:
> {code:java}
>   LOG.info(
>   "Initializing " + getQueuePath() + "\n" + "capacity = "
>   + queueCapacities.getCapacity()
>   + " [= (float) configuredCapacity / 100 ]" + "\n"
>   + "absoluteCapacity = " + queueCapacities.getAbsoluteCapacity()
>   + " [= parentAbsoluteCapacity * capacity ]" + "\n"
>   + "maxCapacity = " + queueCapacities.getMaximumCapacity()
>   + " [= configuredMaxCapacity ]" + "\n" + "absoluteMaxCapacity = 
> "
>   + queueCapacities.getAbsoluteMaximumCapacity()
>   + " [= 1.0 maximumCapacity undefined, "
>   + "(parentAbsoluteMaxCapacity * maximumCapacity) / 100 
> otherwise ]"
>   + "\n" + "effectiveMinResource=" +
>   getEffectiveCapacity(CommonNodeLabelsManager.NO_LABEL) + "\n"
>   + " , effectiveMaxResource=" +
>   getEffectiveMaxCapacity(CommonNodeLabelsManager.NO_LABEL)
>   + "\n" + "userLimit = " + usersManager.getUserLimit()
>   + " [= configuredUserLimit ]" + "\n" + "userLimitFactor = "
>   + usersManager.getUserLimitFactor()
>   + " [= configuredUserLimitFactor ]" + "\n" + "maxApplications = 
> "
>   + maxApplications
>   + " [= configuredMaximumSystemApplicationsPerQueue or"
>   + " (int)(configuredMaximumSystemApplications * 
> absoluteCapacity)]"
>   + "\n" + "maxApplicationsPerUser = " + maxApplicationsPerUser
>   + " [= (int)(maxApplications * (userLimit / 100.0f) * "
>   + "userLimitFactor) ]" + "\n"
>   + "maxParallelApps = " + getMaxParallelApps() + "\n"
>   + "usedCapacity = " +
>   + queueCapacities.getUsedCapacity() + " [= usedResourcesMemory 
> / "
>   + "(clusterResourceMemory * absoluteCapacity)]" + "\n"
>   + "absoluteUsedCapacity = " + absoluteUsedCapacity
>   + " [= usedResourcesMemory / clusterResourceMemory]" + "\n"
>   + "maxAMResourcePerQueuePercent = " + 
> maxAMResourcePerQueuePercent
>   + " [= configuredMaximumAMResourcePercent ]" + "\n"
>   + "minimumAllocationFactor = " + minimumAllocationFactor
>   + " [= (float)(maximumAllocationMemory - 
> minimumAllocationMemory) / "
>   + "maximumAllocationMemory ]" + "\n" + "maximumAllocation = "
>   +

[jira] [Commented] (YARN-10513) CS Flexible Auto Queue Creation RM UIv2 modifications

2021-02-22 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17288346#comment-17288346
 ] 

Peter Bacsko commented on YARN-10513:
-

+1

Committed to trunk. Thanks [~gandras] for the patch and [~bteke] for the 
review. 

> CS Flexible Auto Queue Creation RM UIv2 modifications
> -
>
> Key: YARN-10513
> URL: https://issues.apache.org/jira/browse/YARN-10513
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Andras Gyori
>Priority: Major
> Attachments: Screenshot 2021-02-04 at 12.54.25.png, Screenshot 
> 2021-02-04 at 12.54.52.png, Screenshot 2021-02-04 at 12.55.10.png, Screenshot 
> 2021-02-08 at 10.34.32.png, Screenshot 2021-02-17 at 15.22.30.png, 
> YARN-10513.001.patch, YARN-10513.002.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-10513) CS Flexible Auto Queue Creation RM UIv2 modifications

2021-02-22 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17288341#comment-17288341
 ] 

Peter Bacsko edited comment on YARN-10513 at 2/22/21, 12:01 PM:


Patch LGTM, although I'm nowhere near a Javascript maestro. [~gandras] if you 
put enough effort into testing, I'll believe that it works :)

Going to commit this soon.


was (Author: pbacsko):
Patch LGTM, although I'm nowhere near a Javascript maestro. [~gandras] if you 
put enough effort into testing, I'll believe it :)

Going to commit this soon.

> CS Flexible Auto Queue Creation RM UIv2 modifications
> -
>
> Key: YARN-10513
> URL: https://issues.apache.org/jira/browse/YARN-10513
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Andras Gyori
>Priority: Major
> Attachments: Screenshot 2021-02-04 at 12.54.25.png, Screenshot 
> 2021-02-04 at 12.54.52.png, Screenshot 2021-02-04 at 12.55.10.png, Screenshot 
> 2021-02-08 at 10.34.32.png, Screenshot 2021-02-17 at 15.22.30.png, 
> YARN-10513.001.patch, YARN-10513.002.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10513) CS Flexible Auto Queue Creation RM UIv2 modifications

2021-02-22 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17288341#comment-17288341
 ] 

Peter Bacsko commented on YARN-10513:
-

Patch LGTM, although I'm nowhere near a Javascript maestro. [~gandras] if you 
put enough effort into testing, I'll believe it :)

Going to commit this soon.

> CS Flexible Auto Queue Creation RM UIv2 modifications
> -
>
> Key: YARN-10513
> URL: https://issues.apache.org/jira/browse/YARN-10513
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Andras Gyori
>Priority: Major
> Attachments: Screenshot 2021-02-04 at 12.54.25.png, Screenshot 
> 2021-02-04 at 12.54.52.png, Screenshot 2021-02-04 at 12.55.10.png, Screenshot 
> 2021-02-08 at 10.34.32.png, Screenshot 2021-02-17 at 15.22.30.png, 
> YARN-10513.001.patch, YARN-10513.002.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-10636) CS Auto Queue creation should reject submissions with empty path parts

2021-02-19 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17287065#comment-17287065
 ] 

Peter Bacsko edited comment on YARN-10636 at 2/19/21, 1:00 PM:
---

Thanks [~shuzirra] for the patch and [~gandras]/[~zhuqi]/[~bteke] for the 
review.

Since we're in a big hurry, I already committed v2 w/o Jenkins results. Once 
Jenkins posts here, I'll check it and (hopefully) close the ticket.

EDIT: there will be no Jenkins, because the changeset is already in :D anyway 
v1-v2 difference is minimal.


was (Author: pbacsko):
Thanks [~shuzirra] for the patch and [~gandras]/[~zhuqi]/[~bteke] for the 
review.

Since we're in a big hurry, I already committed v2 w/o Jenkins results. Once 
Jenkins posts here, I'll check it and (hopefully) close the ticket.

> CS Auto Queue creation should reject submissions with empty path parts
> --
>
> Key: YARN-10636
> URL: https://issues.apache.org/jira/browse/YARN-10636
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Attachments: YARN-10636.001.patch, YARN-10636.002.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10636) CS Auto Queue creation should reject submissions with empty path parts

2021-02-19 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17287065#comment-17287065
 ] 

Peter Bacsko commented on YARN-10636:
-

Thanks [~shuzirra] for the patch and [~gandras]/[~zhuqi]/[~bteke] for the 
review.

Since we're in a big hurry, I already committed v2 w/o Jenkins results. Once 
Jenkins posts here, I'll check it and (hopefully) close the ticket.

> CS Auto Queue creation should reject submissions with empty path parts
> --
>
> Key: YARN-10636
> URL: https://issues.apache.org/jira/browse/YARN-10636
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Attachments: YARN-10636.001.patch, YARN-10636.002.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10635) CSMapping rule can return paths with empty parts

2021-02-19 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17287005#comment-17287005
 ] 

Peter Bacsko commented on YARN-10635:
-

+1

Thanks [~shuzirra] for the patch and [~bteke], [~shuzirra], [~zhuqi] for the 
review.

Patch has been committed to trunk.

> CSMapping rule can return paths with empty parts
> 
>
> Key: YARN-10635
> URL: https://issues.apache.org/jira/browse/YARN-10635
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Attachments: YARN-10635.001.patch, YARN-10635.002.patch, 
> YARN-10635.003.patch
>
>
> When a variable to be substituted evaluates to empty string, we might result 
> with paths where one of the parts is empty, these paths are obviously 
> problematic, but sometimes (when the path includes a dynamicParent) we accept 
> them as valid paths instead of getting the fallback action of the rule.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10631) Document AM preemption related changes (YARN-9537 and YARN-10625)

2021-02-17 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10631:

Summary: Document AM preemption related changes (YARN-9537 and YARN-10625)  
(was: Document AM-preemption related changes (YARN-9537 and YARN-10625))

> Document AM preemption related changes (YARN-9537 and YARN-10625)
> -
>
> Key: YARN-10631
> URL: https://issues.apache.org/jira/browse/YARN-10631
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>
> Preemption-related changes were introduced in YARN-9537 and YARN-10625.
> These also introduce new properties which are not documented for Fair 
> Scheduler. Extend the documentation with these enhancements.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-10631) Document AM-preemption related changes (YARN-9537 and YARN-10625)

2021-02-17 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10631:
---

 Summary: Document AM-preemption related changes (YARN-9537 and 
YARN-10625)
 Key: YARN-10631
 URL: https://issues.apache.org/jira/browse/YARN-10631
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Peter Bacsko
Assignee: Peter Bacsko


Preemption-related changes were introduced in YARN-9537 and YARN-10625.

These also introduce new properties which are not documented for Fair 
Scheduler. Extend the documentation with these enhancements.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10625) FairScheduler: add global flag to disable AM-preemption

2021-02-15 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284749#comment-17284749
 ] 

Peter Bacsko commented on YARN-10625:
-

[~bteke] / [~snemeth] please review this patch if you have some free cycles.

I think a doc update is also desirable, I'd do that in a follow-up JIRA.

> FairScheduler: add global flag to disable AM-preemption
> ---
>
> Key: YARN-10625
> URL: https://issues.apache.org/jira/browse/YARN-10625
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.3.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10625-001.patch
>
>
> YARN-9537 added a feature to disable AM preemption on a per queue basis.
> This is a nice enhancement, but it's very inconvenient if the cluster has a 
> lot of queues or queues dynamically created/deleted regularly (static queue 
> configuration changes).
> It's a legitimate use-case to have AM preemption turned off completely. To 
> make it easier, add property which acts as a global flag for this feature.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10625) FairScheduler: add global flag to disable AM-preemption

2021-02-15 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10625:

Attachment: YARN-10625-001.patch

> FairScheduler: add global flag to disable AM-preemption
> ---
>
> Key: YARN-10625
> URL: https://issues.apache.org/jira/browse/YARN-10625
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.3.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10625-001.patch
>
>
> YARN-9537 added a feature to disable AM preemption on a per queue basis.
> This is a nice enhancement, but it's very inconvenient if the cluster has a 
> lot of queues or queues dynamically created/deleted regularly (static queue 
> configuration changes).
> It's a legitimate use-case to have AM preemption turned off completely. To 
> make it easier, add property which acts as a global flag for this feature.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-10625) FairScheduler: add global flag to disable AM-preemption

2021-02-12 Thread Peter Bacsko (Jira)

Peter Bacsko created YARN-10625:
---

 Summary: FairScheduler: add global flag to disable AM-preemption
 Key: YARN-10625
 URL: https://issues.apache.org/jira/browse/YARN-10625
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Affects Versions: 3.3.0
Reporter: Peter Bacsko
Assignee: Peter Bacsko


YARN-9537 added a feature to disable AM preemption on a per queue basis.

This is a nice enhancement, but it's very inconvenient if the cluster has a lot 
of queues or queues dynamically created/deleted regularly (static queue 
configuration changes).

It's a legitimate use-case to have AM preemption turned off completely. To make 
it easier, add property which acts as a global flag for this feature.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10620) fs2cs: parentQueue for certain placement rules are not set during conversion

2021-02-10 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10620:

Description: 
There are some placement rules in FS which are currently not handled properly 
by fs2cs:

{noformat}




{noformat}

The first rule means that if the user queue doesn't exist, it should be created 
as {{root.}}.
The second means the same thing, except refers to the primary group instead of 
the submitting user: {{root.}}.

The problem is that in order for the create="true" setting to take effect, we 
must set the parent queue in the generated JSON:

Current:
{noformat}
{
  "rules" : [ {
"type" : "user",
"matches" : "*",
"policy" : "user",
"fallbackResult" : "skip",
"create" : true
  }, {
"type" : "user",
"matches" : "*",
"policy" : "primaryGroup",
"fallbackResult" : "skip",
"create" : true
  } ]
}
{noformat}

Expected:
{noformat}
{
  "rules" : [ {
"type" : "user",
"matches" : "*",
"policy" : "user",
"fallbackResult" : "skip",
"parentQueue": "root",
"create" : true
  }, {
"type" : "user",
"matches" : "*",
"policy" : "primaryGroup",
"fallbackResult" : "skip",
"parentQueue": "root",
"create" : true
  } ]
{noformat}

This is missing right now and it need to be fixed.

  was:
There are some placement rules in FS which are currently not handled properly 
by fs2cs:

{noformat}




{noformat}

The first rule means that if the user queue doesn't exist, it should be created 
as {{root.}}.
The second means the same thing, except refers to the primary group instead of 
the submitting user: {{root.}}.

The problem is that in order for the create="true" setting to take effect, we 
must set the parent queue in the generated JSON:

{noformat}
{
  "rules" : [ {
"type" : "user",
"matches" : "*",
"policy" : "user",
"fallbackResult" : "skip",
"create" : true
  }, {
"type" : "user",
"matches" : "*",
"policy" : "primaryGroup",
"fallbackResult" : "skip",
"create" : true
  } ]
}
{noformat}

This is missing right now and it need to be fixed.


> fs2cs: parentQueue for certain placement rules are not set during conversion
> 
>
> Key: YARN-10620
> URL: https://issues.apache.org/jira/browse/YARN-10620
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10620-001.patch, YARN-10620-002.patch
>
>
> There are some placement rules in FS which are currently not handled properly 
> by fs2cs:
> {noformat}
> 
> 
> 
> 
> {noformat}
> The first rule means that if the user queue doesn't exist, it should be 
> created as {{root.}}.
> The second means the same thing, except refers to the primary group instead 
> of the submitting user: {{root.}}.
> The problem is that in order for the create="true" setting to take effect, we 
> must set the parent queue in the generated JSON:
> Current:
> {noformat}
> {
>   "rules" : [ {
> "type" : "user",
> "matches" : "*",
> "policy" : "user",
> "fallbackResult" : "skip",
> "create" : true
>   }, {
> "type" : "user",
> "matches" : "*",
> "policy" : "primaryGroup",
> "fallbackResult" : "skip",
> "create" : true
>   } ]
> }
> {noformat}
> Expected:
> {noformat}
> {
>   "rules" : [ {
> "type" : "user",
> "matches" : "*",
> "policy" : "user",
> "fallbackResult" : "skip",
> "parentQueue": "root",
> "create" : true
>   }, {
> "type" : "user",
> "matches" : "*",
> "policy" : "primaryGroup",
> "fallbackResult" : "skip",
> "parentQueue": "root",
> "create" : true
>   } ]
> {noformat}
> This is missing right now and it need to be fixed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10620) fs2cs: parentQueue for certain placement rules are not set during conversion

2021-02-10 Thread Peter Bacsko (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17282322#comment-17282322
 ] 

Peter Bacsko commented on YARN-10620:
-

[~gandras] thanks, I modified the code a little bit, used a set instead of an 
array.

> fs2cs: parentQueue for certain placement rules are not set during conversion
> 
>
> Key: YARN-10620
> URL: https://issues.apache.org/jira/browse/YARN-10620
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10620-001.patch, YARN-10620-002.patch
>
>
> There are some placement rules in FS which are currently not handled properly 
> by fs2cs:
> {noformat}
> 
> 
> 
> 
> {noformat}
> The first rule means that if the user queue doesn't exist, it should be 
> created as {{root.}}.
> The second means the same thing, except refers to the primary group instead 
> of the submitting user: {{root.}}.
> The problem is that in order for the create="true" setting to take effect, we 
> must set the parent queue in the generated JSON:
> {noformat}
> {
>   "rules" : [ {
> "type" : "user",
> "matches" : "*",
> "policy" : "user",
> "fallbackResult" : "skip",
> "create" : true
>   }, {
> "type" : "user",
> "matches" : "*",
> "policy" : "primaryGroup",
> "fallbackResult" : "skip",
> "create" : true
>   } ]
> }
> {noformat}
> This is missing right now and it need to be fixed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10620) fs2cs: parentQueue for certain placement rules are not set during conversion

2021-02-10 Thread Peter Bacsko (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10620:

Attachment: YARN-10620-002.patch

> fs2cs: parentQueue for certain placement rules are not set during conversion
> 
>
> Key: YARN-10620
> URL: https://issues.apache.org/jira/browse/YARN-10620
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10620-001.patch, YARN-10620-002.patch
>
>
> There are some placement rules in FS which are currently not handled properly 
> by fs2cs:
> {noformat}
> 
> 
> 
> 
> {noformat}
> The first rule means that if the user queue doesn't exist, it should be 
> created as {{root.}}.
> The second means the same thing, except refers to the primary group instead 
> of the submitting user: {{root.}}.
> The problem is that in order for the create="true" setting to take effect, we 
> must set the parent queue in the generated JSON:
> {noformat}
> {
>   "rules" : [ {
> "type" : "user",
> "matches" : "*",
> "policy" : "user",
> "fallbackResult" : "skip",
> "create" : true
>   }, {
> "type" : "user",
> "matches" : "*",
> "policy" : "primaryGroup",
> "fallbackResult" : "skip",
> "create" : true
>   } ]
> }
> {noformat}
> This is missing right now and it need to be fixed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

< 1 2 3 4 5 6 7 8 9 10 >

201 - 300 of 1682 matches

Mail list logo