[jira] [Commented] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail

2020-10-19 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217288#comment-17217288
 ] 

Akira Ajisaka commented on YARN-10460:
--

Thank you [~pbacsko]!

> Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
> -
>
> Key: YARN-10460
> URL: https://issues.apache.org/jira/browse/YARN-10460
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10460-001.patch, YARN-10460-002.patch, 
> YARN-10460-POC.patch
>
>
> In our downstream build environment, we're using JUnit 4.13. Recently, we 
> discovered a truly weird test failure in TestNodeStatusUpdater.
> The problem is that timeout handling has changed in Junit 4.13. See the 
> difference between these two snippets:
> 4.12
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> }
> {noformat}
>  
>  4.13
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> try {
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> } finally {
> try {
> thread.join(1);
> } catch (InterruptedException e) {
> Thread.currentThread().interrupt();
> }
> try {
> threadGroup.destroy();  < This
> } catch (IllegalThreadStateException e) {
> // If a thread from the group is still alive, the ThreadGroup 
> cannot be destroyed.
> // Swallow the exception to keep the same behavior prior to 
> this change.
> }
> }
> }
> {noformat}
> The change comes from [https://github.com/junit-team/junit4/pull/1517].
> Unfortunately, destroying the thread group causes an issue because there are 
> all sorts of object caching in the IPC layer. The exception is:
> {noformat}
> java.lang.IllegalThreadStateException
>   at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867)
>   at java.lang.Thread.init(Thread.java:402)
>   at java.lang.Thread.init(Thread.java:349)
>   at java.lang.Thread.(Thread.java:675)
>   at 
> java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613)
>   at 
> com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
>   at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
>   at 
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1458)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1405)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>   at com.sun.proxy.$Proxy81.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdaterRetryAndNMShutdown(TestNodeStatusUpdater.java:1576)
> {noformat}
> Both the {{clientExecutor}} in {{org.apache.hadoop.ipc.Client}} and the 
> clien

[jira] [Comment Edited] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail

2020-10-19 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217275#comment-17217275
 ] 

Akira Ajisaka edited comment on YARN-10460 at 10/20/20, 5:37 AM:
-

Hi [~pbacsko], I have a question. Are there any failing tests other than this 
issue in JUnit 4.13?

Hi [~snemeth], I would like to backport this to older branches. Now I consider 
upgrading JUnit to 4.13.1 in Apache Hadoop because of CVE-2020-15250 
(HADOOP-17316). The severity is low, but it is worth upgrading.
https://github.com/junit-team/junit4/security/advisories/GHSA-269g-pwp5-87pp


was (Author: ajisakaa):
Hi [~pbacsko], I have a question. Are there any failing tests other than this 
issue in JUnit 4.13?

Hi [~snemeth], I would like to backport this to older branches. Now I consider 
upgrading JUnit to 4.13.1 in Apache Hadoop because of CVE-2020-15250. The 
severity is low, but it is worth upgrading.
https://github.com/junit-team/junit4/security/advisories/GHSA-269g-pwp5-87pp

> Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
> -
>
> Key: YARN-10460
> URL: https://issues.apache.org/jira/browse/YARN-10460
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10460-001.patch, YARN-10460-002.patch, 
> YARN-10460-POC.patch
>
>
> In our downstream build environment, we're using JUnit 4.13. Recently, we 
> discovered a truly weird test failure in TestNodeStatusUpdater.
> The problem is that timeout handling has changed in Junit 4.13. See the 
> difference between these two snippets:
> 4.12
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> }
> {noformat}
>  
>  4.13
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> try {
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> } finally {
> try {
> thread.join(1);
> } catch (InterruptedException e) {
> Thread.currentThread().interrupt();
> }
> try {
> threadGroup.destroy();  < This
> } catch (IllegalThreadStateException e) {
> // If a thread from the group is still alive, the ThreadGroup 
> cannot be destroyed.
> // Swallow the exception to keep the same behavior prior to 
> this change.
> }
> }
> }
> {noformat}
> The change comes from [https://github.com/junit-team/junit4/pull/1517].
> Unfortunately, destroying the thread group causes an issue because there are 
> all sorts of object caching in the IPC layer. The exception is:
> {noformat}
> java.lang.IllegalThreadStateException
>   at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867)
>   at java.lang.Thread.init(Thread.java:402)
>   at java.lang.Thread.init(Thread.java:349)
>   at java.lang.Thread.(Thread.java:675)
>   at 
> java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613)
>   at 
> com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
>   at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
>   at 
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1458)
>   at org.apache.hadoop.ipc.C

[jira] [Comment Edited] (YARN-10178) Global Scheduler asycthread crash caused by 'Comparison method violates its general contract'

2020-10-19 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217285#comment-17217285
 ] 

Wangda Tan edited comment on YARN-10178 at 10/20/20, 5:33 AM:
--

Since recently we have a customer has the same issue, I spent some time to look 
at this, thanks [~tuyu] for the detailed analysis. 

Apart from issues mentioned by [~tuyu], which is async-scheduling related, I 
think it can also happen when async-scheduling is disabled. 

A possible place is completedContainer, it doesn't hold scheduler's lock, so it 
can happen even though async-scheduling is disabled. (I checked the problematic 
log, there's a container release event happened at the exact same timestamp 
(within the same milli-second) when the RM crash happens. 

One possible way to fix the problem is, inside 
PriorityUtilizationQueueOrderingPolicy, take a snapshot of queue capacities 
(which includes 
{code:java}
 AbsoluteUsedCapacity
 UsedCapacity
 ConfiguredMinResource
 AbsoluteCapacity

And plus CSQueue's reference
){code}
Create a new internal class (like PriorityQueueResourcesForSorting) to include 
the 4 fields, instead of sorting the CSQueue directly, we will sort the new 
structure. 

There're additional costs to copy the resources field, but it should be minimum 
for most cases (unless you have thousands of queues). cc: [~bteke] 


was (Author: wangda):
Since recently we have a customer has the same issue, I spent some time to look 
at this, thanks [~tuyu] for the detailed analysis. 

Apart from issues mentioned by [~tuyu], which is async-scheduling related, I 
think it can also happen when async-scheduling is disabled. 

A possible place is completedContainer, it doesn't hold scheduler's lock, so it 
can happen even though async-scheduling is disabled. (I checked the problematic 
log, there's a container release event happened at the exact same timestamp 
(within the same milli-second) when the RM crash happens. 

One possible way to fix the problem is, inside 
PriorityUtilizationQueueOrderingPolicy, take a snapshot of queue capacities 
(which includes 
AbsoluteUsedCapacity
UsedCapacity
ConfiguredMinResource
AbsoluteCapacity
)

Create a new internal class (like PriorityQueueResourcesForSorting) to include 
the 4 fields, instead of sorting the CSQueue directly, we will sort the new 
structure. 

There're additional costs to copy the resources field, but it should be minimum 
for most cases (unless you have thousands of queues). cc: [~bteke] 

> Global Scheduler asycthread crash caused by 'Comparison method violates its 
> general contract'
> -
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Priority: Major
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler

[jira] [Commented] (YARN-10178) Global Scheduler asycthread crash caused by 'Comparison method violates its general contract'

2020-10-19 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217285#comment-17217285
 ] 

Wangda Tan commented on YARN-10178:
---

Since recently we have a customer has the same issue, I spent some time to look 
at this, thanks [~tuyu] for the detailed analysis. 

Apart from issues mentioned by [~tuyu], which is async-scheduling related, I 
think it can also happen when async-scheduling is disabled. 

A possible place is completedContainer, it doesn't hold scheduler's lock, so it 
can happen even though async-scheduling is disabled. (I checked the problematic 
log, there's a container release event happened at the exact same timestamp 
(within the same milli-second) when the RM crash happens. 

One possible way to fix the problem is, inside 
PriorityUtilizationQueueOrderingPolicy, take a snapshot of queue capacities 
(which includes 
AbsoluteUsedCapacity
UsedCapacity
ConfiguredMinResource
AbsoluteCapacity
)

Create a new internal class (like PriorityQueueResourcesForSorting) to include 
the 4 fields, instead of sorting the CSQueue directly, we will sort the new 
structure. 

There're additional costs to copy the resources field, but it should be minimum 
for most cases (unless you have thousands of queues). cc: [~bteke] 

> Global Scheduler asycthread crash caused by 'Comparison method violates its 
> general contract'
> -
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Priority: Major
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
> {code}
> JAVA 8 Arrays.sort default use timsort algo, and timsort has  few require 
> {code:java}
> 1.x.compareTo(y) != y.compareTo(x)
> 2.x>y,y>z --> x > z
> 3.x=y, x.compareTo(z) == y.compareTo(z)
> {code}
> if not Arrays paramters not satify this require,TimSort will throw 
> 'java.lang.IllegalArgumentException'
> look at PriorityUtilizationQueueOrderingPolicy.compare function,we will know 
> Capacity Scheduler use this these queue resource usage to compare
> {code:java}
> AbsoluteUsedCapacity
> UsedCapacity
> ConfiguredMinResource
> AbsoluteCapacity
> {code}
> In Capacity Scheduler Global Scheduler AsyncThread use 
> PriorityUtilizationQueueOrderingPolicy function to choose queue to assign 
> container,and construct a CSAssignment struct, and use 
> submitResourceCommit

[jira] [Commented] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail

2020-10-19 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217284#comment-17217284
 ] 

Peter Bacsko commented on YARN-10460:
-

[~aajisaka] there is one more test which is potentially affected by the same 
thing 
{{org.apache.hadoop.mapreduce.task.reduce.TestFetcher.testCorruptedIFile()}} 
but I haven't been able to repro this locally, but I saw the same stack trace 
in Jenkins. So we have to keep our eye on that as well.

> Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
> -
>
> Key: YARN-10460
> URL: https://issues.apache.org/jira/browse/YARN-10460
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10460-001.patch, YARN-10460-002.patch, 
> YARN-10460-POC.patch
>
>
> In our downstream build environment, we're using JUnit 4.13. Recently, we 
> discovered a truly weird test failure in TestNodeStatusUpdater.
> The problem is that timeout handling has changed in Junit 4.13. See the 
> difference between these two snippets:
> 4.12
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> }
> {noformat}
>  
>  4.13
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> try {
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> } finally {
> try {
> thread.join(1);
> } catch (InterruptedException e) {
> Thread.currentThread().interrupt();
> }
> try {
> threadGroup.destroy();  < This
> } catch (IllegalThreadStateException e) {
> // If a thread from the group is still alive, the ThreadGroup 
> cannot be destroyed.
> // Swallow the exception to keep the same behavior prior to 
> this change.
> }
> }
> }
> {noformat}
> The change comes from [https://github.com/junit-team/junit4/pull/1517].
> Unfortunately, destroying the thread group causes an issue because there are 
> all sorts of object caching in the IPC layer. The exception is:
> {noformat}
> java.lang.IllegalThreadStateException
>   at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867)
>   at java.lang.Thread.init(Thread.java:402)
>   at java.lang.Thread.init(Thread.java:349)
>   at java.lang.Thread.(Thread.java:675)
>   at 
> java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613)
>   at 
> com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
>   at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
>   at 
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1458)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1405)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>   at com.sun.proxy.$Proxy81.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNo

[jira] [Commented] (YARN-10453) Add partition resource info to get-node-labels and label-mappings api responses

2020-10-19 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217282#comment-17217282
 ] 

Hadoop QA commented on YARN-10453:
--

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
15s{color} |  | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} |  | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} |  | {color:green} The patch does not contain any @author tags. 
{color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} |  | {color:green} The patch appears to include 1 new or modified 
test files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 
14s{color} |  | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
59s{color} |  | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
48s{color} |  | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
36s{color} |  | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
52s{color} |  | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
17m 38s{color} |  | {color:green} branch has no errors when building and 
testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
39s{color} |  | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
34s{color} |  | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  1m 
46s{color} |  | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
45s{color} |  | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
50s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
53s{color} |  | {color:green} the patch passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
53s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
44s{color} |  | {color:green} the patch passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
44s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} blanks {color} | {color:green}  0m  
0s{color} |  | {color:green} The patch has no blanks issues. {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
30s{color} |  | {color:green} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 0 new + 15 unchanged - 1 fixed = 15 total (was 16) {color} 
|
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
47s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
18m 17s{color} |  | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
36s{color} |  | {color:green} the patch passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
32s{color} |  | {color:green} the patch passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
51s{color} |  | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} || ||
| {color:green}+1{color} | {color:green} unit {color}

[jira] [Commented] (YARN-8737) Race condition in ParentQueue when reinitializing and sorting child queues in the meanwhile

2020-10-19 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217279#comment-17217279
 ] 

Wangda Tan commented on YARN-8737:
--

Rekicked Jenkins, after reviewed the case, the fix looks good to me, even 
though it covered a small set of the issues. I agree to move scheduling-related 
issues in YARN-10178.

> Race condition in ParentQueue when reinitializing and sorting child queues in 
> the meanwhile
> ---
>
> Key: YARN-8737
> URL: https://issues.apache.org/jira/browse/YARN-8737
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.3.0, 2.9.3, 3.2.2, 3.1.4
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8737.001.patch
>
>
> Administrator raised a update for queues through REST API, in RM parent queue 
> is refreshing child queues through calling ParentQueue#reinitialize, 
> meanwhile, async-schedule threads is sorting child queues when calling 
> ParentQueue#sortAndGetChildrenAllocationIterator. Race condition may happen 
> and throw exception as follow because TimSort does not handle the concurrent 
> modification of objects it is sorting:
> {noformat}
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>         at java.util.TimSort.mergeHi(TimSort.java:899)
>         at java.util.TimSort.mergeAt(TimSort.java:516)
>         at java.util.TimSort.mergeCollapse(TimSort.java:441)
>         at java.util.TimSort.sort(TimSort.java:245)
>         at java.util.Arrays.sort(Arrays.java:1512)
>         at java.util.ArrayList.sort(ArrayList.java:1454)
>         at java.util.Collections.sort(Collections.java:175)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:291)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:804)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:817)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:636)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:2494)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:2431)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersOnMultiNodes(CapacityScheduler.java:2588)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:2676)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.scheduleBasedOnNodeLabels(CapacityScheduler.java:927)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:962)
> {noformat}
> I think we can add read-lock for 
> ParentQueue#sortAndGetChildrenAllocationIterator to solve this problem, the 
> write-lock will be hold when updating child queues in 
> ParentQueue#reinitialize.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail

2020-10-19 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217275#comment-17217275
 ] 

Akira Ajisaka commented on YARN-10460:
--

Hi [~pbacsko], I have a question. Are there any failing tests other than this 
issue in JUnit 4.13?

Hi [~snemeth], I would like to backport this to older branches. Now I consider 
upgrading JUnit to 4.13.1 in Apache Hadoop because of CVE-2020-15250. The 
severity is low, but it is worth upgrading.
https://github.com/junit-team/junit4/security/advisories/GHSA-269g-pwp5-87pp

> Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
> -
>
> Key: YARN-10460
> URL: https://issues.apache.org/jira/browse/YARN-10460
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10460-001.patch, YARN-10460-002.patch, 
> YARN-10460-POC.patch
>
>
> In our downstream build environment, we're using JUnit 4.13. Recently, we 
> discovered a truly weird test failure in TestNodeStatusUpdater.
> The problem is that timeout handling has changed in Junit 4.13. See the 
> difference between these two snippets:
> 4.12
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> }
> {noformat}
>  
>  4.13
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> try {
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> } finally {
> try {
> thread.join(1);
> } catch (InterruptedException e) {
> Thread.currentThread().interrupt();
> }
> try {
> threadGroup.destroy();  < This
> } catch (IllegalThreadStateException e) {
> // If a thread from the group is still alive, the ThreadGroup 
> cannot be destroyed.
> // Swallow the exception to keep the same behavior prior to 
> this change.
> }
> }
> }
> {noformat}
> The change comes from [https://github.com/junit-team/junit4/pull/1517].
> Unfortunately, destroying the thread group causes an issue because there are 
> all sorts of object caching in the IPC layer. The exception is:
> {noformat}
> java.lang.IllegalThreadStateException
>   at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867)
>   at java.lang.Thread.init(Thread.java:402)
>   at java.lang.Thread.init(Thread.java:349)
>   at java.lang.Thread.(Thread.java:675)
>   at 
> java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613)
>   at 
> com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
>   at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
>   at 
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1458)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1405)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>   at com.sun.proxy.$Proxy81.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128)
>   at 
> org.ap

[jira] [Commented] (YARN-10453) Add partition resource info to get-node-labels and label-mappings api responses

2020-10-19 Thread Sunil G (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217221#comment-17217221
 ] 

Sunil G commented on YARN-10453:


Kicked jenkins again.

> Add partition resource info to get-node-labels and label-mappings api 
> responses
> ---
>
> Key: YARN-10453
> URL: https://issues.apache.org/jira/browse/YARN-10453
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Akhil PB
>Assignee: Akhil PB
>Priority: Major
> Attachments: YARN-10453.001.patch, YARN-10453.002.patch
>
>
> This jira will add partition resource info to responses get-node-labels and 
> label-mappings apis.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10466) Fix NullPointerException in yarn-services Component.java

2020-10-19 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217118#comment-17217118
 ] 

Hadoop QA commented on YARN-10466:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
18s{color} |  | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} |  | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} |  | {color:green} The patch does not contain any @author tags. 
{color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} |  | {color:red} The patch doesn't appear to include any new or 
modified tests. Please justify why no new tests are needed for this patch. Also 
please list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 30m 
29s{color} |  | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
38s{color} |  | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
33s{color} |  | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
27s{color} |  | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
39s{color} |  | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m 47s{color} |  | {color:green} branch has no errors when building and 
testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
25s{color} |  | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
24s{color} |  | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  1m  
2s{color} |  | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
58s{color} |  | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
30s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
29s{color} |  | {color:green} the patch passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
29s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
25s{color} |  | {color:green} the patch passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
25s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} blanks {color} | {color:green}  0m  
0s{color} |  | {color:green} The patch has no blanks issues. {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
17s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
27s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 25s{color} |  | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
21s{color} |  | {color:green} the patch passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
20s{color} |  | {color:green} the patch passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
1s{color} |  | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} || ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 1

[jira] [Created] (YARN-10467) ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers

2020-10-19 Thread Haibo Chen (Jira)
Haibo Chen created YARN-10467:
-

 Summary: ContainerIdPBImpl objects can be leaked in 
RMNodeImpl.completedContainers
 Key: YARN-10467
 URL: https://issues.apache.org/jira/browse/YARN-10467
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.10.0
Reporter: Haibo Chen
Assignee: Haibo Chen


In one of our recent heap analysis, we found that the majority of the heap is 
occupied by {{RMNodeImpl.completedContainers}}, which 
accounts for 19GB, out of 24.3 GB.  There are over 86 million ContainerIdPBImpl 
objects, in contrast, only 161,601 RMContainerImpl objects which represent the 
# of active containers that RM is still tracking.  Inspecting some 
ContainerIdPBImpl objects, they belong to applications that have long finished. 
This indicates some sort of memory leak of ContainerIdPBImpl objects in 
RMNodeImpl.

 

Right now, when a container is reported by a NM as completed, it is immediately 
added to RMNodeImpl.completedContainers and later cleaned up after the AM has 
been notified of its completion in the AM-RM heartbeat. The cleanup can be 
broken into a few steps.
 * Step 1:  the completed container is first added to 
RMAppAttemptImpl.justFinishedContainers (this is asynchronous to being added to 
{{RMNodeImpl.completedContainers}}).
 * Step 2: During the heartbeat AM-RM heartbeat, the container is removed from 
RMAppAttemptImpl.justFinishedContainers and added to 
RMAppAttemptImpl.finishedContainersSentToAM

Once a completed container gets added to 
RMAppAttemptImpl.finishedContainersSentToAM, it is guaranteed to be cleaned up 
from {{RMNodeImpl.completedContainers}}

 

However, if the AM exits (regardless of failure or success) before some 
recently completed containers can be added to  
RMAppAttemptImpl.finishedContainersSentToAM in previous heartbeats, there won’t 
be any future AM-RM heartbeat to perform aforementioned step 2. Hence, these 
objects stay in RMNodeImpl.completedContainers forever.

We have observed in MR that AMs can decide to exit upon success of all it tasks 
without waiting for notification of the completion of every container, or AM 
may just die suddenly (e.g. OOM).  Spark and other framework may just be 
similar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10465) Support getClusterNodes, getNodeToLabels, getLabelsToNodes, getClusterNodeLabels API's for Federation

2020-10-19 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217073#comment-17217073
 ] 

Hadoop QA commented on YARN-10465:
--

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  2m 
20s{color} |  | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} |  | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} |  | {color:green} The patch does not contain any @author tags. 
{color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} |  | {color:green} The patch appears to include 2 new or modified 
test files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 25m 
 3s{color} |  | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
31s{color} |  | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
28s{color} |  | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
21s{color} |  | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
28s{color} |  | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
18m 10s{color} |  | {color:green} branch has no errors when building and 
testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
27s{color} |  | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
24s{color} |  | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  0m 
48s{color} |  | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
46s{color} |  | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
28s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
24s{color} |  | {color:green} the patch passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
24s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
20s{color} |  | {color:green} the patch passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
20s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} blanks {color} | {color:green}  0m  
0s{color} |  | {color:green} The patch has no blanks issues. {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 14s{color} | 
[/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-router.txt|https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/240/artifact/out/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-router.txt]
 | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-router: 
The patch generated 10 new + 0 unchanged - 0 fixed = 10 total (was 0) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
23s{color} |  | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
18m 15s{color} |  | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
23s{color} |  | {color:green} the patch passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
23s{color} |  | {color:green} the patch passed with JDK Private 
Build-1.8.0_265-8u265-b01-0u

[jira] [Commented] (YARN-8173) [Router] Implement missing FederationClientInterceptor#getApplications()

2020-10-19 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217059#comment-17217059
 ] 

Hadoop QA commented on YARN-8173:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} |  | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m 11s{color} 
|  | {color:red} YARN-8173 does not apply to trunk. Rebase required? Wrong 
Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | YARN-8173 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12927063/YARN-8173.007.patch |
| Console output | 
https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/241/console |
| versions | git=2.17.1 |
| Powered by | Apache Yetus 0.13.0-SNAPSHOT https://yetus.apache.org |


This message was automatically generated.



> [Router] Implement missing FederationClientInterceptor#getApplications()
> 
>
> Key: YARN-8173
> URL: https://issues.apache.org/jira/browse/YARN-8173
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.0.0
>Reporter: Yiran Wu
>Assignee: Yiran Wu
>Priority: Major
> Attachments: YARN-8173.001.patch, YARN-8173.002.patch, 
> YARN-8173.003.patch, YARN-8173.004.patch, YARN-8173.005.patch, 
> YARN-8173.006.patch, YARN-8173.007.patch
>
>
> oozie dependent method Implement
> {code:java}
> getApplications()
> getDeglationToken()
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10466) Fix NullPointerException in yarn-services Component.java

2020-10-19 Thread D M Murali Krishna Reddy (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

D M Murali Krishna Reddy updated YARN-10466:

Attachment: YARN-10466.001.patch

> Fix NullPointerException in  yarn-services Component.java
> -
>
> Key: YARN-10466
> URL: https://issues.apache.org/jira/browse/YARN-10466
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: D M Murali Krishna Reddy
>Assignee: D M Murali Krishna Reddy
>Priority: Minor
> Attachments: YARN-10466.001.patch
>
>
> Due to changes in 
> [YARN-10219|https://issues.apache.org/jira/browse/YARN-10219]   where the 
> constraint is initialised as null, there might be few scenarios in which NPE 
> can be thrown  in requestContainers method.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8173) [Router] Implement missing FederationClientInterceptor#getApplications()

2020-10-19 Thread D M Murali Krishna Reddy (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217056#comment-17217056
 ] 

D M Murali Krishna Reddy commented on YARN-8173:


[~yiran],  I would like to work on this, if you are not currently working on 
this task.

> [Router] Implement missing FederationClientInterceptor#getApplications()
> 
>
> Key: YARN-8173
> URL: https://issues.apache.org/jira/browse/YARN-8173
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.0.0
>Reporter: Yiran Wu
>Assignee: Yiran Wu
>Priority: Major
> Attachments: YARN-8173.001.patch, YARN-8173.002.patch, 
> YARN-8173.003.patch, YARN-8173.004.patch, YARN-8173.005.patch, 
> YARN-8173.006.patch, YARN-8173.007.patch
>
>
> oozie dependent method Implement
> {code:java}
> getApplications()
> getDeglationToken()
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10466) Fix NullPointerException in yarn-services Component.java

2020-10-19 Thread D M Murali Krishna Reddy (Jira)
D M Murali Krishna Reddy created YARN-10466:
---

 Summary: Fix NullPointerException in  yarn-services Component.java
 Key: YARN-10466
 URL: https://issues.apache.org/jira/browse/YARN-10466
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: D M Murali Krishna Reddy
Assignee: D M Murali Krishna Reddy


Due to changes in [YARN-10219|https://issues.apache.org/jira/browse/YARN-10219] 
  where the constraint is initialised as null, there might be few scenarios in 
which NPE can be thrown  in requestContainers method.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10465) Support getClusterNodes, getNodeToLabels, getLabelsToNodes, getClusterNodeLabels API's for Federation

2020-10-19 Thread D M Murali Krishna Reddy (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

D M Murali Krishna Reddy updated YARN-10465:

Attachment: YARN-10465.001.patch

> Support getClusterNodes, getNodeToLabels, getLabelsToNodes, 
> getClusterNodeLabels API's for Federation
> -
>
> Key: YARN-10465
> URL: https://issues.apache.org/jira/browse/YARN-10465
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: federation
>Reporter: D M Murali Krishna Reddy
>Assignee: D M Murali Krishna Reddy
>Priority: Major
> Attachments: YARN-10465.001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10465) Support getClusterNodes, getNodeToLabels, getLabelsToNodes, getClusterNodeLabels API's for Federation

2020-10-19 Thread D M Murali Krishna Reddy (Jira)
D M Murali Krishna Reddy created YARN-10465:
---

 Summary: Support getClusterNodes, getNodeToLabels, 
getLabelsToNodes, getClusterNodeLabels API's for Federation
 Key: YARN-10465
 URL: https://issues.apache.org/jira/browse/YARN-10465
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: federation
Reporter: D M Murali Krishna Reddy
Assignee: D M Murali Krishna Reddy






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10450) Add cpu and memory utilization per node and cluster-wide metrics

2020-10-19 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216747#comment-17216747
 ] 

Jim Brennan commented on YARN-10450:


Thanks [~ebadger]!

> Add cpu and memory utilization per node and cluster-wide metrics
> 
>
> Key: YARN-10450
> URL: https://issues.apache.org/jira/browse/YARN-10450
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.3.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Fix For: 3.4.0, 3.3.1, 3.1.5, 2.10.2, 3.2.3
>
> Attachments: NodesPage.png, YARN-10450-branch-2.10.003.patch, 
> YARN-10450-branch-3.1.003.patch, YARN-10450-branch-3.2.003.patch, 
> YARN-10450.001.patch, YARN-10450.002.patch, YARN-10450.003.patch
>
>
> Add metrics to show actual cpu and memory utilization for each node and 
> aggregated for the entire cluster.  This is information is already passed 
> from NM to RM in the node status update.
> We have been running with this internally for quite a while and found it 
> useful to be able to quickly see the actual cpu/memory utilization on the 
> node/cluster.  It's especially useful if some form of overcommit is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10442) RM should make sure node label file highly available

2020-10-19 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1721#comment-1721
 ] 

Hadoop QA commented on YARN-10442:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
15s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} No case conflicting files 
found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} The patch does not contain any 
@author tags. {color} |
| {color:green}+1{color} | {color:green} {color} | {color:green}  0m  0s{color} 
| {color:green}test4tests{color} | {color:green} The patch appears to include 1 
new or modified test files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} || ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  6m  
9s{color} | {color:blue}{color} | {color:blue} Maven dependency ordering for 
branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 30m 
37s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 12m 
27s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  9m 
39s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
42s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  3m  
4s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
22m 59s{color} | {color:green}{color} | {color:green} branch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m 
33s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m 
34s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  2m 
15s{color} | {color:blue}{color} | {color:blue} Used deprecated FindBugs 
config; considering switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  6m 
46s{color} | {color:green}{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
29s{color} | {color:blue}{color} | {color:blue} Maven dependency ordering for 
patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  2m 
24s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  9m  
8s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  9m  
8s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  7m 
55s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  7m 
55s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
1m 25s{color} | 
{color:orange}https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2390/2/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn.txt{color}
 | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch generated 3 new + 
210 unchanged - 0 fixed = 213 total (was 210) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m 
31s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | 
{color:red}https://ci-hadoop.apache.org/job/

[jira] [Commented] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail

2020-10-19 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216657#comment-17216657
 ] 

Szilard Nemeth commented on YARN-10460:
---

[~pbacsko] OK, thanks. Resolving this jira, then.

> Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
> -
>
> Key: YARN-10460
> URL: https://issues.apache.org/jira/browse/YARN-10460
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10460-001.patch, YARN-10460-002.patch, 
> YARN-10460-POC.patch
>
>
> In our downstream build environment, we're using JUnit 4.13. Recently, we 
> discovered a truly weird test failure in TestNodeStatusUpdater.
> The problem is that timeout handling has changed in Junit 4.13. See the 
> difference between these two snippets:
> 4.12
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> }
> {noformat}
>  
>  4.13
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> try {
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> } finally {
> try {
> thread.join(1);
> } catch (InterruptedException e) {
> Thread.currentThread().interrupt();
> }
> try {
> threadGroup.destroy();  < This
> } catch (IllegalThreadStateException e) {
> // If a thread from the group is still alive, the ThreadGroup 
> cannot be destroyed.
> // Swallow the exception to keep the same behavior prior to 
> this change.
> }
> }
> }
> {noformat}
> The change comes from [https://github.com/junit-team/junit4/pull/1517].
> Unfortunately, destroying the thread group causes an issue because there are 
> all sorts of object caching in the IPC layer. The exception is:
> {noformat}
> java.lang.IllegalThreadStateException
>   at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867)
>   at java.lang.Thread.init(Thread.java:402)
>   at java.lang.Thread.init(Thread.java:349)
>   at java.lang.Thread.(Thread.java:675)
>   at 
> java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613)
>   at 
> com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
>   at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
>   at 
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1458)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1405)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>   at com.sun.proxy.$Proxy81.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdaterRetryAndNMShutdown(TestNodeStatusUpdater.java:1576)
> {noformat}
> Both the {{clientExecutor}} in {{org.apache.hadoop

[jira] [Commented] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail

2020-10-19 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216637#comment-17216637
 ] 

Peter Bacsko commented on YARN-10460:
-

I think it's OK to have it only on trunk for now. 

> Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
> -
>
> Key: YARN-10460
> URL: https://issues.apache.org/jira/browse/YARN-10460
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10460-001.patch, YARN-10460-002.patch, 
> YARN-10460-POC.patch
>
>
> In our downstream build environment, we're using JUnit 4.13. Recently, we 
> discovered a truly weird test failure in TestNodeStatusUpdater.
> The problem is that timeout handling has changed in Junit 4.13. See the 
> difference between these two snippets:
> 4.12
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> }
> {noformat}
>  
>  4.13
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> try {
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> } finally {
> try {
> thread.join(1);
> } catch (InterruptedException e) {
> Thread.currentThread().interrupt();
> }
> try {
> threadGroup.destroy();  < This
> } catch (IllegalThreadStateException e) {
> // If a thread from the group is still alive, the ThreadGroup 
> cannot be destroyed.
> // Swallow the exception to keep the same behavior prior to 
> this change.
> }
> }
> }
> {noformat}
> The change comes from [https://github.com/junit-team/junit4/pull/1517].
> Unfortunately, destroying the thread group causes an issue because there are 
> all sorts of object caching in the IPC layer. The exception is:
> {noformat}
> java.lang.IllegalThreadStateException
>   at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867)
>   at java.lang.Thread.init(Thread.java:402)
>   at java.lang.Thread.init(Thread.java:349)
>   at java.lang.Thread.(Thread.java:675)
>   at 
> java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613)
>   at 
> com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
>   at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
>   at 
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1458)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1405)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>   at com.sun.proxy.$Proxy81.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdaterRetryAndNMShutdown(TestNodeStatusUpdater.java:1576)
> {noformat}
> Both the {{clientExecutor}} in {{org.apache.hadoop.ip

[jira] [Commented] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail

2020-10-19 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216635#comment-17216635
 ] 

Szilard Nemeth commented on YARN-10460:
---

Hi [~pbacsko],

Thanks for this fix. Committed to trunk.

Thanks [~aajisaka], [~ebadger] for the reviews.

Do you guys want to backport this to older branches?

> Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
> -
>
> Key: YARN-10460
> URL: https://issues.apache.org/jira/browse/YARN-10460
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10460-001.patch, YARN-10460-002.patch, 
> YARN-10460-POC.patch
>
>
> In our downstream build environment, we're using JUnit 4.13. Recently, we 
> discovered a truly weird test failure in TestNodeStatusUpdater.
> The problem is that timeout handling has changed in Junit 4.13. See the 
> difference between these two snippets:
> 4.12
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> }
> {noformat}
>  
>  4.13
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> try {
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> } finally {
> try {
> thread.join(1);
> } catch (InterruptedException e) {
> Thread.currentThread().interrupt();
> }
> try {
> threadGroup.destroy();  < This
> } catch (IllegalThreadStateException e) {
> // If a thread from the group is still alive, the ThreadGroup 
> cannot be destroyed.
> // Swallow the exception to keep the same behavior prior to 
> this change.
> }
> }
> }
> {noformat}
> The change comes from [https://github.com/junit-team/junit4/pull/1517].
> Unfortunately, destroying the thread group causes an issue because there are 
> all sorts of object caching in the IPC layer. The exception is:
> {noformat}
> java.lang.IllegalThreadStateException
>   at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867)
>   at java.lang.Thread.init(Thread.java:402)
>   at java.lang.Thread.init(Thread.java:349)
>   at java.lang.Thread.(Thread.java:675)
>   at 
> java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613)
>   at 
> com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
>   at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
>   at 
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1458)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1405)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>   at com.sun.proxy.$Proxy81.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdaterRetryAn

[jira] [Updated] (YARN-10460) Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail

2020-10-19 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10460:
--
Fix Version/s: 3.4.0

> Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail
> -
>
> Key: YARN-10460
> URL: https://issues.apache.org/jira/browse/YARN-10460
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, test
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10460-001.patch, YARN-10460-002.patch, 
> YARN-10460-POC.patch
>
>
> In our downstream build environment, we're using JUnit 4.13. Recently, we 
> discovered a truly weird test failure in TestNodeStatusUpdater.
> The problem is that timeout handling has changed in Junit 4.13. See the 
> difference between these two snippets:
> 4.12
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> }
> {noformat}
>  
>  4.13
> {noformat}
> @Override
> public void evaluate() throws Throwable {
> CallableStatement callable = new CallableStatement();
> FutureTask task = new FutureTask(callable);
> ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup");
> Thread thread = new Thread(threadGroup, task, "Time-limited test");
> try {
> thread.setDaemon(true);
> thread.start();
> callable.awaitStarted();
> Throwable throwable = getResult(task, thread);
> if (throwable != null) {
> throw throwable;
> }
> } finally {
> try {
> thread.join(1);
> } catch (InterruptedException e) {
> Thread.currentThread().interrupt();
> }
> try {
> threadGroup.destroy();  < This
> } catch (IllegalThreadStateException e) {
> // If a thread from the group is still alive, the ThreadGroup 
> cannot be destroyed.
> // Swallow the exception to keep the same behavior prior to 
> this change.
> }
> }
> }
> {noformat}
> The change comes from [https://github.com/junit-team/junit4/pull/1517].
> Unfortunately, destroying the thread group causes an issue because there are 
> all sorts of object caching in the IPC layer. The exception is:
> {noformat}
> java.lang.IllegalThreadStateException
>   at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867)
>   at java.lang.Thread.init(Thread.java:402)
>   at java.lang.Thread.init(Thread.java:349)
>   at java.lang.Thread.(Thread.java:675)
>   at 
> java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613)
>   at 
> com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
>   at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
>   at 
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1458)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1405)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>   at com.sun.proxy.$Proxy81.startContainers(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdaterRetryAndNMShutdown(TestNodeStatusUpdater.java:1576)
> {noformat}
> Both the {{clientExecutor}} in {{org.apache.hadoop.ipc.Client}} and the 
> client object in {{ProtobufRpcEngine}}/{{ProtobufRpcEngine

[jira] [Commented] (YARN-10463) For Federation, we should support getApplicationAttemptReport.

2020-10-19 Thread Bilwa S T (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216580#comment-17216580
 ] 

Bilwa S T commented on YARN-10463:
--

Hi [~zhuqi]

Thanks for patch. I have few comments:
 # In below code you need to call 
routerMetrics.incrAppAttemptsFailedRetrieved() instead of 
routerMetrics.incrAppsFailedRetrieved();

{code:java}
try {
  response = clientRMProxy.getApplicationAttemptReport(request);
} catch (Exception e) {
  routerMetrics.incrAppsFailedRetrieved();
  LOG.error("Unable to get the applicationAttempt report for "
  + request.getApplicationAttemptId() + "to SubCluster "
  + subClusterId.getId(), e);
  throw e;
}
{code}
      2. Add null check for Application id ie 
request.getApplicationAttemptId().getApplicationId()

      3. In TestFederationClientInterceptor.java, change "Test 
FederationClientInterceptor: Get Application Report" to "Test 
FederationClientInterceptor: Get Application Attempt Report"

     4. In testcase instead of Assert.fail() and catch, use 
LambdaTestUtils#intercept().

> For Federation, we should support getApplicationAttemptReport.
> --
>
> Key: YARN-10463
> URL: https://issues.apache.org/jira/browse/YARN-10463
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10463.001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org