[jira] [Resolved] (YARN-11662) RM Web API endpoint queue reference differs from JMX endpoint for CS
[ https://issues.apache.org/jira/browse/YARN-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenc Erdelyi resolved YARN-11662. --- Resolution: Duplicate Duplicate of YARN-11538 > RM Web API endpoint queue reference differs from JMX endpoint for CS > > > Key: YARN-11662 > URL: https://issues.apache.org/jira/browse/YARN-11662 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.4.0 >Reporter: Ferenc Erdelyi >Assignee: Ferenc Erdelyi >Priority: Major > > When a placement is not successful (because of the lack of a placement rule > or an unsuccessful placement), the application is placed in the default queue > instead of the root.default. The parent queue won't be defined when there is > no placement rule. This causes an inconsistency between the JMX endpoint > (reporting the app. runs under the root.default) and the RM Web API endpoint > (reporting the app runs under the default queue). > Similarly, when we submit an application with an unambiguous leaf queue > specified, the RM Web API endpoint will report the queue as the leaf queue > name instead of the full queue path. However, the full queue path is the > expected value to be consistent with the JMX endpoint. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11669) cgroups v2 support for YARN
Ferenc Erdelyi created YARN-11669: - Summary: cgroups v2 support for YARN Key: YARN-11669 URL: https://issues.apache.org/jira/browse/YARN-11669 Project: Hadoop YARN Issue Type: New Feature Components: yarn Reporter: Ferenc Erdelyi The cgroups v2 is becoming the default for OSs, like RHEL9. Support for YARN has to be implemented. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11662) RM Web API endpoint queue reference differs from JMX endpoint for CS
[ https://issues.apache.org/jira/browse/YARN-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827133#comment-17827133 ] Ferenc Erdelyi commented on YARN-11662: --- Might be duplicate of YARN-11538 > RM Web API endpoint queue reference differs from JMX endpoint for CS > > > Key: YARN-11662 > URL: https://issues.apache.org/jira/browse/YARN-11662 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.4.0 >Reporter: Ferenc Erdelyi >Assignee: Ferenc Erdelyi >Priority: Major > > When a placement is not successful (because of the lack of a placement rule > or an unsuccessful placement), the application is placed in the default queue > instead of the root.default. The parent queue won't be defined when there is > no placement rule. This causes an inconsistency between the JMX endpoint > (reporting the app. runs under the root.default) and the RM Web API endpoint > (reporting the app runs under the default queue). > Similarly, when we submit an application with an unambiguous leaf queue > specified, the RM Web API endpoint will report the queue as the leaf queue > name instead of the full queue path. However, the full queue path is the > expected value to be consistent with the JMX endpoint. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11662) RM Web API endpoint queue reference differs from JMX endpoint for CS
[ https://issues.apache.org/jira/browse/YARN-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenc Erdelyi updated YARN-11662: -- Description: When a placement is not successful (because of the lack of a placement rule or an unsuccessful placement), the application is placed in the default queue instead of the root.default. The parent queue won't be defined when there is no placement rule. This causes an inconsistency between the JMX endpoint (reporting the app. runs under the root.default) and the RM Web API endpoint (reporting the app runs under the default queue). Similarly, when we submit an application with an unambiguous leaf queue specified, the RM Web API endpoint will report the queue as the leaf queue name instead of the full queue path. However, the full queue path is the expected value to be consistent with the JMX endpoint. was: When a placement is not successful (because of the lack of a placement rule or an unsuccessful placement), the application is placed in the default queue instead of the root.default. The parent queue won't be defined when there is no placement rule. This causes an inconsistency between the JMX endpoint (reporting the app. runs under the root.default) and the RM Web API endpoint (reporting the app runs under the default queue). Similarly, when we submit an application with an unambiguous leaf queue specified, the RM Web API endpoint will report the queue as the leaf queue name instead of the full queue path. However, the full queue path is the expected value to be consistent with the JMX endpoint. I propose using the scheduler's getQueueInfo in the RMAppManager to parse the queue name and get the full queue path for the placementQueueName, which fixes the above issue. > RM Web API endpoint queue reference differs from JMX endpoint for CS > > > Key: YARN-11662 > URL: https://issues.apache.org/jira/browse/YARN-11662 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.4.0 >Reporter: Ferenc Erdelyi >Assignee: Ferenc Erdelyi >Priority: Major > > When a placement is not successful (because of the lack of a placement rule > or an unsuccessful placement), the application is placed in the default queue > instead of the root.default. The parent queue won't be defined when there is > no placement rule. This causes an inconsistency between the JMX endpoint > (reporting the app. runs under the root.default) and the RM Web API endpoint > (reporting the app runs under the default queue). > Similarly, when we submit an application with an unambiguous leaf queue > specified, the RM Web API endpoint will report the queue as the leaf queue > name instead of the full queue path. However, the full queue path is the > expected value to be consistent with the JMX endpoint. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11662) RM Web API endpoint queue reference differs from JMX endpoint for CS
[ https://issues.apache.org/jira/browse/YARN-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenc Erdelyi updated YARN-11662: -- Affects Version/s: 3.4.0 (was: 3.3.0) > RM Web API endpoint queue reference differs from JMX endpoint for CS > > > Key: YARN-11662 > URL: https://issues.apache.org/jira/browse/YARN-11662 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.4.0 >Reporter: Ferenc Erdelyi >Assignee: Ferenc Erdelyi >Priority: Major > > When a placement is not successful (because of the lack of a placement rule > or an unsuccessful placement), the application is placed in the default queue > instead of the root.default. The parent queue won't be defined when there is > no placement rule. This causes an inconsistency between the JMX endpoint > (reporting the app. runs under the root.default) and the RM Web API endpoint > (reporting the app runs under the default queue). > Similarly, when we submit an application with an unambiguous leaf queue > specified, the RM Web API endpoint will report the queue as the leaf queue > name instead of the full queue path. However, the full queue path is the > expected value to be consistent with the JMX endpoint. > I propose using the scheduler's getQueueInfo in the RMAppManager to parse the > queue name and get the full queue path for the placementQueueName, which > fixes the above issue. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11662) RM Web API endpoint queue reference differs from JMX endpoint for CS
[ https://issues.apache.org/jira/browse/YARN-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenc Erdelyi updated YARN-11662: -- Affects Version/s: 3.3.0 > RM Web API endpoint queue reference differs from JMX endpoint for CS > > > Key: YARN-11662 > URL: https://issues.apache.org/jira/browse/YARN-11662 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.3.0 >Reporter: Ferenc Erdelyi >Assignee: Ferenc Erdelyi >Priority: Major > > When a placement is not successful (because of the lack of a placement rule > or an unsuccessful placement), the application is placed in the default queue > instead of the root.default. The parent queue won't be defined when there is > no placement rule. This causes an inconsistency between the JMX endpoint > (reporting the app. runs under the root.default) and the RM Web API endpoint > (reporting the app runs under the default queue). > Similarly, when we submit an application with an unambiguous leaf queue > specified, the RM Web API endpoint will report the queue as the leaf queue > name instead of the full queue path. However, the full queue path is the > expected value to be consistent with the JMX endpoint. > I propose using the scheduler's getQueueInfo in the RMAppManager to parse the > queue name and get the full queue path for the placementQueueName, which > fixes the above issue. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-11662) RM Web API endpoint queue reference differs from JMX endpoint for CS
[ https://issues.apache.org/jira/browse/YARN-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenc Erdelyi reassigned YARN-11662: - Assignee: Ferenc Erdelyi > RM Web API endpoint queue reference differs from JMX endpoint for CS > > > Key: YARN-11662 > URL: https://issues.apache.org/jira/browse/YARN-11662 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Ferenc Erdelyi >Assignee: Ferenc Erdelyi >Priority: Major > > When a placement is not successful (because of the lack of a placement rule > or an unsuccessful placement), the application is placed in the default queue > instead of the root.default. The parent queue won't be defined when there is > no placement rule. This causes an inconsistency between the JMX endpoint > (reporting the app. runs under the root.default) and the RM Web API endpoint > (reporting the app runs under the default queue). > Similarly, when we submit an application with an unambiguous leaf queue > specified, the RM Web API endpoint will report the queue as the leaf queue > name instead of the full queue path. However, the full queue path is the > expected value to be consistent with the JMX endpoint. > I propose using the scheduler's getQueueInfo in the RMAppManager to parse the > queue name and get the full queue path for the placementQueueName, which > fixes the above issue. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11662) RM Web API endpoint queue reference differs from JMX endpoint for CS
Ferenc Erdelyi created YARN-11662: - Summary: RM Web API endpoint queue reference differs from JMX endpoint for CS Key: YARN-11662 URL: https://issues.apache.org/jira/browse/YARN-11662 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Ferenc Erdelyi When a placement is not successful (because of the lack of a placement rule or an unsuccessful placement), the application is placed in the default queue instead of the root.default. The parent queue won't be defined when there is no placement rule. This causes an inconsistency between the JMX endpoint (reporting the app. runs under the root.default) and the RM Web API endpoint (reporting the app runs under the default queue). Similarly, when we submit an application with an unambiguous leaf queue specified, the RM Web API endpoint will report the queue as the leaf queue name instead of the full queue path. However, the full queue path is the expected value to be consistent with the JMX endpoint. I propose using the scheduler's getQueueInfo in the RMAppManager to parse the queue name and get the full queue path for the placementQueueName, which fixes the above issue. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11639) ConcurrentModificationException and NPE in PriorityUtilizationQueueOrderingPolicy
[ https://issues.apache.org/jira/browse/YARN-11639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17810358#comment-17810358 ] Ferenc Erdelyi commented on YARN-11639: --- [~bteke] backport is required for both branch-3.3/3.2 branches and there is no conflict. Shall I open two separate backport Jira for each branch? > ConcurrentModificationException and NPE in > PriorityUtilizationQueueOrderingPolicy > - > > Key: YARN-11639 > URL: https://issues.apache.org/jira/browse/YARN-11639 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Ferenc Erdelyi >Assignee: Ferenc Erdelyi >Priority: Major > Labels: pull-request-available > > When dynamic queue creation is enabled in weight mode and the deletion policy > coincides with the PriorityQueueResourcesForSorting, RM stops assigning > resources because of either ConcurrentModificationException or NPE in > PriorityUtilizationQueueOrderingPolicy. > Reproduced the NPE issue in Java8 and Java11 environment: > {code:java} > ... INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Removing queue: root.dyn.PmvkMgrEBQppu > 2024-01-02 17:00:59,399 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[Thread-11,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy$PriorityQueueResourcesForSorting.(PriorityUtilizationQueueOrderingPolicy.java:225) > at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > at > java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654) > at > java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) > at > java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) > at > java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) > at > java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:260) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:1100) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:1124) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1724) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1659) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1816) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1562) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:558) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:605) > {code} > Observed the ConcurrentModificationException in Java8 environment, but could > not reproduce yet: > {code:java} > 2023-10-27 02:50:37,584 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler:Thread > Thread[Thread-15,5, main] threw an Exception. > java.util.ConcurrentModificationException > at > java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1388) > at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) > at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) > at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at
[jira] [Assigned] (YARN-11639) ConcurrentModificationException and NPE in PriorityUtilizationQueueOrderingPolicy
[ https://issues.apache.org/jira/browse/YARN-11639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenc Erdelyi reassigned YARN-11639: - Assignee: Ferenc Erdelyi > ConcurrentModificationException and NPE in > PriorityUtilizationQueueOrderingPolicy > - > > Key: YARN-11639 > URL: https://issues.apache.org/jira/browse/YARN-11639 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Ferenc Erdelyi >Assignee: Ferenc Erdelyi >Priority: Major > > When dynamic queue creation is enabled in weight mode and the deletion policy > coincides with the PriorityQueueResourcesForSorting, RM stops assigning > resources because of either ConcurrentModificationException or NPE in > PriorityUtilizationQueueOrderingPolicy. > Reproduced the NPE issue in Java8 and Java11 environment: > {code:java} > ... INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Removing queue: root.dyn.PmvkMgrEBQppu > 2024-01-02 17:00:59,399 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[Thread-11,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy$PriorityQueueResourcesForSorting.(PriorityUtilizationQueueOrderingPolicy.java:225) > at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > at > java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654) > at > java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) > at > java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) > at > java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) > at > java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:260) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:1100) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:1124) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1724) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1659) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1816) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1562) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:558) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:605) > {code} > Observed the ConcurrentModificationException in Java8 environment, but could > not reproduce yet: > {code:java} > 2023-10-27 02:50:37,584 ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler:Thread > Thread[Thread-15,5, main] threw an Exception. > java.util.ConcurrentModificationException > at > java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1388) > at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) > at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) > at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) > at >
[jira] [Created] (YARN-11639) ConcurrentModificationException and NPE in PriorityUtilizationQueueOrderingPolicy
Ferenc Erdelyi created YARN-11639: - Summary: ConcurrentModificationException and NPE in PriorityUtilizationQueueOrderingPolicy Key: YARN-11639 URL: https://issues.apache.org/jira/browse/YARN-11639 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler Reporter: Ferenc Erdelyi When dynamic queue creation is enabled in weight mode and the deletion policy coincides with the PriorityQueueResourcesForSorting, RM stops assigning resources because of either ConcurrentModificationExceptionor NPE in PriorityUtilizationQueueOrderingPolicy. Reproduced the NPE issue in Java8 and Java11 environment: {code:java} ... INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Removing queue: root.dyn.PmvkMgrEBQppu 2024-01-02 17:00:59,399 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Thread-11,5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy$PriorityQueueResourcesForSorting.(PriorityUtilizationQueueOrderingPolicy.java:225) at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654) at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:260) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:1100) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:1124) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1724) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1659) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1816) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1562) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:558) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:605) {code} Observed the ConcurrentModificationException in Java8 environment, but could not reproduce yet: {code:java} 2023-10-27 02:50:37,584 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler:Thread Thread[Thread-15,5, main] threw an Exception. java.util.ConcurrentModificationException at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1388) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtiliza ueOrderingPolicy.Java:260) {code} The immediate (temporary) remedy to keep the cluster going is to restart the RM. The workaround is to disable the deletion of dynamically created child queues. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail:
[jira] [Updated] (YARN-11639) ConcurrentModificationException and NPE in PriorityUtilizationQueueOrderingPolicy
[ https://issues.apache.org/jira/browse/YARN-11639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenc Erdelyi updated YARN-11639: -- Description: When dynamic queue creation is enabled in weight mode and the deletion policy coincides with the PriorityQueueResourcesForSorting, RM stops assigning resources because of either ConcurrentModificationException or NPE in PriorityUtilizationQueueOrderingPolicy. Reproduced the NPE issue in Java8 and Java11 environment: {code:java} ... INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Removing queue: root.dyn.PmvkMgrEBQppu 2024-01-02 17:00:59,399 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Thread-11,5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy$PriorityQueueResourcesForSorting.(PriorityUtilizationQueueOrderingPolicy.java:225) at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654) at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:260) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:1100) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:1124) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1724) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1659) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1816) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1562) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:558) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:605) {code} Observed the ConcurrentModificationException in Java8 environment, but could not reproduce yet: {code:java} 2023-10-27 02:50:37,584 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler:Thread Thread[Thread-15,5, main] threw an Exception. java.util.ConcurrentModificationException at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1388) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtiliza ueOrderingPolicy.Java:260) {code} The immediate (temporary) remedy to keep the cluster going is to restart the RM. The workaround is to disable the deletion of dynamically created child queues. was: When dynamic queue creation is enabled in weight mode and the deletion policy coincides with the PriorityQueueResourcesForSorting, RM stops assigning resources because of either ConcurrentModificationExceptionor NPE in PriorityUtilizationQueueOrderingPolicy. Reproduced the NPE issue in Java8 and Java11 environment: {code:java} ... INFO
[jira] [Commented] (YARN-11590) RM process stuck after calling confStore.format() when ZK SSL/TLS is enabled, as netty thread waits indefinitely
[ https://issues.apache.org/jira/browse/YARN-11590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17773709#comment-17773709 ] Ferenc Erdelyi commented on YARN-11590: --- Thanks to [~bkosztolnik] for identifying the cause of the issue and suggesting a solution. > RM process stuck after calling confStore.format() when ZK SSL/TLS is enabled, > as netty thread waits indefinitely > - > > Key: YARN-11590 > URL: https://issues.apache.org/jira/browse/YARN-11590 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Ferenc Erdelyi >Assignee: Ferenc Erdelyi >Priority: Major > > YARN-11468 enabled Zookeeper SSL/TLS support for YARN. > Curator uses ClientCnxnSocketNetty for secured connection and the thread > needs to be closed after calling confStore.format() to avoid the netty thread > waiting indefinitely, which renders the RM unresponsive after deleting the > confstore when started with the "-format-conf-store" arg. > The unclosed thread, which keeps RM running: > {code:java} > 2023-10-10 12:13:01,000 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: The > Thread[main-SendThread(ferdelyi-1.ferdelyi.root.hwx.site:2182),5,main]TIMED_WAITING > is stands at [sun.misc.Unsafe.park(Native Method), > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215), > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078), > > java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:522), > java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:684), > org.apache.zookeeper.ClientCnxnSocketNetty.doTransport(ClientCnxnSocketNetty.java:275), > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1289)] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11590) RM process stuck after calling confStore.format() when ZK SSL/TLS is enabled, as netty thread waits indefinitely
[ https://issues.apache.org/jira/browse/YARN-11590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenc Erdelyi updated YARN-11590: -- Description: YARN-11468 enabled Zookeeper SSL/TLS support for YARN. Curator uses ClientCnxnSocketNetty for secured connection and the thread needs to be closed after calling confStore.format() to avoid the netty thread waiting indefinitely, which renders the RM unresponsive after deleting the confstore when started with the "-format-conf-store" arg. The unclosed thread, which keeps RM running: {code:java} 2023-10-10 12:13:01,000 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: The Thread[main-SendThread(ferdelyi-1.ferdelyi.root.hwx.site:2182),5,main]TIMED_WAITING is stands at [sun.misc.Unsafe.park(Native Method), java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215), java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078), java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:522), java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:684), org.apache.zookeeper.ClientCnxnSocketNetty.doTransport(ClientCnxnSocketNetty.java:275), org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1289)] {code} was: YARN-11468 enabled Zookeeper SSL/TLS support for YARN. Curator uses ClientCnxnSocketNetty for secured connection and the thread needs to be closed with confStore.close() after calling confStore.format() to avoid the netty thread to wait indefinitely, which renders the RM unresponsive after deleting the confstore when started with the "-format-conf-store" arg. The unclosed thread, which keeps RM running: {code:java} 2023-10-10 12:13:01,000 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: The Thread[main-SendThread(ferdelyi-1.ferdelyi.root.hwx.site:2182),5,main]TIMED_WAITING is stands at [sun.misc.Unsafe.park(Native Method), java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215), java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078), java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:522), java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:684), org.apache.zookeeper.ClientCnxnSocketNetty.doTransport(ClientCnxnSocketNetty.java:275), org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1289)] {code} > RM process stuck after calling confStore.format() when ZK SSL/TLS is enabled, > as netty thread waits indefinitely > - > > Key: YARN-11590 > URL: https://issues.apache.org/jira/browse/YARN-11590 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Ferenc Erdelyi >Assignee: Ferenc Erdelyi >Priority: Major > > YARN-11468 enabled Zookeeper SSL/TLS support for YARN. > Curator uses ClientCnxnSocketNetty for secured connection and the thread > needs to be closed after calling confStore.format() to avoid the netty thread > waiting indefinitely, which renders the RM unresponsive after deleting the > confstore when started with the "-format-conf-store" arg. > The unclosed thread, which keeps RM running: > {code:java} > 2023-10-10 12:13:01,000 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: The > Thread[main-SendThread(ferdelyi-1.ferdelyi.root.hwx.site:2182),5,main]TIMED_WAITING > is stands at [sun.misc.Unsafe.park(Native Method), > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215), > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078), > > java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:522), > java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:684), > org.apache.zookeeper.ClientCnxnSocketNetty.doTransport(ClientCnxnSocketNetty.java:275), > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1289)] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11590) RM process stuck after calling confStore.format() when ZK SSL/TLS is enabled, as netty thread waits indefinitely
[ https://issues.apache.org/jira/browse/YARN-11590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenc Erdelyi updated YARN-11590: -- Summary: RM process stuck after calling confStore.format() when ZK SSL/TLS is enabled, as netty thread waits indefinitely (was: RM process stuck after confStore.format() when ZK SSL/TLS is enabled, as netty thread waits indefinitely) > RM process stuck after calling confStore.format() when ZK SSL/TLS is enabled, > as netty thread waits indefinitely > - > > Key: YARN-11590 > URL: https://issues.apache.org/jira/browse/YARN-11590 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Ferenc Erdelyi >Assignee: Ferenc Erdelyi >Priority: Major > > YARN-11468 enabled Zookeeper SSL/TLS support for YARN. > Curator uses ClientCnxnSocketNetty for secured connection and the thread > needs to be closed with confStore.close() after calling confStore.format() to > avoid the netty thread to wait indefinitely, which renders the RM > unresponsive after deleting the confstore when started with the > "-format-conf-store" arg. > The unclosed thread, which keeps RM running: > {code:java} > 2023-10-10 12:13:01,000 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: The > Thread[main-SendThread(ferdelyi-1.ferdelyi.root.hwx.site:2182),5,main]TIMED_WAITING > is stands at [sun.misc.Unsafe.park(Native Method), > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215), > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078), > > java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:522), > java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:684), > org.apache.zookeeper.ClientCnxnSocketNetty.doTransport(ClientCnxnSocketNetty.java:275), > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1289)] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-11590) RM process stuck after confStore.format() when ZK SSL/TLS is enabled, as netty thread waits indefinitely
[ https://issues.apache.org/jira/browse/YARN-11590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenc Erdelyi reassigned YARN-11590: - Assignee: Ferenc Erdelyi > RM process stuck after confStore.format() when ZK SSL/TLS is enabled, as > netty thread waits indefinitely > - > > Key: YARN-11590 > URL: https://issues.apache.org/jira/browse/YARN-11590 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Ferenc Erdelyi >Assignee: Ferenc Erdelyi >Priority: Major > > YARN-11468 enabled Zookeeper SSL/TLS support for YARN. > Curator uses ClientCnxnSocketNetty for secured connection and the thread > needs to be closed with confStore.close() after calling confStore.format() to > avoid the netty thread to wait indefinitely, which renders the RM > unresponsive after deleting the confstore when started with the > "-format-conf-store" arg. > The unclosed thread, which keeps RM running: > {code:java} > 2023-10-10 12:13:01,000 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: The > Thread[main-SendThread(ferdelyi-1.ferdelyi.root.hwx.site:2182),5,main]TIMED_WAITING > is stands at [sun.misc.Unsafe.park(Native Method), > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215), > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078), > > java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:522), > java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:684), > org.apache.zookeeper.ClientCnxnSocketNetty.doTransport(ClientCnxnSocketNetty.java:275), > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1289)] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11590) RM process stuck after confStore.format() when ZK SSL/TLS is enabled, as netty thread waits indefinitely
Ferenc Erdelyi created YARN-11590: - Summary: RM process stuck after confStore.format() when ZK SSL/TLS is enabled, as netty thread waits indefinitely Key: YARN-11590 URL: https://issues.apache.org/jira/browse/YARN-11590 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Ferenc Erdelyi YARN-11468 enabled Zookeeper SSL/TLS support for YARN. Curator uses ClientCnxnSocketNetty for secured connection and the thread needs to be closed with confStore.close() after calling confStore.format() to avoid the netty thread to wait indefinitely, which renders the RM unresponsive after deleting the confstore when started with the "-format-conf-store" arg. The unclosed thread, which keeps RM running: {code:java} 2023-10-10 12:13:01,000 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: The Thread[main-SendThread(ferdelyi-1.ferdelyi.root.hwx.site:2182),5,main]TIMED_WAITING is stands at [sun.misc.Unsafe.park(Native Method), java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215), java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078), java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:522), java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:684), org.apache.zookeeper.ClientCnxnSocketNetty.doTransport(ClientCnxnSocketNetty.java:275), org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1289)] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-6974) Make CuratorBasedElectorService the default
[ https://issues.apache.org/jira/browse/YARN-6974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenc Erdelyi reassigned YARN-6974: Assignee: Ferenc Erdelyi > Make CuratorBasedElectorService the default > --- > > Key: YARN-6974 > URL: https://issues.apache.org/jira/browse/YARN-6974 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.0.0-beta1 >Reporter: Robert Kanter >Assignee: Ferenc Erdelyi >Priority: Critical > > YARN-4438 (and cleanup in YARN-5709) added the > {{CuratorBasedElectorService}}, which does leader election via Curator. The > intention was to leave it off by default to allow time for it to bake, and > eventually make it the default and remove the > {{ActiveStandbyElectorBasedElectorService}}. > We should do that. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11468) Zookeeper SSL/TLS support
[ https://issues.apache.org/jira/browse/YARN-11468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenc Erdelyi updated YARN-11468: -- Description: Zookeeper 3.5.5 server can operate with SSL/TLS secure connection with its clients. [https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeper+SSL+User+Guide] The SSL communication should be possible in the different parts of YARN, where it communicates with Zookeeper servers. The Zookeeper clients are used in the following places: * ResourceManager * ZKConfigurationStore * ZKRMStateStore The yarn.resourcemanager.zk-client-ssl.enabled flag to enable SSL communication should be provided in the yarn-default.xml and the required parameters for the keystore and truststore should be picked up from the core-default.xml (HADOOP-18709) yarn.resourcemanager.ha.curator-leader-elector.enabled has to set to true via yarn-site.xml to make sure Curator is used, otherwise we can't enable SSL. was: Zookeeper 3.5.5 server can operate with SSL/TLS secure connection with its clients. [https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeper+SSL+User+Guide] The SSL communication should be possible in the different parts of YARN, where it communicates with Zookeeper servers. The Zookeeper clients are used in the following places: * ResourceManager * ZKConfigurationStore * ZKRMStateStore The yarn.resourcemanager.zk-client-ssl.enabled flag to enable SSL communication should be provided in the yarn-default.xml and the required parameters for the keystore and truststore should be picked up from the core-default.xml (HADOOP-18709) > Zookeeper SSL/TLS support > - > > Key: YARN-11468 > URL: https://issues.apache.org/jira/browse/YARN-11468 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Ferenc Erdelyi >Assignee: Ferenc Erdelyi >Priority: Critical > > Zookeeper 3.5.5 server can operate with SSL/TLS secure connection with its > clients. > [https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeper+SSL+User+Guide] > The SSL communication should be possible in the different parts of YARN, > where it communicates with Zookeeper servers. The Zookeeper clients are used > in the following places: > * ResourceManager > * ZKConfigurationStore > * ZKRMStateStore > The yarn.resourcemanager.zk-client-ssl.enabled flag to enable SSL > communication should be provided in the yarn-default.xml and the required > parameters for the keystore and truststore should be picked up from the > core-default.xml (HADOOP-18709) > yarn.resourcemanager.ha.curator-leader-elector.enabled has to set to true via > yarn-site.xml to make sure Curator is used, otherwise we can't enable SSL. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-11499) Clear the queuemetrics object on queue deletion from the metricssystems
[ https://issues.apache.org/jira/browse/YARN-11499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenc Erdelyi reassigned YARN-11499: - Assignee: Tamas Domok > Clear the queuemetrics object on queue deletion from the metricssystems > --- > > Key: YARN-11499 > URL: https://issues.apache.org/jira/browse/YARN-11499 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Ferenc Erdelyi >Assignee: Tamas Domok >Priority: Major > > *Placeholder for:* > https://issues.apache.org/jira/browse/YARN-11490?focusedCommentId=17721370=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17721370 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11499) Clear the queuemetrics object on queue deletion from the metricssystems
[ https://issues.apache.org/jira/browse/YARN-11499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenc Erdelyi updated YARN-11499: -- Description: *Placeholder for:* https://issues.apache.org/jira/browse/YARN-11490?focusedCommentId=17721370=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17721370 > Clear the queuemetrics object on queue deletion from the metricssystems > --- > > Key: YARN-11499 > URL: https://issues.apache.org/jira/browse/YARN-11499 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Ferenc Erdelyi >Priority: Major > > *Placeholder for:* > https://issues.apache.org/jira/browse/YARN-11490?focusedCommentId=17721370=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17721370 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11499) Clear the queuemetrics object on queue deletion from the metricssystems
Ferenc Erdelyi created YARN-11499: - Summary: Clear the queuemetrics object on queue deletion from the metricssystems Key: YARN-11499 URL: https://issues.apache.org/jira/browse/YARN-11499 Project: Hadoop YARN Issue Type: Improvement Reporter: Ferenc Erdelyi -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11468) Zookeeper SSL/TLS support
[ https://issues.apache.org/jira/browse/YARN-11468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenc Erdelyi updated YARN-11468: -- Description: Zookeeper 3.5.5 server can operate with SSL/TLS secure connection with its clients. [https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeper+SSL+User+Guide] The SSL communication should be possible in the different parts of YARN, where it communicates with Zookeeper servers. The Zookeeper clients are used in the following places: * ResourceManager * ZKConfigurationStore * ZKRMStateStore The yarn.resourcemanager.zk-client-ssl.enabled flag to enable SSL communication should be provided in the yarn-default.xml and the required parameters for the keystore and truststore should be picked up from the core-default.xml (HADOOP-18709) was: Zookeeper 3.5.5 server can operate with SSL/TLS secure connection with its clients. [https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeper+SSL+User+Guide] The SSL communication should be possible in the different parts of YARN, where it communicates with Zookeeper servers. The Zookeeper clients are used in the following places: * ResourceManager * ZKConfigurationStore * ZKRMStateStore The yarn.zookeeper.ssl.client.enable flag to enable SSL communication should be provided in the yarn-default.xml and the required parameters for the keystore and truststore should be picked up from the core-default.xml (HADOOP-18709) > Zookeeper SSL/TLS support > - > > Key: YARN-11468 > URL: https://issues.apache.org/jira/browse/YARN-11468 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Ferenc Erdelyi >Assignee: Ferenc Erdelyi >Priority: Critical > > Zookeeper 3.5.5 server can operate with SSL/TLS secure connection with its > clients. > [https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeper+SSL+User+Guide] > The SSL communication should be possible in the different parts of YARN, > where it communicates with Zookeeper servers. The Zookeeper clients are used > in the following places: > * ResourceManager > * ZKConfigurationStore > * ZKRMStateStore > The yarn.resourcemanager.zk-client-ssl.enabled flag to enable SSL > communication should be provided in the yarn-default.xml and the required > parameters for the keystore and truststore should be picked up from the > core-default.xml (HADOOP-18709) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11468) Zookeeper SSL/TLS support
[ https://issues.apache.org/jira/browse/YARN-11468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenc Erdelyi updated YARN-11468: -- Description: Zookeeper 3.5.5 server can operate with SSL/TLS secure connection with its clients. [https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeper+SSL+User+Guide] The SSL communication should be possible in the different parts of YARN, where it communicates with Zookeeper servers. The Zookeeper clients are used in the following places: * ResourceManager * ZKConfigurationStore * ZKRMStateStore The yarn.zookeeper.ssl.client.enable flag to enable SSL communication should be provided in the yarn-default.xml and the required parameters for the keystore and truststore should be picked up from the core-default.xml (HADOOP-18709) was: Zookeeper 3.5.5 server can operate with SSL/TLS secure connection with its clients. [https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeper+SSL+User+Guide] The SSL communication should be possible in the different parts of YARN, where it communicates with Zookeeper servers. The Zookeeper clients are used in the following places: * ResourceManager * ZKConfigurationStore * ZKRMStateStore The flag to enable SSL communication and the required parameters should be provided by different configuration parameters, corresponding to the different use cases. > Zookeeper SSL/TLS support > - > > Key: YARN-11468 > URL: https://issues.apache.org/jira/browse/YARN-11468 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Ferenc Erdelyi >Assignee: Ferenc Erdelyi >Priority: Critical > > Zookeeper 3.5.5 server can operate with SSL/TLS secure connection with its > clients. > [https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeper+SSL+User+Guide] > The SSL communication should be possible in the different parts of YARN, > where it communicates with Zookeeper servers. The Zookeeper clients are used > in the following places: > * ResourceManager > * ZKConfigurationStore > * ZKRMStateStore > The yarn.zookeeper.ssl.client.enable flag to enable SSL communication should > be provided in the yarn-default.xml and the required parameters for the > keystore and truststore should be picked up from the core-default.xml > (HADOOP-18709) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-11468) Zookeeper SSL/TLS support
[ https://issues.apache.org/jira/browse/YARN-11468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenc Erdelyi reassigned YARN-11468: - Assignee: Ferenc Erdelyi > Zookeeper SSL/TLS support > - > > Key: YARN-11468 > URL: https://issues.apache.org/jira/browse/YARN-11468 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Ferenc Erdelyi >Assignee: Ferenc Erdelyi >Priority: Critical > > Zookeeper 3.5.5 server can operate with SSL/TLS secure connection with its > clients. > [https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeper+SSL+User+Guide] > The SSL communication should be possible in the different parts of YARN, > where it communicates with Zookeeper servers. The Zookeeper clients are used > in the following places: > * ResourceManager > * ZKConfigurationStore > * ZKRMStateStore > The flag to enable SSL communication and the required parameters should be > provided by different configuration parameters, corresponding to the > different use cases. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11468) Zookeeper SSL/TLS support
Ferenc Erdelyi created YARN-11468: - Summary: Zookeeper SSL/TLS support Key: YARN-11468 URL: https://issues.apache.org/jira/browse/YARN-11468 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Ferenc Erdelyi Zookeeper 3.5.5 server can operate with SSL/TLS secure connection with its clients. [https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeper+SSL+User+Guide] The SSL communication should be possible in the different parts of YARN, where it communicates with Zookeeper servers. The Zookeeper clients are used in the following places: * ResourceManager * ZKConfigurationStore * ZKRMStateStore The flag to enable SSL communication and the required parameters should be provided by different configuration parameters, corresponding to the different use cases. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10379) Refactor ContainerExecutor exit code Exception handling
[ https://issues.apache.org/jira/browse/YARN-10379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenc Erdelyi reassigned YARN-10379: - Assignee: Ferenc Erdelyi (was: Benjamin Teke) > Refactor ContainerExecutor exit code Exception handling > --- > > Key: YARN-10379 > URL: https://issues.apache.org/jira/browse/YARN-10379 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: Benjamin Teke >Assignee: Ferenc Erdelyi >Priority: Minor > > **Currently every time a shell command is executed and returns with a > non-zero exitcode an exception gets thrown. But along the call tree this > exception gets catched, after some info/warn logging and other processing > steps rethrown, possibly packaged to another exception. For example: > * from PrivilegedOperationExecutor.executePrivilegedOperation - > ExitCodeException catch (as IOException), PrivilegedOperationException thrown > * then in LinuxContainerExecutor.startLocalizer - > PrivilegedOperationException catch, exitCode collection, logging, IOException > rethrown > * then in ResourceLocalizationService.run - generic Exception catch, but > there is a TODO for separate ExitCodeException handling, however that > information is only present here in an error message string > This flow could be simplified and unified in the different executors. For > example use one specific exception till the last possible step, catch it only > where it is necessary and keep the exitcode as it could be used later in the > process. This change could help with maintainability and readability. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org