[jira] [Resolved] (YARN-11662) RM Web API endpoint queue reference differs from JMX endpoint for CS

2024-04-23 Thread Ferenc Erdelyi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenc Erdelyi resolved YARN-11662.
---
Resolution: Duplicate

Duplicate of YARN-11538

> RM Web API endpoint queue reference differs from JMX endpoint for CS
> 
>
> Key: YARN-11662
> URL: https://issues.apache.org/jira/browse/YARN-11662
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.4.0
>Reporter: Ferenc Erdelyi
>Assignee: Ferenc Erdelyi
>Priority: Major
>
> When a placement is not successful (because of the lack of a placement rule 
> or an unsuccessful placement), the application is placed in the default queue 
> instead of the root.default. The parent queue won't be defined when there is 
> no placement rule. This causes an inconsistency between the JMX endpoint 
> (reporting the app. runs under the root.default) and the RM Web API endpoint 
> (reporting the app runs under the default queue).
> Similarly, when we submit an application with an unambiguous leaf queue 
> specified, the RM Web API endpoint will report the queue as the leaf queue 
> name instead of the full queue path. However, the full queue path is the 
> expected value to be consistent with the JMX endpoint.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11669) cgroups v2 support for YARN

2024-03-28 Thread Ferenc Erdelyi (Jira)
Ferenc Erdelyi created YARN-11669:
-

 Summary: cgroups v2 support for YARN
 Key: YARN-11669
 URL: https://issues.apache.org/jira/browse/YARN-11669
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: yarn
Reporter: Ferenc Erdelyi


The cgroups v2 is becoming the default for OSs, like RHEL9.
Support for YARN has to be implemented.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11662) RM Web API endpoint queue reference differs from JMX endpoint for CS

2024-03-14 Thread Ferenc Erdelyi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827133#comment-17827133
 ] 

Ferenc Erdelyi commented on YARN-11662:
---

Might be duplicate of YARN-11538

> RM Web API endpoint queue reference differs from JMX endpoint for CS
> 
>
> Key: YARN-11662
> URL: https://issues.apache.org/jira/browse/YARN-11662
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.4.0
>Reporter: Ferenc Erdelyi
>Assignee: Ferenc Erdelyi
>Priority: Major
>
> When a placement is not successful (because of the lack of a placement rule 
> or an unsuccessful placement), the application is placed in the default queue 
> instead of the root.default. The parent queue won't be defined when there is 
> no placement rule. This causes an inconsistency between the JMX endpoint 
> (reporting the app. runs under the root.default) and the RM Web API endpoint 
> (reporting the app runs under the default queue).
> Similarly, when we submit an application with an unambiguous leaf queue 
> specified, the RM Web API endpoint will report the queue as the leaf queue 
> name instead of the full queue path. However, the full queue path is the 
> expected value to be consistent with the JMX endpoint.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11662) RM Web API endpoint queue reference differs from JMX endpoint for CS

2024-03-14 Thread Ferenc Erdelyi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenc Erdelyi updated YARN-11662:
--
Description: 
When a placement is not successful (because of the lack of a placement rule or 
an unsuccessful placement), the application is placed in the default queue 
instead of the root.default. The parent queue won't be defined when there is no 
placement rule. This causes an inconsistency between the JMX endpoint 
(reporting the app. runs under the root.default) and the RM Web API endpoint 
(reporting the app runs under the default queue).

Similarly, when we submit an application with an unambiguous leaf queue 
specified, the RM Web API endpoint will report the queue as the leaf queue name 
instead of the full queue path. However, the full queue path is the expected 
value to be consistent with the JMX endpoint.

  was:
When a placement is not successful (because of the lack of a placement rule or 
an unsuccessful placement), the application is placed in the default queue 
instead of the root.default. The parent queue won't be defined when there is no 
placement rule. This causes an inconsistency between the JMX endpoint 
(reporting the app. runs under the root.default) and the RM Web API endpoint 
(reporting the app runs under the default queue).

Similarly, when we submit an application with an unambiguous leaf queue 
specified, the RM Web API endpoint will report the queue as the leaf queue name 
instead of the full queue path. However, the full queue path is the expected 
value to be consistent with the JMX endpoint.

I propose using the scheduler's getQueueInfo in the RMAppManager to parse the 
queue name and get the full queue path for the placementQueueName, which fixes 
the above issue.


> RM Web API endpoint queue reference differs from JMX endpoint for CS
> 
>
> Key: YARN-11662
> URL: https://issues.apache.org/jira/browse/YARN-11662
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.4.0
>Reporter: Ferenc Erdelyi
>Assignee: Ferenc Erdelyi
>Priority: Major
>
> When a placement is not successful (because of the lack of a placement rule 
> or an unsuccessful placement), the application is placed in the default queue 
> instead of the root.default. The parent queue won't be defined when there is 
> no placement rule. This causes an inconsistency between the JMX endpoint 
> (reporting the app. runs under the root.default) and the RM Web API endpoint 
> (reporting the app runs under the default queue).
> Similarly, when we submit an application with an unambiguous leaf queue 
> specified, the RM Web API endpoint will report the queue as the leaf queue 
> name instead of the full queue path. However, the full queue path is the 
> expected value to be consistent with the JMX endpoint.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11662) RM Web API endpoint queue reference differs from JMX endpoint for CS

2024-03-13 Thread Ferenc Erdelyi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenc Erdelyi updated YARN-11662:
--
Affects Version/s: 3.4.0
   (was: 3.3.0)

> RM Web API endpoint queue reference differs from JMX endpoint for CS
> 
>
> Key: YARN-11662
> URL: https://issues.apache.org/jira/browse/YARN-11662
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.4.0
>Reporter: Ferenc Erdelyi
>Assignee: Ferenc Erdelyi
>Priority: Major
>
> When a placement is not successful (because of the lack of a placement rule 
> or an unsuccessful placement), the application is placed in the default queue 
> instead of the root.default. The parent queue won't be defined when there is 
> no placement rule. This causes an inconsistency between the JMX endpoint 
> (reporting the app. runs under the root.default) and the RM Web API endpoint 
> (reporting the app runs under the default queue).
> Similarly, when we submit an application with an unambiguous leaf queue 
> specified, the RM Web API endpoint will report the queue as the leaf queue 
> name instead of the full queue path. However, the full queue path is the 
> expected value to be consistent with the JMX endpoint.
> I propose using the scheduler's getQueueInfo in the RMAppManager to parse the 
> queue name and get the full queue path for the placementQueueName, which 
> fixes the above issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11662) RM Web API endpoint queue reference differs from JMX endpoint for CS

2024-03-13 Thread Ferenc Erdelyi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenc Erdelyi updated YARN-11662:
--
Affects Version/s: 3.3.0

> RM Web API endpoint queue reference differs from JMX endpoint for CS
> 
>
> Key: YARN-11662
> URL: https://issues.apache.org/jira/browse/YARN-11662
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.3.0
>Reporter: Ferenc Erdelyi
>Assignee: Ferenc Erdelyi
>Priority: Major
>
> When a placement is not successful (because of the lack of a placement rule 
> or an unsuccessful placement), the application is placed in the default queue 
> instead of the root.default. The parent queue won't be defined when there is 
> no placement rule. This causes an inconsistency between the JMX endpoint 
> (reporting the app. runs under the root.default) and the RM Web API endpoint 
> (reporting the app runs under the default queue).
> Similarly, when we submit an application with an unambiguous leaf queue 
> specified, the RM Web API endpoint will report the queue as the leaf queue 
> name instead of the full queue path. However, the full queue path is the 
> expected value to be consistent with the JMX endpoint.
> I propose using the scheduler's getQueueInfo in the RMAppManager to parse the 
> queue name and get the full queue path for the placementQueueName, which 
> fixes the above issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-11662) RM Web API endpoint queue reference differs from JMX endpoint for CS

2024-03-13 Thread Ferenc Erdelyi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenc Erdelyi reassigned YARN-11662:
-

Assignee: Ferenc Erdelyi

> RM Web API endpoint queue reference differs from JMX endpoint for CS
> 
>
> Key: YARN-11662
> URL: https://issues.apache.org/jira/browse/YARN-11662
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Ferenc Erdelyi
>Assignee: Ferenc Erdelyi
>Priority: Major
>
> When a placement is not successful (because of the lack of a placement rule 
> or an unsuccessful placement), the application is placed in the default queue 
> instead of the root.default. The parent queue won't be defined when there is 
> no placement rule. This causes an inconsistency between the JMX endpoint 
> (reporting the app. runs under the root.default) and the RM Web API endpoint 
> (reporting the app runs under the default queue).
> Similarly, when we submit an application with an unambiguous leaf queue 
> specified, the RM Web API endpoint will report the queue as the leaf queue 
> name instead of the full queue path. However, the full queue path is the 
> expected value to be consistent with the JMX endpoint.
> I propose using the scheduler's getQueueInfo in the RMAppManager to parse the 
> queue name and get the full queue path for the placementQueueName, which 
> fixes the above issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11662) RM Web API endpoint queue reference differs from JMX endpoint for CS

2024-03-13 Thread Ferenc Erdelyi (Jira)
Ferenc Erdelyi created YARN-11662:
-

 Summary: RM Web API endpoint queue reference differs from JMX 
endpoint for CS
 Key: YARN-11662
 URL: https://issues.apache.org/jira/browse/YARN-11662
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Ferenc Erdelyi


When a placement is not successful (because of the lack of a placement rule or 
an unsuccessful placement), the application is placed in the default queue 
instead of the root.default. The parent queue won't be defined when there is no 
placement rule. This causes an inconsistency between the JMX endpoint 
(reporting the app. runs under the root.default) and the RM Web API endpoint 
(reporting the app runs under the default queue).

Similarly, when we submit an application with an unambiguous leaf queue 
specified, the RM Web API endpoint will report the queue as the leaf queue name 
instead of the full queue path. However, the full queue path is the expected 
value to be consistent with the JMX endpoint.

I propose using the scheduler's getQueueInfo in the RMAppManager to parse the 
queue name and get the full queue path for the placementQueueName, which fixes 
the above issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11639) ConcurrentModificationException and NPE in PriorityUtilizationQueueOrderingPolicy

2024-01-24 Thread Ferenc Erdelyi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17810358#comment-17810358
 ] 

Ferenc Erdelyi commented on YARN-11639:
---

[~bteke] backport is required for both branch-3.3/3.2 branches and there is no 
conflict. Shall I open two separate backport Jira for each branch?

> ConcurrentModificationException and NPE in 
> PriorityUtilizationQueueOrderingPolicy
> -
>
> Key: YARN-11639
> URL: https://issues.apache.org/jira/browse/YARN-11639
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Ferenc Erdelyi
>Assignee: Ferenc Erdelyi
>Priority: Major
>  Labels: pull-request-available
>
> When dynamic queue creation is enabled in weight mode and the deletion policy 
> coincides with the PriorityQueueResourcesForSorting, RM stops assigning 
> resources because of either ConcurrentModificationException or NPE in 
> PriorityUtilizationQueueOrderingPolicy.
> Reproduced the NPE issue in Java8 and Java11 environment:
> {code:java}
> ... INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Removing queue: root.dyn.PmvkMgrEBQppu
> 2024-01-02 17:00:59,399 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[Thread-11,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy$PriorityQueueResourcesForSorting.(PriorityUtilizationQueueOrderingPolicy.java:225)
>   at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
>   at 
> java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654)
>   at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
>   at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
>   at 
> java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
>   at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>   at 
> java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:260)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:1100)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:1124)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1724)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1659)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1816)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1562)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:558)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:605)
> {code}
> Observed the ConcurrentModificationException in Java8 environment, but could 
> not reproduce yet:
> {code:java}
> 2023-10-27 02:50:37,584 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler:Thread 
> Thread[Thread-15,5, main] threw an Exception.
> java.util.ConcurrentModificationException
> at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1388)
> at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
> at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
> at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at 

[jira] [Assigned] (YARN-11639) ConcurrentModificationException and NPE in PriorityUtilizationQueueOrderingPolicy

2024-01-09 Thread Ferenc Erdelyi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenc Erdelyi reassigned YARN-11639:
-

Assignee: Ferenc Erdelyi

> ConcurrentModificationException and NPE in 
> PriorityUtilizationQueueOrderingPolicy
> -
>
> Key: YARN-11639
> URL: https://issues.apache.org/jira/browse/YARN-11639
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Ferenc Erdelyi
>Assignee: Ferenc Erdelyi
>Priority: Major
>
> When dynamic queue creation is enabled in weight mode and the deletion policy 
> coincides with the PriorityQueueResourcesForSorting, RM stops assigning 
> resources because of either ConcurrentModificationException or NPE in 
> PriorityUtilizationQueueOrderingPolicy.
> Reproduced the NPE issue in Java8 and Java11 environment:
> {code:java}
> ... INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Removing queue: root.dyn.PmvkMgrEBQppu
> 2024-01-02 17:00:59,399 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[Thread-11,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy$PriorityQueueResourcesForSorting.(PriorityUtilizationQueueOrderingPolicy.java:225)
>   at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
>   at 
> java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654)
>   at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
>   at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
>   at 
> java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
>   at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>   at 
> java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:260)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:1100)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:1124)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1724)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1659)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1816)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1562)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:558)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:605)
> {code}
> Observed the ConcurrentModificationException in Java8 environment, but could 
> not reproduce yet:
> {code:java}
> 2023-10-27 02:50:37,584 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler:Thread 
> Thread[Thread-15,5, main] threw an Exception.
> java.util.ConcurrentModificationException
> at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1388)
> at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
> at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
> at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> at 
> 

[jira] [Created] (YARN-11639) ConcurrentModificationException and NPE in PriorityUtilizationQueueOrderingPolicy

2024-01-03 Thread Ferenc Erdelyi (Jira)
Ferenc Erdelyi created YARN-11639:
-

 Summary: ConcurrentModificationException and NPE in 
PriorityUtilizationQueueOrderingPolicy
 Key: YARN-11639
 URL: https://issues.apache.org/jira/browse/YARN-11639
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler
Reporter: Ferenc Erdelyi


When dynamic queue creation is enabled in weight mode and the deletion policy 
coincides with the PriorityQueueResourcesForSorting, RM stops assigning 
resources because of either ConcurrentModificationExceptionor NPE in 
PriorityUtilizationQueueOrderingPolicy.

Reproduced the NPE issue in Java8 and Java11 environment:
{code:java}
... INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Removing queue: root.dyn.PmvkMgrEBQppu
2024-01-02 17:00:59,399 ERROR 
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
Thread[Thread-11,5,main] threw an Exception.
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy$PriorityQueueResourcesForSorting.(PriorityUtilizationQueueOrderingPolicy.java:225)
at 
java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
at 
java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654)
at 
java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
at 
java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
at 
java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
at 
java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at 
java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:260)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:1100)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:1124)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1724)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1659)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1816)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1562)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:558)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:605)
{code}

Observed the ConcurrentModificationException in Java8 environment, but could 
not reproduce yet:
{code:java}
2023-10-27 02:50:37,584 ERROR 
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler:Thread Thread[Thread-15,5, 
main] threw an Exception.
java.util.ConcurrentModificationException
at 
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1388)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtiliza
ueOrderingPolicy.Java:260)
{code}

The immediate (temporary) remedy to keep the cluster going is to restart the RM.
The workaround is to disable the deletion of dynamically created child queues. 






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: 

[jira] [Updated] (YARN-11639) ConcurrentModificationException and NPE in PriorityUtilizationQueueOrderingPolicy

2024-01-03 Thread Ferenc Erdelyi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenc Erdelyi updated YARN-11639:
--
Description: 
When dynamic queue creation is enabled in weight mode and the deletion policy 
coincides with the PriorityQueueResourcesForSorting, RM stops assigning 
resources because of either ConcurrentModificationException or NPE in 
PriorityUtilizationQueueOrderingPolicy.

Reproduced the NPE issue in Java8 and Java11 environment:
{code:java}
... INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Removing queue: root.dyn.PmvkMgrEBQppu
2024-01-02 17:00:59,399 ERROR 
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
Thread[Thread-11,5,main] threw an Exception.
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy$PriorityQueueResourcesForSorting.(PriorityUtilizationQueueOrderingPolicy.java:225)
at 
java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
at 
java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654)
at 
java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
at 
java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
at 
java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
at 
java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at 
java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:260)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:1100)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:1124)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1724)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1659)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1816)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1562)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:558)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:605)
{code}

Observed the ConcurrentModificationException in Java8 environment, but could 
not reproduce yet:
{code:java}
2023-10-27 02:50:37,584 ERROR 
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler:Thread Thread[Thread-15,5, 
main] threw an Exception.
java.util.ConcurrentModificationException
at 
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1388)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtiliza
ueOrderingPolicy.Java:260)
{code}

The immediate (temporary) remedy to keep the cluster going is to restart the RM.
The workaround is to disable the deletion of dynamically created child queues. 




  was:
When dynamic queue creation is enabled in weight mode and the deletion policy 
coincides with the PriorityQueueResourcesForSorting, RM stops assigning 
resources because of either ConcurrentModificationExceptionor NPE in 
PriorityUtilizationQueueOrderingPolicy.

Reproduced the NPE issue in Java8 and Java11 environment:
{code:java}
... INFO 

[jira] [Commented] (YARN-11590) RM process stuck after calling confStore.format() when ZK SSL/TLS is enabled, as netty thread waits indefinitely

2023-10-10 Thread Ferenc Erdelyi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17773709#comment-17773709
 ] 

Ferenc Erdelyi commented on YARN-11590:
---

Thanks to [~bkosztolnik] for identifying the cause of the issue and suggesting 
a solution.

> RM process stuck after calling confStore.format() when ZK SSL/TLS is enabled, 
>  as netty thread waits indefinitely
> -
>
> Key: YARN-11590
> URL: https://issues.apache.org/jira/browse/YARN-11590
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Ferenc Erdelyi
>Assignee: Ferenc Erdelyi
>Priority: Major
>
> YARN-11468 enabled Zookeeper SSL/TLS support for YARN.
> Curator uses ClientCnxnSocketNetty for secured connection and the thread 
> needs to be closed after calling confStore.format() to avoid the netty thread 
> waiting indefinitely, which renders the RM unresponsive after deleting the 
> confstore when started with the "-format-conf-store" arg.
> The unclosed thread, which keeps RM running:
> {code:java}
> 2023-10-10 12:13:01,000 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: The 
> Thread[main-SendThread(ferdelyi-1.ferdelyi.root.hwx.site:2182),5,main]TIMED_WAITING
>  is stands at [sun.misc.Unsafe.park(Native Method), 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215), 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078),
>  
> java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:522),
>  java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:684), 
> org.apache.zookeeper.ClientCnxnSocketNetty.doTransport(ClientCnxnSocketNetty.java:275),
>  org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1289)]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11590) RM process stuck after calling confStore.format() when ZK SSL/TLS is enabled, as netty thread waits indefinitely

2023-10-10 Thread Ferenc Erdelyi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenc Erdelyi updated YARN-11590:
--
Description: 
YARN-11468 enabled Zookeeper SSL/TLS support for YARN.
Curator uses ClientCnxnSocketNetty for secured connection and the thread needs 
to be closed after calling confStore.format() to avoid the netty thread waiting 
indefinitely, which renders the RM unresponsive after deleting the confstore 
when started with the "-format-conf-store" arg.

The unclosed thread, which keeps RM running:
{code:java}
2023-10-10 12:13:01,000 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: The 
Thread[main-SendThread(ferdelyi-1.ferdelyi.root.hwx.site:2182),5,main]TIMED_WAITING
 is stands at [sun.misc.Unsafe.park(Native Method), 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215), 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078),
 
java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:522),
 java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:684), 
org.apache.zookeeper.ClientCnxnSocketNetty.doTransport(ClientCnxnSocketNetty.java:275),
 org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1289)]
{code}


  was:
YARN-11468 enabled Zookeeper SSL/TLS support for YARN.
Curator uses ClientCnxnSocketNetty for secured connection and the thread needs 
to be closed with confStore.close() after calling confStore.format() to avoid 
the netty thread to wait indefinitely, which renders the RM unresponsive after 
deleting the confstore when started with the "-format-conf-store" arg.

The unclosed thread, which keeps RM running:
{code:java}
2023-10-10 12:13:01,000 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: The 
Thread[main-SendThread(ferdelyi-1.ferdelyi.root.hwx.site:2182),5,main]TIMED_WAITING
 is stands at [sun.misc.Unsafe.park(Native Method), 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215), 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078),
 
java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:522),
 java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:684), 
org.apache.zookeeper.ClientCnxnSocketNetty.doTransport(ClientCnxnSocketNetty.java:275),
 org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1289)]
{code}



> RM process stuck after calling confStore.format() when ZK SSL/TLS is enabled, 
>  as netty thread waits indefinitely
> -
>
> Key: YARN-11590
> URL: https://issues.apache.org/jira/browse/YARN-11590
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Ferenc Erdelyi
>Assignee: Ferenc Erdelyi
>Priority: Major
>
> YARN-11468 enabled Zookeeper SSL/TLS support for YARN.
> Curator uses ClientCnxnSocketNetty for secured connection and the thread 
> needs to be closed after calling confStore.format() to avoid the netty thread 
> waiting indefinitely, which renders the RM unresponsive after deleting the 
> confstore when started with the "-format-conf-store" arg.
> The unclosed thread, which keeps RM running:
> {code:java}
> 2023-10-10 12:13:01,000 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: The 
> Thread[main-SendThread(ferdelyi-1.ferdelyi.root.hwx.site:2182),5,main]TIMED_WAITING
>  is stands at [sun.misc.Unsafe.park(Native Method), 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215), 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078),
>  
> java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:522),
>  java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:684), 
> org.apache.zookeeper.ClientCnxnSocketNetty.doTransport(ClientCnxnSocketNetty.java:275),
>  org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1289)]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11590) RM process stuck after calling confStore.format() when ZK SSL/TLS is enabled, as netty thread waits indefinitely

2023-10-10 Thread Ferenc Erdelyi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenc Erdelyi updated YARN-11590:
--
Summary: RM process stuck after calling confStore.format() when ZK SSL/TLS 
is enabled,  as netty thread waits indefinitely  (was: RM process stuck after 
confStore.format() when ZK SSL/TLS is enabled,  as netty thread waits 
indefinitely)

> RM process stuck after calling confStore.format() when ZK SSL/TLS is enabled, 
>  as netty thread waits indefinitely
> -
>
> Key: YARN-11590
> URL: https://issues.apache.org/jira/browse/YARN-11590
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Ferenc Erdelyi
>Assignee: Ferenc Erdelyi
>Priority: Major
>
> YARN-11468 enabled Zookeeper SSL/TLS support for YARN.
> Curator uses ClientCnxnSocketNetty for secured connection and the thread 
> needs to be closed with confStore.close() after calling confStore.format() to 
> avoid the netty thread to wait indefinitely, which renders the RM 
> unresponsive after deleting the confstore when started with the 
> "-format-conf-store" arg.
> The unclosed thread, which keeps RM running:
> {code:java}
> 2023-10-10 12:13:01,000 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: The 
> Thread[main-SendThread(ferdelyi-1.ferdelyi.root.hwx.site:2182),5,main]TIMED_WAITING
>  is stands at [sun.misc.Unsafe.park(Native Method), 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215), 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078),
>  
> java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:522),
>  java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:684), 
> org.apache.zookeeper.ClientCnxnSocketNetty.doTransport(ClientCnxnSocketNetty.java:275),
>  org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1289)]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-11590) RM process stuck after confStore.format() when ZK SSL/TLS is enabled, as netty thread waits indefinitely

2023-10-10 Thread Ferenc Erdelyi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenc Erdelyi reassigned YARN-11590:
-

Assignee: Ferenc Erdelyi

> RM process stuck after confStore.format() when ZK SSL/TLS is enabled,  as 
> netty thread waits indefinitely
> -
>
> Key: YARN-11590
> URL: https://issues.apache.org/jira/browse/YARN-11590
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Ferenc Erdelyi
>Assignee: Ferenc Erdelyi
>Priority: Major
>
> YARN-11468 enabled Zookeeper SSL/TLS support for YARN.
> Curator uses ClientCnxnSocketNetty for secured connection and the thread 
> needs to be closed with confStore.close() after calling confStore.format() to 
> avoid the netty thread to wait indefinitely, which renders the RM 
> unresponsive after deleting the confstore when started with the 
> "-format-conf-store" arg.
> The unclosed thread, which keeps RM running:
> {code:java}
> 2023-10-10 12:13:01,000 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: The 
> Thread[main-SendThread(ferdelyi-1.ferdelyi.root.hwx.site:2182),5,main]TIMED_WAITING
>  is stands at [sun.misc.Unsafe.park(Native Method), 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215), 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078),
>  
> java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:522),
>  java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:684), 
> org.apache.zookeeper.ClientCnxnSocketNetty.doTransport(ClientCnxnSocketNetty.java:275),
>  org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1289)]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11590) RM process stuck after confStore.format() when ZK SSL/TLS is enabled, as netty thread waits indefinitely

2023-10-10 Thread Ferenc Erdelyi (Jira)
Ferenc Erdelyi created YARN-11590:
-

 Summary: RM process stuck after confStore.format() when ZK SSL/TLS 
is enabled,  as netty thread waits indefinitely
 Key: YARN-11590
 URL: https://issues.apache.org/jira/browse/YARN-11590
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Ferenc Erdelyi


YARN-11468 enabled Zookeeper SSL/TLS support for YARN.
Curator uses ClientCnxnSocketNetty for secured connection and the thread needs 
to be closed with confStore.close() after calling confStore.format() to avoid 
the netty thread to wait indefinitely, which renders the RM unresponsive after 
deleting the confstore when started with the "-format-conf-store" arg.

The unclosed thread, which keeps RM running:
{code:java}
2023-10-10 12:13:01,000 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: The 
Thread[main-SendThread(ferdelyi-1.ferdelyi.root.hwx.site:2182),5,main]TIMED_WAITING
 is stands at [sun.misc.Unsafe.park(Native Method), 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215), 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078),
 
java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:522),
 java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:684), 
org.apache.zookeeper.ClientCnxnSocketNetty.doTransport(ClientCnxnSocketNetty.java:275),
 org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1289)]
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-6974) Make CuratorBasedElectorService the default

2023-09-25 Thread Ferenc Erdelyi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-6974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenc Erdelyi reassigned YARN-6974:


Assignee: Ferenc Erdelyi

> Make CuratorBasedElectorService the default
> ---
>
> Key: YARN-6974
> URL: https://issues.apache.org/jira/browse/YARN-6974
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.0.0-beta1
>Reporter: Robert Kanter
>Assignee: Ferenc Erdelyi
>Priority: Critical
>
> YARN-4438 (and cleanup in YARN-5709) added the 
> {{CuratorBasedElectorService}}, which does leader election via Curator.  The 
> intention was to leave it off by default to allow time for it to bake, and 
> eventually make it the default and remove the 
> {{ActiveStandbyElectorBasedElectorService}}.  
> We should do that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11468) Zookeeper SSL/TLS support

2023-09-21 Thread Ferenc Erdelyi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenc Erdelyi updated YARN-11468:
--
Description: 
Zookeeper 3.5.5 server can operate with SSL/TLS secure connection with its 
clients.

[https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeper+SSL+User+Guide]

The SSL communication should be possible in the different parts of YARN, where 
it communicates with Zookeeper servers. The Zookeeper clients are used in the 
following places:
 * ResourceManager
 * ZKConfigurationStore
 * ZKRMStateStore

The yarn.resourcemanager.zk-client-ssl.enabled flag to enable SSL communication 
should be provided in the yarn-default.xml and the required parameters for the 
keystore and truststore should be picked up from the core-default.xml 
(HADOOP-18709)

yarn.resourcemanager.ha.curator-leader-elector.enabled has to set to true via 
yarn-site.xml to make sure Curator is used, otherwise we can't enable SSL.

  was:
Zookeeper 3.5.5 server can operate with SSL/TLS secure connection with its 
clients.

[https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeper+SSL+User+Guide]

The SSL communication should be possible in the different parts of YARN, where 
it communicates with Zookeeper servers. The Zookeeper clients are used in the 
following places:
 * ResourceManager
 * ZKConfigurationStore
 * ZKRMStateStore

The yarn.resourcemanager.zk-client-ssl.enabled flag to enable SSL communication 
should be provided in the yarn-default.xml and the required parameters for the 
keystore and truststore should be picked up from the core-default.xml 
(HADOOP-18709)


> Zookeeper SSL/TLS support
> -
>
> Key: YARN-11468
> URL: https://issues.apache.org/jira/browse/YARN-11468
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Ferenc Erdelyi
>Assignee: Ferenc Erdelyi
>Priority: Critical
>
> Zookeeper 3.5.5 server can operate with SSL/TLS secure connection with its 
> clients.
> [https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeper+SSL+User+Guide]
> The SSL communication should be possible in the different parts of YARN, 
> where it communicates with Zookeeper servers. The Zookeeper clients are used 
> in the following places:
>  * ResourceManager
>  * ZKConfigurationStore
>  * ZKRMStateStore
> The yarn.resourcemanager.zk-client-ssl.enabled flag to enable SSL 
> communication should be provided in the yarn-default.xml and the required 
> parameters for the keystore and truststore should be picked up from the 
> core-default.xml (HADOOP-18709)
> yarn.resourcemanager.ha.curator-leader-elector.enabled has to set to true via 
> yarn-site.xml to make sure Curator is used, otherwise we can't enable SSL.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-11499) Clear the queuemetrics object on queue deletion from the metricssystems

2023-05-23 Thread Ferenc Erdelyi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenc Erdelyi reassigned YARN-11499:
-

Assignee: Tamas Domok

> Clear the queuemetrics object on queue deletion from the metricssystems
> ---
>
> Key: YARN-11499
> URL: https://issues.apache.org/jira/browse/YARN-11499
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Ferenc Erdelyi
>Assignee: Tamas Domok
>Priority: Major
>
> *Placeholder for:*
> https://issues.apache.org/jira/browse/YARN-11490?focusedCommentId=17721370=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17721370
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11499) Clear the queuemetrics object on queue deletion from the metricssystems

2023-05-23 Thread Ferenc Erdelyi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenc Erdelyi updated YARN-11499:
--
Description: 
*Placeholder for:*

https://issues.apache.org/jira/browse/YARN-11490?focusedCommentId=17721370=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17721370

 

> Clear the queuemetrics object on queue deletion from the metricssystems
> ---
>
> Key: YARN-11499
> URL: https://issues.apache.org/jira/browse/YARN-11499
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Ferenc Erdelyi
>Priority: Major
>
> *Placeholder for:*
> https://issues.apache.org/jira/browse/YARN-11490?focusedCommentId=17721370=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17721370
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11499) Clear the queuemetrics object on queue deletion from the metricssystems

2023-05-23 Thread Ferenc Erdelyi (Jira)
Ferenc Erdelyi created YARN-11499:
-

 Summary: Clear the queuemetrics object on queue deletion from the 
metricssystems
 Key: YARN-11499
 URL: https://issues.apache.org/jira/browse/YARN-11499
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Ferenc Erdelyi






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11468) Zookeeper SSL/TLS support

2023-05-17 Thread Ferenc Erdelyi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenc Erdelyi updated YARN-11468:
--
Description: 
Zookeeper 3.5.5 server can operate with SSL/TLS secure connection with its 
clients.

[https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeper+SSL+User+Guide]

The SSL communication should be possible in the different parts of YARN, where 
it communicates with Zookeeper servers. The Zookeeper clients are used in the 
following places:
 * ResourceManager
 * ZKConfigurationStore
 * ZKRMStateStore

The yarn.resourcemanager.zk-client-ssl.enabled flag to enable SSL communication 
should be provided in the yarn-default.xml and the required parameters for the 
keystore and truststore should be picked up from the core-default.xml 
(HADOOP-18709)

  was:
Zookeeper 3.5.5 server can operate with SSL/TLS secure connection with its 
clients.

[https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeper+SSL+User+Guide]

The SSL communication should be possible in the different parts of YARN, where 
it communicates with Zookeeper servers. The Zookeeper clients are used in the 
following places:
 * ResourceManager
 * ZKConfigurationStore
 * ZKRMStateStore

The yarn.zookeeper.ssl.client.enable flag to enable SSL communication should be 
provided in the yarn-default.xml and the required parameters for the keystore 
and truststore should be picked up from the core-default.xml (HADOOP-18709)


> Zookeeper SSL/TLS support
> -
>
> Key: YARN-11468
> URL: https://issues.apache.org/jira/browse/YARN-11468
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Ferenc Erdelyi
>Assignee: Ferenc Erdelyi
>Priority: Critical
>
> Zookeeper 3.5.5 server can operate with SSL/TLS secure connection with its 
> clients.
> [https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeper+SSL+User+Guide]
> The SSL communication should be possible in the different parts of YARN, 
> where it communicates with Zookeeper servers. The Zookeeper clients are used 
> in the following places:
>  * ResourceManager
>  * ZKConfigurationStore
>  * ZKRMStateStore
> The yarn.resourcemanager.zk-client-ssl.enabled flag to enable SSL 
> communication should be provided in the yarn-default.xml and the required 
> parameters for the keystore and truststore should be picked up from the 
> core-default.xml (HADOOP-18709)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11468) Zookeeper SSL/TLS support

2023-04-19 Thread Ferenc Erdelyi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenc Erdelyi updated YARN-11468:
--
Description: 
Zookeeper 3.5.5 server can operate with SSL/TLS secure connection with its 
clients.

[https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeper+SSL+User+Guide]

The SSL communication should be possible in the different parts of YARN, where 
it communicates with Zookeeper servers. The Zookeeper clients are used in the 
following places:
 * ResourceManager
 * ZKConfigurationStore
 * ZKRMStateStore

The yarn.zookeeper.ssl.client.enable flag to enable SSL communication should be 
provided in the yarn-default.xml and the required parameters for the keystore 
and truststore should be picked up from the core-default.xml (HADOOP-18709)

  was:
Zookeeper 3.5.5 server can operate with SSL/TLS secure connection with its 
clients.

[https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeper+SSL+User+Guide]

The SSL communication should be possible in the different parts of YARN, where 
it communicates with Zookeeper servers. The Zookeeper clients are used in the 
following places:
 * ResourceManager
 * ZKConfigurationStore
 * ZKRMStateStore

The flag to enable SSL communication and the required parameters should be 
provided by different configuration parameters, corresponding to the different 
use cases. 


> Zookeeper SSL/TLS support
> -
>
> Key: YARN-11468
> URL: https://issues.apache.org/jira/browse/YARN-11468
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Ferenc Erdelyi
>Assignee: Ferenc Erdelyi
>Priority: Critical
>
> Zookeeper 3.5.5 server can operate with SSL/TLS secure connection with its 
> clients.
> [https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeper+SSL+User+Guide]
> The SSL communication should be possible in the different parts of YARN, 
> where it communicates with Zookeeper servers. The Zookeeper clients are used 
> in the following places:
>  * ResourceManager
>  * ZKConfigurationStore
>  * ZKRMStateStore
> The yarn.zookeeper.ssl.client.enable flag to enable SSL communication should 
> be provided in the yarn-default.xml and the required parameters for the 
> keystore and truststore should be picked up from the core-default.xml 
> (HADOOP-18709)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-11468) Zookeeper SSL/TLS support

2023-04-18 Thread Ferenc Erdelyi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenc Erdelyi reassigned YARN-11468:
-

Assignee: Ferenc Erdelyi

> Zookeeper SSL/TLS support
> -
>
> Key: YARN-11468
> URL: https://issues.apache.org/jira/browse/YARN-11468
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Ferenc Erdelyi
>Assignee: Ferenc Erdelyi
>Priority: Critical
>
> Zookeeper 3.5.5 server can operate with SSL/TLS secure connection with its 
> clients.
> [https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeper+SSL+User+Guide]
> The SSL communication should be possible in the different parts of YARN, 
> where it communicates with Zookeeper servers. The Zookeeper clients are used 
> in the following places:
>  * ResourceManager
>  * ZKConfigurationStore
>  * ZKRMStateStore
> The flag to enable SSL communication and the required parameters should be 
> provided by different configuration parameters, corresponding to the 
> different use cases. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11468) Zookeeper SSL/TLS support

2023-04-18 Thread Ferenc Erdelyi (Jira)
Ferenc Erdelyi created YARN-11468:
-

 Summary: Zookeeper SSL/TLS support
 Key: YARN-11468
 URL: https://issues.apache.org/jira/browse/YARN-11468
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Ferenc Erdelyi


Zookeeper 3.5.5 server can operate with SSL/TLS secure connection with its 
clients.

[https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeper+SSL+User+Guide]

The SSL communication should be possible in the different parts of YARN, where 
it communicates with Zookeeper servers. The Zookeeper clients are used in the 
following places:
 * ResourceManager
 * ZKConfigurationStore
 * ZKRMStateStore

The flag to enable SSL communication and the required parameters should be 
provided by different configuration parameters, corresponding to the different 
use cases. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10379) Refactor ContainerExecutor exit code Exception handling

2023-01-26 Thread Ferenc Erdelyi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenc Erdelyi reassigned YARN-10379:
-

Assignee: Ferenc Erdelyi  (was: Benjamin Teke)

> Refactor ContainerExecutor exit code Exception handling
> ---
>
> Key: YARN-10379
> URL: https://issues.apache.org/jira/browse/YARN-10379
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Benjamin Teke
>Assignee: Ferenc Erdelyi
>Priority: Minor
>
> **Currently every time a shell command is executed and returns with a 
> non-zero exitcode an exception gets thrown. But along the call tree this 
> exception gets catched, after some info/warn logging and other processing 
> steps rethrown, possibly packaged to another exception. For example:
>  * from PrivilegedOperationExecutor.executePrivilegedOperation - 
> ExitCodeException catch (as IOException), PrivilegedOperationException thrown
>  * then in LinuxContainerExecutor.startLocalizer - 
> PrivilegedOperationException catch, exitCode collection, logging, IOException 
> rethrown
>  * then in ResourceLocalizationService.run - generic Exception catch, but 
> there is a TODO for separate ExitCodeException handling, however that 
> information is only present here in an error message string
> This flow could be simplified and unified in the different executors. For 
> example use one specific exception till the last possible step, catch it only 
> where it is necessary and keep the exitcode as it could be used later in the 
> process. This change could help with maintainability and readability.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org