[jira] [Commented] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract'
[ https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17456178#comment-17456178 ] Andras Gyori commented on YARN-10178: - Thanks [~epayne] for the details. The root cause you are describing is in place I think. Its probably transitivity, that is violated (namely if q1 > q2 and q2 > q3 then q1 > q3, but the time it reaches the q1, q3 comparison, the queues had already changed, thus breaking the TimSort requirements), though not entirely sure about that. All in all, the snapshot idea seems to be the correct one. As for {noformat} I read online that even the stream method of List is not a deep copy. Is that true? If we are only making a reference of the queue list, then the resource usages of each queue can change and cause the sorted list to be wrong during sorting.{noformat} I believe it is not a problem, as we are not making a copy, but creating new objects out of queues, and only taking floats out of them, which are value types. However, configuredMinResource is indeed a reference and mutable as well, so we might need to clone that with Resources.clone() (I think it is the standard convention). > Global Scheduler async thread crash caused by 'Comparison method violates its > general contract' > --- > > Key: YARN-10178 > URL: https://issues.apache.org/jira/browse/YARN-10178 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.2.1 >Reporter: tuyu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10178.001.patch, YARN-10178.002.patch, > YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch > > > Global Scheduler Async Thread crash stack > {code:java} > ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received > RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, > Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: > Comparison method violates its general contract! >at > java.util.TimSort.mergeHi(TimSort.java:899) > at java.util.TimSort.mergeAt(TimSort.java:516) > at java.util.TimSort.mergeForceCollapse(TimSort.java:457) > at java.util.TimSort.sort(TimSort.java:254) > at java.util.Arrays.sort(Arrays.java:1512) > at java.util.ArrayList.sort(ArrayList.java:1462) > at java.util.Collections.sort(Collections.java:177) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616) > {code} > JAVA 8 Arrays.sort default use timsort algo, and timsort has few require > {code:java} > 1.x.compareTo(y) != y.compareTo(x) > 2.x>y,y>z --> x > z > 3.x=y, x.compareTo(z) == y.compareTo(z) > {code} > if not Arrays paramters not satify this require,TimSort will throw > 'java.lang.IllegalArgumentException' > look at PriorityUtilizationQueueOrderingPolicy.compare function,we will know > Capacity Scheduler use this these queue resource usage to compare > {code:java} > AbsoluteUsedCapacity > UsedCapacity > ConfiguredMinResource > AbsoluteCapacity > {code} > In Capacity Scheduler Global Scheduler AsyncThread use > PriorityUtilizationQueueOrderingPolicy function to choose queue to assign > container,and construct
[jira] [Commented] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract'
[ https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17456058#comment-17456058 ] Eric Payne commented on YARN-10178: --- This is a complicated problem, and I'm still trying to get my brain around what exactly is happening and what would fix it. So, if I get some of the details wrong here, please correct me. [~gandras], bq. I was wondering whether we could avoid creating the snapshot altogether, by modifying the original comparator to acquire the necessary values immediately I think that the problem is happening within the TimSort.sort() is that when the queue list is being sorted, the resources of the elements that have already been sorted are changing. So when TimSort.sort() tries to find the correct location for the new element, the sort order is wrong. So, I think the copy is needed so that a static list of queues is being sorted. [~zhuqi]/[~wangda]/others, I read online that even the stream method of List is not a deep copy. Is that true? If we are only making a reference of the queue list, then the resource usages of each queue can change and cause the sorted list to be wrong during sorting. bq. We should not use the Stream API because of older branches. I suggest rewriting getAssignmentIterator: I believe that the Stream API was introduced in JDK 8. If we choose to use it, we would not be able to backport this fix to anything prior to Hadoop 2.10. I am fine with that, but I am interested in others' opinions. {quote} Measuring performance is a delicate procedure. Including it in a unit test is incredibly volatile (On my local machine I have not been able to pass the test for example) especially when naive time measurement is involved. Not sure if we can easily reproduce it, but I think in this case the no test is better than a potentially intermittent test. {quote} I agree with [~gandras]. I have been trying to determine a way to write a unit test that can reproduce this, but so far I have had no luck. But I think a unit test that doesn't reproduce the error _and_ could fail intermittently is not ideal. > Global Scheduler async thread crash caused by 'Comparison method violates its > general contract' > --- > > Key: YARN-10178 > URL: https://issues.apache.org/jira/browse/YARN-10178 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.2.1 >Reporter: tuyu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10178.001.patch, YARN-10178.002.patch, > YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch > > > Global Scheduler Async Thread crash stack > {code:java} > ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received > RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, > Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: > Comparison method violates its general contract! >at > java.util.TimSort.mergeHi(TimSort.java:899) > at java.util.TimSort.mergeAt(TimSort.java:516) > at java.util.TimSort.mergeForceCollapse(TimSort.java:457) > at java.util.TimSort.sort(TimSort.java:254) > at java.util.Arrays.sort(Arrays.java:1512) > at java.util.ArrayList.sort(ArrayList.java:1462) > at java.util.Collections.sort(Collections.java:177) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481) > at >
[jira] [Commented] (YARN-11006) Allow overriding user limit factor and maxAMResourcePercent with AQCv2 templates
[ https://issues.apache.org/jira/browse/YARN-11006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17455914#comment-17455914 ] Benjamin Teke commented on YARN-11006: -- Thanks [~snemeth]. No need to backport it, this only impacts AQCv2, which will be part of 3.4. > Allow overriding user limit factor and maxAMResourcePercent with AQCv2 > templates > > > Key: YARN-11006 > URL: https://issues.apache.org/jira/browse/YARN-11006 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > [YARN-10801|https://issues.apache.org/jira/browse/YARN-10801] fixed the > template configurations for every queue property, but it introduced a strange > behaviour as well. When setting the template configurations > LeafQueue.setDynamicQueueProperties is called: > {code:java} > @Override > protected void setDynamicQueueProperties( > CapacitySchedulerConfiguration configuration) { > super.setDynamicQueueProperties(configuration); > // set to -1, to disable it > configuration.setUserLimitFactor(getQueuePath(), -1); > // Set Max AM percentage to a higher value > configuration.setMaximumApplicationMasterResourcePerQueuePercent( > getQueuePath(), 1f); > } > {code} > This sets the configured template properties in the configuration object and > then it overwrites the user limit factor and the maximum AM resource percent > values with the hardcoded ones. The order should be reversed. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11005) Implement the core QUEUE_LENGTH_THEN_RESOURCES OContainer allocation policy
[ https://issues.apache.org/jira/browse/YARN-11005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Íñigo Goiri resolved YARN-11005. Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed > Implement the core QUEUE_LENGTH_THEN_RESOURCES OContainer allocation policy > --- > > Key: YARN-11005 > URL: https://issues.apache.org/jira/browse/YARN-11005 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Andrew Chung >Assignee: Andrew Chung >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 4h 40m > Remaining Estimate: 0h > > This sub-task contains the bulk of the work that will implement the new > resource-aware Opportunistic container allocation policy > {{QUEUE_LENGTH_THEN_RESOURCES}}. > The core tasks here are to: > # Allow {{ClusterNode}} to be allocated resource-aware using information from > {{RMNode}}, > # Add the {{QUEUE_LENGTH_THEN_RESOURCES}} {{LoadComparator}}, > # Implement the new sorting logic and the logic to determine whether a node > can queue Opportunistic containers in case the > {{QUEUE_LENGTH_THEN_RESOURCES}} policy is chosen, and > # Modify {{NodeQueueLoadMonitor.selectAnyNode}} to be request-resource-aware. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10965) Centralize queue resource calculation based on CapacityVectors
[ https://issues.apache.org/jira/browse/YARN-10965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17455343#comment-17455343 ] Andras Gyori edited comment on YARN-10965 at 12/8/21, 3:56 PM: --- As it is a crucial part of CapacityScheduler, it would be helpful if a few community members took a look on this and maybe on the design doc as well. cc. [~jbrennan] [~epayne] [~sunilg] [~zhuqi] [~BilwaST] [~adam.antal] was (Author: gandras): As it is a crucial part of CapacityScheduler, it would be helpful if a few community members took a look on this and maybe on the design doc as well. cc. [~jbrennan] [~epayne] [~sunilg] [~zhuqi] [~BilwaST] > Centralize queue resource calculation based on CapacityVectors > -- > > Key: YARN-10965 > URL: https://issues.apache.org/jira/browse/YARN-10965 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Labels: pull-request-available > Time Spent: 1h 40m > Remaining Estimate: 0h > > With the introduction of YARN-10930 it is possible to unify queue resource > calculation. In order to narrow down the scope of this patch, the base system > is implemented here, without refactoring the existing resource calculation in > updateClusterResource (which will be done in YARN-11000). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10965) Centralize queue resource calculation based on CapacityVectors
[ https://issues.apache.org/jira/browse/YARN-10965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17455343#comment-17455343 ] Andras Gyori commented on YARN-10965: - As it is a crucial part of CapacityScheduler, it would be helpful if a few community members took a look on this and maybe on the design doc as well. cc. [~jbrennan] [~epayne] [~sunilg] [~zhuqi] [~BilwaST] > Centralize queue resource calculation based on CapacityVectors > -- > > Key: YARN-10965 > URL: https://issues.apache.org/jira/browse/YARN-10965 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Labels: pull-request-available > Time Spent: 1h 40m > Remaining Estimate: 0h > > With the introduction of YARN-10930 it is possible to unify queue resource > calculation. In order to narrow down the scope of this patch, the base system > is implemented here, without refactoring the existing resource calculation in > updateClusterResource (which will be done in YARN-11000). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11038) Fix testQueueSubmitWithACL* tests in TestAppManager
[ https://issues.apache.org/jira/browse/YARN-11038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szilard Nemeth updated YARN-11038: -- Fix Version/s: 3.4.0 > Fix testQueueSubmitWithACL* tests in TestAppManager > --- > > Key: YARN-11038 > URL: https://issues.apache.org/jira/browse/YARN-11038 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Reporter: Tamas Domok >Assignee: Tamas Domok >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 20m > Remaining Estimate: 0h > > These two tests doesn't test anything: > - testQueueSubmitWithACLsEnabledWithQueueMapping > - testQueueSubmitWithACLsEnabledWithQueueMappingForAutoCreatedQueue > > Issues: > - no assert if the exception is not thrown > - the configuration doesn't even loaded (csConf is not used for the mockRM) > - the placement manager did not even match the test scenario > - the successful submit was not asserted either > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract'
[ https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17455323#comment-17455323 ] Andras Gyori commented on YARN-10178: - We have faced the same issue in a production cluster recently. I agree with [~epayne] that this should be resolved as soon as possible. My feedback on the patch: * As this is a subtle concurrency issue, I have not been able to reproduce it yet, but I was wondering whether we could avoid creating the snapshot altogether, by modifying the original comparator to acquire the necessary values immediately, thus hopefully eliminating the possibility of violating the sorting's requirements. This would look as the following: {code:java} float q1AbsCapacity = q1.getQueueCapacities().getAbsoluteCapacity(p); float q2AbsCapacity = q2.getQueueCapacities().getAbsoluteCapacity(p); float q1AbsUsedCapacity = q1.getQueueCapacities().getAbsoluteUsedCapacity(p); float q2AbsUsedCapacity = q2.getQueueCapacities().getAbsoluteUsedCapacity(p); float q1UsedCapacity = q1.getQueueCapacities().getUsedCapacity(p); float q2UsedCapacity = q2.getQueueCapacities().getUsedCapacity(p); .{code} * We should not use the Stream API because of older branches. I suggest rewriting getAssignmentIterator: {code:java} @Override public Iterator getAssignmentIterator(String partition) { // Since partitionToLookAt is a thread local variable, and every time we // copy and sort queues, so it's safe for multi-threading environment. PriorityUtilizationQueueOrderingPolicy.partitionToLookAt.set(partition); // Sort the snapshots instead of the queues directly, due to race conditions // See YARN-10178 for more information. List queueSnapshots = new ArrayList<>(); for (CSQueue queue : queues) { queueSnapshots.add(new QueueSnapshot(queue)); } queueSnapshots.sort(new PriorityQueueComparator()); List sortedQueues = new ArrayList<>(); for (QueueSnapshot queueSnapshot : queueSnapshots) { sortedQueues.add(queueSnapshot.queue); } return sortedQueues.iterator(); } {code} * We do not need to keep the old logic * Measuring performance is a delicate procedure. Including it in a unit test is incredibly volatile (On my local machine I have not been able to pass the test for example) especially when naive time measurement is involved. Not sure if we can easily reproduce it, but I think in this case the no test is better than a potentially intermittent test. > Global Scheduler async thread crash caused by 'Comparison method violates its > general contract' > --- > > Key: YARN-10178 > URL: https://issues.apache.org/jira/browse/YARN-10178 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.2.1 >Reporter: tuyu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10178.001.patch, YARN-10178.002.patch, > YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch > > > Global Scheduler Async Thread crash stack > {code:java} > ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received > RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, > Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: > Comparison method violates its general contract! >at > java.util.TimSort.mergeHi(TimSort.java:899) > at java.util.TimSort.mergeAt(TimSort.java:516) > at java.util.TimSort.mergeForceCollapse(TimSort.java:457) > at java.util.TimSort.sort(TimSort.java:254) > at java.util.Arrays.sort(Arrays.java:1512) > at java.util.ArrayList.sort(ArrayList.java:1462) > at java.util.Collections.sort(Collections.java:177) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629) > at >
[jira] [Resolved] (YARN-11031) Improve the maintainability of RM webapp tests like TestRMWebServicesCapacitySched
[ https://issues.apache.org/jira/browse/YARN-11031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szilard Nemeth resolved YARN-11031. --- Hadoop Flags: Reviewed Resolution: Fixed > Improve the maintainability of RM webapp tests like > TestRMWebServicesCapacitySched > -- > > Key: YARN-11031 > URL: https://issues.apache.org/jira/browse/YARN-11031 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tamas Domok >Assignee: Tamas Domok >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 40m > Remaining Estimate: 0h > > It's hard to maintain the asserts in TestRMWebServicesCapacitySched, > TestRMWebServicesCapacitySchedDynamicConfig test classes when the scheduler > response is modified. Currently only a subset of the scheduler response is > asserted in these tests. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11031) Improve the maintainability of RM webapp tests like TestRMWebServicesCapacitySched
[ https://issues.apache.org/jira/browse/YARN-11031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szilard Nemeth updated YARN-11031: -- Fix Version/s: 3.4.0 > Improve the maintainability of RM webapp tests like > TestRMWebServicesCapacitySched > -- > > Key: YARN-11031 > URL: https://issues.apache.org/jira/browse/YARN-11031 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tamas Domok >Assignee: Tamas Domok >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 40m > Remaining Estimate: 0h > > It's hard to maintain the asserts in TestRMWebServicesCapacitySched, > TestRMWebServicesCapacitySchedDynamicConfig test classes when the scheduler > response is modified. Currently only a subset of the scheduler response is > asserted in these tests. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11040) Error in build log: hadoop-resourceestimator has missing dependencies: jsonschema2pojo-core-1.0.2.jar
Tamas Domok created YARN-11040: -- Summary: Error in build log: hadoop-resourceestimator has missing dependencies: jsonschema2pojo-core-1.0.2.jar Key: YARN-11040 URL: https://issues.apache.org/jira/browse/YARN-11040 Project: Hadoop YARN Issue Type: Bug Components: build, yarn Affects Versions: 3.4.0 Reporter: Tamas Domok Assignee: Tamas Domok There is error log about a missing dependency during package build. Reproduction: mvn clean package -Pyarn-ui -Pdist -Dtar -Dmaven.javadoc.skip=true -DskipTests -DskipShade 2>&1 | grep jsonschema2pojo {code:java} [INFO] --- jsonschema2pojo-maven-plugin:1.1.1:generate (default) @ hadoop-yarn-server-resourcemanager --- [INFO] Copying jsonschema2pojo-core-1.0.2.jar to /Users/tdomok/Work/hadoop/hadoop-tools/hadoop-federation-balance/target/lib/jsonschema2pojo-core-1.0.2.jar [INFO] org.jsonschema2pojo:jsonschema2pojo-core:jar:1.0.2 already exists in destination. [INFO] Copying jsonschema2pojo-core-1.0.2.jar to /Users/tdomok/Work/hadoop/hadoop-tools/hadoop-aws/target/lib/jsonschema2pojo-core-1.0.2.jar ERROR: hadoop-resourceestimator has missing dependencies: jsonschema2pojo-core-1.0.2.jar {code} The build is successful but there is this error in the build log: {code:java} ERROR: hadoop-resourceestimator has missing dependencies: jsonschema2pojo-core-1.0.2.jar {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11020) [UI2] No container is found for an application attempt with a single AM container
[ https://issues.apache.org/jira/browse/YARN-11020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17455224#comment-17455224 ] Hadoop QA commented on YARN-11020: -- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 12m 43s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} No case conflicting files found. {color} | | {color:blue}0{color} | {color:blue} jshint {color} | {color:blue} 0m 0s{color} | {color:blue}{color} | {color:blue} jshint was not available. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} branch-3.3 Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 32m 9s{color} | {color:green}{color} | {color:green} branch-3.3 passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 48m 28s{color} | {color:green}{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 15s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 48s{color} | {color:green}{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | || || || || {color:brown} Other Tests {color} || || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 31s{color} | {color:green}{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 78m 33s{color} | {color:black}{color} | {color:black}{color} | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/1258/artifact/out/Dockerfile | | JIRA Issue | YARN-11020 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13037122/YARN-11020-branch-3.3.001.patch | | Optional Tests | dupname asflicense shadedclient jshint | | uname | Linux 783477a68210 4.15.0-65-generic #74-Ubuntu SMP Tue Sep 17 17:06:04 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | personality/hadoop.sh | | git revision | branch-3.3 / 1ee661d7da4 | | Max. process+thread count | 636 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-ui U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-ui | | Console output | https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/1258/console | | versions | git=2.17.1 maven=3.6.0 | | Powered by | Apache Yetus 0.13.0-SNAPSHOT https://yetus.apache.org | This message was automatically generated. > [UI2] No container is found for an application attempt with a single AM > container > - > > Key: YARN-11020 > URL: https://issues.apache.org/jira/browse/YARN-11020 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-ui-v2 >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Attachments: YARN-11020-branch-3.3.001.patch > > Time Spent: 1h > Remaining Estimate: 0h > > In UI2 for an application under the Logs tab, No container data available > message is shown if the application attempt only submitted a single container > (which is the AM container). > The culprit of the issue is that the response from YARN is not consistent, > because for a single container it looks like: > {noformat} > { > "containerLogsInfo": { > "containerLogInfo": [ > { > "fileName": "prelaunch.out", > "fileSize": "100", > "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021" > }, > { > "fileName": "directory.info", > "fileSize": "2296", > "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021" > }, > { > "fileName": "stderr", >
[jira] [Updated] (YARN-11034) Add enhanced headroom in AllocateResponse
[ https://issues.apache.org/jira/browse/YARN-11034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated YARN-11034: -- Labels: pull-request-available (was: ) > Add enhanced headroom in AllocateResponse > - > > Key: YARN-11034 > URL: https://issues.apache.org/jira/browse/YARN-11034 > Project: Hadoop YARN > Issue Type: Task >Reporter: Minni Mittal >Assignee: Minni Mittal >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Add enhanced headroom in allocate response. This provides a channel for RMs > to return load information for AMRMProxy and decision making when rerouting > resource requests. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Reopened] (YARN-11020) [UI2] No container is found for an application attempt with a single AM container
[ https://issues.apache.org/jira/browse/YARN-11020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andras Gyori reopened YARN-11020: - > [UI2] No container is found for an application attempt with a single AM > container > - > > Key: YARN-11020 > URL: https://issues.apache.org/jira/browse/YARN-11020 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-ui-v2 >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Attachments: YARN-11020-branch-3.3.001.patch > > Time Spent: 1h > Remaining Estimate: 0h > > In UI2 for an application under the Logs tab, No container data available > message is shown if the application attempt only submitted a single container > (which is the AM container). > The culprit of the issue is that the response from YARN is not consistent, > because for a single container it looks like: > {noformat} > { > "containerLogsInfo": { > "containerLogInfo": [ > { > "fileName": "prelaunch.out", > "fileSize": "100", > "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021" > }, > { > "fileName": "directory.info", > "fileSize": "2296", > "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021" > }, > { > "fileName": "stderr", > "fileSize": "1722", > "lastModifiedTime": "Mon Nov 29 09:28:28 + 2021" > }, > { > "fileName": "prelaunch.err", > "fileSize": "0", > "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021" > }, > { > "fileName": "stdout", > "fileSize": "0", > "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021" > }, > { > "fileName": "syslog", > "fileSize": "38551", > "lastModifiedTime": "Mon Nov 29 09:28:28 + 2021" > }, > { > "fileName": "launch_container.sh", > "fileSize": "5013", > "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021" > } > ], > "logAggregationType": "AGGREGATED", > "containerId": "container_1638174027957_0008_01_01", > "nodeId": "da175178c179:43977" > } > }{noformat} > As for applications with multiple containers it looks like: > {noformat} > { > "containerLogsInfo": [{ > > }, { }] > }{noformat} > We can not change the response of the endpoint due to backward compatibility, > therefore we need to make UI2 be able to handle both scenarios. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11020) [UI2] No container is found for an application attempt with a single AM container
[ https://issues.apache.org/jira/browse/YARN-11020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andras Gyori updated YARN-11020: Attachment: (was: YARN-11020-branch-3.3.001.patch) > [UI2] No container is found for an application attempt with a single AM > container > - > > Key: YARN-11020 > URL: https://issues.apache.org/jira/browse/YARN-11020 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-ui-v2 >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Attachments: YARN-11020-branch-3.3.001.patch > > Time Spent: 1h > Remaining Estimate: 0h > > In UI2 for an application under the Logs tab, No container data available > message is shown if the application attempt only submitted a single container > (which is the AM container). > The culprit of the issue is that the response from YARN is not consistent, > because for a single container it looks like: > {noformat} > { > "containerLogsInfo": { > "containerLogInfo": [ > { > "fileName": "prelaunch.out", > "fileSize": "100", > "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021" > }, > { > "fileName": "directory.info", > "fileSize": "2296", > "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021" > }, > { > "fileName": "stderr", > "fileSize": "1722", > "lastModifiedTime": "Mon Nov 29 09:28:28 + 2021" > }, > { > "fileName": "prelaunch.err", > "fileSize": "0", > "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021" > }, > { > "fileName": "stdout", > "fileSize": "0", > "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021" > }, > { > "fileName": "syslog", > "fileSize": "38551", > "lastModifiedTime": "Mon Nov 29 09:28:28 + 2021" > }, > { > "fileName": "launch_container.sh", > "fileSize": "5013", > "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021" > } > ], > "logAggregationType": "AGGREGATED", > "containerId": "container_1638174027957_0008_01_01", > "nodeId": "da175178c179:43977" > } > }{noformat} > As for applications with multiple containers it looks like: > {noformat} > { > "containerLogsInfo": [{ > > }, { }] > }{noformat} > We can not change the response of the endpoint due to backward compatibility, > therefore we need to make UI2 be able to handle both scenarios. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11020) [UI2] No container is found for an application attempt with a single AM container
[ https://issues.apache.org/jira/browse/YARN-11020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17455174#comment-17455174 ] Andras Gyori commented on YARN-11020: - The container log fetching is missing from branch-3.2, so only backported it to branch-3.3. > [UI2] No container is found for an application attempt with a single AM > container > - > > Key: YARN-11020 > URL: https://issues.apache.org/jira/browse/YARN-11020 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-ui-v2 >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Attachments: YARN-11020-branch-3.3.001.patch > > Time Spent: 1h > Remaining Estimate: 0h > > In UI2 for an application under the Logs tab, No container data available > message is shown if the application attempt only submitted a single container > (which is the AM container). > The culprit of the issue is that the response from YARN is not consistent, > because for a single container it looks like: > {noformat} > { > "containerLogsInfo": { > "containerLogInfo": [ > { > "fileName": "prelaunch.out", > "fileSize": "100", > "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021" > }, > { > "fileName": "directory.info", > "fileSize": "2296", > "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021" > }, > { > "fileName": "stderr", > "fileSize": "1722", > "lastModifiedTime": "Mon Nov 29 09:28:28 + 2021" > }, > { > "fileName": "prelaunch.err", > "fileSize": "0", > "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021" > }, > { > "fileName": "stdout", > "fileSize": "0", > "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021" > }, > { > "fileName": "syslog", > "fileSize": "38551", > "lastModifiedTime": "Mon Nov 29 09:28:28 + 2021" > }, > { > "fileName": "launch_container.sh", > "fileSize": "5013", > "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021" > } > ], > "logAggregationType": "AGGREGATED", > "containerId": "container_1638174027957_0008_01_01", > "nodeId": "da175178c179:43977" > } > }{noformat} > As for applications with multiple containers it looks like: > {noformat} > { > "containerLogsInfo": [{ > > }, { }] > }{noformat} > We can not change the response of the endpoint due to backward compatibility, > therefore we need to make UI2 be able to handle both scenarios. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11023) Extend the root QueueInfo with max-parallel-apps in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-11023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17455173#comment-17455173 ] Szilard Nemeth commented on YARN-11023: --- Thanks [~tdomok] for the confirmation. Resolving this jira then. > Extend the root QueueInfo with max-parallel-apps in CapacityScheduler > - > > Key: YARN-11023 > URL: https://issues.apache.org/jira/browse/YARN-11023 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.4.0 >Reporter: Tamas Domok >Assignee: Tamas Domok >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > YARN-10891 extended the QueueInfo with the maxParallelApps property, but for > the root queue this property is missing. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-11023) Extend the root QueueInfo with max-parallel-apps in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-11023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17455173#comment-17455173 ] Szilard Nemeth edited comment on YARN-11023 at 12/8/21, 11:20 AM: -- Thanks [~tdomok] for the confirmation. was (Author: snemeth): Thanks [~tdomok] for the confirmation. Resolving this jira then. > Extend the root QueueInfo with max-parallel-apps in CapacityScheduler > - > > Key: YARN-11023 > URL: https://issues.apache.org/jira/browse/YARN-11023 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.4.0 >Reporter: Tamas Domok >Assignee: Tamas Domok >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > YARN-10891 extended the QueueInfo with the maxParallelApps property, but for > the root queue this property is missing. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11020) [UI2] No container is found for an application attempt with a single AM container
[ https://issues.apache.org/jira/browse/YARN-11020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andras Gyori updated YARN-11020: Attachment: YARN-11020-branch-3.3.001.patch > [UI2] No container is found for an application attempt with a single AM > container > - > > Key: YARN-11020 > URL: https://issues.apache.org/jira/browse/YARN-11020 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-ui-v2 >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Attachments: YARN-11020-branch-3.3.001.patch > > Time Spent: 1h > Remaining Estimate: 0h > > In UI2 for an application under the Logs tab, No container data available > message is shown if the application attempt only submitted a single container > (which is the AM container). > The culprit of the issue is that the response from YARN is not consistent, > because for a single container it looks like: > {noformat} > { > "containerLogsInfo": { > "containerLogInfo": [ > { > "fileName": "prelaunch.out", > "fileSize": "100", > "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021" > }, > { > "fileName": "directory.info", > "fileSize": "2296", > "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021" > }, > { > "fileName": "stderr", > "fileSize": "1722", > "lastModifiedTime": "Mon Nov 29 09:28:28 + 2021" > }, > { > "fileName": "prelaunch.err", > "fileSize": "0", > "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021" > }, > { > "fileName": "stdout", > "fileSize": "0", > "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021" > }, > { > "fileName": "syslog", > "fileSize": "38551", > "lastModifiedTime": "Mon Nov 29 09:28:28 + 2021" > }, > { > "fileName": "launch_container.sh", > "fileSize": "5013", > "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021" > } > ], > "logAggregationType": "AGGREGATED", > "containerId": "container_1638174027957_0008_01_01", > "nodeId": "da175178c179:43977" > } > }{noformat} > As for applications with multiple containers it looks like: > {noformat} > { > "containerLogsInfo": [{ > > }, { }] > }{noformat} > We can not change the response of the endpoint due to backward compatibility, > therefore we need to make UI2 be able to handle both scenarios. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11016) Queue weight is incorrectly reset to zero
[ https://issues.apache.org/jira/browse/YARN-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17455172#comment-17455172 ] Szilard Nemeth commented on YARN-11016: --- Thanks [~gandras] for the confirmation. Resolving this jira then. > Queue weight is incorrectly reset to zero > - > > Key: YARN-11016 > URL: https://issues.apache.org/jira/browse/YARN-11016 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > QueueCapacities#clearConfigurableFields set WEIGHT capacity to 0, which could > cause problems like in the following scenario: > 1. Initializing queues > 2. Parent 'parent' have accessibleNodeLabels set, and since accessible node > labels are inherited, its children, for example 'child' has 'test' label as > its accessible-node-label. > 3. In LeafQueue#updateClusterResource, we call > LeafQueue#activateApplications, which then calls > LeafQueue#calculateAndGetAMResourceLimitPerPartition for each label (see > getNodeLabelsForQueue). > In this case, the labels are the accessible node labels (the inherited > 'test). > During this event, the ResourceUsage object is updated for the label 'test', > thus extending its nodeLabelsSet with 'test'. > 4. In a following updateClusterResource call, for example an addNode event, > we now have 'test' label in ResourceUsage even though it was never explicitly > configured and we call CSQueueUtils#updateQueueStatistics, that takes the > union of the node labels from QueueCapacities and ResourceUsage (this union > is now the empty default label AND 'test') and updates QueueCapacities with > the label 'perf-test'. > Now QueueCapacities has 'test' in its nodeLabelsSet as well! > 5. After a reinitialization (like an update from mutation API), the > CSQueueUtils#loadCapacitiesByLabelsFromCon is called, which resets the > QueueCapacities values to zero (even weight, which is wrong in my opinion) > and loads the values again from the config. > The problem here is that values are reset for all node labels in > QueueCapacities (even for 'test'), but we only load the values for the > configured node labels (which we did not set, so it is defaulted to the empty > label). > 6. Now all children of 'parent' have weight=0 for 'test' in QueueCapacities > and that is why the update fails. > It even explains why validation passes, because the validation endpoint > instantiates a brand new CapacityScheduler for which these cascade of effects > can not accumulate (as there are no multiple updateClusterResource calls) > This scenario manifests as an error when updating via mutation API: > {noformat} > Failed to re-init queues : Parent queue 'parent' have children queue used > mixed of weight mode, percentage and absolute mode, it is not allowed, please > double check, details:{noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11039) LogAggregationFileControllerFactory::getFileControllerForRead should close FS
Rajesh Balamohan created YARN-11039: --- Summary: LogAggregationFileControllerFactory::getFileControllerForRead should close FS Key: YARN-11039 URL: https://issues.apache.org/jira/browse/YARN-11039 Project: Hadoop YARN Issue Type: Improvement Components: log-aggregation Reporter: Rajesh Balamohan getFileControllerForRead::getFileControllerForRead internally opens up a new FS object everytime and is not closed. When cloud connectors (e.g s3a) is used along with Knox, it ends up leaking KnoxTokenMonitor for every unclosed FS object causing thread leaks in NM. Lines of interest: [https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileControllerFactory.java#L167] {noformat} try { Path remoteAppLogDir = fileController.getOlderRemoteAppLogDir(appId, appOwner); if (LogAggregationUtils.getNodeFiles(conf, remoteAppLogDir, appId, appOwner).hasNext()) { return fileController; } } catch (Exception ex) { diagnosticsMsg.append(ex.getMessage() + "\n"); continue; } {noformat} [https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/LogAggregationUtils.java#L252] {noformat} public static RemoteIterator getNodeFiles(Configuration conf, Path remoteAppLogDir, ApplicationId appId, String appOwner) throws IOException { Path qualifiedLogDir = FileContext.getFileContext(conf).makeQualified(remoteAppLogDir); return FileContext.getFileContext( qualifiedLogDir.toUri(), conf).listStatus(remoteAppLogDir); } {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10850) TimelineService v2 lists containers for all attempts when filtering for one
[ https://issues.apache.org/jira/browse/YARN-10850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated YARN-10850: -- Labels: pull-request-available (was: ) > TimelineService v2 lists containers for all attempts when filtering for one > --- > > Key: YARN-10850 > URL: https://issues.apache.org/jira/browse/YARN-10850 > Project: Hadoop YARN > Issue Type: Bug > Components: timelinereader >Reporter: Benjamin Teke >Assignee: Tibor Kovács >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > When using the command > {code:java} > yarn container -list > {code} > with an application attempt ID based on the help only the containers for that > attempt should be listed. > {code:java} > -list List containers for application > attempt when application > attempt ID is provided. When > application name is provided, > then it finds the instances of > the application based on app's > own implementation, and > -appTypes option must be > specified unless it is the > default yarn-service type. With > app name, it supports optional > use of -version to filter > instances based on app version, > -components to filter instances > based on component names, > -states to filter instances > based on instance state. > {code} > When TimelineService v2 is enabled all of the containers for the application > are returned. > {code:java} > hrt_qa@ctr-e172-1620330694487-146061-01-02:/hwqe/hadoopqe$ yarn > applicationattempt -list application_1625124233002_0007 > 21/07/01 09:32:23 INFO impl.TimelineReaderClientImpl: Initialized > TimelineReader > URI=http://ctr-e172-1620330694487-146061-01-04.hwx.site:8198/ws/v2/timeline/, > clusterId=yarn-cluster > 21/07/01 09:32:24 INFO client.AHSProxy: Connecting to Application History > server at ctr-e172-1620330694487-146061-01-04.hwx.site/172.27.113.4:10200 > 21/07/01 09:32:24 INFO client.ConfiguredRMFailoverProxyProvider: Failing over > to rm2 > Total number of application attempts :2 > ApplicationAttempt-Id State > AM-Container-IdTracking-URL > appattempt_1625124233002_0007_01FAILED > container_e43_1625124233002_0007_01_01 > http://ctr-e172-1620330694487-146061-01-03.hwx.site:8088/proxy/application_1625124233002_0007/ > appattempt_1625124233002_0007_02KILLED > container_e43_1625124233002_0007_02_01 > http://ctr-e172-1620330694487-146061-01-03.hwx.site:8088/proxy/application_1625124233002_0007/ > {code} > Querying the 2 app attempts produces the same output: > {code:java} > hrt_qa@ctr-e172-1620330694487-146061-01-02:/hwqe/hadoopqe$ yarn container > -list appattempt_1625124233002_0007_01 > 21/07/01 09:32:35 INFO impl.TimelineReaderClientImpl: Initialized > TimelineReader > URI=http://ctr-e172-1620330694487-146061-01-04.hwx.site:8198/ws/v2/timeline/, > clusterId=yarn-cluster > 21/07/01 09:32:35 INFO client.AHSProxy: Connecting to Application History > server at ctr-e172-1620330694487-146061-01-04.hwx.site/172.27.113.4:10200 > 21/07/01 09:32:35 INFO client.ConfiguredRMFailoverProxyProvider: Failing over > to rm2 > 21/07/01 09:32:36 INFO conf.Configuration: found resource resource-types.xml > at file:/etc/hadoop/7.1.7.0-504/0/resource-types.xml > Total number of containers :12 > Container-Id Start Time Finish > Time StateHost Node Http Address > LOG-URL > container_e43_1625124233002_0007_02_04 N/A > N/ACOMPLETE > ctr-e172-1620330694487-146061-01-02.hwx.site:25454 > ctr-e172-1620330694487-146061-01-02.hwx.site:8042 >
[jira] [Resolved] (YARN-11023) Extend the root QueueInfo with max-parallel-apps in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-11023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tamas Domok resolved YARN-11023. Resolution: Fixed Hi [~snemeth], There is no need for the backport, this feature was not backported to the branch-3.3 or branch-3.2. > Extend the root QueueInfo with max-parallel-apps in CapacityScheduler > - > > Key: YARN-11023 > URL: https://issues.apache.org/jira/browse/YARN-11023 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.4.0 >Reporter: Tamas Domok >Assignee: Tamas Domok >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > YARN-10891 extended the QueueInfo with the maxParallelApps property, but for > the root queue this property is missing. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10888) [Umbrella] New capacity modes for CS
[ https://issues.apache.org/jira/browse/YARN-10888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andras Gyori updated YARN-10888: Attachment: (was: capacity_scheduler_queue_capacity.html) > [Umbrella] New capacity modes for CS > > > Key: YARN-10888 > URL: https://issues.apache.org/jira/browse/YARN-10888 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: capacity_scheduler_queue_capacity.pdf > > > *Investigate how resource allocation configuration could be more consistent > in CapacityScheduler* > It would be nice if everywhere where a capacity can be defined could be > defined the same way: > * With fixed amounts (e.g. 1 GB memory, 8 vcores, 3 GPU) > * With percentages > ** Percentage of all resources (eg 10% of all memory, vcore, GPU) > ** Percentage per resource type (eg 10% memory, 25% vcore, 50% GPU) > * Allow mixing different modes under one hierarchy but not under the same > parent queues. > We need to determine all configuration options where capacities can be > defined, and see if it is possible to extend the configuration, or if it > makes sense in that case. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org