[jira] [Commented] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract'

2021-12-08 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17456178#comment-17456178
 ] 

Andras Gyori commented on YARN-10178:
-

Thanks [~epayne] for the details. The root cause you are describing is in place 
I think. Its probably transitivity, that is violated (namely if q1 > q2 and q2 
> q3 then q1 > q3, but the time it reaches the q1, q3 comparison, the queues 
had already changed, thus breaking the TimSort requirements), though not 
entirely sure about that.

All in all, the snapshot idea seems to be the correct one. As for 
{noformat}
 I read online that even the stream method of List is not a deep copy. Is that 
true? If we are only making a reference of the queue list, then the resource 
usages of each queue can change and cause the sorted list to be wrong during 
sorting.{noformat}
I believe it is not a problem, as we are not making a copy, but creating new 
objects out of queues, and only taking floats out of them, which are value 
types. However, configuredMinResource is indeed a reference and mutable as 
well, so we might need to clone that with Resources.clone() (I think it is the 
standard convention).

> Global Scheduler async thread crash caused by 'Comparison method violates its 
> general contract'
> ---
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10178.001.patch, YARN-10178.002.patch, 
> YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch
>
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
> {code}
> JAVA 8 Arrays.sort default use timsort algo, and timsort has  few require 
> {code:java}
> 1.x.compareTo(y) != y.compareTo(x)
> 2.x>y,y>z --> x > z
> 3.x=y, x.compareTo(z) == y.compareTo(z)
> {code}
> if not Arrays paramters not satify this require,TimSort will throw 
> 'java.lang.IllegalArgumentException'
> look at PriorityUtilizationQueueOrderingPolicy.compare function,we will know 
> Capacity Scheduler use this these queue resource usage to compare
> {code:java}
> AbsoluteUsedCapacity
> UsedCapacity
> ConfiguredMinResource
> AbsoluteCapacity
> {code}
> In Capacity Scheduler Global Scheduler AsyncThread use 
> PriorityUtilizationQueueOrderingPolicy function to choose queue to assign 
> container,and construct 

[jira] [Commented] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract'

2021-12-08 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17456058#comment-17456058
 ] 

Eric Payne commented on YARN-10178:
---

This is a complicated problem, and I'm still trying to get my brain around what 
exactly is happening and what would fix it. So, if I get some of the details 
wrong here, please correct me.
[~gandras], 
bq. I was wondering whether we could avoid creating the snapshot altogether, by 
modifying the original comparator to acquire the necessary values immediately
I think that the problem is happening within the TimSort.sort() is that when 
the queue list is being sorted, the resources of the elements that have already 
been sorted are changing. So when TimSort.sort() tries to find the correct 
location for the new element, the sort order is wrong. So, I think the copy is 
needed so that a static list of queues is being sorted.

[~zhuqi]/[~wangda]/others, I read online that even the stream method of List is 
not a deep copy. Is that true? If we are only making a reference of the queue 
list, then the resource usages of each queue can change and cause the sorted 
list to be wrong during sorting.

bq. We should not use the Stream API because of older branches. I suggest 
rewriting getAssignmentIterator: 
I believe that the Stream API was introduced in JDK 8. If we choose to use it, 
we would not be able to backport this fix to anything prior to Hadoop 2.10. I 
am fine with that, but I am interested in others' opinions.

{quote}
Measuring performance is a delicate procedure. Including it in a unit test is 
incredibly volatile (On my local machine I have not been able to pass the test 
for example) especially when naive time measurement is involved. Not sure if we 
can easily reproduce it, but I think in this case the no test is better than a 
potentially intermittent test.
{quote}
I agree with [~gandras]. I have been trying to determine a way to write a unit 
test that can reproduce this, but so far I have had no luck. But I think a unit 
test that doesn't reproduce the error _and_ could fail intermittently is not 
ideal.


> Global Scheduler async thread crash caused by 'Comparison method violates its 
> general contract'
> ---
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10178.001.patch, YARN-10178.002.patch, 
> YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch
>
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
> at 
> 

[jira] [Commented] (YARN-11006) Allow overriding user limit factor and maxAMResourcePercent with AQCv2 templates

2021-12-08 Thread Benjamin Teke (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17455914#comment-17455914
 ] 

Benjamin Teke commented on YARN-11006:
--

Thanks [~snemeth]. No need to backport it, this only impacts AQCv2, which will 
be part of 3.4.

> Allow overriding user limit factor and maxAMResourcePercent with AQCv2 
> templates
> 
>
> Key: YARN-11006
> URL: https://issues.apache.org/jira/browse/YARN-11006
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> [YARN-10801|https://issues.apache.org/jira/browse/YARN-10801] fixed the 
> template configurations for every queue property, but it introduced a strange 
> behaviour as well. When setting the template configurations 
> LeafQueue.setDynamicQueueProperties is called:
> {code:java}
>   @Override
>   protected void setDynamicQueueProperties(
>   CapacitySchedulerConfiguration configuration) {
> super.setDynamicQueueProperties(configuration);
> // set to -1, to disable it
> configuration.setUserLimitFactor(getQueuePath(), -1);
> // Set Max AM percentage to a higher value
> configuration.setMaximumApplicationMasterResourcePerQueuePercent(
> getQueuePath(), 1f);
>   }
> {code}
> This sets the configured template properties in the configuration object and 
> then it overwrites the user limit factor and the maximum AM resource percent 
> values with the hardcoded ones. The order should be reversed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11005) Implement the core QUEUE_LENGTH_THEN_RESOURCES OContainer allocation policy

2021-12-08 Thread Jira


 [ 
https://issues.apache.org/jira/browse/YARN-11005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Íñigo Goiri resolved YARN-11005.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Implement the core QUEUE_LENGTH_THEN_RESOURCES OContainer allocation policy
> ---
>
> Key: YARN-11005
> URL: https://issues.apache.org/jira/browse/YARN-11005
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Andrew Chung
>Assignee: Andrew Chung
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> This sub-task contains the bulk of the work that will implement the new 
> resource-aware Opportunistic container allocation policy 
> {{QUEUE_LENGTH_THEN_RESOURCES}}.
> The core tasks here are to:
> # Allow {{ClusterNode}} to be allocated resource-aware using information from 
> {{RMNode}},
> # Add the {{QUEUE_LENGTH_THEN_RESOURCES}} {{LoadComparator}},
> # Implement the new sorting logic and the logic to determine whether a node 
> can queue Opportunistic containers in case the 
> {{QUEUE_LENGTH_THEN_RESOURCES}} policy is chosen, and
> # Modify {{NodeQueueLoadMonitor.selectAnyNode}} to be request-resource-aware.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10965) Centralize queue resource calculation based on CapacityVectors

2021-12-08 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17455343#comment-17455343
 ] 

Andras Gyori edited comment on YARN-10965 at 12/8/21, 3:56 PM:
---

As it is a crucial part of CapacityScheduler, it would be helpful if a few 
community members took a look on this and maybe on the design doc as well.
cc. [~jbrennan] [~epayne] [~sunilg] [~zhuqi] [~BilwaST] [~adam.antal] 


was (Author: gandras):
As it is a crucial part of CapacityScheduler, it would be helpful if a few 
community members took a look on this and maybe on the design doc as well.
cc. [~jbrennan] [~epayne] [~sunilg] [~zhuqi] [~BilwaST] 

> Centralize queue resource calculation based on CapacityVectors
> --
>
> Key: YARN-10965
> URL: https://issues.apache.org/jira/browse/YARN-10965
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> With the introduction of YARN-10930 it is possible to unify queue resource 
> calculation. In order to narrow down the scope of this patch, the base system 
> is implemented here, without refactoring the existing resource calculation in 
> updateClusterResource (which will be done in YARN-11000).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10965) Centralize queue resource calculation based on CapacityVectors

2021-12-08 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17455343#comment-17455343
 ] 

Andras Gyori commented on YARN-10965:
-

As it is a crucial part of CapacityScheduler, it would be helpful if a few 
community members took a look on this and maybe on the design doc as well.
cc. [~jbrennan] [~epayne] [~sunilg] [~zhuqi] [~BilwaST] 

> Centralize queue resource calculation based on CapacityVectors
> --
>
> Key: YARN-10965
> URL: https://issues.apache.org/jira/browse/YARN-10965
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> With the introduction of YARN-10930 it is possible to unify queue resource 
> calculation. In order to narrow down the scope of this patch, the base system 
> is implemented here, without refactoring the existing resource calculation in 
> updateClusterResource (which will be done in YARN-11000).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11038) Fix testQueueSubmitWithACL* tests in TestAppManager

2021-12-08 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-11038:
--
Fix Version/s: 3.4.0

> Fix testQueueSubmitWithACL* tests in TestAppManager
> ---
>
> Key: YARN-11038
> URL: https://issues.apache.org/jira/browse/YARN-11038
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Tamas Domok
>Assignee: Tamas Domok
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> These two tests doesn't test anything:
>  - testQueueSubmitWithACLsEnabledWithQueueMapping
>  - testQueueSubmitWithACLsEnabledWithQueueMappingForAutoCreatedQueue
>  
> Issues:
>  - no assert if the exception is not thrown
>  - the configuration doesn't even loaded (csConf is not used for the mockRM)
>  - the placement manager did not even match the test scenario
>  - the successful submit was not asserted either
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract'

2021-12-08 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17455323#comment-17455323
 ] 

Andras Gyori commented on YARN-10178:
-

We have faced the same issue in a production cluster recently. I agree with 
[~epayne] that this should be resolved as soon as possible. My feedback on the 
patch:
 * As this is a subtle concurrency issue, I have not been able to reproduce it 
yet, but I was wondering whether we could avoid creating the snapshot 
altogether, by modifying the original comparator to acquire the necessary 
values immediately, thus hopefully eliminating the possibility of violating the 
sorting's requirements. This would look as the following:

{code:java}
float q1AbsCapacity = q1.getQueueCapacities().getAbsoluteCapacity(p);
float q2AbsCapacity = q2.getQueueCapacities().getAbsoluteCapacity(p);
float q1AbsUsedCapacity = q1.getQueueCapacities().getAbsoluteUsedCapacity(p);
float q2AbsUsedCapacity = q2.getQueueCapacities().getAbsoluteUsedCapacity(p); 
float q1UsedCapacity = q1.getQueueCapacities().getUsedCapacity(p);
float q2UsedCapacity = q2.getQueueCapacities().getUsedCapacity(p); 
.{code}
 

 * We should not use the Stream API because of older branches. I suggest 
rewriting getAssignmentIterator:
{code:java}
@Override
public Iterator getAssignmentIterator(String partition) {
  // Since partitionToLookAt is a thread local variable, and every time we
  // copy and sort queues, so it's safe for multi-threading environment.
  PriorityUtilizationQueueOrderingPolicy.partitionToLookAt.set(partition);

  // Sort the snapshots instead of the queues directly, due to race conditions
  // See YARN-10178 for more information.
  List queueSnapshots = new ArrayList<>();
  for (CSQueue queue : queues) {
queueSnapshots.add(new QueueSnapshot(queue));
  }
  queueSnapshots.sort(new PriorityQueueComparator());

  List sortedQueues = new ArrayList<>();
  for (QueueSnapshot queueSnapshot : queueSnapshots) {
sortedQueues.add(queueSnapshot.queue);
  }

  return sortedQueues.iterator();
} {code}

 * We do not need to keep the old logic
 * Measuring performance is a delicate procedure. Including it in a unit test 
is incredibly volatile (On my local machine I have not been able to pass the 
test for example) especially when naive time measurement is involved. Not sure 
if we can easily reproduce it, but I think in this case the no test is better 
than a potentially intermittent test.

> Global Scheduler async thread crash caused by 'Comparison method violates its 
> general contract'
> ---
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10178.001.patch, YARN-10178.002.patch, 
> YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch
>
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at 
> 

[jira] [Resolved] (YARN-11031) Improve the maintainability of RM webapp tests like TestRMWebServicesCapacitySched

2021-12-08 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth resolved YARN-11031.
---
Hadoop Flags: Reviewed
  Resolution: Fixed

> Improve the maintainability of RM webapp tests like 
> TestRMWebServicesCapacitySched
> --
>
> Key: YARN-11031
> URL: https://issues.apache.org/jira/browse/YARN-11031
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tamas Domok
>Assignee: Tamas Domok
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> It's hard to maintain the asserts in TestRMWebServicesCapacitySched, 
> TestRMWebServicesCapacitySchedDynamicConfig test classes when the scheduler 
> response is modified. Currently only a subset of the scheduler response is 
> asserted in these tests.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11031) Improve the maintainability of RM webapp tests like TestRMWebServicesCapacitySched

2021-12-08 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-11031:
--
Fix Version/s: 3.4.0

> Improve the maintainability of RM webapp tests like 
> TestRMWebServicesCapacitySched
> --
>
> Key: YARN-11031
> URL: https://issues.apache.org/jira/browse/YARN-11031
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tamas Domok
>Assignee: Tamas Domok
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> It's hard to maintain the asserts in TestRMWebServicesCapacitySched, 
> TestRMWebServicesCapacitySchedDynamicConfig test classes when the scheduler 
> response is modified. Currently only a subset of the scheduler response is 
> asserted in these tests.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11040) Error in build log: hadoop-resourceestimator has missing dependencies: jsonschema2pojo-core-1.0.2.jar

2021-12-08 Thread Tamas Domok (Jira)
Tamas Domok created YARN-11040:
--

 Summary: Error in build log: hadoop-resourceestimator has missing 
dependencies: jsonschema2pojo-core-1.0.2.jar
 Key: YARN-11040
 URL: https://issues.apache.org/jira/browse/YARN-11040
 Project: Hadoop YARN
  Issue Type: Bug
  Components: build, yarn
Affects Versions: 3.4.0
Reporter: Tamas Domok
Assignee: Tamas Domok


There is error log about a missing dependency during package build.

Reproduction:

mvn clean package -Pyarn-ui -Pdist -Dtar -Dmaven.javadoc.skip=true -DskipTests 
-DskipShade 2>&1 | grep jsonschema2pojo
{code:java}
[INFO] --- jsonschema2pojo-maven-plugin:1.1.1:generate (default) @ 
hadoop-yarn-server-resourcemanager ---
[INFO] Copying jsonschema2pojo-core-1.0.2.jar to 
/Users/tdomok/Work/hadoop/hadoop-tools/hadoop-federation-balance/target/lib/jsonschema2pojo-core-1.0.2.jar
[INFO] org.jsonschema2pojo:jsonschema2pojo-core:jar:1.0.2 already exists in 
destination.
[INFO] Copying jsonschema2pojo-core-1.0.2.jar to 
/Users/tdomok/Work/hadoop/hadoop-tools/hadoop-aws/target/lib/jsonschema2pojo-core-1.0.2.jar
ERROR: hadoop-resourceestimator has missing dependencies: 
jsonschema2pojo-core-1.0.2.jar {code}

The build is successful but there is this error in the build log:
{code:java}
ERROR: hadoop-resourceestimator has missing dependencies: 
jsonschema2pojo-core-1.0.2.jar  {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11020) [UI2] No container is found for an application attempt with a single AM container

2021-12-08 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17455224#comment-17455224
 ] 

Hadoop QA commented on YARN-11020:
--

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 12m 
43s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} No case conflicting files 
found. {color} |
| {color:blue}0{color} | {color:blue} jshint {color} | {color:blue}  0m  
0s{color} | {color:blue}{color} | {color:blue} jshint was not available. 
{color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} The patch does not contain any 
@author tags. {color} |
|| || || || {color:brown} branch-3.3 Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 32m 
 9s{color} | {color:green}{color} | {color:green} branch-3.3 passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
48m 28s{color} | {color:green}{color} | {color:green} branch has no errors when 
building and testing our client artifacts. {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
15s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace 
issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 48s{color} | {color:green}{color} | {color:green} patch has no errors when 
building and testing our client artifacts. {color} |
|| || || || {color:brown} Other Tests {color} || ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
31s{color} | {color:green}{color} | {color:green} The patch does not generate 
ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 78m 33s{color} | 
{color:black}{color} | {color:black}{color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/1258/artifact/out/Dockerfile
 |
| JIRA Issue | YARN-11020 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/13037122/YARN-11020-branch-3.3.001.patch
 |
| Optional Tests | dupname asflicense shadedclient jshint |
| uname | Linux 783477a68210 4.15.0-65-generic #74-Ubuntu SMP Tue Sep 17 
17:06:04 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | personality/hadoop.sh |
| git revision | branch-3.3 / 1ee661d7da4 |
| Max. process+thread count | 636 (vs. ulimit of 5500) |
| modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-ui U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-ui |
| Console output | 
https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/1258/console |
| versions | git=2.17.1 maven=3.6.0 |
| Powered by | Apache Yetus 0.13.0-SNAPSHOT https://yetus.apache.org |


This message was automatically generated.



> [UI2] No container is found for an application attempt with a single AM 
> container
> -
>
> Key: YARN-11020
> URL: https://issues.apache.org/jira/browse/YARN-11020
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: YARN-11020-branch-3.3.001.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> In UI2 for an application under the Logs tab, No container data available 
> message is shown if the application attempt only submitted a single container 
> (which is the AM container). 
> The culprit of the issue is that the response from YARN is not consistent, 
> because for a single container it looks like:
> {noformat}
> {
>     "containerLogsInfo": {
>         "containerLogInfo": [
>             {
>                 "fileName": "prelaunch.out",
>                 "fileSize": "100",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "directory.info",
>                 "fileSize": "2296",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "stderr",
>               

[jira] [Updated] (YARN-11034) Add enhanced headroom in AllocateResponse

2021-12-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YARN-11034:
--
Labels: pull-request-available  (was: )

> Add enhanced headroom in AllocateResponse
> -
>
> Key: YARN-11034
> URL: https://issues.apache.org/jira/browse/YARN-11034
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Minni Mittal
>Assignee: Minni Mittal
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Add enhanced headroom in allocate response. This provides a channel for RMs 
> to return load information for AMRMProxy and decision making when rerouting 
> resource requests.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Reopened] (YARN-11020) [UI2] No container is found for an application attempt with a single AM container

2021-12-08 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori reopened YARN-11020:
-

> [UI2] No container is found for an application attempt with a single AM 
> container
> -
>
> Key: YARN-11020
> URL: https://issues.apache.org/jira/browse/YARN-11020
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: YARN-11020-branch-3.3.001.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> In UI2 for an application under the Logs tab, No container data available 
> message is shown if the application attempt only submitted a single container 
> (which is the AM container). 
> The culprit of the issue is that the response from YARN is not consistent, 
> because for a single container it looks like:
> {noformat}
> {
>     "containerLogsInfo": {
>         "containerLogInfo": [
>             {
>                 "fileName": "prelaunch.out",
>                 "fileSize": "100",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "directory.info",
>                 "fileSize": "2296",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "stderr",
>                 "fileSize": "1722",
>                 "lastModifiedTime": "Mon Nov 29 09:28:28 + 2021"
>             },
>             {
>                 "fileName": "prelaunch.err",
>                 "fileSize": "0",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "stdout",
>                 "fileSize": "0",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "syslog",
>                 "fileSize": "38551",
>                 "lastModifiedTime": "Mon Nov 29 09:28:28 + 2021"
>             },
>             {
>                 "fileName": "launch_container.sh",
>                 "fileSize": "5013",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             }
>         ],
>         "logAggregationType": "AGGREGATED",
>         "containerId": "container_1638174027957_0008_01_01",
>         "nodeId": "da175178c179:43977"
>     }
> }{noformat}
> As for applications with multiple containers it looks like:
> {noformat}
> {
>     "containerLogsInfo": [{
>         
>     }, {  }]
> }{noformat}
> We can not change the response of the endpoint due to backward compatibility, 
> therefore we need to make UI2 be able to handle both scenarios.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11020) [UI2] No container is found for an application attempt with a single AM container

2021-12-08 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-11020:

Attachment: (was: YARN-11020-branch-3.3.001.patch)

> [UI2] No container is found for an application attempt with a single AM 
> container
> -
>
> Key: YARN-11020
> URL: https://issues.apache.org/jira/browse/YARN-11020
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: YARN-11020-branch-3.3.001.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> In UI2 for an application under the Logs tab, No container data available 
> message is shown if the application attempt only submitted a single container 
> (which is the AM container). 
> The culprit of the issue is that the response from YARN is not consistent, 
> because for a single container it looks like:
> {noformat}
> {
>     "containerLogsInfo": {
>         "containerLogInfo": [
>             {
>                 "fileName": "prelaunch.out",
>                 "fileSize": "100",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "directory.info",
>                 "fileSize": "2296",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "stderr",
>                 "fileSize": "1722",
>                 "lastModifiedTime": "Mon Nov 29 09:28:28 + 2021"
>             },
>             {
>                 "fileName": "prelaunch.err",
>                 "fileSize": "0",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "stdout",
>                 "fileSize": "0",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "syslog",
>                 "fileSize": "38551",
>                 "lastModifiedTime": "Mon Nov 29 09:28:28 + 2021"
>             },
>             {
>                 "fileName": "launch_container.sh",
>                 "fileSize": "5013",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             }
>         ],
>         "logAggregationType": "AGGREGATED",
>         "containerId": "container_1638174027957_0008_01_01",
>         "nodeId": "da175178c179:43977"
>     }
> }{noformat}
> As for applications with multiple containers it looks like:
> {noformat}
> {
>     "containerLogsInfo": [{
>         
>     }, {  }]
> }{noformat}
> We can not change the response of the endpoint due to backward compatibility, 
> therefore we need to make UI2 be able to handle both scenarios.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11020) [UI2] No container is found for an application attempt with a single AM container

2021-12-08 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17455174#comment-17455174
 ] 

Andras Gyori commented on YARN-11020:
-

The container log fetching is missing from branch-3.2, so only backported it to 
branch-3.3.

> [UI2] No container is found for an application attempt with a single AM 
> container
> -
>
> Key: YARN-11020
> URL: https://issues.apache.org/jira/browse/YARN-11020
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: YARN-11020-branch-3.3.001.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> In UI2 for an application under the Logs tab, No container data available 
> message is shown if the application attempt only submitted a single container 
> (which is the AM container). 
> The culprit of the issue is that the response from YARN is not consistent, 
> because for a single container it looks like:
> {noformat}
> {
>     "containerLogsInfo": {
>         "containerLogInfo": [
>             {
>                 "fileName": "prelaunch.out",
>                 "fileSize": "100",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "directory.info",
>                 "fileSize": "2296",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "stderr",
>                 "fileSize": "1722",
>                 "lastModifiedTime": "Mon Nov 29 09:28:28 + 2021"
>             },
>             {
>                 "fileName": "prelaunch.err",
>                 "fileSize": "0",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "stdout",
>                 "fileSize": "0",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "syslog",
>                 "fileSize": "38551",
>                 "lastModifiedTime": "Mon Nov 29 09:28:28 + 2021"
>             },
>             {
>                 "fileName": "launch_container.sh",
>                 "fileSize": "5013",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             }
>         ],
>         "logAggregationType": "AGGREGATED",
>         "containerId": "container_1638174027957_0008_01_01",
>         "nodeId": "da175178c179:43977"
>     }
> }{noformat}
> As for applications with multiple containers it looks like:
> {noformat}
> {
>     "containerLogsInfo": [{
>         
>     }, {  }]
> }{noformat}
> We can not change the response of the endpoint due to backward compatibility, 
> therefore we need to make UI2 be able to handle both scenarios.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11023) Extend the root QueueInfo with max-parallel-apps in CapacityScheduler

2021-12-08 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17455173#comment-17455173
 ] 

Szilard Nemeth commented on YARN-11023:
---

Thanks [~tdomok] for the confirmation. Resolving this jira then.

> Extend the root QueueInfo with max-parallel-apps in CapacityScheduler
> -
>
> Key: YARN-11023
> URL: https://issues.apache.org/jira/browse/YARN-11023
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.4.0
>Reporter: Tamas Domok
>Assignee: Tamas Domok
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> YARN-10891 extended the QueueInfo with the maxParallelApps property, but for 
> the root queue this property is missing.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-11023) Extend the root QueueInfo with max-parallel-apps in CapacityScheduler

2021-12-08 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17455173#comment-17455173
 ] 

Szilard Nemeth edited comment on YARN-11023 at 12/8/21, 11:20 AM:
--

Thanks [~tdomok] for the confirmation.


was (Author: snemeth):
Thanks [~tdomok] for the confirmation. Resolving this jira then.

> Extend the root QueueInfo with max-parallel-apps in CapacityScheduler
> -
>
> Key: YARN-11023
> URL: https://issues.apache.org/jira/browse/YARN-11023
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.4.0
>Reporter: Tamas Domok
>Assignee: Tamas Domok
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> YARN-10891 extended the QueueInfo with the maxParallelApps property, but for 
> the root queue this property is missing.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11020) [UI2] No container is found for an application attempt with a single AM container

2021-12-08 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-11020:

Attachment: YARN-11020-branch-3.3.001.patch

> [UI2] No container is found for an application attempt with a single AM 
> container
> -
>
> Key: YARN-11020
> URL: https://issues.apache.org/jira/browse/YARN-11020
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: YARN-11020-branch-3.3.001.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> In UI2 for an application under the Logs tab, No container data available 
> message is shown if the application attempt only submitted a single container 
> (which is the AM container). 
> The culprit of the issue is that the response from YARN is not consistent, 
> because for a single container it looks like:
> {noformat}
> {
>     "containerLogsInfo": {
>         "containerLogInfo": [
>             {
>                 "fileName": "prelaunch.out",
>                 "fileSize": "100",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "directory.info",
>                 "fileSize": "2296",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "stderr",
>                 "fileSize": "1722",
>                 "lastModifiedTime": "Mon Nov 29 09:28:28 + 2021"
>             },
>             {
>                 "fileName": "prelaunch.err",
>                 "fileSize": "0",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "stdout",
>                 "fileSize": "0",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             },
>             {
>                 "fileName": "syslog",
>                 "fileSize": "38551",
>                 "lastModifiedTime": "Mon Nov 29 09:28:28 + 2021"
>             },
>             {
>                 "fileName": "launch_container.sh",
>                 "fileSize": "5013",
>                 "lastModifiedTime": "Mon Nov 29 09:28:16 + 2021"
>             }
>         ],
>         "logAggregationType": "AGGREGATED",
>         "containerId": "container_1638174027957_0008_01_01",
>         "nodeId": "da175178c179:43977"
>     }
> }{noformat}
> As for applications with multiple containers it looks like:
> {noformat}
> {
>     "containerLogsInfo": [{
>         
>     }, {  }]
> }{noformat}
> We can not change the response of the endpoint due to backward compatibility, 
> therefore we need to make UI2 be able to handle both scenarios.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11016) Queue weight is incorrectly reset to zero

2021-12-08 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17455172#comment-17455172
 ] 

Szilard Nemeth commented on YARN-11016:
---

Thanks [~gandras] for the confirmation. Resolving this jira then.

> Queue weight is incorrectly reset to zero
> -
>
> Key: YARN-11016
> URL: https://issues.apache.org/jira/browse/YARN-11016
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> QueueCapacities#clearConfigurableFields set WEIGHT capacity to 0, which could 
> cause problems like in the following scenario:
> 1. Initializing queues
> 2. Parent 'parent' have accessibleNodeLabels set, and since accessible node 
> labels are inherited, its children, for example 'child' has 'test' label as 
> its accessible-node-label.
> 3. In LeafQueue#updateClusterResource, we call 
> LeafQueue#activateApplications, which then calls 
> LeafQueue#calculateAndGetAMResourceLimitPerPartition for each label (see 
> getNodeLabelsForQueue). 
> In this case, the labels are the accessible node labels (the inherited 
> 'test). 
> During this event, the ResourceUsage object is updated for the label 'test', 
> thus extending its nodeLabelsSet with 'test'.
> 4. In a following updateClusterResource call, for example an addNode event, 
> we now have 'test' label in ResourceUsage even though it was never explicitly 
> configured and we call CSQueueUtils#updateQueueStatistics, that takes the 
> union of the node labels from QueueCapacities and ResourceUsage (this union 
> is now the empty default label AND 'test') and updates QueueCapacities with 
> the label 'perf-test'. 
> Now QueueCapacities has 'test' in its nodeLabelsSet as well!
> 5. After a reinitialization (like an update from mutation API), the 
> CSQueueUtils#loadCapacitiesByLabelsFromCon is called, which resets the 
> QueueCapacities values to zero (even weight, which is wrong in my opinion) 
> and loads the values again from the config. 
> The problem here is that values are reset for all node labels in 
> QueueCapacities (even for 'test'), but we only load the values for the 
> configured node labels (which we did not set, so it is defaulted to the empty 
> label).
> 6. Now all children of 'parent' have weight=0 for 'test' in QueueCapacities 
> and that is why the update fails. 
> It even explains why validation passes, because the validation endpoint 
> instantiates a brand new CapacityScheduler for which these cascade of effects 
> can not accumulate (as there are no multiple updateClusterResource calls)
> This scenario manifests as an error when updating via mutation API:
> {noformat}
> Failed to re-init queues : Parent queue 'parent' have children queue used 
> mixed of weight mode, percentage and absolute mode, it is not allowed, please 
> double check, details:{noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11039) LogAggregationFileControllerFactory::getFileControllerForRead should close FS

2021-12-08 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created YARN-11039:
---

 Summary: 
LogAggregationFileControllerFactory::getFileControllerForRead should close FS 
 Key: YARN-11039
 URL: https://issues.apache.org/jira/browse/YARN-11039
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: log-aggregation
Reporter: Rajesh Balamohan


getFileControllerForRead::getFileControllerForRead internally opens up a new FS 
object everytime and is not closed.

When cloud connectors (e.g s3a) is used along with Knox, it ends up leaking 
KnoxTokenMonitor for every unclosed FS object causing thread leaks in NM.

Lines of interest:

[https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileControllerFactory.java#L167]
{noformat}
   try {
  Path remoteAppLogDir = fileController.getOlderRemoteAppLogDir(appId,
  appOwner);
  if (LogAggregationUtils.getNodeFiles(conf, remoteAppLogDir, appId,
  appOwner).hasNext()) {
return fileController;
  }
} catch (Exception ex) {
  diagnosticsMsg.append(ex.getMessage() + "\n");
  continue;
}
{noformat}
[https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/LogAggregationUtils.java#L252]
{noformat}

  public static RemoteIterator getNodeFiles(Configuration conf,
  Path remoteAppLogDir, ApplicationId appId, String appOwner)
  throws IOException {
Path qualifiedLogDir =
FileContext.getFileContext(conf).makeQualified(remoteAppLogDir);
return FileContext.getFileContext(
qualifiedLogDir.toUri(), conf).listStatus(remoteAppLogDir);
  }
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10850) TimelineService v2 lists containers for all attempts when filtering for one

2021-12-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YARN-10850:
--
Labels: pull-request-available  (was: )

> TimelineService v2 lists containers for all attempts when filtering for one
> ---
>
> Key: YARN-10850
> URL: https://issues.apache.org/jira/browse/YARN-10850
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelinereader
>Reporter: Benjamin Teke
>Assignee: Tibor Kovács
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When using the command
> {code:java}
> yarn container -list 
> {code}
> with an application attempt ID based on the help only the containers for that 
> attempt should be listed.
> {code:java}
> -list List containers for application
>   attempt when application
>   attempt ID is provided. When
>   application name is provided,
>   then it finds the instances of
>   the application based on app's
>   own implementation, and
>   -appTypes option must be
>   specified unless it is the
>   default yarn-service type. With
>   app name, it supports optional
>   use of -version to filter
>   instances based on app version,
>   -components to filter instances
>   based on component names,
>   -states to filter instances
>   based on instance state.
> {code}
> When TimelineService v2 is enabled all of the containers for the application 
> are returned. 
> {code:java}
> hrt_qa@ctr-e172-1620330694487-146061-01-02:/hwqe/hadoopqe$ yarn 
> applicationattempt -list application_1625124233002_0007
> 21/07/01 09:32:23 INFO impl.TimelineReaderClientImpl: Initialized 
> TimelineReader 
> URI=http://ctr-e172-1620330694487-146061-01-04.hwx.site:8198/ws/v2/timeline/,
>  clusterId=yarn-cluster
> 21/07/01 09:32:24 INFO client.AHSProxy: Connecting to Application History 
> server at ctr-e172-1620330694487-146061-01-04.hwx.site/172.27.113.4:10200
> 21/07/01 09:32:24 INFO client.ConfiguredRMFailoverProxyProvider: Failing over 
> to rm2
> Total number of application attempts :2
>  ApplicationAttempt-Id   State
> AM-Container-IdTracking-URL
> appattempt_1625124233002_0007_01FAILED
> container_e43_1625124233002_0007_01_01  
> http://ctr-e172-1620330694487-146061-01-03.hwx.site:8088/proxy/application_1625124233002_0007/
> appattempt_1625124233002_0007_02KILLED
> container_e43_1625124233002_0007_02_01  
> http://ctr-e172-1620330694487-146061-01-03.hwx.site:8088/proxy/application_1625124233002_0007/
> {code}
> Querying the 2 app attempts produces the same output:
> {code:java}
> hrt_qa@ctr-e172-1620330694487-146061-01-02:/hwqe/hadoopqe$ yarn container 
> -list appattempt_1625124233002_0007_01
> 21/07/01 09:32:35 INFO impl.TimelineReaderClientImpl: Initialized 
> TimelineReader 
> URI=http://ctr-e172-1620330694487-146061-01-04.hwx.site:8198/ws/v2/timeline/,
>  clusterId=yarn-cluster
> 21/07/01 09:32:35 INFO client.AHSProxy: Connecting to Application History 
> server at ctr-e172-1620330694487-146061-01-04.hwx.site/172.27.113.4:10200
> 21/07/01 09:32:35 INFO client.ConfiguredRMFailoverProxyProvider: Failing over 
> to rm2
> 21/07/01 09:32:36 INFO conf.Configuration: found resource resource-types.xml 
> at file:/etc/hadoop/7.1.7.0-504/0/resource-types.xml
> Total number of containers :12
>   Container-Id  Start Time Finish 
> Time   StateHost   Node Http Address  
>   LOG-URL
> container_e43_1625124233002_0007_02_04 N/A
>  N/ACOMPLETE
> ctr-e172-1620330694487-146061-01-02.hwx.site:25454  
> ctr-e172-1620330694487-146061-01-02.hwx.site:8042   
> 

[jira] [Resolved] (YARN-11023) Extend the root QueueInfo with max-parallel-apps in CapacityScheduler

2021-12-08 Thread Tamas Domok (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tamas Domok resolved YARN-11023.

Resolution: Fixed

Hi [~snemeth],

There is no need for the backport, this feature was not backported to the 
branch-3.3 or branch-3.2.

> Extend the root QueueInfo with max-parallel-apps in CapacityScheduler
> -
>
> Key: YARN-11023
> URL: https://issues.apache.org/jira/browse/YARN-11023
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.4.0
>Reporter: Tamas Domok
>Assignee: Tamas Domok
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> YARN-10891 extended the QueueInfo with the maxParallelApps property, but for 
> the root queue this property is missing.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10888) [Umbrella] New capacity modes for CS

2021-12-08 Thread Andras Gyori (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Gyori updated YARN-10888:

Attachment: (was: capacity_scheduler_queue_capacity.html)

> [Umbrella] New capacity modes for CS
> 
>
> Key: YARN-10888
> URL: https://issues.apache.org/jira/browse/YARN-10888
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: capacity_scheduler_queue_capacity.pdf
>
>
> *Investigate how resource allocation configuration could be more consistent 
> in CapacityScheduler*
> It would be nice if everywhere where a capacity can be defined could be 
> defined the same way:
>  * With fixed amounts (e.g. 1 GB memory, 8 vcores, 3 GPU)
>  * With percentages
>  ** Percentage of all resources (eg 10% of all memory, vcore, GPU)
>  ** Percentage per resource type (eg 10% memory, 25% vcore, 50% GPU)
>  * Allow mixing different modes under one hierarchy but not under the same 
> parent queues.
> We need to determine all configuration options where capacities can be 
> defined, and see if it is possible to extend the configuration, or if it 
> makes sense in that case.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org