[ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10178:
----------------------------------
    Description: 
Stack trace:
{code:java}
ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
Comparison method violates its general contract!                                
                                     at 
java.util.TimSort.mergeHi(TimSort.java:899)
        at java.util.TimSort.mergeAt(TimSort.java:516)
        at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
        at java.util.TimSort.sort(TimSort.java:254)
        at java.util.Arrays.sort(Arrays.java:1512)
        at java.util.ArrayList.sort(ArrayList.java:1462)
        at java.util.Collections.sort(Collections.java:177)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)

{code}
In JDK 8, Arrays.sort by default is using the timsort algorithm, and timsort 
has a few requirements:
{code:java}
1.x.compareTo(y) != y.compareTo(x)
2.x>y,y>z --> x > z
3.x=y, x.compareTo(z) == y.compareTo(z)
{code}
If the Array / List does not satisfy any of these requirements, TimSort will 
throw a java.lang.IllegalArgumentException.

 

1. If we take a look into PriorityUtilizationQueueOrderingPolicy.compare 
method, we can see that Capacity Scheduler these queue fields in order to 
compare resource usage:
{code:java}
AbsoluteUsedCapacity
UsedCapacity
ConfiguredMinResource
AbsoluteCapacity
{code}
 

2. In CS, during the execution of AsyncScheduleThread while the queues are 
being sorted in PriorityUtilizationQueueOrderingPolicy, for choosing the queue 
to assign the container to this IllegalArgumentException is thrown.

 

3. If we take a look into the ResourceCommitterService method, it tries to 
commit a CSAssignment coming from the ResourceCommitRequest, look tryCommit 
function, the queue resource usage is being updated.
{code:java}
public boolean tryCommit(Resource cluster, ResourceCommitRequest r,
    boolean updatePending) {
  long commitStart = System.nanoTime();
  ResourceCommitRequest<FiCaSchedulerApp, FiCaSchedulerNode> request =
      (ResourceCommitRequest<FiCaSchedulerApp, FiCaSchedulerNode>) r;
 
  ...
  boolean isSuccess = false;
  if (attemptId != null) {
    FiCaSchedulerApp app = getApplicationAttempt(attemptId);
    // Required sanity check for attemptId - when async-scheduling enabled,
    // proposal might be outdated if AM failover just finished
    // and proposal queue was not be consumed in time
    if (app != null && attemptId.equals(app.getApplicationAttemptId())) {
      if (app.accept(cluster, request, updatePending)
          && app.apply(cluster, request, updatePending)) { // apply this 
resource
        ...
        }
    }
  }
  return isSuccess;
}
}
{code}
{code:java}
public boolean apply(Resource cluster, ResourceCommitRequest<FiCaSchedulerApp,
    FiCaSchedulerNode> request, boolean updatePending) {
...
    if (!reReservation) {
        getCSLeafQueue().apply(cluster, request); 
    }
...
}
{code}
4. 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue#apply
 invokes 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue#allocateResource:
{code:java}
void allocateResource(Resource clusterResource,
    Resource resource, String nodePartition) {
  try {
    writeLock.lock(); // only lock leaf queue lock
    queueUsage.incUsed(nodePartition, resource);
 
    ++numContainers;
 
    CSQueueUtils.updateQueueStatistics(resourceCalculator, clusterResource,
        this, labelManager, nodePartition); // there will update queue 
statistics
  } finally {
    writeLock.unlock();
  }
}
{code}
5. We can see that ResourceCommitterService will only lock the Leaf Queue to 
update the queue statistics, but the AsyncScheduleThread do only lock the Root 
Queue (in ParentQueue#sortAndGetChildrenAllocationIterator)
{code:java}
private Iterator<CSQueue> sortAndGetChildrenAllocationIterator(
      String partition) {
    try {
      readLock.lock();
      return queueOrderingPolicy.getAssignmentIterator(partition);
    } finally {
      readLock.unlock();
    }
  }
{code}
so if multiple threads are comparing queue usage statistics and 
ResourceCommitterService applies Leaf Queue changes in statistics in a 
concurrent manner, it will break the TimSort algorithm's requirements, causing 
a thread crash.

  was:
Stack trace:
{code:java}
ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
Comparison method violates its general contract!                                
                                     at 
java.util.TimSort.mergeHi(TimSort.java:899)
        at java.util.TimSort.mergeAt(TimSort.java:516)
        at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
        at java.util.TimSort.sort(TimSort.java:254)
        at java.util.Arrays.sort(Arrays.java:1512)
        at java.util.ArrayList.sort(ArrayList.java:1462)
        at java.util.Collections.sort(Collections.java:177)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)

{code}
In JDK 8, Arrays.sort by default is using the timsort algorithm, and timsort 
has a few requirements:
{code:java}
1.x.compareTo(y) != y.compareTo(x)
2.x>y,y>z --> x > z
3.x=y, x.compareTo(z) == y.compareTo(z)
{code}
If the Array / List does not satisfy any of these requirements, TimSort will 
throw a java.lang.IllegalArgumentException.

 

1. If we take a look into PriorityUtilizationQueueOrderingPolicy.compare 
method, we can see that Capacity Scheduler these queue fields in order to 
compare resource usage:
{code:java}
AbsoluteUsedCapacity
UsedCapacity
ConfiguredMinResource
AbsoluteCapacity
{code}
 

2. In CS, during the execution of AsyncScheduleThread while the queues are 
being sorted in PriorityUtilizationQueueOrderingPolicy, for choosing the queue 
to assign the container to this IllegalArgumentException is thrown.

 

3. If we take a look into the ResourceCommitterService method, it tries to 
commit a CSAssignment coming from the ResourceCommitRequest, look tryCommit 
function, the queue resource usage is being updated.
{code:java}
public boolean tryCommit(Resource cluster, ResourceCommitRequest r,
    boolean updatePending) {
  long commitStart = System.nanoTime();
  ResourceCommitRequest<FiCaSchedulerApp, FiCaSchedulerNode> request =
      (ResourceCommitRequest<FiCaSchedulerApp, FiCaSchedulerNode>) r;
 
  ...
  boolean isSuccess = false;
  if (attemptId != null) {
    FiCaSchedulerApp app = getApplicationAttempt(attemptId);
    // Required sanity check for attemptId - when async-scheduling enabled,
    // proposal might be outdated if AM failover just finished
    // and proposal queue was not be consumed in time
    if (app != null && attemptId.equals(app.getApplicationAttemptId())) {
      if (app.accept(cluster, request, updatePending)
          && app.apply(cluster, request, updatePending)) { // apply this 
resource
        ...
        }
    }
  }
  return isSuccess;
}
}
{code}
{code:java}
public boolean apply(Resource cluster, ResourceCommitRequest<FiCaSchedulerApp,
    FiCaSchedulerNode> request, boolean updatePending) {
...
    if (!reReservation) {
        getCSLeafQueue().apply(cluster, request); 
    }
...
}
{code}
4. 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue#apply
 invokes 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue#allocateResource:
{code:java}
void allocateResource(Resource clusterResource,
    Resource resource, String nodePartition) {
  try {
    writeLock.lock(); // only lock leaf queue lock
    queueUsage.incUsed(nodePartition, resource);
 
    ++numContainers;
 
    CSQueueUtils.updateQueueStatistics(resourceCalculator, clusterResource,
        this, labelManager, nodePartition); // there will update queue 
statistics
  } finally {
    writeLock.unlock();
  }
}
{code}
5. We can see that ResourceCommitterService will only lock the Leaf Queue to 
update the queue statistics, but the AsyncScheduleThread do only lock the Root 
Queue (in ParentQueue#sortAndGetChildrenAllocationIterator)
{code:java}
private Iterator<CSQueue> sortAndGetChildrenAllocationIterator(
      String partition) {
    try {
      readLock.lock();
      return queueOrderingPolicy.getAssignmentIterator(partition);
    } finally {
      readLock.unlock();
    }
  }
{code}
so if multi threads are comparing queue usage statistics and 
ResourceCommitterService applies Leaf Queue changes in statistics in a 
concurrent manner, it will break the TimSort algorithm's requirements, causing 
a thread crash.


> Global Scheduler async thread crash caused by 'Comparison method violates its 
> general contract
> ----------------------------------------------------------------------------------------------
>
>                 Key: YARN-10178
>                 URL: https://issues.apache.org/jira/browse/YARN-10178
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler
>    Affects Versions: 3.2.1
>            Reporter: tuyu
>            Assignee: Andras Gyori
>            Priority: Major
>             Fix For: 3.4.0, 2.10.2, 3.3.2, 3.2.4
>
>         Attachments: YARN-10178.001.patch, YARN-10178.002.patch, 
> YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch, 
> YARN-10178.006.patch, YARN-10178.branch-2.10.001.patch
>
>
> Stack trace:
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!                              
>                                        at 
> java.util.TimSort.mergeHi(TimSort.java:899)
>         at java.util.TimSort.mergeAt(TimSort.java:516)
>         at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
>         at java.util.TimSort.sort(TimSort.java:254)
>         at java.util.Arrays.sort(Arrays.java:1512)
>         at java.util.ArrayList.sort(ArrayList.java:1462)
>         at java.util.Collections.sort(Collections.java:177)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
> {code}
> In JDK 8, Arrays.sort by default is using the timsort algorithm, and timsort 
> has a few requirements:
> {code:java}
> 1.x.compareTo(y) != y.compareTo(x)
> 2.x>y,y>z --> x > z
> 3.x=y, x.compareTo(z) == y.compareTo(z)
> {code}
> If the Array / List does not satisfy any of these requirements, TimSort will 
> throw a java.lang.IllegalArgumentException.
>  
> 1. If we take a look into PriorityUtilizationQueueOrderingPolicy.compare 
> method, we can see that Capacity Scheduler these queue fields in order to 
> compare resource usage:
> {code:java}
> AbsoluteUsedCapacity
> UsedCapacity
> ConfiguredMinResource
> AbsoluteCapacity
> {code}
>  
> 2. In CS, during the execution of AsyncScheduleThread while the queues are 
> being sorted in PriorityUtilizationQueueOrderingPolicy, for choosing the 
> queue to assign the container to this IllegalArgumentException is thrown.
>  
> 3. If we take a look into the ResourceCommitterService method, it tries to 
> commit a CSAssignment coming from the ResourceCommitRequest, look tryCommit 
> function, the queue resource usage is being updated.
> {code:java}
> public boolean tryCommit(Resource cluster, ResourceCommitRequest r,
>     boolean updatePending) {
>   long commitStart = System.nanoTime();
>   ResourceCommitRequest<FiCaSchedulerApp, FiCaSchedulerNode> request =
>       (ResourceCommitRequest<FiCaSchedulerApp, FiCaSchedulerNode>) r;
>  
>   ...
>   boolean isSuccess = false;
>   if (attemptId != null) {
>     FiCaSchedulerApp app = getApplicationAttempt(attemptId);
>     // Required sanity check for attemptId - when async-scheduling enabled,
>     // proposal might be outdated if AM failover just finished
>     // and proposal queue was not be consumed in time
>     if (app != null && attemptId.equals(app.getApplicationAttemptId())) {
>       if (app.accept(cluster, request, updatePending)
>           && app.apply(cluster, request, updatePending)) { // apply this 
> resource
>         ...
>         }
>     }
>   }
>   return isSuccess;
> }
> }
> {code}
> {code:java}
> public boolean apply(Resource cluster, ResourceCommitRequest<FiCaSchedulerApp,
>     FiCaSchedulerNode> request, boolean updatePending) {
> ...
>     if (!reReservation) {
>         getCSLeafQueue().apply(cluster, request); 
>     }
> ...
> }
> {code}
> 4. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue#apply
>  invokes 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue#allocateResource:
> {code:java}
> void allocateResource(Resource clusterResource,
>     Resource resource, String nodePartition) {
>   try {
>     writeLock.lock(); // only lock leaf queue lock
>     queueUsage.incUsed(nodePartition, resource);
>  
>     ++numContainers;
>  
>     CSQueueUtils.updateQueueStatistics(resourceCalculator, clusterResource,
>         this, labelManager, nodePartition); // there will update queue 
> statistics
>   } finally {
>     writeLock.unlock();
>   }
> }
> {code}
> 5. We can see that ResourceCommitterService will only lock the Leaf Queue to 
> update the queue statistics, but the AsyncScheduleThread do only lock the 
> Root Queue (in ParentQueue#sortAndGetChildrenAllocationIterator)
> {code:java}
> private Iterator<CSQueue> sortAndGetChildrenAllocationIterator(
>       String partition) {
>     try {
>       readLock.lock();
>       return queueOrderingPolicy.getAssignmentIterator(partition);
>     } finally {
>       readLock.unlock();
>     }
>   }
> {code}
> so if multiple threads are comparing queue usage statistics and 
> ResourceCommitterService applies Leaf Queue changes in statistics in a 
> concurrent manner, it will break the TimSort algorithm's requirements, 
> causing a thread crash.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to