[ 
https://issues.apache.org/jira/browse/YARN-11785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-11785:
----------------------------
    Description: 
Below is the error stack trace:
{code:java}
ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
Thread-12, that exited unexpectedly: 
org.apache.hadoop.metrics2.MetricsException: Metrics source 
PartitionQueueMetrics,partition=,q0=root,q1=xxx already exists!
    at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
    at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
    at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
    at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.getPartitionQueueMetrics(QueueMetrics.java:286)
    at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.setAvailableResourcesToUser(QueueMetrics.java:529)
    at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.computeUserLimitAndSetHeadroom(LeafQueue.java:1490)
    at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1146)
    at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:803)
    at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
    at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:803)
    at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
    at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1697)
    at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1632)
    at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1787)
    at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1536)
    at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:606)
    at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:653)
 {code}
 

Reproduce this issue:
 * 1. A RPC handling thread is executing refreshQueue command and adding new 
item into QUEUE_METRICS map.
 * 2. In the meanwhile, the async-scheduling thread fail to retrieve an 
existing PartitionQueueMetric from QueueMetrics#QUEUE_METRICS (returns null), 
then attempt to re-register the same queue name.  This triggers a 
MetricsException ("Duplicate metric name") and causes the async-scheduling 
thread to exit unexpectedly. 

The root cause is that QUEUE_METRICS field in QueueMetrics is implemented with 
HashMap, which is not thread-safe but expected to be called concurrently, as 
shown in reproduce steps, it can be called in async-scheduling thread and RPC 
threads. Concurrent put and get operations for HashMap can lead to visibility 
issue. This can be fixed by ensuring thread-safe access via ConcurrentHashMap 
for QUEUE_METRICS field.

  was:
Below is the error stack trace:
{code:java}
ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
Thread-12, that exited unexpectedly: 
org.apache.hadoop.metrics2.MetricsException: Metrics source 
PartitionQueueMetrics,partition=,q0=root,q1=xxx already exists!
    at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
    at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
    at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
    at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.getPartitionQueueMetrics(QueueMetrics.java:286)
    at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.setAvailableResourcesToUser(QueueMetrics.java:529)
    at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.computeUserLimitAndSetHeadroom(LeafQueue.java:1490)
    at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1146)
    at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:803)
    at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
    at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:803)
    at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
    at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1697)
    at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1632)
    at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1787)
    at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1536)
    at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:606)
    at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:653)
 {code}
 

Reproduce this issue:
 * 1. A RPC handling thread is executing refreshQueue command and adding new 
item into QUEUE_METRICS map.
 * 2. In the meanwhile, the async-scheduling thread fail to retrieve an 
existing PartitionQueueMetric from QueueMetrics#QUEUE_METRICS (returns null), 
then attempt to re-register the same queue name.  This triggers a 
MetricsException ("Duplicate metric name") and causes the async-scheduling 
thread to exit unexpectedly. 

The root cause is that QUEUE_METRICS field is implemented with HashMap, which 
is not thread-safe. Concurrent put and get operations can lead to visibility 
issue. This issue can be fixed by ensuring thread-safe access via 
ConcurrentHashMap for QUEUE_METRICS field.


> Race condition in QueueMetrics due to non-thread-safe HashMap causes 
> MetricsException
> -------------------------------------------------------------------------------------
>
>                 Key: YARN-11785
>                 URL: https://issues.apache.org/jira/browse/YARN-11785
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler
>    Affects Versions: 3.2.4, 3.3.6, 3.4.1
>            Reporter: Tao Yang
>            Assignee: Tao Yang
>            Priority: Major
>
> Below is the error stack trace:
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-12, that exited unexpectedly: 
> org.apache.hadoop.metrics2.MetricsException: Metrics source 
> PartitionQueueMetrics,partition=,q0=root,q1=xxx already exists!
>     at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
>     at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
>     at 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.getPartitionQueueMetrics(QueueMetrics.java:286)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.setAvailableResourcesToUser(QueueMetrics.java:529)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.computeUserLimitAndSetHeadroom(LeafQueue.java:1490)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1146)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:803)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:803)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1697)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1632)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1787)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1536)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:606)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:653)
>  {code}
>  
> Reproduce this issue:
>  * 1. A RPC handling thread is executing refreshQueue command and adding new 
> item into QUEUE_METRICS map.
>  * 2. In the meanwhile, the async-scheduling thread fail to retrieve an 
> existing PartitionQueueMetric from QueueMetrics#QUEUE_METRICS (returns null), 
> then attempt to re-register the same queue name.  This triggers a 
> MetricsException ("Duplicate metric name") and causes the async-scheduling 
> thread to exit unexpectedly. 
> The root cause is that QUEUE_METRICS field in QueueMetrics is implemented 
> with HashMap, which is not thread-safe but expected to be called 
> concurrently, as shown in reproduce steps, it can be called in 
> async-scheduling thread and RPC threads. Concurrent put and get operations 
> for HashMap can lead to visibility issue. This can be fixed by ensuring 
> thread-safe access via ConcurrentHashMap for QUEUE_METRICS field.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to