[ https://issues.apache.org/jira/browse/YARN-11785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shilun Fan resolved YARN-11785. ------------------------------- Fix Version/s: 3.5.0 Hadoop Flags: Reviewed Resolution: Fixed > Race condition in QueueMetrics due to non-thread-safe HashMap causes > MetricsException > ------------------------------------------------------------------------------------- > > Key: YARN-11785 > URL: https://issues.apache.org/jira/browse/YARN-11785 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler > Affects Versions: 3.2.4, 3.3.6, 3.4.1 > Reporter: Tao Yang > Assignee: Tao Yang > Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > Below is the error stack trace: > {code:java} > ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received > RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, > Thread-12, that exited unexpectedly: > org.apache.hadoop.metrics2.MetricsException: Metrics source > PartitionQueueMetrics,partition=,q0=root,q1=xxx already exists! > at > org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152) > at > org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125) > at > org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.getPartitionQueueMetrics(QueueMetrics.java:286) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.setAvailableResourcesToUser(QueueMetrics.java:529) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.computeUserLimitAndSetHeadroom(LeafQueue.java:1490) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1146) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:803) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:803) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1697) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1632) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1787) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1536) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:606) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:653) > {code} > > Reproduce this issue: > * 1. A RPC handling thread is executing refreshQueue command and adding new > item into QUEUE_METRICS map. > * 2. In the meanwhile, the async-scheduling thread fail to retrieve an > existing PartitionQueueMetric from QueueMetrics#QUEUE_METRICS (returns null), > then attempt to re-register the same queue name. This triggers a > MetricsException ("Duplicate metric name") and causes the async-scheduling > thread to exit unexpectedly. > The root cause is that QUEUE_METRICS field in QueueMetrics is implemented > with HashMap, which is not thread-safe but expected to be called > concurrently, as shown in reproduce steps, it can be called in > async-scheduling thread and RPC threads. Concurrent put and get operations > for HashMap can lead to visibility issue. This can be fixed by ensuring > thread-safe access via ConcurrentHashMap for QUEUE_METRICS field. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org