[ 
https://issues.apache.org/jira/browse/YARN-11966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18090564#comment-18090564
 ] 

zhaozhao commented on YARN-11966:
---------------------------------

How to Reproduce

This bug can be reliably reproduced on a live RM by simulating the state 
inconsistency that occurs during RM failover — clearing the {{QUEUE_METRICS}} 
map while {{DefaultMetricsSystem}} still holds the registered sources.

Prerequisites:

Hadoop 3.2.2 cluster with FairScheduler
Arthas attached to the ResourceManager process
Steps:

Use Arthas to clear the metrics cache (simulates what 
{{ResourceManager.reinitialize()}} does during HA failover):
{code:bash}

Attach to RM process and clear QUEUE_METRICS map
ognl 
'@org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics@clearQueueMetrics()'
 {code}

Immediately submit applications to trigger allocate and nodeUpdate code paths:
{code:bash} for i in $(seq 1 10); do yarn jar 
$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar pi 1 1 & 
done {code}

Observe the RM log — within seconds, FATAL error appears and EventDispatcher 
exits:
{code} FATAL o.a.h.yarn.event.EventDispatcher: Error in handling event type 
NODE_UPDATE to the Event Dispatcher 
org.apache.hadoop.metrics2.MetricsException: Metrics source 
PartitionQueueMetrics,partition=,q0=root already exists! {code}

Why this works:

{{clearQueueMetrics()}} empties the static {{QUEUE_METRICS}} ConcurrentHashMap 
but does NOT call {{metricsSystem.unregisterSource()}} for the 
already-registered sources. When the submitted apps trigger 
{{getUserMetrics()}}, {{getPartitionQueueMetrics()}}, or {{forQueue()}}, these 
methods see an empty map, attempt to re-register the same source name, and 
{{DefaultMetricsSystem.newSourceName()}} throws because the name still exists 
in its internal {{sourceNames}} map.

This is exactly what happens during a real RM HA failover when 
{{ResourceManager.reinitialize()}} → {{QueueMetrics.clearQueueMetrics()}} is 
called.

After applying the patch:

The same reproduction steps complete without error — applications run normally 
because {{unregisterSource()}} cleans up stale entries before re-registration.

> org.apache.hadoop.metrics2.MetricsException: Metrics source 
> PartitionQueueMetrics,partition=,q0=root,q1=default,user=* already exists!
> --------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-11966
>                 URL: https://issues.apache.org/jira/browse/YARN-11966
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>            Reporter: zhaozhao
>            Priority: Major
>         Attachments: 
> 0001-Fix-MetricsException-already-exists-in-PartitionQueu.patch
>
>
> Summary
> In Hadoop 3.2.2 with FairScheduler and RM HA enabled, the ResourceManager's 
> EventDispatcher exits with a FATAL error due to {{MetricsException: Metrics 
> source ... already exists!}} thrown from multiple metrics registration paths. 
> This causes the RM scheduler thread to die and the RM becomes unresponsive.
> Environment
> Hadoop 3.2.2
> FairScheduler
> RM HA enabled
> Root Cause
> {{ResourceManager.reinitialize()}} calls {{QueueMetrics.clearQueueMetrics()}} 
> which clears the static {{QUEUE_METRICS}} ConcurrentHashMap. However, the 
> metrics sources already registered in {{DefaultMetricsSystem}} (via 
> {{MetricsSystemImpl.allSources}} and {{DefaultMetricsSystem.sourceNames}}) 
> are NOT cleared.
> When the scheduler resumes processing events, methods like 
> {{getUserMetrics()}}, {{getPartitionQueueMetrics()}}, and {{forQueue()}} 
> check {{QUEUE_METRICS}} map, find it empty, and attempt to re-register the 
> source. {{DefaultMetricsSystem.newSourceName()}} detects the source name 
> already exists and throws {{MetricsException}}.
> YARN-10329 addressed a similar issue in test cases but did not fix the 
> production code paths.
> Affected Registration Points (4 locations missed)
> ||Class||Method||Source Name Pattern|| 
> |{{PartitionQueueMetrics}}|{{getUserMetrics()}}|{{PartitionQueueMetrics,partition=,q0=root,q1=default,user=xxx}}|
>  |{{QueueMetrics}}|{{getUserMetrics()}}|{{QueueMetrics,q0=root,user=xxx}}| 
> |{{QueueMetrics}}|{{forQueue()}}|{{QueueMetrics,q0=root}}| 
> |{{FSQueueMetrics}}|{{forQueue()}}|{{QueueMetrics,q0=root}} (FairScheduler 
> path)|
> Additionally, {{QueueMetrics.getPartitionQueueMetrics()}} in 3.2.2 uses 
> {{synchronized}} on the instance ({{this}}), which does not prevent 
> concurrent registration from different QueueMetrics instances. This was 
> changed to {{synchronized (QUEUE_METRICS)}} with double-check locking.
> Fix
> Add {{metricsSystem.unregisterSource(sourceName)}} before 
> {{metricsSystem.register()}} in all affected methods. {{unregisterSource()}} 
> is idempotent — no-op if the source doesn't exist.
> Stack Traces
> {code:java} 2026-06-22 17:11:49,489 ERROR 
> o.a.h.yarn.server.resourcemanager.ResourceManager: Error in handling event 
> type ATTEMPT_ADDED for applicationAttempt 
> appattempt_1782119232412_0022_000001 
> org.apache.hadoop.metrics2.MetricsException: Metrics source 
> PartitionQueueMetrics,partition=,q0=root,q1=default,user=bigdata_deploy 
> already exists! at 
> o.a.h.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
>  at 
> o.a.h.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229) at 
> o.a.h.yarn.server.resourcemanager.scheduler.PartitionQueueMetrics.getUserMetrics(PartitionQueueMetrics.java:80)
>  at 
> o.a.h.yarn.server.resourcemanager.scheduler.QueueMetrics.internalIncrPendingResources(QueueMetrics.java:627)
>  {code}
> {code:java} 2026-06-22 17:11:50,220 FATAL o.a.h.yarn.event.EventDispatcher: 
> Error in handling event type NODE_UPDATE to the Event Dispatcher 
> org.apache.hadoop.metrics2.MetricsException: Metrics source 
> PartitionQueueMetrics,partition=,q0=root,user=bigdata_deploy already exists! 
> at 
> o.a.h.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
>  at 
> o.a.h.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229) at 
> o.a.h.yarn.server.resourcemanager.scheduler.PartitionQueueMetrics.getUserMetrics(PartitionQueueMetrics.java:80)
>  at 
> o.a.h.yarn.server.resourcemanager.scheduler.QueueMetrics.internalAllocateResources(QueueMetrics.java:788)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to