[
https://issues.apache.org/jira/browse/YARN-11966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
zhaozhao updated YARN-11966:
----------------------------
Description:
Summary
In Hadoop 3.2.2 with FairScheduler and RM HA enabled, the ResourceManager's
EventDispatcher exits with a FATAL error due to {{MetricsException: Metrics
source ... already exists!}} thrown from multiple metrics registration paths.
This causes the RM scheduler thread to die and the RM becomes unresponsive.
Environment
Hadoop 3.2.2
FairScheduler
RM HA enabled
Root Cause
{{ResourceManager.reinitialize()}} calls {{QueueMetrics.clearQueueMetrics()}}
which clears the static {{QUEUE_METRICS}} ConcurrentHashMap. However, the
metrics sources already registered in {{DefaultMetricsSystem}} (via
{{MetricsSystemImpl.allSources}} and {{DefaultMetricsSystem.sourceNames}}) are
NOT cleared.
When the scheduler resumes processing events, methods like
{{getUserMetrics()}}, {{getPartitionQueueMetrics()}}, and {{forQueue()}} check
{{QUEUE_METRICS}} map, find it empty, and attempt to re-register the source.
{{DefaultMetricsSystem.newSourceName()}} detects the source name already exists
and throws {{MetricsException}}.
YARN-10329 addressed a similar issue in test cases but did not fix the
production code paths.
Affected Registration Points (4 locations missed)
||Class||Method||Source Name Pattern||
|{{PartitionQueueMetrics}}|{{getUserMetrics()}}|{{PartitionQueueMetrics,partition=,q0=root,q1=default,user=xxx}}|
|{{QueueMetrics}}|{{getUserMetrics()}}|{{QueueMetrics,q0=root,user=xxx}}|
|{{QueueMetrics}}|{{forQueue()}}|{{QueueMetrics,q0=root}}|
|{{FSQueueMetrics}}|{{forQueue()}}|{{QueueMetrics,q0=root}} (FairScheduler
path)|
Additionally, {{QueueMetrics.getPartitionQueueMetrics()}} in 3.2.2 uses
{{synchronized}} on the instance ({{this}}), which does not prevent concurrent
registration from different QueueMetrics instances. This was changed to
{{synchronized (QUEUE_METRICS)}} with double-check locking.
Fix
Add {{metricsSystem.unregisterSource(sourceName)}} before
{{metricsSystem.register()}} in all affected methods. {{unregisterSource()}} is
idempotent — no-op if the source doesn't exist.
Stack Traces
{code:java} 2026-06-22 17:11:49,489 ERROR
o.a.h.yarn.server.resourcemanager.ResourceManager: Error in handling event type
ATTEMPT_ADDED for applicationAttempt appattempt_1782119232412_0022_000001
org.apache.hadoop.metrics2.MetricsException: Metrics source
PartitionQueueMetrics,partition=,q0=root,q1=default,user=bigdata_deploy already
exists! at
o.a.h.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
at o.a.h.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
at
o.a.h.yarn.server.resourcemanager.scheduler.PartitionQueueMetrics.getUserMetrics(PartitionQueueMetrics.java:80)
at
o.a.h.yarn.server.resourcemanager.scheduler.QueueMetrics.internalIncrPendingResources(QueueMetrics.java:627)
{code}
{code:java} 2026-06-22 17:11:50,220 FATAL o.a.h.yarn.event.EventDispatcher:
Error in handling event type NODE_UPDATE to the Event Dispatcher
org.apache.hadoop.metrics2.MetricsException: Metrics source
PartitionQueueMetrics,partition=,q0=root,user=bigdata_deploy already exists! at
o.a.h.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
at o.a.h.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
at
o.a.h.yarn.server.resourcemanager.scheduler.PartitionQueueMetrics.getUserMetrics(PartitionQueueMetrics.java:80)
at
o.a.h.yarn.server.resourcemanager.scheduler.QueueMetrics.internalAllocateResources(QueueMetrics.java:788)
{code}
was:
Summary
In Hadoop 3.2.2 with FairScheduler and RM HA enabled, the ResourceManager's
EventDispatcher exits with a FATAL error due to {{MetricsException: Metrics
source ... already exists!}} thrown from multiple metrics registration paths.
This causes the RM scheduler thread to die and the RM becomes unresponsive.
Environment
Hadoop 3.2.2
FairScheduler
RM HA enabled
Root Cause
{{ResourceManager.reinitialize()}} calls {{QueueMetrics.clearQueueMetrics()}}
which clears the static {{QUEUE_METRICS}} ConcurrentHashMap. However, the
metrics sources already registered in {{DefaultMetricsSystem}} (via
{{MetricsSystemImpl.allSources}} and {{DefaultMetricsSystem.sourceNames}}) are
NOT cleared.
When the scheduler resumes processing events, methods like
{{getUserMetrics()}}, {{getPartitionQueueMetrics()}}, and {{forQueue()}} check
{{QUEUE_METRICS}} map, find it empty, and attempt to re-register the source.
{{DefaultMetricsSystem.newSourceName()}} detects the source name already exists
and throws {{MetricsException}}.
YARN-10329 addressed a similar issue in test cases but did not fix the
production code paths.
Affected Registration Points (4 locations missed)
||Class||Method||Source Name Pattern||
|{{PartitionQueueMetrics}}|{{getUserMetrics()}}|{{PartitionQueueMetrics,partition=,q0=root,q1=default,user=xxx}}|
|{{QueueMetrics}}|{{getUserMetrics()}}|{{QueueMetrics,q0=root,user=xxx}}|
|{{QueueMetrics}}|{{forQueue()}}|{{QueueMetrics,q0=root}}|
|{{FSQueueMetrics}}|{{forQueue()}}|{{QueueMetrics,q0=root}} (FairScheduler
path)|
Additionally, {{QueueMetrics.getPartitionQueueMetrics()}} in 3.2.2 uses
{{synchronized}} on the instance ({{this}}), which does not prevent concurrent
registration from different QueueMetrics instances. This was changed to
{{synchronized (QUEUE_METRICS)}} with double-check locking.
Fix
Add {{metricsSystem.unregisterSource(sourceName)}} before
{{metricsSystem.register()}} in all affected methods. {{unregisterSource()}} is
idempotent — no-op if the source doesn't exist.
Stack Traces
{code:java} 2026-06-22 17:11:49,489 ERROR
o.a.h.yarn.server.resourcemanager.ResourceManager: Error in handling event type
ATTEMPT_ADDED for applicationAttempt appattempt_1782119232412_0022_000001
org.apache.hadoop.metrics2.MetricsException: Metrics source
PartitionQueueMetrics,partition=,q0=root,q1=default,user=bigdata_deploy already
exists! at
o.a.h.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
at o.a.h.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
at
o.a.h.yarn.server.resourcemanager.scheduler.PartitionQueueMetrics.getUserMetrics(PartitionQueueMetrics.java:80)
at
o.a.h.yarn.server.resourcemanager.scheduler.QueueMetrics.internalIncrPendingResources(QueueMetrics.java:627)
{code}
{code:java} 2026-06-22 17:11:50,220 FATAL o.a.h.yarn.event.EventDispatcher:
Error in handling event type NODE_UPDATE to the Event Dispatcher
org.apache.hadoop.metrics2.MetricsException: Metrics source
PartitionQueueMetrics,partition=,q0=root,user=bigdata_deploy already exists! at
o.a.h.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
at o.a.h.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
at
o.a.h.yarn.server.resourcemanager.scheduler.PartitionQueueMetrics.getUserMetrics(PartitionQueueMetrics.java:80)
at
o.a.h.yarn.server.resourcemanager.scheduler.QueueMetrics.internalAllocateResources(QueueMetrics.java:788)
> org.apache.hadoop.metrics2.MetricsException: Metrics source
> PartitionQueueMetrics,partition=,q0=root,q1=default,user=* already exists!
> --------------------------------------------------------------------------------------------------------------------------------------
>
> Key: YARN-11966
> URL: https://issues.apache.org/jira/browse/YARN-11966
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 3.2.2
> Reporter: zhaozhao
> Priority: Major
> Attachments:
> 0001-Fix-MetricsException-already-exists-in-PartitionQueu.patch
>
>
> Summary
> In Hadoop 3.2.2 with FairScheduler and RM HA enabled, the ResourceManager's
> EventDispatcher exits with a FATAL error due to {{MetricsException: Metrics
> source ... already exists!}} thrown from multiple metrics registration paths.
> This causes the RM scheduler thread to die and the RM becomes unresponsive.
> Environment
> Hadoop 3.2.2
> FairScheduler
> RM HA enabled
> Root Cause
> {{ResourceManager.reinitialize()}} calls {{QueueMetrics.clearQueueMetrics()}}
> which clears the static {{QUEUE_METRICS}} ConcurrentHashMap. However, the
> metrics sources already registered in {{DefaultMetricsSystem}} (via
> {{MetricsSystemImpl.allSources}} and {{DefaultMetricsSystem.sourceNames}})
> are NOT cleared.
> When the scheduler resumes processing events, methods like
> {{getUserMetrics()}}, {{getPartitionQueueMetrics()}}, and {{forQueue()}}
> check {{QUEUE_METRICS}} map, find it empty, and attempt to re-register the
> source. {{DefaultMetricsSystem.newSourceName()}} detects the source name
> already exists and throws {{MetricsException}}.
> YARN-10329 addressed a similar issue in test cases but did not fix the
> production code paths.
> Affected Registration Points (4 locations missed)
> ||Class||Method||Source Name Pattern||
> |{{PartitionQueueMetrics}}|{{getUserMetrics()}}|{{PartitionQueueMetrics,partition=,q0=root,q1=default,user=xxx}}|
> |{{QueueMetrics}}|{{getUserMetrics()}}|{{QueueMetrics,q0=root,user=xxx}}|
> |{{QueueMetrics}}|{{forQueue()}}|{{QueueMetrics,q0=root}}|
> |{{FSQueueMetrics}}|{{forQueue()}}|{{QueueMetrics,q0=root}} (FairScheduler
> path)|
> Additionally, {{QueueMetrics.getPartitionQueueMetrics()}} in 3.2.2 uses
> {{synchronized}} on the instance ({{this}}), which does not prevent
> concurrent registration from different QueueMetrics instances. This was
> changed to {{synchronized (QUEUE_METRICS)}} with double-check locking.
> Fix
> Add {{metricsSystem.unregisterSource(sourceName)}} before
> {{metricsSystem.register()}} in all affected methods. {{unregisterSource()}}
> is idempotent — no-op if the source doesn't exist.
> Stack Traces
> {code:java} 2026-06-22 17:11:49,489 ERROR
> o.a.h.yarn.server.resourcemanager.ResourceManager: Error in handling event
> type ATTEMPT_ADDED for applicationAttempt
> appattempt_1782119232412_0022_000001
> org.apache.hadoop.metrics2.MetricsException: Metrics source
> PartitionQueueMetrics,partition=,q0=root,q1=default,user=bigdata_deploy
> already exists! at
> o.a.h.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
> at
> o.a.h.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229) at
> o.a.h.yarn.server.resourcemanager.scheduler.PartitionQueueMetrics.getUserMetrics(PartitionQueueMetrics.java:80)
> at
> o.a.h.yarn.server.resourcemanager.scheduler.QueueMetrics.internalIncrPendingResources(QueueMetrics.java:627)
> {code}
> {code:java} 2026-06-22 17:11:50,220 FATAL o.a.h.yarn.event.EventDispatcher:
> Error in handling event type NODE_UPDATE to the Event Dispatcher
> org.apache.hadoop.metrics2.MetricsException: Metrics source
> PartitionQueueMetrics,partition=,q0=root,user=bigdata_deploy already exists!
> at
> o.a.h.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
> at
> o.a.h.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229) at
> o.a.h.yarn.server.resourcemanager.scheduler.PartitionQueueMetrics.getUserMetrics(PartitionQueueMetrics.java:80)
> at
> o.a.h.yarn.server.resourcemanager.scheduler.QueueMetrics.internalAllocateResources(QueueMetrics.java:788)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]