[ 
https://issues.apache.org/jira/browse/YARN-11490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724973#comment-17724973
 ] 

ASF GitHub Bot commented on YARN-11490:
---------------------------------------

tomicooler commented on code in PR #5644:
URL: https://github.com/apache/hadoop/pull/5644#discussion_r1200597052


##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/QueueMetrics.java:
##########
@@ -334,13 +336,15 @@ public synchronized QueueMetrics 
getPartitionQueueMetrics(String partition) {
       QueueMetrics queueMetrics =
           new PartitionQueueMetrics(metricsSystem, this.queueName, parentQueue,
               this.enableUserMetrics, this.conf, partition);
-      registerMetrics(
+      metricsSystem.register(
           pSourceName(partitionJMXStr).append(qSourceName(this.queueName))
               .toString(),
           "Metrics for queue: " + this.queueName,
           queueMetrics.tag(PARTITION_INFO, partitionJMXStr).tag(QUEUE_INFO,
               this.queueName));
-      getQueueMetrics().put(metricName, queueMetrics);
+      if (!conf.getBoolean(CONFIGURATION_VALIDATION, false)) {

Review Comment:
   I added some helpers and documented that this is internal only.





> JMX QueueMetrics breaks after mutable config validation in CS
> -------------------------------------------------------------
>
>                 Key: YARN-11490
>                 URL: https://issues.apache.org/jira/browse/YARN-11490
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler
>    Affects Versions: 3.4.0
>            Reporter: Tamas Domok
>            Assignee: Tamas Domok
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: addqueue.xml, defaultqueue.json, 
> hadoop-tdomok-resourcemanager-tdomok-MBP16.log, removequeue.xml, 
> stopqueue.json
>
>
> Reproduction steps:
> 1. Submit a long running job
> {code}
> hadoop-3.4.0-SNAPSHOT/bin/yarn jar 
> hadoop-3.4.0-SNAPSHOT/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.4.0-SNAPSHOT-tests.jar
>  sleep -m 1 -r 1 -rt 1200000 -mt 20
> {code}
> 2. Verify that there is one running app
> {code}
> $ curl http://localhost:8088/ws/v1/cluster/metrics | jq
> {code}
> 3. Verify that the JMX endpoint reports 1 running app as well
> {code}
> $ curl http://localhost:8088/jmx | jq
> {code}
> 4. Validate the configuration (x2)
> {code}
> $ curl -X POST -H 'Content-Type: application/json' -d @defaultqueue.json 
> localhost:8088/ws/v1/cluster/scheduler-conf/validate
> $ cat defaultqueue.json
> {"update-queue":{"queue-name":"root.default","params":{"entry":{"key":"maximum-applications","value":"100"}}},"subClusterId":"","global":null,"global-updates":null}
> {code}
> 5. Check 2. and 3. again. The cluster metrics should still work but the JMX 
> endpoint will show 0 running apps, that's the bug.
> It is caused by YARN-11211, reverting that patch (or only removing the 
> _QueueMetrics.clearQueueMetrics();_ line) fixes the issue. But I think that 
> would re-introduce the memory leak.
> It looks like the QUEUE_METRICS hash map is "add-only", the 
> clearQueueMetrics() was only called from ResourceManager.reinitialize() 
> method (transitionToActive/transitionToStandby) prior to YARN-11211. 
> Constantly adding and removing queues with unique names would cause a leak as 
> well, because there is no remove from QUEUE_METRICS, so it is not just the 
> validation API that has this problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to