[
https://issues.apache.org/jira/browse/YARN-11490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17721723#comment-17721723
]
ASF GitHub Bot commented on YARN-11490:
---------------------------------------
tomicooler opened a new pull request, #5644:
URL: https://github.com/apache/hadoop/pull/5644
### Description of PR
YARN-11211 broke the JMX QueueMetrics, detailed root cause analysis in the
[Jira](https://issues.apache.org/jira/browse/YARN-11490?focusedCommentId=17721370&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17721370).
### How was this patch tested?
```shell
# run a long sleep job
hadoop-3.4.0-SNAPSHOT/bin/yarn jar
hadoop-3.4.0-SNAPSHOT/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.4.0-SNAPSHOT-tests.jar
sleep -m 1 -r 1 -rt 1200000 -mt 20
# tried to validate an add-queue operation
curl -X POST -H 'Content-Type: application/xml' -d @addqueue.xml
localhost:8088/ws/v1/cluster/scheduler-conf/validate
# verified that no root.a is in the jmx response
curl http://localhost:8088/jmx | jq
# validated a queue config change
curl -X POST -H 'Content-Type: application/json' -d @defaultqueue.json
localhost:8088/ws/v1/cluster/scheduler-conf/validate
# submitted other jobs
# repeated these steps multiple times and verified that the jmx endpoint
still shows valid data (e.g.: appsRunning/appsPending works, there is no root.a
in the response)
# created the root.a queue
curl -X PUT -H 'Content-Type: application/xml' -d @removequeue.xml
localhost:8088/ws/v1/cluster/scheduler-conf
# verified that the jmx response contains the root.a (2x times:
"QueueMetrics,q0=root,q1=a" and "PartitionQueueMetrics,partition=,q0=root,q1=a")
curl http://localhost:8088/jmx | jq
```
```shell
# restarted yarn, then I created the root.a queue, and compared the jmx
response to the previous test
curl -X PUT -H 'Content-Type: application/xml' -d @removequeue.xml
localhost:8088/ws/v1/cluster/scheduler-conf
curl http://localhost:8088/jmx | jq
```
Note: based on my understanding of the code `PartitionQueueMetrics` can't
show the appsRunning/appsPending (at least not for the CapacityScheduler).
### For code changes:
- [x] Does the title or this PR starts with the corresponding JIRA issue id
(e.g. 'HADOOP-17799. Your PR title ...')?
- [ ] Object storage: have the integration tests been executed and the
endpoint declared according to the connector-specific documentation?
- [ ] If adding new dependencies to the code, are these dependencies
licensed in a way that is compatible for inclusion under [ASF
2.0](http://www.apache.org/legal/resolved.html#category-a)?
- [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`,
`NOTICE-binary` files?
> JMX QueueMetrics breaks after mutable config validation in CS
> -------------------------------------------------------------
>
> Key: YARN-11490
> URL: https://issues.apache.org/jira/browse/YARN-11490
> Project: Hadoop YARN
> Issue Type: Bug
> Components: capacityscheduler
> Affects Versions: 3.4.0
> Reporter: Tamas Domok
> Assignee: Tamas Domok
> Priority: Major
> Attachments: addqueue.xml, defaultqueue.json,
> hadoop-tdomok-resourcemanager-tdomok-MBP16.log, removequeue.xml,
> stopqueue.json
>
>
> Reproduction steps:
> 1. Submit a long running job
> {code}
> hadoop-3.4.0-SNAPSHOT/bin/yarn jar
> hadoop-3.4.0-SNAPSHOT/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.4.0-SNAPSHOT-tests.jar
> sleep -m 1 -r 1 -rt 1200000 -mt 20
> {code}
> 2. Verify that there is one running app
> {code}
> $ curl http://localhost:8088/ws/v1/cluster/metrics | jq
> {code}
> 3. Verify that the JMX endpoint reports 1 running app as well
> {code}
> $ curl http://localhost:8088/jmx | jq
> {code}
> 4. Validate the configuration (x2)
> {code}
> $ curl -X POST -H 'Content-Type: application/json' -d @defaultqueue.json
> localhost:8088/ws/v1/cluster/scheduler-conf/validate
> $ cat defaultqueue.json
> {"update-queue":{"queue-name":"root.default","params":{"entry":{"key":"maximum-applications","value":"100"}}},"subClusterId":"","global":null,"global-updates":null}
> {code}
> 5. Check 2. and 3. again. The cluster metrics should still work but the JMX
> endpoint will show 0 running apps, that's the bug.
> It is caused by YARN-11211, reverting that patch (or only removing the
> _QueueMetrics.clearQueueMetrics();_ line) fixes the issue. But I think that
> would re-introduce the memory leak.
> It looks like the QUEUE_METRICS hash map is "add-only", the
> clearQueueMetrics() was only called from ResourceManager.reinitialize()
> method (transitionToActive/transitionToStandby) prior to YARN-11211.
> Constantly adding and removing queues with unique names would cause a leak as
> well, because there is no remove from QUEUE_METRICS, so it is not just the
> validation API that has this problem.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]