[
https://issues.apache.org/jira/browse/YARN-11490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720520#comment-17720520
]
Tamas Domok commented on YARN-11490:
------------------------------------
After init:
{code}
2023-05-08 14:55:00,080 ERROR
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics:
tomi CSQueueMetrics1 root - null
2023-05-08 14:55:00,081 ERROR
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics:
tomi CSQueueMetrics2 root -
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics@27eedb64
2023-05-08 14:55:00,082 ERROR
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics:
tomi CSQueueMetrics4 root -
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics@27eedb64
2023-05-08 14:55:00,088 ERROR
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
tomi AbstractCSQueue null -
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics@27eedb64
2023-05-08 14:55:00,109 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
root, capacity=1.0, absoluteCapacity=1.0, maxCapacity=1.0,
absoluteMaxCapacity=1.0, state=RUNNING, acls=ADMINISTER_QUEUE:*SUBMIT_APP:*,
labels=*,
, reservationsContinueLooking=true, orderingPolicy=utilization, priority=0,
allowZeroCapacitySum=false
2023-05-08 14:55:00,112 ERROR
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics:
tomi CSQueueMetrics1 root.default - null
2023-05-08 14:55:00,112 ERROR
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics:
tomi CSQueueMetrics2 root.default -
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics@53fd0d10
2023-05-08 14:55:00,112 ERROR
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics:
tomi CSQueueMetrics4 root.default -
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics@53fd0d10
2023-05-08 14:55:00,113 ERROR
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
tomi AbstractCSQueue null -
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics@53fd0d10
2023-05-08 14:55:00,122 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractLeafQueue:
Initializing root.default
{code}
After first validation:
{code}
2023-05-08 14:55:34,327 ERROR
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics:
tomi CSQueueMetrics1 root -
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics@27eedb64
2023-05-08 14:55:34,327 ERROR
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
tomi AbstractCSQueue null -
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics@27eedb64
2023-05-08 14:55:34,331 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
root, capacity=1.0, absoluteCapacity=1.0, maxCapacity=1.0,
absoluteMaxCapacity=1.0, state=RUNNING, acls=ADMINISTER_QUEUE:*SUBMIT_APP:*,
labels=*,
, reservationsContinueLooking=true, orderingPolicy=utilization, priority=0,
allowZeroCapacitySum=false
2023-05-08 14:55:34,331 ERROR
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics:
tomi CSQueueMetrics1 root.default -
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics@53fd0d10
2023-05-08 14:55:34,331 ERROR
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
tomi AbstractCSQueue null -
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics@53fd0d10
2023-05-08 14:55:34,332 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractLeafQueue:
Initializing root.default
2023-05-08 14:55:34,333 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Re-initializing queues...
2023-05-08 14:55:34,337 ERROR
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
tomi AbstractCSQueue root: numChildQueue= 1, capacity=1.0,
absoluteCapacity=1.0, usedResources=<memory:0, vCores:0>, usedCapacity=0.0,
numApps=0, numContainers=0 -
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics@27eedb64
2023-05-08 14:55:34,340 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
root, capacity=1.0, absoluteCapacity=1.0, maxCapacity=1.0,
absoluteMaxCapacity=1.0, state=RUNNING, acls=ADMINISTER_QUEUE:*SUBMIT_APP:*,
labels=*,
, reservationsContinueLooking=true, orderingPolicy=utilization, priority=0,
allowZeroCapacitySum=false
2023-05-08 14:55:34,340 ERROR
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
tomi AbstractCSQueue root.default: capacity=1.0, absoluteCapacity=1.0,
usedResources=<memory:0, vCores:0>, usedCapacity=0.0, absoluteUsedCapacity=0.0,
numApps=0, numContainers=0, effectiveMinResource=<memory:8192, vCores:8> ,
effectiveMaxResource=<memory:8192, vCores:8> -
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics@53fd0d10
2023-05-08 14:55:34,341 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractLeafQueue:
Initializing root.default
{code}
Here the JMX would still show 1 apps running. But the validation ends with the
QUEUE_METRICS.clear(). So the next validation will create new queue metrics and
unregister / register happens.
After second validation
{code}
2023-05-08 14:58:34,650 ERROR
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics:
tomi CSQueueMetrics1 root - null
2023-05-08 14:58:34,650 ERROR
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics:
tomi CSQueueMetrics2 root -
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics@4ccc9396
2023-05-08 14:58:34,651 ERROR
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics:
tomi CSQueueMetrics3 root -
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics@4ccc9396
2023-05-08 14:58:34,651 ERROR
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics:
tomi CSQueueMetrics4 root -
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics@4ccc9396
2023-05-08 14:58:34,651 ERROR
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
tomi AbstractCSQueue null -
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics@4ccc9396
2023-05-08 14:58:34,654 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
root, capacity=1.0, absoluteCapacity=1.0, maxCapacity=1.0,
absoluteMaxCapacity=1.0, state=RUNNING, acls=ADMINISTER_QUEUE:*SUBMIT_APP:*,
labels=*,
, reservationsContinueLooking=true, orderingPolicy=utilization, priority=0,
allowZeroCapacitySum=false
2023-05-08 14:58:34,654 ERROR
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics:
tomi CSQueueMetrics1 root.default - null
2023-05-08 14:58:34,654 ERROR
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics:
tomi CSQueueMetrics2 root.default -
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics@352a074a
2023-05-08 14:58:34,654 ERROR
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics:
tomi CSQueueMetrics3 root.default -
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics@352a074a
2023-05-08 14:58:34,654 ERROR
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics:
tomi CSQueueMetrics4 root.default -
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics@352a074a
2023-05-08 14:58:34,654 ERROR
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
tomi AbstractCSQueue null -
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics@352a074a
2023-05-08 14:58:34,656 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractLeafQueue:
Initializing root.default
2023-05-08 14:58:34,657 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Re-initializing queues...
2023-05-08 14:58:34,658 WARN SecurityLogger.org.apache.hadoop.ipc.Server: Auth
failed for 192.168.50.222:58993 / 192.168.50.222:58993:null (DIGEST-MD5: IO
error acquiring password) with true cause:
(appattempt_1683550395454_0001_000001 not found in AMRMTokenSecretManager.)
2023-05-08 14:58:34,660 ERROR
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
tomi AbstractCSQueue root: numChildQueue= 1, capacity=1.0,
absoluteCapacity=1.0, usedResources=<memory:0, vCores:0>, usedCapacity=0.0,
numApps=0, numContainers=0 -
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics@4ccc9396
2023-05-08 14:58:34,662 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
root, capacity=1.0, absoluteCapacity=1.0, maxCapacity=1.0,
absoluteMaxCapacity=1.0, state=RUNNING, acls=ADMINISTER_QUEUE:*SUBMIT_APP:*,
labels=*,
, reservationsContinueLooking=true, orderingPolicy=utilization, priority=0,
allowZeroCapacitySum=false
2023-05-08 14:58:34,662 ERROR
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
tomi AbstractCSQueue root.default: capacity=1.0, absoluteCapacity=1.0,
usedResources=<memory:0, vCores:0>, usedCapacity=0.0, absoluteUsedCapacity=0.0,
numApps=0, numContainers=0, effectiveMinResource=<memory:8192, vCores:8> ,
effectiveMaxResource=<memory:8192, vCores:8> -
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueMetrics@352a074a
2023-05-08 14:58:34,663 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractLeafQueue:
Initializing root.default
{code}
The JMX shows 0 running apps after this. Without unregister there would be
exceptions: org.apache.hadoop.metrics2.MetricsException: Metrics source
QueueMetrics,q0=root already exists!
> JMX QueueMetrics breaks after mutable config validation in CS
> -------------------------------------------------------------
>
> Key: YARN-11490
> URL: https://issues.apache.org/jira/browse/YARN-11490
> Project: Hadoop YARN
> Issue Type: Bug
> Components: capacityscheduler
> Affects Versions: 3.4.0
> Reporter: Tamas Domok
> Assignee: Tamas Domok
> Priority: Major
>
> Reproduction steps:
> 1. Submit a long running job
> {code}
> hadoop-3.4.0-SNAPSHOT/bin/yarn jar
> hadoop-3.4.0-SNAPSHOT/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.4.0-SNAPSHOT-tests.jar
> sleep -m 1 -r 1 -rt 1200000 -mt 20
> {code}
> 2. Verify that there is one running app
> {code}
> $ curl http://localhost:8088/ws/v1/cluster/metrics | jq
> {code}
> 3. Verify that the JMX endpoint reports 1 running app as well
> {code}
> $ curl http://localhost:8088/jmx | jq
> {code}
> 4. Validate the configuration (x2)
> {code}
> $ curl -X POST -H 'Content-Type: application/json' -d @defaultqueue.json
> localhost:8088/ws/v1/cluster/scheduler-conf/validate
> $ cat defaultqueue.json
> {"update-queue":{"queue-name":"root.default","params":{"entry":{"key":"maximum-applications","value":"100"}}},"subClusterId":"","global":null,"global-updates":null}
> {code}
> 5. Check 2. and 3. again. The cluster metrics should still work but the JMX
> endpoint will show 0 running apps, that's the bug.
> It is caused by YARN-11211, reverting that patch (or only removing the
> _QueueMetrics.clearQueueMetrics();_ line) fixes the issue. But I think that
> would re-introduce the memory leak.
> It looks like the QUEUE_METRICS hash map is "add-only", the
> clearQueueMetrics() was only called from ResourceManager.reinitialize()
> method (transitionToActive/transitionToStandby) prior to YARN-11211.
> Constantly adding and removing queues with unique names would cause a leak as
> well, because there is no remove from QUEUE_METRICS, so it is not just the
> validation API that has this problem.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]