[
https://issues.apache.org/jira/browse/YARN-11878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18029286#comment-18029286
]
ASF GitHub Bot commented on YARN-11878:
---------------------------------------
qq619618919 commented on PR #8026:
URL: https://github.com/apache/hadoop/pull/8026#issuecomment-3394100330
### Performance Verification in Production
We tested this patch in a production YARN cluster and used Arthas to monitor
RM node event handling performance via:
```bash
monitor -c 5
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher
handle
```
### Result:
Before patch (with original YARN-11003 behavior): average NM heartbeat
handling time ≈ 1.10 ms
After patch (skip/caching getCapability() when Opportunistic containers
disabled): average NM heartbeat handling time ≈ 0.09 ms
This shows over 12× improvement in heartbeat event processing latency,
reducing RM AsyncDispatcher thread load significantly and improving scheduling
responsiveness in large clusters.
### Conclusion:
The patch removes unnecessary getCapability() calls when the Opportunistic
container feature is disabled, reducing CPU overhead and improving event queue
turnover rate.
This optimization has already proven effective in production with
substantial gains in RM performance.
> AsyncDispatcher event queue backlog with millions of STATUS_UPDATE events
> -------------------------------------------------------------------------
>
> Key: YARN-11878
> URL: https://issues.apache.org/jira/browse/YARN-11878
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: api, RM
> Affects Versions: 3.4.0, 3.4.1
> Reporter: victor
> Priority: Major
> Labels: pull-request-available
>
> In large-scale YARN clusters with a high number of nodes, the AsyncDispatcher
> event queue can grow to several million events. More than 90% of these are
> STATUS_UPDATE events.
> Profiling shows that within StatusUpdateWhenHealthyTransition.transition,
> more than 90% of the CPU time is spent in:
> ContainerStatusPBImpl.getCapability()
> This method appears to repeatedly parse or build protobuf capability objects
> on every STATUS_UPDATE, causing severe CPU overhead.
> Observed Logs (excerpt):
> 2025-09-29 01:23:32,546 INFO event.AsyncDispatcher: Size of event-queue is
> 2256000
> 2025-09-29 01:23:32,544 INFO event.AsyncDispatcher: Event type:
> STATUS_UPDATE, Event record counter: 2081612
> 2025-09-29 01:23:32,544 INFO event.AsyncDispatcher: Event type: KILL, Event
> record counter: 27808
> 2025-09-29 01:23:32,543 INFO event.AsyncDispatcher: Event type: NODE_UPDATE,
> Event record counter: 224
> ...
> EVENT statistics example:
> STATUS_UPDATE: 2,081,612
> KILL: 27,808
> CONTAINER_FINISHED: 706
> NODE_USABLE: 207
> NODE_UPDATE: 224
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]