[ 
https://issues.apache.org/jira/browse/YARN-11878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18029286#comment-18029286
 ] 

ASF GitHub Bot commented on YARN-11878:
---------------------------------------

qq619618919 commented on PR #8026:
URL: https://github.com/apache/hadoop/pull/8026#issuecomment-3394100330

   ### Performance Verification in Production
   We tested this patch in a production YARN cluster and used Arthas to monitor 
RM node event handling performance via:
   ```bash
   monitor -c 5 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher
 handle
   ```
   
   ### Result:
   Before patch (with original YARN-11003 behavior): average NM heartbeat 
handling time ≈ 1.10 ms
   After patch (skip/caching getCapability() when Opportunistic containers 
disabled): average NM heartbeat handling time ≈ 0.09 ms
   This shows over 12× improvement in heartbeat event processing latency, 
reducing RM AsyncDispatcher thread load significantly and improving scheduling 
responsiveness in large clusters.
   
   ### Conclusion:
   The patch removes unnecessary getCapability() calls when the Opportunistic 
container feature is disabled, reducing CPU overhead and improving event queue 
turnover rate.
   This optimization has already proven effective in production with 
substantial gains in RM performance.




> AsyncDispatcher event queue backlog with millions of STATUS_UPDATE events
> -------------------------------------------------------------------------
>
>                 Key: YARN-11878
>                 URL: https://issues.apache.org/jira/browse/YARN-11878
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: api, RM
>    Affects Versions: 3.4.0, 3.4.1
>            Reporter: victor
>            Priority: Major
>              Labels: pull-request-available
>
> In large-scale YARN clusters with a high number of nodes, the AsyncDispatcher 
> event queue can grow to several million events. More than 90% of these are 
> STATUS_UPDATE events.
> Profiling shows that within StatusUpdateWhenHealthyTransition.transition, 
> more than 90% of the CPU time is spent in: 
> ContainerStatusPBImpl.getCapability()
> This method appears to repeatedly parse or build protobuf capability objects 
> on every STATUS_UPDATE, causing severe CPU overhead.
> Observed Logs (excerpt):
> 2025-09-29 01:23:32,546 INFO event.AsyncDispatcher: Size of event-queue is 
> 2256000
> 2025-09-29 01:23:32,544 INFO event.AsyncDispatcher: Event type: 
> STATUS_UPDATE, Event record counter: 2081612
> 2025-09-29 01:23:32,544 INFO event.AsyncDispatcher: Event type: KILL, Event 
> record counter: 27808
> 2025-09-29 01:23:32,543 INFO event.AsyncDispatcher: Event type: NODE_UPDATE, 
> Event record counter: 224
> ...
> EVENT statistics example:
> STATUS_UPDATE: 2,081,612
> KILL: 27,808
> CONTAINER_FINISHED: 706
> NODE_USABLE: 207
> NODE_UPDATE: 224



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to