[jira] [Commented] (YARN-11878) AsyncDispatcher event queue backlog with millions of STATUS_UPDATE events

ASF GitHub Bot (Jira) Sat, 18 Oct 2025 11:19:46 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-11878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18029282#comment-18029282
 ]


ASF GitHub Bot commented on YARN-11878:
---------------------------------------

qq619618919 opened a new pull request, #8026:
URL: https://github.com/apache/hadoop/pull/8026

   ### Description of PR
   JIRA: [YARN-11878](https://issues.apache.org/jira/browse/YARN-11878). 
AsyncDispatcher event queue backlog with millions of STATUS_UPDATE events
   
   Avoid costly ContainerStatusPBImpl.getCapability() calls in STATUS_UPDATE 
when Opportunistic containers are disabled
   
   ### Background
   This behavior was introduced by 
[YARN-11003](https://issues.apache.org/jira/browse/YARN-11003). to support 
Opportunistic containers optimization in the ResourceManager.
   
   To implement that optimization, `StatusUpdateWhenHealthyTransition` calls 
`ContainerStatusPBImpl.getCapability()` during every `STATUS_UPDATE` event.
   This ensures container resource capability info is always available for 
scheduling decisions
   when opportunistic containers are enabled.
   
   However, in clusters where **opportunistic containers are disabled**,
   retrieving `capability` in every `STATUS_UPDATE` becomes **unnecessary**,
   since the capability value is not used in most workflows.
   
   ### Currently
   **NodeManager heartbeat**: frequent `STATUS_UPDATE` events sent to the 
ResourceManager
   **Each STATUS_UPDATE processing**: triggers 
`ContainerStatusPBImpl.getCapability()`
   **Problem**: Even when the opportunistic container feature is **off**, the 
same costly protobuf parsing and `ResourcePBImpl` object construction still 
happens for each event. This leads to:
   1. High CPU usage in the AsyncDispatcher event processing thread
   2. Millions of repeated, unused protobuf parses in large clusters
   3. Increased event queue latency and slower scheduling decisions
   
   ### Impact
   In clusters with thousands of nodes, `STATUS_UPDATE` events can account for 
>90% of the AsyncDispatcher queue.
   Profiling shows that `getCapability()` calls consume >90% of CPU time in 
`StatusUpdateWhenHealthyTransition.transition()` when opportunistic containers 
are disabled.
   The overhead is **pure waste** under these conditions and can be entirely 
skipped.
   
   ### Proposed Changes
   1. Skip capability retrieval logic when `opportunisticContainersEnabled` is 
false.
   2. Cache `remoteContainer.getCapability()` result in a local variable to 
prevent multiple protobuf parsing calls within the same STATUS_UPDATE handling.
   
   
   ### How was this patch tested?
   CI
   
   ### For code changes:
   
   - [x] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'YARN-11878. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> AsyncDispatcher event queue backlog with millions of STATUS_UPDATE events
> -------------------------------------------------------------------------
>
>                 Key: YARN-11878
>                 URL: https://issues.apache.org/jira/browse/YARN-11878
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: api, RM
>    Affects Versions: 3.4.0, 3.4.1
>            Reporter: victor
>            Priority: Major
>
> In large-scale YARN clusters with a high number of nodes, the AsyncDispatcher 
> event queue can grow to several million events. More than 90% of these are 
> STATUS_UPDATE events.
> Profiling shows that within StatusUpdateWhenHealthyTransition.transition, 
> more than 90% of the CPU time is spent in: 
> ContainerStatusPBImpl.getCapability()
> This method appears to repeatedly parse or build protobuf capability objects 
> on every STATUS_UPDATE, causing severe CPU overhead.
> Observed Logs (excerpt):
> 2025-09-29 01:23:32,546 INFO event.AsyncDispatcher: Size of event-queue is 
> 2256000
> 2025-09-29 01:23:32,544 INFO event.AsyncDispatcher: Event type: 
> STATUS_UPDATE, Event record counter: 2081612
> 2025-09-29 01:23:32,544 INFO event.AsyncDispatcher: Event type: KILL, Event 
> record counter: 27808
> 2025-09-29 01:23:32,543 INFO event.AsyncDispatcher: Event type: NODE_UPDATE, 
> Event record counter: 224
> ...
> EVENT statistics example:
> STATUS_UPDATE: 2,081,612
> KILL: 27,808
> CONTAINER_FINISHED: 706
> NODE_USABLE: 207
> NODE_UPDATE: 224



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YARN-11878) AsyncDispatcher event queue backlog with millions of STATUS_UPDATE events

Reply via email to