victor created YARN-11878:
-----------------------------
Summary: AsyncDispatcher event queue backlog with millions of
STATUS_UPDATE events
Key: YARN-11878
URL: https://issues.apache.org/jira/browse/YARN-11878
Project: Hadoop YARN
Issue Type: Bug
Components: yarn-service
Affects Versions: 3.4.1, 3.4.0
Reporter: victor
Fix For: 3.5.0, 3.4.3
In large-scale YARN clusters with a high number of nodes, the AsyncDispatcher
event queue can grow to several million events. More than 90% of these are
STATUS_UPDATE events.
Profiling shows that within StatusUpdateWhenHealthyTransition.transition, more
than 90% of the CPU time is spent in: ContainerStatusPBImpl.getCapability()
This method appears to repeatedly parse or build protobuf capability objects on
every STATUS_UPDATE, causing severe CPU overhead.
Observed Logs (excerpt):
2025-09-29 01:23:32,546 INFO event.AsyncDispatcher: Size of event-queue is
2256000
2025-09-29 01:23:32,544 INFO event.AsyncDispatcher: Event type: STATUS_UPDATE,
Event record counter: 2081612
2025-09-29 01:23:32,544 INFO event.AsyncDispatcher: Event type: KILL, Event
record counter: 27808
2025-09-29 01:23:32,543 INFO event.AsyncDispatcher: Event type: NODE_UPDATE,
Event record counter: 224
...
EVENT statistics example:
STATUS_UPDATE: 2,081,612
KILL: 27,808
CONTAINER_FINISHED: 706
NODE_USABLE: 207
NODE_UPDATE: 224
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]