Zhanqi Cai created YARN-10739:
---------------------------------
Summary: GenericEventHandler.printEventQueueDetails cause RM
recovery cost too much time
Key: YARN-10739
URL: https://issues.apache.org/jira/browse/YARN-10739
Project: Hadoop YARN
Issue Type: Bug
Components: resourcemanager
Affects Versions: 3.4.0, 3.3.1, 3.2.3
Reporter: Zhanqi Cai
Due to YARN-10642 add GenericEventHandler.printEventQueueDetails on
AsyncDispatcher, if the event queue size is too large, the
printEventQueueDetails will cost too much time and RM take long time to
process.
For example:
If we have 4K nodes on cluster and 4K apps running, if we do switch and the
nodemanger will register with RM, and RM will call NodesListManager to do
RMAppNodeUpdateEvent, code like below:
for(RMApp app : rmContext.getRMApps().values()) {
if (!app.isAppFinalStateStored()) {
this.rmContext
.getDispatcher()
.getEventHandler()
.handle(
new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
appNodeUpdateType));
}
So the total event is 4k*4k=1600W, during this window, the
GenericEventHandler.printEventQueueDetails will print the event queue detail
and be called frequently, once the event queue size reach to 100W+, the
Iterator of queue from printEventQueueDetails will be so slow refer to below:
private void printEventQueueDetails() {
Iterator<Event> iterator = eventQueue.iterator();
Map<Enum, Long> counterMap = new HashMap<>();
while (iterator.hasNext()) {
Enum eventType = iterator.next().getType();
Then RM recovery will cost too much time.....
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]