Zhanqi Cai created YARN-10739:
---------------------------------

             Summary: GenericEventHandler.printEventQueueDetails cause RM 
recovery cost too much time
                 Key: YARN-10739
                 URL: https://issues.apache.org/jira/browse/YARN-10739
             Project: Hadoop YARN
          Issue Type: Bug
          Components: resourcemanager
    Affects Versions: 3.4.0, 3.3.1, 3.2.3
            Reporter: Zhanqi Cai


Due to YARN-10642 add GenericEventHandler.printEventQueueDetails on 
AsyncDispatcher, if the event queue size is too large, the 
printEventQueueDetails will cost too much time and RM  take long time to 
process.

For example:
If we have 4K nodes on cluster and 4K apps running, if we do switch and the 
nodemanger will register with RM, and RM will call NodesListManager to do 
RMAppNodeUpdateEvent, code like below:

for(RMApp app : rmContext.getRMApps().values()) {
 if (!app.isAppFinalStateStored()) {
 this.rmContext
 .getDispatcher()
 .getEventHandler()
 .handle(
 new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
 appNodeUpdateType));
 }
So the total event is 4k*4k=1600W, during this window, the 
GenericEventHandler.printEventQueueDetails will print the event queue detail 
and be called frequently, once the event queue size reach to 100W+, the 
Iterator of queue from printEventQueueDetails will be so slow refer to below:

private void printEventQueueDetails() {
 Iterator<Event> iterator = eventQueue.iterator();
 Map<Enum, Long> counterMap = new HashMap<>();
 while (iterator.hasNext()) {
 Enum eventType = iterator.next().getType();

Then RM recovery will cost too much time.....



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to