[jira] [Comment Edited] (YARN-10739) GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time

Qi Zhu (Jira) Fri, 16 Apr 2021 05:38:05 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323774#comment-17323774
 ]


Qi Zhu edited comment on YARN-10739 at 4/16/21, 12:37 PM:
----------------------------------------------------------

Thanks [~zhanqi.cai] for reporting this, use time to gap the print is not a 
good choice.

And you should use Time.monotonicNow(), which will replace 
System.currentTimeMillis() in hadoop project.

I think the key problem is the queue size is too big, not just because the 
iterator print.

I just finished YARN-9618.
{code:java}
for(RMApp app : rmContext.getRMApps().values()) {
  if (!app.isAppFinalStateStored()) {
    this.rmContext
        .getDispatcher()
        .getEventHandler()
        .handle(
            new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
                appNodeUpdateType));
  }
}{code}
This one above is not reasonable, i have improved it in YARN-9618.


was (Author: zhuqi):
Thanks [~zhanqi.cai] for reporting this, use time to gap the print is not a 
good choice.

But you should use Time.monotonicNow(), which will replace 
System.currentTimeMillis() in hadoop project.

I think the key problem is the queue size is too big, not just because the 
iterator print.

I just finished YARN-9618.
{code:java}
for(RMApp app : rmContext.getRMApps().values()) {
  if (!app.isAppFinalStateStored()) {
    this.rmContext
        .getDispatcher()
        .getEventHandler()
        .handle(
            new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
                appNodeUpdateType));
  }
}{code}
This one above is not reasonable, i have improved it in YARN-9618.

> GenericEventHandler.printEventQueueDetails cause RM recovery cost too much 
> time
> -------------------------------------------------------------------------------
>
>                 Key: YARN-10739
>                 URL: https://issues.apache.org/jira/browse/YARN-10739
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 3.4.0, 3.3.1, 3.2.3
>            Reporter: Zhanqi Cai
>            Priority: Critical
>         Attachments: YARN-10739-001.patch
>
>
> Due to YARN-8995 YARN-10642 add GenericEventHandler.printEventQueueDetails on 
> AsyncDispatcher, if the event queue size is too large, the 
> printEventQueueDetails will cost too much time and RM  take a long time to 
> process.
> For example:
>  If we have 4K nodes on cluster and 4K apps running, if we do switch and the 
> node manager will register with RM, and RM will call NodesListManager to do 
> RMAppNodeUpdateEvent, code like below:
> {code:java}
> for(RMApp app : rmContext.getRMApps().values()) {
>   if (!app.isAppFinalStateStored()) {
>     this.rmContext
>         .getDispatcher()
>         .getEventHandler()
>         .handle(
>             new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
>                 appNodeUpdateType));
>   }
> }{code}
> So the total event is 4k*4k=16 mil, during this window, the 
> GenericEventHandler.printEventQueueDetails will print the event queue detail 
> and be called frequently, once the event queue size reaches 1 mil+, the 
> Iterator of the queue from printEventQueueDetails will be so slow refer to 
> below: 
> {code:java}
> private void printEventQueueDetails() {
>   Iterator<Event> iterator = eventQueue.iterator();
>   Map<Enum, Long> counterMap = new HashMap<>();
>   while (iterator.hasNext()) {
>     Enum eventType = iterator.next().getType();
> {code}
> Then RM recovery will cost too much time.....
>  Refer to our log:
> {code:java}
> 2021-04-14 20:35:34,432 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(306)) - Size of event-queue is 12000000
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: KILL, Event 
> record counter: 310836
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_UPDATE, 
> Event record counter: 1103
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: 
> NODE_REMOVED, Event record counter: 1
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: APP_REMOVED, 
> Event record counter: 1
> {code}
> Between AsyncDispatcher.handle and printEventQueueDetails, here is more than 
> 1s to do Iterator.
> I upload a file to ensure the printEventQueueDetails only be called one-time 
> pre-30s.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (YARN-10739) GenericEventHandler.printEventQueueDetails cause RM recovery cost too much time

Reply via email to