[ 
https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307413#comment-17307413
 ] 

Eric Badger edited comment on YARN-9618 at 3/23/21, 8:52 PM:
-------------------------------------------------------------

bq. Actually, why we use an other async dispatcher here is try to make the 
rmDispatcher#eventQueue not boom to affect other event process. The boom will 
transformed to nodeListManagerDispatcher#eventQueue.
I think [~gandras]'s point is that all of the events are going to go through 
{{rmDispatcher}} either way. Without the proposed change, {{rmDispatcher}} will 
get the event in the eventQueue and will also do the processing. With this 
proposed change, {{rmDispatcher}} will get the event and then it will copy it 
over to {{nodeListManagerDispatcher}}. Then {{nodeListManagerDispatcher}} will 
do the processing. But in both cases, {{rmDispatcher}} is dealing with 
{{RMAppNodeUpdateEvent}} in some way. 

So the question is whether copying the event or processing the event takes more 
time. If copying the event takes more time than processing the event, then this 
change only makes things worse. If processing the event takes more time than 
copying the event to the new async dispatcher, then this change makes sense and 
will remove some load on the {{rmDispatcher}}.

[~gandras], is that right?


was (Author: ebadger):
bq. Actually, why we use an other async dispatcher here is try to make the 
rmDispatcher#eventQueue not boom to affect other event process. The boom will 
transformed to nodeListManagerDispatcher#eventQueue.
I think [~gandras]'s point is that all of the events are going to go through 
{{rmDispatcher}} either way. Without the proposed change, {{rmDispatcher}} will 
get the event in the eventQueue and will also do the processing. With this 
proposed change, {{rmDispatcher}} will get the event and then it will copy it 
over to {{nodeListManagerDispatcher}}. Then {{nodeListManagerDispatcher}} will 
do the processing. But in both cases, {{rmDispatcher}} is dealing with 
{{RMAppNodeUpdateEvent}}s in some way. 

So the question is whether copying the event or processing the event takes more 
time. If copying the event takes more time than processing the event, then this 
change only makes things worse. If processing the event takes more time than 
copying the event to the new async dispatcher, then this change makes sense and 
will remove some load on the {{rmDispatcher}}.

[~gandras], is that right?

> NodeListManager event improvement
> ---------------------------------
>
>                 Key: YARN-9618
>                 URL: https://issues.apache.org/jira/browse/YARN-9618
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Bibin Chundatt
>            Assignee: Qi Zhu
>            Priority: Critical
>         Attachments: YARN-9618.001.patch, YARN-9618.002.patch, 
> YARN-9618.003.patch, YARN-9618.004.patch, YARN-9618.005.patch
>
>
> Current implementation nodelistmanager event blocks async dispacher and can 
> cause RM crash and slowing down event processing.
> # Cluster restart with 1K running apps . Each usable event will create 1K 
> events over all events could be 5k*1k events for 5K cluster
> # Event processing is blocked till new events are added to queue.
> Solution :
> # Add another async Event handler similar to scheduler.
> # Instead of adding events to dispatcher directly call RMApp event handler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to