[
https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307413#comment-17307413
]
Eric Badger edited comment on YARN-9618 at 3/23/21, 8:52 PM:
-------------------------------------------------------------
bq. Actually, why we use an other async dispatcher here is try to make the
rmDispatcher#eventQueue not boom to affect other event process. The boom will
transformed to nodeListManagerDispatcher#eventQueue.
I think [~gandras]'s point is that all of the events are going to go through
{{rmDispatcher}} either way. Without the proposed change, {{rmDispatcher}} will
get the event in the eventQueue and will also do the processing. With this
proposed change, {{rmDispatcher}} will get the event and then it will copy it
over to {{nodeListManagerDispatcher}}. Then {{nodeListManagerDispatcher}} will
do the processing. But in both cases, {{rmDispatcher}} is dealing with
{{RMAppNodeUpdateEvent}} in some way.
So the question is whether copying the event or processing the event takes more
time. If copying the event takes more time than processing the event, then this
change only makes things worse. If processing the event takes more time than
copying the event to the new async dispatcher, then this change makes sense and
will remove some load on the {{rmDispatcher}}.
[~gandras], is that right?
was (Author: ebadger):
bq. Actually, why we use an other async dispatcher here is try to make the
rmDispatcher#eventQueue not boom to affect other event process. The boom will
transformed to nodeListManagerDispatcher#eventQueue.
I think [~gandras]'s point is that all of the events are going to go through
{{rmDispatcher}} either way. Without the proposed change, {{rmDispatcher}} will
get the event in the eventQueue and will also do the processing. With this
proposed change, {{rmDispatcher}} will get the event and then it will copy it
over to {{nodeListManagerDispatcher}}. Then {{nodeListManagerDispatcher}} will
do the processing. But in both cases, {{rmDispatcher}} is dealing with
{{RMAppNodeUpdateEvent}}s in some way.
So the question is whether copying the event or processing the event takes more
time. If copying the event takes more time than processing the event, then this
change only makes things worse. If processing the event takes more time than
copying the event to the new async dispatcher, then this change makes sense and
will remove some load on the {{rmDispatcher}}.
[~gandras], is that right?
> NodeListManager event improvement
> ---------------------------------
>
> Key: YARN-9618
> URL: https://issues.apache.org/jira/browse/YARN-9618
> Project: Hadoop YARN
> Issue Type: Sub-task
> Reporter: Bibin Chundatt
> Assignee: Qi Zhu
> Priority: Critical
> Attachments: YARN-9618.001.patch, YARN-9618.002.patch,
> YARN-9618.003.patch, YARN-9618.004.patch, YARN-9618.005.patch
>
>
> Current implementation nodelistmanager event blocks async dispacher and can
> cause RM crash and slowing down event processing.
> # Cluster restart with 1K running apps . Each usable event will create 1K
> events over all events could be 5k*1k events for 5K cluster
> # Event processing is blocked till new events are added to queue.
> Solution :
> # Add another async Event handler similar to scheduler.
> # Instead of adding events to dispatcher directly call RMApp event handler.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]