[ https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bence Kosztolnik updated YARN-11656: ------------------------------------ Description: h2. Problem statement I observed Yarn cluster has pending and available resources as well, but the cluster utilization is usually around ~50%. The cluster had loaded with 200 parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 reduce containers configured, on a 50 nodes cluster, where each node had 8 cores, and a lot of memory (there was cpu bottleneck). Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to persist a RMStateStoreEvent (using FileSystemRMStateStore). To reduce the impact of the issue: - create a dispatcher where events can persist in parallel threads - create metric data for the RMStateStore event queue to be able easily to identify the problem if occurs on a cluster {panel:title=Issue visible on UI2} !issue.png|height=250! {panel} Also another way to identify the issue if we can see too much time is required to store info for app after reach new_saving state {panel:title=How issue can look like in log} !log.png! {panel} h2. Solution Created a *MultiDispatcher* class which implements the Dispatcher interface. The Dispatcher creates a separate metric object called _Event metrics for "rm-state-store"_ where we can see - how many unhandled events are currently present in the event queue for the specific event type - how many events were handled for the specific event type - average execution time for the specific event The dispatcher has the following configs ( the placeholder is for the dispatcher name, for example, rm-state-store ) ||Config name||Description||Default value|| |yarn.dispatcher.multi-thread.{}.*default-pool-size*|How many parallel threads should execute the parallel event execution| 4| |yarn.dispatcher.multi-thread.{}.*max-pool-size*|If the event queue is full the execution threads will scale up to this many|8| |yarn.dispatcher.multi-thread.{}.*keep-alive-seconds*|Execution threads will be destroyed after this many seconds|10| |yarn.dispatcher.multi-thread.{}.*queue-size*|Size of the eventqueue|1 000 000| |yarn.dispatcher.multi-thread.{}.*monitor-seconds*|The size of the event queue will be logged with this frequency (if not zero) |30| |yarn.dispatcher.multi-thread.{}.*graceful-stop-seconds*|After the stop signal the dispatcher will wait this many seconds to be able to process the incoming events before terminating them|60| h2. Testing was: h2. Problem statement I observed Yarn cluster has pending and available resources as well, but the cluster utilization is usually around ~50%. The cluster had loaded with 200 parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 reduce containers configured, on a 50 nodes cluster, where each node had 8 cores, and a lot of memory (there was cpu bottleneck). Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to persist a RMStateStoreEvent (using FileSystemRMStateStore). To reduce the impact of the issue: - create a dispatcher where events can persist in parallel threads - create metric data for the RMStateStore event queue to be able easily to identify the problem if occurs on a cluster {panel:title=Issue visible on UI2} !issue.png|height=250! {panel} Also another way to identify the issue if we can see too much time is required to store info for app after reach new_saving state {panel:title=How issue can look like in log} {panel} h2. Solution Created a *MultiDispatcher* class which implements the Dispatcher interface. The Dispatcher creates a separate metric object called _Event metrics for "rm-state-store"_ where we can see - how many unhandled events are currently present in the event queue for the specific event type - how many events were handled for the specific event type - average execution time for the specific event The dispatcher has the following configs ( the placeholder is for the dispatcher name, for example, rm-state-store ) ||Config name||Description||Default value|| |yarn.dispatcher.multi-thread.{}.*default-pool-size*|How many parallel threads should execute the parallel event execution| 4| |yarn.dispatcher.multi-thread.{}.*max-pool-size*|If the event queue is full the execution threads will scale up to this many|8| |yarn.dispatcher.multi-thread.{}.*keep-alive-seconds*|Execution threads will be destroyed after this many seconds|10| |yarn.dispatcher.multi-thread.{}.*queue-size*|Size of the eventqueue|1 000 000| |yarn.dispatcher.multi-thread.{}.*monitor-seconds*|The size of the event queue will be logged with this frequency (if not zero) |30| |yarn.dispatcher.multi-thread.{}.*graceful-stop-seconds*|After the stop signal the dispatcher will wait this many seconds to be able to process the incoming events before terminating them|60| h2. Testing > RMStateStore event queue blocked > -------------------------------- > > Key: YARN-11656 > URL: https://issues.apache.org/jira/browse/YARN-11656 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn > Affects Versions: 3.4.1 > Reporter: Bence Kosztolnik > Priority: Major > Attachments: issue.png, log.png > > > h2. Problem statement > > I observed Yarn cluster has pending and available resources as well, but the > cluster utilization is usually around ~50%. The cluster had loaded with 200 > parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 > reduce containers configured, on a 50 nodes cluster, where each node had 8 > cores, and a lot of memory (there was cpu bottleneck). > Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to > persist a RMStateStoreEvent (using FileSystemRMStateStore). > To reduce the impact of the issue: > - create a dispatcher where events can persist in parallel threads > - create metric data for the RMStateStore event queue to be able easily to > identify the problem if occurs on a cluster > {panel:title=Issue visible on UI2} > !issue.png|height=250! > {panel} > Also another way to identify the issue if we can see too much time is > required to store info for app after reach new_saving state > {panel:title=How issue can look like in log} > !log.png! > {panel} > h2. Solution > Created a *MultiDispatcher* class which implements the Dispatcher interface. > The Dispatcher creates a separate metric object called _Event metrics for > "rm-state-store"_ where we can see > - how many unhandled events are currently present in the event queue for the > specific event type > - how many events were handled for the specific event type > - average execution time for the specific event > The dispatcher has the following configs ( the placeholder is for the > dispatcher name, for example, rm-state-store ) > ||Config name||Description||Default value|| > |yarn.dispatcher.multi-thread.{}.*default-pool-size*|How many parallel > threads should execute the parallel event execution| 4| > |yarn.dispatcher.multi-thread.{}.*max-pool-size*|If the event queue is full > the execution threads will scale up to this many|8| > |yarn.dispatcher.multi-thread.{}.*keep-alive-seconds*|Execution threads will > be destroyed after this many seconds|10| > |yarn.dispatcher.multi-thread.{}.*queue-size*|Size of the eventqueue|1 000 > 000| > |yarn.dispatcher.multi-thread.{}.*monitor-seconds*|The size of the event > queue will be logged with this frequency (if not zero) |30| > |yarn.dispatcher.multi-thread.{}.*graceful-stop-seconds*|After the stop > signal the dispatcher will wait this many seconds to be able to process the > incoming events before terminating them|60| > h2. Testing -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org