[jira] [Assigned] (YARN-11687) Update CGroupsResourceCalculator to track usages using cgroupv2
[ https://issues.apache.org/jira/browse/YARN-11687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik reassigned YARN-11687: --- Assignee: Bence Kosztolnik > Update CGroupsResourceCalculator to track usages using cgroupv2 > --- > > Key: YARN-11687 > URL: https://issues.apache.org/jira/browse/YARN-11687 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Benjamin Teke >Assignee: Bence Kosztolnik >Priority: Major > > [CGroupsResourceCalculator|https://github.com/apache/hadoop/blob/f609460bda0c2bd87dd3580158e549e2f34f14d5/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupsResourceCalculator.java] > should also be updated to handle the cgroup v2 changes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11656) RMStateStore event queue blocked
[ https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik updated YARN-11656: Description: h2. Problem statement I observed Yarn cluster has pending and available resources as well, but the cluster utilization is usually around ~50%. The cluster had loaded with 200 parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 reduce containers configured, on a 50 nodes cluster, where each node had 8 cores, and a lot of memory (there was cpu bottleneck). Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to persist a RMStateStoreEvent (using FileSystemRMStateStore). To reduce the impact of the issue: - create a dispatcher where events can persist in parallel threads - create metric data for the RMStateStore event queue to be able easily to identify the problem if occurs on a cluster {panel:title=Issue visible on UI2} !issue.png|height=250! {panel} Also another way to identify the issue if we can see too much time is required to store info for app after reach new_saving state {panel:title=How issue can look like in log} !log.png|height=250! {panel} h2. Solution Created a *MultiDispatcher* class which implements the Dispatcher interface. The Dispatcher creates a separate metric object called _Event metrics for "rm-state-store"_ where we can see - how many unhandled events are currently present in the event queue for the specific event type - how many events were handled for the specific event type - average execution time for the specific event The dispatcher has the following configs ( the placeholder is for the dispatcher name, for example, rm-state-store ) ||Config name||Description||Default value|| |yarn.dispatcher.multi-thread.{}.*default-pool-size*|How many parallel threads should execute the parallel event execution|4| |yarn.dispatcher.multi-thread.{}.*queue-size*|Size of the eventqueue|1 000 000| |yarn.dispatcher.multi-thread.{}.*monitor-seconds*|The size of the event queue will be logged with this frequency (if not zero) |0| |yarn.dispatcher.multi-thread.{}.*graceful-stop-seconds*|After the stop signal the dispatcher will wait this many seconds to be able to process the incoming events before terminating them|60| |yarn.dispatcher.multi-thread.{}.*metrics-enabled*|The dispatcher should publish metrics data to the metric system|false| {panel:title=Example output from RM JMX api} {noformat} ... { "name": "Hadoop:service=ResourceManager,name=Event metrics for rm-state-store", "modelerType": "Event metrics for rm-state-store", "tag.Context": "yarn", "tag.Hostname": CENSORED "RMStateStoreEventType#STORE_APP_ATTEMPT_Current": 51, "RMStateStoreEventType#STORE_APP_ATTEMPT_NumOps": 0, "RMStateStoreEventType#STORE_APP_ATTEMPT_AvgTime": 0.0, "RMStateStoreEventType#STORE_APP_Current": 124, "RMStateStoreEventType#STORE_APP_NumOps": 46, "RMStateStoreEventType#STORE_APP_AvgTime": 3318.25, "RMStateStoreEventType#UPDATE_APP_Current": 31, "RMStateStoreEventType#UPDATE_APP_NumOps": 16, "RMStateStoreEventType#UPDATE_APP_AvgTime": 2629.5, "RMStateStoreEventType#UPDATE_APP_ATTEMPT_Current": 31, "RMStateStoreEventType#UPDATE_APP_ATTEMPT_NumOps": 12, "RMStateStoreEventType#UPDATE_APP_ATTEMPT_AvgTime": 2048.5, "RMStateStoreEventType#REMOVE_APP_Current": 12, "RMStateStoreEventType#REMOVE_APP_NumOps": 3, "RMStateStoreEventType#REMOVE_APP_AvgTime": 1378.0, "RMStateStoreEventType#REMOVE_APP_ATTEMPT_Current": 0, "RMStateStoreEventType#REMOVE_APP_ATTEMPT_NumOps": 0, "RMStateStoreEventType#REMOVE_APP_ATTEMPT_AvgTime": 0.0, "RMStateStoreEventType#FENCED_Current": 0, "RMStateStoreEventType#FENCED_NumOps": 0, "RMStateStoreEventType#FENCED_AvgTime": 0.0, "RMStateStoreEventType#STORE_MASTERKEY_Current": 0, "RMStateStoreEventType#STORE_MASTERKEY_NumOps": 0, "RMStateStoreEventType#STORE_MASTERKEY_AvgTime": 0.0, "RMStateStoreEventType#REMOVE_MASTERKEY_Current": 0, "RMStateStoreEventType#REMOVE_MASTERKEY_NumOps": 0, "RMStateStoreEventType#REMOVE_MASTERKEY_AvgTime": 0.0, "RMStateStoreEventType#STORE_DELEGATION_TOKEN_Current": 0, "RMStateStoreEventType#STORE_DELEGATION_TOKEN_NumOps": 0, "RMStateStoreEventType#STORE_DELEGATION_TOKEN_AvgTime": 0.0, "RMStateStoreEventType#REMOVE_DELEGATION_TOKEN_Current": 0, "RMStateStoreEventType#REMOVE_DELEGATION_TOKEN_NumOps": 0, "RMStateStoreEventType#REMOVE_DELEGATION_TOKEN_AvgTime": 0.0, "RMStateStoreEventType#UPDATE_DELEGATION_TOKEN_Current": 0, "RMStateStoreEventType#UPDATE_DELEGATION_TOKEN_NumOps": 0, "RMStateStoreEventType#UPDATE_DELEGATION_TOKEN_AvgTime": 0.0, "RMStateStoreEventType#UPDATE_AMRM_TOKEN_Current": 0,
[jira] [Updated] (YARN-11656) RMStateStore event queue blocked
[ https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik updated YARN-11656: Summary: RMStateStore event queue blocked (was: RMStateStore eventqueue blocked) > RMStateStore event queue blocked > > > Key: YARN-11656 > URL: https://issues.apache.org/jira/browse/YARN-11656 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.4.1 >Reporter: Bence Kosztolnik >Assignee: Bence Kosztolnik >Priority: Major > Attachments: issue.png, log.png > > > h2. Problem statement > > I observed Yarn cluster has pending and available resources as well, but the > cluster utilization is usually around ~50%. The cluster had loaded with 200 > parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 > reduce containers configured, on a 50 nodes cluster, where each node had 8 > cores, and a lot of memory (there was cpu bottleneck). > Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to > persist a RMStateStoreEvent (using FileSystemRMStateStore). > To reduce the impact of the issue: > - create a dispatcher where events can persist in parallel threads > - create metric data for the RMStateStore event queue to be able easily to > identify the problem if occurs on a cluster > {panel:title=Issue visible on UI2} > !issue.png|height=250! > {panel} > Also another way to identify the issue if we can see too much time is > required to store info for app after reach new_saving state > {panel:title=How issue can look like in log} > !log.png|height=250! > {panel} > h2. Solution > Created a *MultiDispatcher* class which implements the Dispatcher interface. > The Dispatcher creates a separate metric object called _Event metrics for > "rm-state-store"_ where we can see > - how many unhandled events are currently present in the event queue for the > specific event type > - how many events were handled for the specific event type > - average execution time for the specific event > The dispatcher has the following configs ( the placeholder is for the > dispatcher name, for example, rm-state-store ) > ||Config name||Description||Default value|| > |yarn.dispatcher.multi-thread.{}.*default-pool-size*|How many parallel > threads should execute the parallel event execution| 4| > |yarn.dispatcher.multi-thread.{}.*max-pool-size*|If the event queue is full > the execution threads will scale up to this many|8| > |yarn.dispatcher.multi-thread.{}.*keep-alive-seconds*|Execution threads will > be destroyed after this many seconds|10| > |yarn.dispatcher.multi-thread.{}.*queue-size*|Size of the eventqueue|1 000 > 000| > |yarn.dispatcher.multi-thread.{}.*monitor-seconds*|The size of the event > queue will be logged with this frequency (if not zero) |30| > |yarn.dispatcher.multi-thread.{}.*graceful-stop-seconds*|After the stop > signal the dispatcher will wait this many seconds to be able to process the > incoming events before terminating them|60| > {panel:title=Example output from RM JMX api} > {noformat} > ... > { > "name": "Hadoop:service=ResourceManager,name=Event metrics for > rm-state-store", > "modelerType": "Event metrics for rm-state-store", > "tag.Context": "yarn", > "tag.Hostname": CENSORED > "RMStateStoreEventType#STORE_APP_ATTEMPT_Current": 51, > "RMStateStoreEventType#STORE_APP_ATTEMPT_NumOps": 0, > "RMStateStoreEventType#STORE_APP_ATTEMPT_AvgTime": 0.0, > "RMStateStoreEventType#STORE_APP_Current": 124, > "RMStateStoreEventType#STORE_APP_NumOps": 46, > "RMStateStoreEventType#STORE_APP_AvgTime": 3318.25, > "RMStateStoreEventType#UPDATE_APP_Current": 31, > "RMStateStoreEventType#UPDATE_APP_NumOps": 16, > "RMStateStoreEventType#UPDATE_APP_AvgTime": 2629.5, > "RMStateStoreEventType#UPDATE_APP_ATTEMPT_Current": 31, > "RMStateStoreEventType#UPDATE_APP_ATTEMPT_NumOps": 12, > "RMStateStoreEventType#UPDATE_APP_ATTEMPT_AvgTime": 2048.5, > "RMStateStoreEventType#REMOVE_APP_Current": 12, > "RMStateStoreEventType#REMOVE_APP_NumOps": 3, > "RMStateStoreEventType#REMOVE_APP_AvgTime": 1378.0, > "RMStateStoreEventType#REMOVE_APP_ATTEMPT_Current": 0, > "RMStateStoreEventType#REMOVE_APP_ATTEMPT_NumOps": 0, > "RMStateStoreEventType#REMOVE_APP_ATTEMPT_AvgTime": 0.0, > "RMStateStoreEventType#FENCED_Current": 0, > "RMStateStoreEventType#FENCED_NumOps": 0, > "RMStateStoreEventType#FENCED_AvgTime": 0.0, > "RMStateStoreEventType#STORE_MASTERKEY_Current": 0, > "RMStateStoreEventType#STORE_MASTERKEY_NumOps": 0, > "RMStateStoreEventType#STORE_MASTERKEY_AvgTime": 0.0, >
[jira] [Updated] (YARN-11656) RMStateStore eventqueue blocked
[ https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik updated YARN-11656: Summary: RMStateStore eventqueue blocked (was: RMStateStore event queue blocked) > RMStateStore eventqueue blocked > --- > > Key: YARN-11656 > URL: https://issues.apache.org/jira/browse/YARN-11656 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.4.1 >Reporter: Bence Kosztolnik >Assignee: Bence Kosztolnik >Priority: Major > Attachments: issue.png, log.png > > > h2. Problem statement > > I observed Yarn cluster has pending and available resources as well, but the > cluster utilization is usually around ~50%. The cluster had loaded with 200 > parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 > reduce containers configured, on a 50 nodes cluster, where each node had 8 > cores, and a lot of memory (there was cpu bottleneck). > Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to > persist a RMStateStoreEvent (using FileSystemRMStateStore). > To reduce the impact of the issue: > - create a dispatcher where events can persist in parallel threads > - create metric data for the RMStateStore event queue to be able easily to > identify the problem if occurs on a cluster > {panel:title=Issue visible on UI2} > !issue.png|height=250! > {panel} > Also another way to identify the issue if we can see too much time is > required to store info for app after reach new_saving state > {panel:title=How issue can look like in log} > !log.png|height=250! > {panel} > h2. Solution > Created a *MultiDispatcher* class which implements the Dispatcher interface. > The Dispatcher creates a separate metric object called _Event metrics for > "rm-state-store"_ where we can see > - how many unhandled events are currently present in the event queue for the > specific event type > - how many events were handled for the specific event type > - average execution time for the specific event > The dispatcher has the following configs ( the placeholder is for the > dispatcher name, for example, rm-state-store ) > ||Config name||Description||Default value|| > |yarn.dispatcher.multi-thread.{}.*default-pool-size*|How many parallel > threads should execute the parallel event execution| 4| > |yarn.dispatcher.multi-thread.{}.*max-pool-size*|If the event queue is full > the execution threads will scale up to this many|8| > |yarn.dispatcher.multi-thread.{}.*keep-alive-seconds*|Execution threads will > be destroyed after this many seconds|10| > |yarn.dispatcher.multi-thread.{}.*queue-size*|Size of the eventqueue|1 000 > 000| > |yarn.dispatcher.multi-thread.{}.*monitor-seconds*|The size of the event > queue will be logged with this frequency (if not zero) |30| > |yarn.dispatcher.multi-thread.{}.*graceful-stop-seconds*|After the stop > signal the dispatcher will wait this many seconds to be able to process the > incoming events before terminating them|60| > {panel:title=Example output from RM JMX api} > {noformat} > ... > { > "name": "Hadoop:service=ResourceManager,name=Event metrics for > rm-state-store", > "modelerType": "Event metrics for rm-state-store", > "tag.Context": "yarn", > "tag.Hostname": CENSORED > "RMStateStoreEventType#STORE_APP_ATTEMPT_Current": 51, > "RMStateStoreEventType#STORE_APP_ATTEMPT_NumOps": 0, > "RMStateStoreEventType#STORE_APP_ATTEMPT_AvgTime": 0.0, > "RMStateStoreEventType#STORE_APP_Current": 124, > "RMStateStoreEventType#STORE_APP_NumOps": 46, > "RMStateStoreEventType#STORE_APP_AvgTime": 3318.25, > "RMStateStoreEventType#UPDATE_APP_Current": 31, > "RMStateStoreEventType#UPDATE_APP_NumOps": 16, > "RMStateStoreEventType#UPDATE_APP_AvgTime": 2629.5, > "RMStateStoreEventType#UPDATE_APP_ATTEMPT_Current": 31, > "RMStateStoreEventType#UPDATE_APP_ATTEMPT_NumOps": 12, > "RMStateStoreEventType#UPDATE_APP_ATTEMPT_AvgTime": 2048.5, > "RMStateStoreEventType#REMOVE_APP_Current": 12, > "RMStateStoreEventType#REMOVE_APP_NumOps": 3, > "RMStateStoreEventType#REMOVE_APP_AvgTime": 1378.0, > "RMStateStoreEventType#REMOVE_APP_ATTEMPT_Current": 0, > "RMStateStoreEventType#REMOVE_APP_ATTEMPT_NumOps": 0, > "RMStateStoreEventType#REMOVE_APP_ATTEMPT_AvgTime": 0.0, > "RMStateStoreEventType#FENCED_Current": 0, > "RMStateStoreEventType#FENCED_NumOps": 0, > "RMStateStoreEventType#FENCED_AvgTime": 0.0, > "RMStateStoreEventType#STORE_MASTERKEY_Current": 0, > "RMStateStoreEventType#STORE_MASTERKEY_NumOps": 0, > "RMStateStoreEventType#STORE_MASTERKEY_AvgTime": 0.0, >
[jira] [Assigned] (YARN-11656) RMStateStore event queue blocked
[ https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik reassigned YARN-11656: --- Assignee: Bence Kosztolnik > RMStateStore event queue blocked > > > Key: YARN-11656 > URL: https://issues.apache.org/jira/browse/YARN-11656 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.4.1 >Reporter: Bence Kosztolnik >Assignee: Bence Kosztolnik >Priority: Major > Attachments: issue.png, log.png > > > h2. Problem statement > > I observed Yarn cluster has pending and available resources as well, but the > cluster utilization is usually around ~50%. The cluster had loaded with 200 > parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 > reduce containers configured, on a 50 nodes cluster, where each node had 8 > cores, and a lot of memory (there was cpu bottleneck). > Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to > persist a RMStateStoreEvent (using FileSystemRMStateStore). > To reduce the impact of the issue: > - create a dispatcher where events can persist in parallel threads > - create metric data for the RMStateStore event queue to be able easily to > identify the problem if occurs on a cluster > {panel:title=Issue visible on UI2} > !issue.png|height=250! > {panel} > Also another way to identify the issue if we can see too much time is > required to store info for app after reach new_saving state > {panel:title=How issue can look like in log} > !log.png|height=250! > {panel} > h2. Solution > Created a *MultiDispatcher* class which implements the Dispatcher interface. > The Dispatcher creates a separate metric object called _Event metrics for > "rm-state-store"_ where we can see > - how many unhandled events are currently present in the event queue for the > specific event type > - how many events were handled for the specific event type > - average execution time for the specific event > The dispatcher has the following configs ( the placeholder is for the > dispatcher name, for example, rm-state-store ) > ||Config name||Description||Default value|| > |yarn.dispatcher.multi-thread.{}.*default-pool-size*|How many parallel > threads should execute the parallel event execution| 4| > |yarn.dispatcher.multi-thread.{}.*max-pool-size*|If the event queue is full > the execution threads will scale up to this many|8| > |yarn.dispatcher.multi-thread.{}.*keep-alive-seconds*|Execution threads will > be destroyed after this many seconds|10| > |yarn.dispatcher.multi-thread.{}.*queue-size*|Size of the eventqueue|1 000 > 000| > |yarn.dispatcher.multi-thread.{}.*monitor-seconds*|The size of the event > queue will be logged with this frequency (if not zero) |30| > |yarn.dispatcher.multi-thread.{}.*graceful-stop-seconds*|After the stop > signal the dispatcher will wait this many seconds to be able to process the > incoming events before terminating them|60| > {panel:title=Example output from RM JMX api} > {noformat} > ... > { > "name": "Hadoop:service=ResourceManager,name=Event metrics for > rm-state-store", > "modelerType": "Event metrics for rm-state-store", > "tag.Context": "yarn", > "tag.Hostname": CENSORED > "RMStateStoreEventType#STORE_APP_ATTEMPT_Current": 51, > "RMStateStoreEventType#STORE_APP_ATTEMPT_NumOps": 0, > "RMStateStoreEventType#STORE_APP_ATTEMPT_AvgTime": 0.0, > "RMStateStoreEventType#STORE_APP_Current": 124, > "RMStateStoreEventType#STORE_APP_NumOps": 46, > "RMStateStoreEventType#STORE_APP_AvgTime": 3318.25, > "RMStateStoreEventType#UPDATE_APP_Current": 31, > "RMStateStoreEventType#UPDATE_APP_NumOps": 16, > "RMStateStoreEventType#UPDATE_APP_AvgTime": 2629.5, > "RMStateStoreEventType#UPDATE_APP_ATTEMPT_Current": 31, > "RMStateStoreEventType#UPDATE_APP_ATTEMPT_NumOps": 12, > "RMStateStoreEventType#UPDATE_APP_ATTEMPT_AvgTime": 2048.5, > "RMStateStoreEventType#REMOVE_APP_Current": 12, > "RMStateStoreEventType#REMOVE_APP_NumOps": 3, > "RMStateStoreEventType#REMOVE_APP_AvgTime": 1378.0, > "RMStateStoreEventType#REMOVE_APP_ATTEMPT_Current": 0, > "RMStateStoreEventType#REMOVE_APP_ATTEMPT_NumOps": 0, > "RMStateStoreEventType#REMOVE_APP_ATTEMPT_AvgTime": 0.0, > "RMStateStoreEventType#FENCED_Current": 0, > "RMStateStoreEventType#FENCED_NumOps": 0, > "RMStateStoreEventType#FENCED_AvgTime": 0.0, > "RMStateStoreEventType#STORE_MASTERKEY_Current": 0, > "RMStateStoreEventType#STORE_MASTERKEY_NumOps": 0, > "RMStateStoreEventType#STORE_MASTERKEY_AvgTime": 0.0, > "RMStateStoreEventType#REMOVE_MASTERKEY_Current": 0, >
[jira] [Updated] (YARN-11656) RMStateStore event queue blocked
[ https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik updated YARN-11656: Description: h2. Problem statement I observed Yarn cluster has pending and available resources as well, but the cluster utilization is usually around ~50%. The cluster had loaded with 200 parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 reduce containers configured, on a 50 nodes cluster, where each node had 8 cores, and a lot of memory (there was cpu bottleneck). Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to persist a RMStateStoreEvent (using FileSystemRMStateStore). To reduce the impact of the issue: - create a dispatcher where events can persist in parallel threads - create metric data for the RMStateStore event queue to be able easily to identify the problem if occurs on a cluster {panel:title=Issue visible on UI2} !issue.png|height=250! {panel} Also another way to identify the issue if we can see too much time is required to store info for app after reach new_saving state {panel:title=How issue can look like in log} !log.png|height=250! {panel} h2. Solution Created a *MultiDispatcher* class which implements the Dispatcher interface. The Dispatcher creates a separate metric object called _Event metrics for "rm-state-store"_ where we can see - how many unhandled events are currently present in the event queue for the specific event type - how many events were handled for the specific event type - average execution time for the specific event The dispatcher has the following configs ( the placeholder is for the dispatcher name, for example, rm-state-store ) ||Config name||Description||Default value|| |yarn.dispatcher.multi-thread.{}.*default-pool-size*|How many parallel threads should execute the parallel event execution| 4| |yarn.dispatcher.multi-thread.{}.*max-pool-size*|If the event queue is full the execution threads will scale up to this many|8| |yarn.dispatcher.multi-thread.{}.*keep-alive-seconds*|Execution threads will be destroyed after this many seconds|10| |yarn.dispatcher.multi-thread.{}.*queue-size*|Size of the eventqueue|1 000 000| |yarn.dispatcher.multi-thread.{}.*monitor-seconds*|The size of the event queue will be logged with this frequency (if not zero) |30| |yarn.dispatcher.multi-thread.{}.*graceful-stop-seconds*|After the stop signal the dispatcher will wait this many seconds to be able to process the incoming events before terminating them|60| {panel:title=Example output from RM JMX api} {noformat} ... { "name": "Hadoop:service=ResourceManager,name=Event metrics for rm-state-store", "modelerType": "Event metrics for rm-state-store", "tag.Context": "yarn", "tag.Hostname": CENSORED "RMStateStoreEventType#STORE_APP_ATTEMPT_Current": 51, "RMStateStoreEventType#STORE_APP_ATTEMPT_NumOps": 0, "RMStateStoreEventType#STORE_APP_ATTEMPT_AvgTime": 0.0, "RMStateStoreEventType#STORE_APP_Current": 124, "RMStateStoreEventType#STORE_APP_NumOps": 46, "RMStateStoreEventType#STORE_APP_AvgTime": 3318.25, "RMStateStoreEventType#UPDATE_APP_Current": 31, "RMStateStoreEventType#UPDATE_APP_NumOps": 16, "RMStateStoreEventType#UPDATE_APP_AvgTime": 2629.5, "RMStateStoreEventType#UPDATE_APP_ATTEMPT_Current": 31, "RMStateStoreEventType#UPDATE_APP_ATTEMPT_NumOps": 12, "RMStateStoreEventType#UPDATE_APP_ATTEMPT_AvgTime": 2048.5, "RMStateStoreEventType#REMOVE_APP_Current": 12, "RMStateStoreEventType#REMOVE_APP_NumOps": 3, "RMStateStoreEventType#REMOVE_APP_AvgTime": 1378.0, "RMStateStoreEventType#REMOVE_APP_ATTEMPT_Current": 0, "RMStateStoreEventType#REMOVE_APP_ATTEMPT_NumOps": 0, "RMStateStoreEventType#REMOVE_APP_ATTEMPT_AvgTime": 0.0, "RMStateStoreEventType#FENCED_Current": 0, "RMStateStoreEventType#FENCED_NumOps": 0, "RMStateStoreEventType#FENCED_AvgTime": 0.0, "RMStateStoreEventType#STORE_MASTERKEY_Current": 0, "RMStateStoreEventType#STORE_MASTERKEY_NumOps": 0, "RMStateStoreEventType#STORE_MASTERKEY_AvgTime": 0.0, "RMStateStoreEventType#REMOVE_MASTERKEY_Current": 0, "RMStateStoreEventType#REMOVE_MASTERKEY_NumOps": 0, "RMStateStoreEventType#REMOVE_MASTERKEY_AvgTime": 0.0, "RMStateStoreEventType#STORE_DELEGATION_TOKEN_Current": 0, "RMStateStoreEventType#STORE_DELEGATION_TOKEN_NumOps": 0, "RMStateStoreEventType#STORE_DELEGATION_TOKEN_AvgTime": 0.0, "RMStateStoreEventType#REMOVE_DELEGATION_TOKEN_Current": 0, "RMStateStoreEventType#REMOVE_DELEGATION_TOKEN_NumOps": 0, "RMStateStoreEventType#REMOVE_DELEGATION_TOKEN_AvgTime": 0.0, "RMStateStoreEventType#UPDATE_DELEGATION_TOKEN_Current": 0, "RMStateStoreEventType#UPDATE_DELEGATION_TOKEN_NumOps": 0,
[jira] [Updated] (YARN-11656) RMStateStore event queue blocked
[ https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik updated YARN-11656: Description: h2. Problem statement I observed Yarn cluster has pending and available resources as well, but the cluster utilization is usually around ~50%. The cluster had loaded with 200 parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 reduce containers configured, on a 50 nodes cluster, where each node had 8 cores, and a lot of memory (there was cpu bottleneck). Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to persist a RMStateStoreEvent (using FileSystemRMStateStore). To reduce the impact of the issue: - create a dispatcher where events can persist in parallel threads - create metric data for the RMStateStore event queue to be able easily to identify the problem if occurs on a cluster {panel:title=Issue visible on UI2} !issue.png|height=250! {panel} Also another way to identify the issue if we can see too much time is required to store info for app after reach new_saving state {panel:title=How issue can look like in log} !log.png|height=250! {panel} h2. Solution Created a *MultiDispatcher* class which implements the Dispatcher interface. The Dispatcher creates a separate metric object called _Event metrics for "rm-state-store"_ where we can see - how many unhandled events are currently present in the event queue for the specific event type - how many events were handled for the specific event type - average execution time for the specific event The dispatcher has the following configs ( the placeholder is for the dispatcher name, for example, rm-state-store ) ||Config name||Description||Default value|| |yarn.dispatcher.multi-thread.{}.*default-pool-size*|How many parallel threads should execute the parallel event execution| 4| |yarn.dispatcher.multi-thread.{}.*max-pool-size*|If the event queue is full the execution threads will scale up to this many|8| |yarn.dispatcher.multi-thread.{}.*keep-alive-seconds*|Execution threads will be destroyed after this many seconds|10| |yarn.dispatcher.multi-thread.{}.*queue-size*|Size of the eventqueue|1 000 000| |yarn.dispatcher.multi-thread.{}.*monitor-seconds*|The size of the event queue will be logged with this frequency (if not zero) |30| |yarn.dispatcher.multi-thread.{}.*graceful-stop-seconds*|After the stop signal the dispatcher will wait this many seconds to be able to process the incoming events before terminating them|60| {panel:title=Example output from RM JMX api} {noformat} ... { "name": "Hadoop:service=ResourceManager,name=Event metrics for rm-state-store", "modelerType": "Event metrics for rm-state-store", "tag.Context": "yarn", "tag.Hostname": CENSORED "RMStateStoreEventType#STORE_APP_ATTEMPT_Current": 51, "RMStateStoreEventType#STORE_APP_ATTEMPT_NumOps": 0, "RMStateStoreEventType#STORE_APP_ATTEMPT_AvgTime": 0.0, "RMStateStoreEventType#STORE_APP_Current": 124, "RMStateStoreEventType#STORE_APP_NumOps": 46, "RMStateStoreEventType#STORE_APP_AvgTime": 3318.25, "RMStateStoreEventType#UPDATE_APP_Current": 31, "RMStateStoreEventType#UPDATE_APP_NumOps": 16, "RMStateStoreEventType#UPDATE_APP_AvgTime": 2629.5, "RMStateStoreEventType#UPDATE_APP_ATTEMPT_Current": 31, "RMStateStoreEventType#UPDATE_APP_ATTEMPT_NumOps": 12, "RMStateStoreEventType#UPDATE_APP_ATTEMPT_AvgTime": 2048.5, "RMStateStoreEventType#REMOVE_APP_Current": 12, "RMStateStoreEventType#REMOVE_APP_NumOps": 3, "RMStateStoreEventType#REMOVE_APP_AvgTime": 1378.0, "RMStateStoreEventType#REMOVE_APP_ATTEMPT_Current": 0, "RMStateStoreEventType#REMOVE_APP_ATTEMPT_NumOps": 0, "RMStateStoreEventType#REMOVE_APP_ATTEMPT_AvgTime": 0.0, "RMStateStoreEventType#FENCED_Current": 0, "RMStateStoreEventType#FENCED_NumOps": 0, "RMStateStoreEventType#FENCED_AvgTime": 0.0, "RMStateStoreEventType#STORE_MASTERKEY_Current": 0, "RMStateStoreEventType#STORE_MASTERKEY_NumOps": 0, "RMStateStoreEventType#STORE_MASTERKEY_AvgTime": 0.0, "RMStateStoreEventType#REMOVE_MASTERKEY_Current": 0, "RMStateStoreEventType#REMOVE_MASTERKEY_NumOps": 0, "RMStateStoreEventType#REMOVE_MASTERKEY_AvgTime": 0.0, "RMStateStoreEventType#STORE_DELEGATION_TOKEN_Current": 0, "RMStateStoreEventType#STORE_DELEGATION_TOKEN_NumOps": 0, "RMStateStoreEventType#STORE_DELEGATION_TOKEN_AvgTime": 0.0, "RMStateStoreEventType#REMOVE_DELEGATION_TOKEN_Current": 0, "RMStateStoreEventType#REMOVE_DELEGATION_TOKEN_NumOps": 0, "RMStateStoreEventType#REMOVE_DELEGATION_TOKEN_AvgTime": 0.0, "RMStateStoreEventType#UPDATE_DELEGATION_TOKEN_Current": 0, "RMStateStoreEventType#UPDATE_DELEGATION_TOKEN_NumOps": 0,
[jira] [Updated] (YARN-11656) RMStateStore event queue blocked
[ https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik updated YARN-11656: Description: h2. Problem statement I observed Yarn cluster has pending and available resources as well, but the cluster utilization is usually around ~50%. The cluster had loaded with 200 parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 reduce containers configured, on a 50 nodes cluster, where each node had 8 cores, and a lot of memory (there was cpu bottleneck). Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to persist a RMStateStoreEvent (using FileSystemRMStateStore). To reduce the impact of the issue: - create a dispatcher where events can persist in parallel threads - create metric data for the RMStateStore event queue to be able easily to identify the problem if occurs on a cluster {panel:title=Issue visible on UI2} !issue.png|height=250! {panel} Also another way to identify the issue if we can see too much time is required to store info for app after reach new_saving state {panel:title=How issue can look like in log} !log.png! {panel} h2. Solution Created a *MultiDispatcher* class which implements the Dispatcher interface. The Dispatcher creates a separate metric object called _Event metrics for "rm-state-store"_ where we can see - how many unhandled events are currently present in the event queue for the specific event type - how many events were handled for the specific event type - average execution time for the specific event The dispatcher has the following configs ( the placeholder is for the dispatcher name, for example, rm-state-store ) ||Config name||Description||Default value|| |yarn.dispatcher.multi-thread.{}.*default-pool-size*|How many parallel threads should execute the parallel event execution| 4| |yarn.dispatcher.multi-thread.{}.*max-pool-size*|If the event queue is full the execution threads will scale up to this many|8| |yarn.dispatcher.multi-thread.{}.*keep-alive-seconds*|Execution threads will be destroyed after this many seconds|10| |yarn.dispatcher.multi-thread.{}.*queue-size*|Size of the eventqueue|1 000 000| |yarn.dispatcher.multi-thread.{}.*monitor-seconds*|The size of the event queue will be logged with this frequency (if not zero) |30| |yarn.dispatcher.multi-thread.{}.*graceful-stop-seconds*|After the stop signal the dispatcher will wait this many seconds to be able to process the incoming events before terminating them|60| h2. Testing was: h2. Problem statement I observed Yarn cluster has pending and available resources as well, but the cluster utilization is usually around ~50%. The cluster had loaded with 200 parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 reduce containers configured, on a 50 nodes cluster, where each node had 8 cores, and a lot of memory (there was cpu bottleneck). Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to persist a RMStateStoreEvent (using FileSystemRMStateStore). To reduce the impact of the issue: - create a dispatcher where events can persist in parallel threads - create metric data for the RMStateStore event queue to be able easily to identify the problem if occurs on a cluster {panel:title=Issue visible on UI2} !issue.png|height=250! {panel} Also another way to identify the issue if we can see too much time is required to store info for app after reach new_saving state {panel:title=How issue can look like in log} {panel} h2. Solution Created a *MultiDispatcher* class which implements the Dispatcher interface. The Dispatcher creates a separate metric object called _Event metrics for "rm-state-store"_ where we can see - how many unhandled events are currently present in the event queue for the specific event type - how many events were handled for the specific event type - average execution time for the specific event The dispatcher has the following configs ( the placeholder is for the dispatcher name, for example, rm-state-store ) ||Config name||Description||Default value|| |yarn.dispatcher.multi-thread.{}.*default-pool-size*|How many parallel threads should execute the parallel event execution| 4| |yarn.dispatcher.multi-thread.{}.*max-pool-size*|If the event queue is full the execution threads will scale up to this many|8| |yarn.dispatcher.multi-thread.{}.*keep-alive-seconds*|Execution threads will be destroyed after this many seconds|10| |yarn.dispatcher.multi-thread.{}.*queue-size*|Size of the eventqueue|1 000 000| |yarn.dispatcher.multi-thread.{}.*monitor-seconds*|The size of the event queue will be logged with this frequency (if not zero) |30| |yarn.dispatcher.multi-thread.{}.*graceful-stop-seconds*|After the stop signal the dispatcher will wait this many seconds to be able to process the incoming events before terminating them|60| h2. Testing >
[jira] [Updated] (YARN-11656) RMStateStore event queue blocked
[ https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik updated YARN-11656: Description: h2. Problem statement I observed Yarn cluster has pending and available resources as well, but the cluster utilization is usually around ~50%. The cluster had loaded with 200 parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 reduce containers configured, on a 50 nodes cluster, where each node had 8 cores, and a lot of memory (there was cpu bottleneck). Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to persist a RMStateStoreEvent (using FileSystemRMStateStore). To reduce the impact of the issue: - create a dispatcher where events can persist in parallel threads - create metric data for the RMStateStore event queue to be able easily to identify the problem if occurs on a cluster {panel:title=Issue visible on UI2} !issue.png|height=250! {panel} Also another way to identify the issue if we can see too much time is required to store info for app after reach new_saving state {panel:title=How issue can look like in log} !log.png|height=250! {panel} h2. Solution Created a *MultiDispatcher* class which implements the Dispatcher interface. The Dispatcher creates a separate metric object called _Event metrics for "rm-state-store"_ where we can see - how many unhandled events are currently present in the event queue for the specific event type - how many events were handled for the specific event type - average execution time for the specific event The dispatcher has the following configs ( the placeholder is for the dispatcher name, for example, rm-state-store ) ||Config name||Description||Default value|| |yarn.dispatcher.multi-thread.{}.*default-pool-size*|How many parallel threads should execute the parallel event execution| 4| |yarn.dispatcher.multi-thread.{}.*max-pool-size*|If the event queue is full the execution threads will scale up to this many|8| |yarn.dispatcher.multi-thread.{}.*keep-alive-seconds*|Execution threads will be destroyed after this many seconds|10| |yarn.dispatcher.multi-thread.{}.*queue-size*|Size of the eventqueue|1 000 000| |yarn.dispatcher.multi-thread.{}.*monitor-seconds*|The size of the event queue will be logged with this frequency (if not zero) |30| |yarn.dispatcher.multi-thread.{}.*graceful-stop-seconds*|After the stop signal the dispatcher will wait this many seconds to be able to process the incoming events before terminating them|60| h2. Testing was: h2. Problem statement I observed Yarn cluster has pending and available resources as well, but the cluster utilization is usually around ~50%. The cluster had loaded with 200 parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 reduce containers configured, on a 50 nodes cluster, where each node had 8 cores, and a lot of memory (there was cpu bottleneck). Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to persist a RMStateStoreEvent (using FileSystemRMStateStore). To reduce the impact of the issue: - create a dispatcher where events can persist in parallel threads - create metric data for the RMStateStore event queue to be able easily to identify the problem if occurs on a cluster {panel:title=Issue visible on UI2} !issue.png|height=250! {panel} Also another way to identify the issue if we can see too much time is required to store info for app after reach new_saving state {panel:title=How issue can look like in log} !log.png! {panel} h2. Solution Created a *MultiDispatcher* class which implements the Dispatcher interface. The Dispatcher creates a separate metric object called _Event metrics for "rm-state-store"_ where we can see - how many unhandled events are currently present in the event queue for the specific event type - how many events were handled for the specific event type - average execution time for the specific event The dispatcher has the following configs ( the placeholder is for the dispatcher name, for example, rm-state-store ) ||Config name||Description||Default value|| |yarn.dispatcher.multi-thread.{}.*default-pool-size*|How many parallel threads should execute the parallel event execution| 4| |yarn.dispatcher.multi-thread.{}.*max-pool-size*|If the event queue is full the execution threads will scale up to this many|8| |yarn.dispatcher.multi-thread.{}.*keep-alive-seconds*|Execution threads will be destroyed after this many seconds|10| |yarn.dispatcher.multi-thread.{}.*queue-size*|Size of the eventqueue|1 000 000| |yarn.dispatcher.multi-thread.{}.*monitor-seconds*|The size of the event queue will be logged with this frequency (if not zero) |30| |yarn.dispatcher.multi-thread.{}.*graceful-stop-seconds*|After the stop signal the dispatcher will wait this many seconds to be able to process the incoming events before terminating them|60|
[jira] [Updated] (YARN-11656) RMStateStore event queue blocked
[ https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik updated YARN-11656: Attachment: log.png > RMStateStore event queue blocked > > > Key: YARN-11656 > URL: https://issues.apache.org/jira/browse/YARN-11656 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.4.1 >Reporter: Bence Kosztolnik >Priority: Major > Attachments: issue.png, log.png > > > h2. Problem statement > > I observed Yarn cluster has pending and available resources as well, but the > cluster utilization is usually around ~50%. The cluster had loaded with 200 > parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 > reduce containers configured, on a 50 nodes cluster, where each node had 8 > cores, and a lot of memory (there was cpu bottleneck). > Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to > persist a RMStateStoreEvent (using FileSystemRMStateStore). > To reduce the impact of the issue: > - create a dispatcher where events can persist in parallel threads > - create metric data for the RMStateStore event queue to be able easily to > identify the problem if occurs on a cluster > {panel:title=Issue visible on UI2} > !issue.png|height=250! > {panel} > Also another way to identify the issue if we can see too much time is > required to store info for app after reach new_saving state > {panel:title=How issue can look like in log} > {panel} > h2. Solution > Created a *MultiDispatcher* class which implements the Dispatcher interface. > The Dispatcher creates a separate metric object called _Event metrics for > "rm-state-store"_ where we can see > - how many unhandled events are currently present in the event queue for the > specific event type > - how many events were handled for the specific event type > - average execution time for the specific event > The dispatcher has the following configs ( the placeholder is for the > dispatcher name, for example, rm-state-store ) > ||Config name||Description||Default value|| > |yarn.dispatcher.multi-thread.{}.*default-pool-size*|How many parallel > threads should execute the parallel event execution| 4| > |yarn.dispatcher.multi-thread.{}.*max-pool-size*|If the event queue is full > the execution threads will scale up to this many|8| > |yarn.dispatcher.multi-thread.{}.*keep-alive-seconds*|Execution threads will > be destroyed after this many seconds|10| > |yarn.dispatcher.multi-thread.{}.*queue-size*|Size of the eventqueue|1 000 > 000| > |yarn.dispatcher.multi-thread.{}.*monitor-seconds*|The size of the event > queue will be logged with this frequency (if not zero) |30| > |yarn.dispatcher.multi-thread.{}.*graceful-stop-seconds*|After the stop > signal the dispatcher will wait this many seconds to be able to process the > incoming events before terminating them|60| > h2. Testing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11656) RMStateStore event queue blocked
[ https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik updated YARN-11656: Description: h2. Problem statement I observed Yarn cluster has pending and available resources as well, but the cluster utilization is usually around ~50%. The cluster had loaded with 200 parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 reduce containers configured, on a 50 nodes cluster, where each node had 8 cores, and a lot of memory (there was cpu bottleneck). Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to persist a RMStateStoreEvent (using FileSystemRMStateStore). To reduce the impact of the issue: - create a dispatcher where events can persist in parallel threads - create metric data for the RMStateStore event queue to be able easily to identify the problem if occurs on a cluster {panel:title=Issue visible on UI2} !issue.png|height=250! {panel} Also another way to identify the issue if we can see too much time is required to store info for app after reach new_saving state {panel:title=How issue can look like in log} {panel} h2. Solution Created a *MultiDispatcher* class which implements the Dispatcher interface. The Dispatcher creates a separate metric object called _Event metrics for "rm-state-store"_ where we can see - how many unhandled events are currently present in the event queue for the specific event type - how many events were handled for the specific event type - average execution time for the specific event The dispatcher has the following configs ( the placeholder is for the dispatcher name, for example, rm-state-store ) ||Config name||Description||Default value|| |yarn.dispatcher.multi-thread.{}.*default-pool-size*|How many parallel threads should execute the parallel event execution| 4| |yarn.dispatcher.multi-thread.{}.*max-pool-size*|If the event queue is full the execution threads will scale up to this many|8| |yarn.dispatcher.multi-thread.{}.*keep-alive-seconds*|Execution threads will be destroyed after this many seconds|10| |yarn.dispatcher.multi-thread.{}.*queue-size*|Size of the eventqueue|1 000 000| |yarn.dispatcher.multi-thread.{}.*monitor-seconds*|The size of the event queue will be logged with this frequency (if not zero) |30| |yarn.dispatcher.multi-thread.{}.*graceful-stop-seconds*|After the stop signal the dispatcher will wait this many seconds to be able to process the incoming events before terminating them|60| h2. Testing was: h2. Problem statement I observed Yarn cluster has pending and available resources as well, but the cluster utilization is usually around ~50%. The cluster had loaded with 200 parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 reduce containers configured, on a 50 nodes cluster, where each node had 8 cores, and a lot of memory (there was cpu bottleneck). Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to persist a RMStateStoreEvent (using FileSystemRMStateStore). To reduce the impact of the issue: - create a dispatcher where events can persist in parallel threads - create metric data for the RMStateStore event queue to be able easily to identify the problem if occurs on a cluster {panel:title=Issue visible on UI2} !issue.png|height=250! {panel} h2. Solution Created a *MultiDispatcher* class which implements the Dispatcher interface. The Dispatcher creates a separate metric object called _Event metrics for "rm-state-store"_ where we can see - how many unhandled events are currently present in the event queue for the specific event type - how many events were handled for the specific event type - average execution time for the specific event The dispatcher has the following configs ( the placeholder is for the dispatcher name, for example, rm-state-store ) ||Config name||Description||Default value|| |yarn.dispatcher.multi-thread.{}.*default-pool-size*|How many parallel threads should execute the parallel event execution| 4| |yarn.dispatcher.multi-thread.{}.*max-pool-size*|If the event queue is full the execution threads will scale up to this many|8| |yarn.dispatcher.multi-thread.{}.*keep-alive-seconds*|Execution threads will be destroyed after this many seconds|10| |yarn.dispatcher.multi-thread.{}.*queue-size*|Size of the eventqueue|1 000 000| |yarn.dispatcher.multi-thread.{}.*monitor-seconds*|The size of the event queue will be logged with this frequency (if not zero) |30| |yarn.dispatcher.multi-thread.{}.*graceful-stop-seconds*|After the stop signal the dispatcher will wait this many seconds to be able to process the incoming events before terminating them|60| h2. Testing > RMStateStore event queue blocked > > > Key: YARN-11656 > URL: https://issues.apache.org/jira/browse/YARN-11656 > Project:
[jira] [Updated] (YARN-11656) RMStateStore event queue blocked
[ https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik updated YARN-11656: Description: h2. Problem statement I observed Yarn cluster has pending and available resources as well, but the cluster utilization is usually around ~50%. The cluster had loaded with 200 parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 reduce containers configured, on a 50 nodes cluster, where each node had 8 cores, and a lot of memory (there was cpu bottleneck). Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to persist a RMStateStoreEvent (using FileSystemRMStateStore). To reduce the impact of the issue: - create a dispatcher where events can persist in parallel threads - create metric data for the RMStateStore event queue to be able easily to identify the problem if occurs on a cluster {panel:title=Issue visible on UI2} !issue.png|height=250! {panel} h2. Solution Created a *MultiDispatcher* class which implements the Dispatcher interface. The Dispatcher creates a separate metric object called _Event metrics for "rm-state-store"_ where we can see - how many unhandled events are currently present in the event queue for the specific event type - how many events were handled for the specific event type - average execution time for the specific event The dispatcher has the following configs ( the placeholder is for the dispatcher name, for example, rm-state-store ) ||Config name||Description||Default value|| |yarn.dispatcher.multi-thread.{}.*default-pool-size*|How many parallel threads should execute the parallel event execution| 4| |yarn.dispatcher.multi-thread.{}.*max-pool-size*|If the event queue is full the execution threads will scale up to this many|8| |yarn.dispatcher.multi-thread.{}.*keep-alive-seconds*|Execution threads will be destroyed after this many seconds|10| |yarn.dispatcher.multi-thread.{}.*queue-size*|Size of the eventqueue|1 000 000| |yarn.dispatcher.multi-thread.{}.*monitor-seconds*|The size of the event queue will be logged with this frequency (if not zero) |30| |yarn.dispatcher.multi-thread.{}.*graceful-stop-seconds*|After the stop signal the dispatcher will wait this many seconds to be able to process the incoming events before terminating them|60| h2. Testing was: h2. Problem statement I observed Yarn cluster has pending and available resources as well, but the cluster utilization is usually around ~50%. The cluster had loaded with 200 parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 reduce containers configured, on a 50 nodes cluster, where each node had 8 cores, and a lot of memory (there was cpu bottleneck). Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to persist a RMStateStoreEvent (using FileSystemRMStateStore). To reduce the impact of the issue: - create a dispatcher where events can persist in parallel threads - create metric data for the RMStateStore event queue to be able easily to identify the problem if occurs on a cluster {panel:title=Issue visible on UI2} !issue.png|height=250! {panel} h2. Solution Created a *MultiDispatcher* class which implements the Dispatcher interface. The Dispatcher creates a separate metric object called _Event metrics for "rm-state-store"_ where we can see - how many unhandled events are currently present in the event queue for the specific event type - how many events were handled for the specific event type - average execution time for the specific event The dispatcher has the following configs ( the placeholder is for the dispatcher name, for example rm-state-store ) ||Config name||Description||Default value|| |yarn.dispatcher.multi-thread.{}.default-pool-size|How many parallel threads should execute the parallel event execution| 4| |yarn.dispatcher.multi-thread.{}.max-pool-size|If event queue is full the execution threads will scale up to this manny|8| |yarn.dispatcher.multi-thread.{}.keep-alive-seconds|Execution threads will be destroyed after this many seconds|10| |yarn.dispatcher.multi-thread.{}.queue-size|Size of the eventqueue|1 000 000| |yarn.dispatcher.multi-thread.{}.monitor-seconds|The size of the event queue will be logged with this frequency (if not zero) |30| |yarn.dispatcher.multi-thread.{}.graceful-stop-seconds|After stop signal the dispatcher will wait this many seconds to be able to process the incoming events before terminates them|60| h2. Testing > RMStateStore event queue blocked > > > Key: YARN-11656 > URL: https://issues.apache.org/jira/browse/YARN-11656 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.4.1 >Reporter: Bence Kosztolnik >Priority: Major > Attachments: issue.png > > >
[jira] [Updated] (YARN-11656) RMStateStore event queue blocked
[ https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik updated YARN-11656: Description: h2. Problem statement I observed Yarn cluster has pending and available resources as well, but the cluster utilization is usually around ~50%. The cluster had loaded with 200 parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 reduce containers configured, on a 50 nodes cluster, where each node had 8 cores, and a lot of memory (there was cpu bottleneck). Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to persist a RMStateStoreEvent (using FileSystemRMStateStore). To reduce the impact of the issue: - create a dispatcher where events can persist in parallel threads - create metric data for the RMStateStore event queue to be able easily to identify the problem if occurs on a cluster {panel:title=Issue visible on UI2} !issue.png|height=250! {panel} h2. Solution Created a *MultiDispatcher* class which implements the Dispatcher interface. The Dispatcher creates a separate metric object called _Event metrics for "rm-state-store"_ where we can see - how many unhandled events are currently present in the event queue for the specific event type - how many events were handled for the specific event type - average execution time for the specific event The dispatcher has the following configs ( the placeholder is for the dispatcher name, for example rm-state-store ) ||Config name||Description||Default value|| |yarn.dispatcher.multi-thread.max-pool-size.{}.default-pool-size|How many parallel threads should execute the parallel event execution| 4| |yarn.dispatcher.multi-thread.max-pool-size.{}.max-pool-size|If event queue is full the execution threads will scale up to this manny|8| |yarn.dispatcher.multi-thread.max-pool-size.{}.keep-alive-seconds|Execution threads will be destroyed after this many seconds|10| |yarn.dispatcher.multi-thread.max-pool-size.{}.queue-size|Size of the eventqueue|1 000 000| |yarn.dispatcher.multi-thread.max-pool-size.{}.monitor-seconds|The size of the event queue will be logged with this frequency (if not zero) |30| |yarn.dispatcher.multi-thread.max-pool-size.{}.graceful-stop-seconds|After stop signal the dispatcher will wait this many seconds to be able to process the incoming events before terminates them|60| h2. Testing was: h2. Problem statement I observed Yarn cluster has pending and available resources as well, but the cluster utilization is usually around ~50%. The cluster had loaded with 200 parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 reduce containers configured, on a 50 nodes cluster, where each node had 8 cores, and a lot of memory (there was cpu bottleneck). Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to persist a RMStateStoreEvent (using FileSystemRMStateStore). To reduce the impact of the issue: - create a dispatcher where events can persist in parallel threads - create metric data for the RMStateStore event queue to be able easily to identify the problem if occurs on a cluster {panel:title=Issue visible on UI2} !issue.png|height=250! {panel} h2. Solution Created a *MultiDispatcher* class which implements the Dispatcher interface. The Dispatcher creates a separate metric object called _Event metrics for "rm-state-store"_ where we can see - how many unhandled events are currently present in the event queue for the specific event type - how many events were handled for the specific event type - average execution time for the specific event The dispatcher has the following configs ( the placeholder is for the dispatcher name, for example rm-state-store ) ||Config name||Description||Default value|| |yarn.dispatcher.multi-thread.max-pool-size.{}.default-pool-size|How many parallel threads should execute the parallel event execution| 4| h2. Testing > RMStateStore event queue blocked > > > Key: YARN-11656 > URL: https://issues.apache.org/jira/browse/YARN-11656 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.4.1 >Reporter: Bence Kosztolnik >Priority: Major > Attachments: issue.png > > > h2. Problem statement > > I observed Yarn cluster has pending and available resources as well, but the > cluster utilization is usually around ~50%. The cluster had loaded with 200 > parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 > reduce containers configured, on a 50 nodes cluster, where each node had 8 > cores, and a lot of memory (there was cpu bottleneck). > Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to > persist a RMStateStoreEvent (using FileSystemRMStateStore). > To reduce the
[jira] [Updated] (YARN-11656) RMStateStore event queue blocked
[ https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik updated YARN-11656: Description: h2. Problem statement I observed Yarn cluster has pending and available resources as well, but the cluster utilization is usually around ~50%. The cluster had loaded with 200 parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 reduce containers configured, on a 50 nodes cluster, where each node had 8 cores, and a lot of memory (there was cpu bottleneck). Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to persist a RMStateStoreEvent (using FileSystemRMStateStore). To reduce the impact of the issue: - create a dispatcher where events can persist in parallel threads - create metric data for the RMStateStore event queue to be able easily to identify the problem if occurs on a cluster {panel:title=Issue visible on UI2} !issue.png|height=250! {panel} h2. Solution Created a *MultiDispatcher* class which implements the Dispatcher interface. The Dispatcher creates a separate metric object called _Event metrics for "rm-state-store"_ where we can see - how many unhandled events are currently present in the event queue for the specific event type - how many events were handled for the specific event type - average execution time for the specific event The dispatcher has the following configs ( the placeholder is for the dispatcher name, for example rm-state-store ) ||Config name||Description||Default value|| |yarn.dispatcher.multi-thread.{}.default-pool-size|How many parallel threads should execute the parallel event execution| 4| |yarn.dispatcher.multi-thread.{}.max-pool-size|If event queue is full the execution threads will scale up to this manny|8| |yarn.dispatcher.multi-thread.{}.keep-alive-seconds|Execution threads will be destroyed after this many seconds|10| |yarn.dispatcher.multi-thread.{}.queue-size|Size of the eventqueue|1 000 000| |yarn.dispatcher.multi-thread.{}.monitor-seconds|The size of the event queue will be logged with this frequency (if not zero) |30| |yarn.dispatcher.multi-thread.{}.graceful-stop-seconds|After stop signal the dispatcher will wait this many seconds to be able to process the incoming events before terminates them|60| h2. Testing was: h2. Problem statement I observed Yarn cluster has pending and available resources as well, but the cluster utilization is usually around ~50%. The cluster had loaded with 200 parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 reduce containers configured, on a 50 nodes cluster, where each node had 8 cores, and a lot of memory (there was cpu bottleneck). Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to persist a RMStateStoreEvent (using FileSystemRMStateStore). To reduce the impact of the issue: - create a dispatcher where events can persist in parallel threads - create metric data for the RMStateStore event queue to be able easily to identify the problem if occurs on a cluster {panel:title=Issue visible on UI2} !issue.png|height=250! {panel} h2. Solution Created a *MultiDispatcher* class which implements the Dispatcher interface. The Dispatcher creates a separate metric object called _Event metrics for "rm-state-store"_ where we can see - how many unhandled events are currently present in the event queue for the specific event type - how many events were handled for the specific event type - average execution time for the specific event The dispatcher has the following configs ( the placeholder is for the dispatcher name, for example rm-state-store ) ||Config name||Description||Default value|| |yarn.dispatcher.multi-thread.max-pool-size.{}.default-pool-size|How many parallel threads should execute the parallel event execution| 4| |yarn.dispatcher.multi-thread.max-pool-size.{}.max-pool-size|If event queue is full the execution threads will scale up to this manny|8| |yarn.dispatcher.multi-thread.max-pool-size.{}.keep-alive-seconds|Execution threads will be destroyed after this many seconds|10| |yarn.dispatcher.multi-thread.max-pool-size.{}.queue-size|Size of the eventqueue|1 000 000| |yarn.dispatcher.multi-thread.max-pool-size.{}.monitor-seconds|The size of the event queue will be logged with this frequency (if not zero) |30| |yarn.dispatcher.multi-thread.max-pool-size.{}.graceful-stop-seconds|After stop signal the dispatcher will wait this many seconds to be able to process the incoming events before terminates them|60| h2. Testing > RMStateStore event queue blocked > > > Key: YARN-11656 > URL: https://issues.apache.org/jira/browse/YARN-11656 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.4.1 >Reporter: Bence Kosztolnik >
[jira] [Updated] (YARN-11656) RMStateStore event queue blocked
[ https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik updated YARN-11656: Description: h2. Problem statement I observed Yarn cluster has pending and available resources as well, but the cluster utilization is usually around ~50%. The cluster had loaded with 200 parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 reduce containers configured, on a 50 nodes cluster, where each node had 8 cores, and a lot of memory (there was cpu bottleneck). Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to persist a RMStateStoreEvent (using FileSystemRMStateStore). To reduce the impact of the issue: - create a dispatcher where events can persist in parallel threads - create metric data for the RMStateStore event queue to be able easily to identify the problem if occurs on a cluster {panel:title=Issue visible on UI2} !issue.png|height=250! {panel} h2. Solution Created a *MultiDispatcher* class which implements the Dispatcher interface. The Dispatcher creates a separate metric object called _Event metrics for "rm-state-store"_ where we can see - how many unhandled events are currently present in the event queue for the specific event type - how many events were handled for the specific event type - average execution time for the specific event The dispatcher has the following configs ( the placeholder is for the dispatcher name, for example rm-state-store ) ||Config name||Description||Default value|| |yarn.dispatcher.multi-thread.max-pool-size.{}.default-pool-size|How many parallel threads should execute the parallel event execution| 4| h2. Testing was: h2. Problem statement I observed Yarn cluster has pending and available resources as well, but the cluster utilization is usually around ~50%. The cluster had loaded with 200 parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 reduce containers configured, on a 50 nodes cluster, where each node had 8 cores, and a lot of memory (there was cpu bottleneck). Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to persist a RMStateStoreEvent (using FileSystemRMStateStore). To reduce the impact of the issue: - create a dispatcher where events can persist in parallel threads - create metric data for the RMStateStore event queue to be able easily to identify the problem if occurs on a cluster {panel:title=Issue visible on UI2} !issue.png|height=250! {panel} h2. Solution Created a *MultiDispatcher* class which implements the Dispatcher interface. The Dispatcher creates a separate metric object called _Event metrics for "rm-state-store"_ where we can see - how many unhandled events are currently present in the event queue for the specific event type - how many events were handled for the specific event type - average execution time for the specific event h2. Testing > RMStateStore event queue blocked > > > Key: YARN-11656 > URL: https://issues.apache.org/jira/browse/YARN-11656 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.4.1 >Reporter: Bence Kosztolnik >Priority: Major > Attachments: issue.png > > > h2. Problem statement > > I observed Yarn cluster has pending and available resources as well, but the > cluster utilization is usually around ~50%. The cluster had loaded with 200 > parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 > reduce containers configured, on a 50 nodes cluster, where each node had 8 > cores, and a lot of memory (there was cpu bottleneck). > Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to > persist a RMStateStoreEvent (using FileSystemRMStateStore). > To reduce the impact of the issue: > - create a dispatcher where events can persist in parallel threads > - create metric data for the RMStateStore event queue to be able easily to > identify the problem if occurs on a cluster > {panel:title=Issue visible on UI2} > !issue.png|height=250! > {panel} > h2. Solution > Created a *MultiDispatcher* class which implements the Dispatcher interface. > The Dispatcher creates a separate metric object called _Event metrics for > "rm-state-store"_ where we can see > - how many unhandled events are currently present in the event queue for the > specific event type > - how many events were handled for the specific event type > - average execution time for the specific event > The dispatcher has the following configs ( the placeholder is for the > dispatcher name, for example rm-state-store ) > ||Config name||Description||Default value|| > |yarn.dispatcher.multi-thread.max-pool-size.{}.default-pool-size|How many > parallel threads should execute the parallel event execution|
[jira] [Updated] (YARN-11656) RMStateStore event queue blocked
[ https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik updated YARN-11656: Description: h2. Problem statement h2. I observed Yarn cluster has pending and available resources as well, but the cluster utilization is usually around ~50%. The cluster had loaded with 200 parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 reduce containers configured, on a 50 nodes cluster, where each node had 8 cores, and a lot of memory (there was cpu bottleneck). Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to persist a RMStateStoreEvent (using FileSystemRMStateStore). To reduce the impact of the issue: - create a dispatcher where events can persist in parallel threads - create metric data for the RMStateStore event queue to be able easily to identify the problem if occurs on a cluster {panel:title=Issue visible on UI2} !issue.png|height=250! {panel} h2. Solution h2. Created a *MultiDispatcher* class which implements the Dispatcher interface. The Dispatcher creates a separate metric object called _Event metrics for "rm-state-store"_ where we can see - how many unhandled events are currently present in the event queue for the specific event type - how many events were handled for the specific event type - average execution time for the specific event h2. Testing h2. was: I observed Yarn cluster has pending and available resources as well, but the cluster utilization is usually around ~50%. The cluster had loaded with 200 parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 reduce containers configured, on a 50 nodes cluster, where each node had 8 cores, and a lot of memory (there was cpu bottleneck). Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to persist a RMStateStoreEvent (using FileSystemRMStateStore). To reduce the impact of the issue: - create a dispatcher where events can persist in parallel threads - create metric data for the RMStateStore event queue to be able easily to identify the problem if occurs on a cluster {panel:title=Issue visible on UI2} !issue.png|height=250! {panel} Solution: Created a *MultiDispatcher* class which implements the Dispatcher interface. The Dispatcher creates a separate metric object called _Event metrics for "rm-state-store"_ where we can see - how many unhandled events are currently present in the event queue for the specific event type - how many events were handled for the specific event type - average execution time for the specific event > RMStateStore event queue blocked > > > Key: YARN-11656 > URL: https://issues.apache.org/jira/browse/YARN-11656 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.4.1 >Reporter: Bence Kosztolnik >Priority: Major > Attachments: issue.png > > > h2. Problem statement h2. > > I observed Yarn cluster has pending and available resources as well, but the > cluster utilization is usually around ~50%. The cluster had loaded with 200 > parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 > reduce containers configured, on a 50 nodes cluster, where each node had 8 > cores, and a lot of memory (there was cpu bottleneck). > Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to > persist a RMStateStoreEvent (using FileSystemRMStateStore). > To reduce the impact of the issue: > - create a dispatcher where events can persist in parallel threads > - create metric data for the RMStateStore event queue to be able easily to > identify the problem if occurs on a cluster > {panel:title=Issue visible on UI2} > !issue.png|height=250! > {panel} > h2. Solution h2. > Created a *MultiDispatcher* class which implements the Dispatcher interface. > The Dispatcher creates a separate metric object called _Event metrics for > "rm-state-store"_ where we can see > - how many unhandled events are currently present in the event queue for the > specific event type > - how many events were handled for the specific event type > - average execution time for the specific event > h2. Testing h2. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11656) RMStateStore event queue blocked
[ https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik updated YARN-11656: Description: h2. Problem statement I observed Yarn cluster has pending and available resources as well, but the cluster utilization is usually around ~50%. The cluster had loaded with 200 parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 reduce containers configured, on a 50 nodes cluster, where each node had 8 cores, and a lot of memory (there was cpu bottleneck). Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to persist a RMStateStoreEvent (using FileSystemRMStateStore). To reduce the impact of the issue: - create a dispatcher where events can persist in parallel threads - create metric data for the RMStateStore event queue to be able easily to identify the problem if occurs on a cluster {panel:title=Issue visible on UI2} !issue.png|height=250! {panel} h2. Solution Created a *MultiDispatcher* class which implements the Dispatcher interface. The Dispatcher creates a separate metric object called _Event metrics for "rm-state-store"_ where we can see - how many unhandled events are currently present in the event queue for the specific event type - how many events were handled for the specific event type - average execution time for the specific event h2. Testing was: h2. Problem statement h2. I observed Yarn cluster has pending and available resources as well, but the cluster utilization is usually around ~50%. The cluster had loaded with 200 parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 reduce containers configured, on a 50 nodes cluster, where each node had 8 cores, and a lot of memory (there was cpu bottleneck). Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to persist a RMStateStoreEvent (using FileSystemRMStateStore). To reduce the impact of the issue: - create a dispatcher where events can persist in parallel threads - create metric data for the RMStateStore event queue to be able easily to identify the problem if occurs on a cluster {panel:title=Issue visible on UI2} !issue.png|height=250! {panel} h2. Solution h2. Created a *MultiDispatcher* class which implements the Dispatcher interface. The Dispatcher creates a separate metric object called _Event metrics for "rm-state-store"_ where we can see - how many unhandled events are currently present in the event queue for the specific event type - how many events were handled for the specific event type - average execution time for the specific event h2. Testing h2. > RMStateStore event queue blocked > > > Key: YARN-11656 > URL: https://issues.apache.org/jira/browse/YARN-11656 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.4.1 >Reporter: Bence Kosztolnik >Priority: Major > Attachments: issue.png > > > h2. Problem statement > > I observed Yarn cluster has pending and available resources as well, but the > cluster utilization is usually around ~50%. The cluster had loaded with 200 > parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 > reduce containers configured, on a 50 nodes cluster, where each node had 8 > cores, and a lot of memory (there was cpu bottleneck). > Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to > persist a RMStateStoreEvent (using FileSystemRMStateStore). > To reduce the impact of the issue: > - create a dispatcher where events can persist in parallel threads > - create metric data for the RMStateStore event queue to be able easily to > identify the problem if occurs on a cluster > {panel:title=Issue visible on UI2} > !issue.png|height=250! > {panel} > h2. Solution > Created a *MultiDispatcher* class which implements the Dispatcher interface. > The Dispatcher creates a separate metric object called _Event metrics for > "rm-state-store"_ where we can see > - how many unhandled events are currently present in the event queue for the > specific event type > - how many events were handled for the specific event type > - average execution time for the specific event > h2. Testing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11656) RMStateStore event queue blocked
[ https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik updated YARN-11656: Description: I observed Yarn cluster has pending and available resources as well, but the cluster utilization is usually around ~50%. The cluster had loaded with 200 parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 reduce containers configured, on a 50 nodes cluster, where each node had 8 cores, and a lot of memory (there was cpu bottleneck). Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to persist a RMStateStoreEvent (using FileSystemRMStateStore). To reduce the impact of the issue: - create a dispatcher where events can persist in parallel threads - create metric data for the RMStateStore event queue to be able easily to identify the problem if occurs on a cluster {panel:title=Issue visible on UI2} !issue.png|height=250! {panel} Solution: Created a *MultiDispatcher* class which implements the Dispatcher interface. The Dispatcher creates a separate metric object called _Event metrics for "rm-state-store"_ where we can see - how many unhandled events are currently present in the event queue for the specific event type - how many events were handled for the specific event type - average execution time for the specific event was: I observed Yarn cluster has pending and available resources as well, but the cluster utilization is usually around ~50%. The cluster had loaded with 200 parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 reduce containers configured, on a 50 nodes cluster, where each node had 8 cores, and a lot of memory (there was cpu bottleneck). Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to persist a RMStateStoreEvent (using FileSystemRMStateStore). To reduce the impact of the issue: - create a dispatcher where events can persist in parallel threads - create metric data for the RMStateStore event queue to be able easily to identify the problem if occurs on a cluster {panel:title=Issue visible on UI2} !issue.png|height=250! {panel} > RMStateStore event queue blocked > > > Key: YARN-11656 > URL: https://issues.apache.org/jira/browse/YARN-11656 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.4.1 >Reporter: Bence Kosztolnik >Priority: Major > Attachments: issue.png > > > I observed Yarn cluster has pending and available resources as well, but the > cluster utilization is usually around ~50%. The cluster had loaded with 200 > parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 > reduce containers configured, on a 50 nodes cluster, where each node had 8 > cores, and a lot of memory (there was cpu bottleneck). > Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to > persist a RMStateStoreEvent (using FileSystemRMStateStore). > To reduce the impact of the issue: > - create a dispatcher where events can persist in parallel threads > - create metric data for the RMStateStore event queue to be able easily to > identify the problem if occurs on a cluster > {panel:title=Issue visible on UI2} > !issue.png|height=250! > {panel} > Solution: > Created a *MultiDispatcher* class which implements the Dispatcher interface. > The Dispatcher creates a separate metric object called _Event metrics for > "rm-state-store"_ where we can see > - how many unhandled events are currently present in the event queue for the > specific event type > - how many events were handled for the specific event type > - average execution time for the specific event -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11656) RMStateStore event queue blocked
[ https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik updated YARN-11656: Description: I observed Yarn cluster has pending and available resources as well, but the cluster utilization is usually around ~50%. The cluster had loaded with 200 parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 reduce containers configured, on a 50 nodes cluster, where each node had 8 cores, and a lot of memory (there was cpu bottleneck). Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to persist a RMStateStoreEvent (using FileSystemRMStateStore). To reduce the impact of the issue: - create a dispatcher where events can persist in parallel threads - create metric data for the RMStateStore event queue to be able easily to identify the problem if occurs on a cluster {panel:title=Issue visible on UI2} !issue.png|height=250! {panel} was: I observed Yarn cluster has pending and available resources as well, but the cluster utilization is usually around ~50%. The cluster had loaded with 200 parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 reduce containers configured, on a 50 nodes cluster, where each node had 8 cores, and a lot of memory (there was cpu bottleneck). Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to persist a RMStateStoreEvent (using FileSystemRMStateStore). To reduce the impact of the issue: - create a dispatcher where events can persist in parallel threads - create metric data for the RMStateStore event queue to be able easily to identify the problem if occurs on a cluster {panel:title=Issue visible on UI2} !issue.png|height=250,width=250! {panel} > RMStateStore event queue blocked > > > Key: YARN-11656 > URL: https://issues.apache.org/jira/browse/YARN-11656 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.4.1 >Reporter: Bence Kosztolnik >Priority: Major > Attachments: issue.png > > > I observed Yarn cluster has pending and available resources as well, but the > cluster utilization is usually around ~50%. The cluster had loaded with 200 > parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 > reduce containers configured, on a 50 nodes cluster, where each node had 8 > cores, and a lot of memory (there was cpu bottleneck). > Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to > persist a RMStateStoreEvent (using FileSystemRMStateStore). > To reduce the impact of the issue: > - create a dispatcher where events can persist in parallel threads > - create metric data for the RMStateStore event queue to be able easily to > identify the problem if occurs on a cluster > {panel:title=Issue visible on UI2} > !issue.png|height=250! > {panel} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11656) RMStateStore event queue blocked
[ https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik updated YARN-11656: Description: I observed Yarn cluster has pending and available resources as well, but the cluster utilization is usually around ~50%. The cluster had loaded with 200 parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 reduce containers configured, on a 50 nodes cluster, where each node had 8 cores, and a lot of memory (there was cpu bottleneck). Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to persist a RMStateStoreEvent (using FileSystemRMStateStore). To reduce the impact of the issue: - create a dispatcher where events can persist in parallel threads - create metric data for the RMStateStore event queue to be able easily to identify the problem if occurs on a cluster {panel:title=Issue visible on UI2} !issue.png! {panel} was: I observed Yarn cluster has pending and available resources as well, but the cluster utilization is usually around ~50%. The cluster had loaded with 200 parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 reduce containers configured, on a 50 nodes cluster, where each node had 8 cores, and a lot of memory (there was cpu bottleneck). Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to persist a RMStateStoreEvent (using FileSystemRMStateStore). To reduce the impact of the issue: - create a dispatcher where events can persist in parallel threads - create metric data for the RMStateStore event queue to be able easily to identify the problem if occurs on a cluster {panel:title=Issue visible on UI2} {panel} > RMStateStore event queue blocked > > > Key: YARN-11656 > URL: https://issues.apache.org/jira/browse/YARN-11656 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.4.1 >Reporter: Bence Kosztolnik >Priority: Major > Attachments: issue.png > > > I observed Yarn cluster has pending and available resources as well, but the > cluster utilization is usually around ~50%. The cluster had loaded with 200 > parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 > reduce containers configured, on a 50 nodes cluster, where each node had 8 > cores, and a lot of memory (there was cpu bottleneck). > Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to > persist a RMStateStoreEvent (using FileSystemRMStateStore). > To reduce the impact of the issue: > - create a dispatcher where events can persist in parallel threads > - create metric data for the RMStateStore event queue to be able easily to > identify the problem if occurs on a cluster > {panel:title=Issue visible on UI2} > !issue.png! > {panel} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11656) RMStateStore event queue blocked
[ https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik updated YARN-11656: Description: I observed Yarn cluster has pending and available resources as well, but the cluster utilization is usually around ~50%. The cluster had loaded with 200 parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 reduce containers configured, on a 50 nodes cluster, where each node had 8 cores, and a lot of memory (there was cpu bottleneck). Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to persist a RMStateStoreEvent (using FileSystemRMStateStore). To reduce the impact of the issue: - create a dispatcher where events can persist in parallel threads - create metric data for the RMStateStore event queue to be able easily to identify the problem if occurs on a cluster {panel:title=Issue visible on UI2} !issue.png|height=250,width=250! {panel} was: I observed Yarn cluster has pending and available resources as well, but the cluster utilization is usually around ~50%. The cluster had loaded with 200 parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 reduce containers configured, on a 50 nodes cluster, where each node had 8 cores, and a lot of memory (there was cpu bottleneck). Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to persist a RMStateStoreEvent (using FileSystemRMStateStore). To reduce the impact of the issue: - create a dispatcher where events can persist in parallel threads - create metric data for the RMStateStore event queue to be able easily to identify the problem if occurs on a cluster {panel:title=Issue visible on UI2} !issue.png! {panel} > RMStateStore event queue blocked > > > Key: YARN-11656 > URL: https://issues.apache.org/jira/browse/YARN-11656 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.4.1 >Reporter: Bence Kosztolnik >Priority: Major > Attachments: issue.png > > > I observed Yarn cluster has pending and available resources as well, but the > cluster utilization is usually around ~50%. The cluster had loaded with 200 > parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 > reduce containers configured, on a 50 nodes cluster, where each node had 8 > cores, and a lot of memory (there was cpu bottleneck). > Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to > persist a RMStateStoreEvent (using FileSystemRMStateStore). > To reduce the impact of the issue: > - create a dispatcher where events can persist in parallel threads > - create metric data for the RMStateStore event queue to be able easily to > identify the problem if occurs on a cluster > {panel:title=Issue visible on UI2} > !issue.png|height=250,width=250! > {panel} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11656) RMStateStore event queue blocked
Bence Kosztolnik created YARN-11656: --- Summary: RMStateStore event queue blocked Key: YARN-11656 URL: https://issues.apache.org/jira/browse/YARN-11656 Project: Hadoop YARN Issue Type: Improvement Components: yarn Affects Versions: 3.4.1 Reporter: Bence Kosztolnik Attachments: issue.png I observed Yarn cluster has pending and available resources as well, but the cluster utilization is usually around ~50%. The cluster had loaded with 200 parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 reduce containers configured, on a 50 nodes cluster, where each node had 8 cores, and a lot of memory (there was cpu bottleneck). Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to persist a RMStateStoreEvent (using FileSystemRMStateStore). To reduce the impact of the issue: - create a dispatcher where events can persist in parallel threads - create metric data for the RMStateStore event queue to be able easily to identify the problem if occurs on a cluster {panel:title=Issue visible on UI2} {panel} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-11010) YARN ui2 hangs on the Queues page when the scheduler response contains NaN values
[ https://issues.apache.org/jira/browse/YARN-11010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17817624#comment-17817624 ] Bence Kosztolnik edited comment on YARN-11010 at 2/15/24 10:36 AM: --- I just see this, maybe we can use this solution here as well HADOOP-18954 was (Author: JIRAUSER292672): I just so this, maybe we can use this solution here as well HADOOP-18954 > YARN ui2 hangs on the Queues page when the scheduler response contains NaN > values > - > > Key: YARN-11010 > URL: https://issues.apache.org/jira/browse/YARN-11010 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-ui-v2 >Affects Versions: 3.4.0 >Reporter: Tamas Domok >Assignee: Tamas Domok >Priority: Major > Attachments: capacity-scheduler.xml, shresponse.json > > > When the scheduler response contains NaN values for capacity and maxCapacity > the UI2 hangs on the Queues page. The console log shows the following error: > {code:java} > SyntaxError: Unexpected token N in JSON at position 666 {code} > The scheduler response: > {code:java} > "maxCapacity": NaN, > "absoluteMaxCapacity": NaN, {code} > NaN, infinity, -infinity is not valid in JSON syntax: > https://www.json.org/json-en.html > This might be related as well: YARN-10452 > > I managed to reproduce this with AQCv1, where I set the parent queue's > capacity in absolute mode, then I used percentage mode on the > leaf-queue-template. I'm not sure if this is a valid configuration, however > there is no error or warning in RM logs about any configuration error. To > trigger the issue the DominantResourceCalculator must be used. (When using > absolute mode on the leaf-queue-template this issue is not re-producible, > further details on: YARN-10922). > > Reproduction steps: > # Start the cluster with the attached configuration > # Check the Queues page on UI2 (it should work at this point) > # Send an example job (yarn jar hadoop-mapreduce-examples-3.4.0-SNAPSHOT.jar > pi 1 10) > # Check the Queues page on UI2 (it should not be working at this point) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11010) YARN ui2 hangs on the Queues page when the scheduler response contains NaN values
[ https://issues.apache.org/jira/browse/YARN-11010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17817624#comment-17817624 ] Bence Kosztolnik commented on YARN-11010: - I just so this, maybe we can use this solution here as well > YARN ui2 hangs on the Queues page when the scheduler response contains NaN > values > - > > Key: YARN-11010 > URL: https://issues.apache.org/jira/browse/YARN-11010 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-ui-v2 >Affects Versions: 3.4.0 >Reporter: Tamas Domok >Assignee: Tamas Domok >Priority: Major > Attachments: capacity-scheduler.xml, shresponse.json > > > When the scheduler response contains NaN values for capacity and maxCapacity > the UI2 hangs on the Queues page. The console log shows the following error: > {code:java} > SyntaxError: Unexpected token N in JSON at position 666 {code} > The scheduler response: > {code:java} > "maxCapacity": NaN, > "absoluteMaxCapacity": NaN, {code} > NaN, infinity, -infinity is not valid in JSON syntax: > https://www.json.org/json-en.html > This might be related as well: YARN-10452 > > I managed to reproduce this with AQCv1, where I set the parent queue's > capacity in absolute mode, then I used percentage mode on the > leaf-queue-template. I'm not sure if this is a valid configuration, however > there is no error or warning in RM logs about any configuration error. To > trigger the issue the DominantResourceCalculator must be used. (When using > absolute mode on the leaf-queue-template this issue is not re-producible, > further details on: YARN-10922). > > Reproduction steps: > # Start the cluster with the attached configuration > # Check the Queues page on UI2 (it should work at this point) > # Send an example job (yarn jar hadoop-mapreduce-examples-3.4.0-SNAPSHOT.jar > pi 1 10) > # Check the Queues page on UI2 (it should not be working at this point) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-11010) YARN ui2 hangs on the Queues page when the scheduler response contains NaN values
[ https://issues.apache.org/jira/browse/YARN-11010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17817624#comment-17817624 ] Bence Kosztolnik edited comment on YARN-11010 at 2/15/24 9:30 AM: -- I just so this, maybe we can use this solution here as well HADOOP-18954 was (Author: JIRAUSER292672): I just so this, maybe we can use this solution here as well > YARN ui2 hangs on the Queues page when the scheduler response contains NaN > values > - > > Key: YARN-11010 > URL: https://issues.apache.org/jira/browse/YARN-11010 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-ui-v2 >Affects Versions: 3.4.0 >Reporter: Tamas Domok >Assignee: Tamas Domok >Priority: Major > Attachments: capacity-scheduler.xml, shresponse.json > > > When the scheduler response contains NaN values for capacity and maxCapacity > the UI2 hangs on the Queues page. The console log shows the following error: > {code:java} > SyntaxError: Unexpected token N in JSON at position 666 {code} > The scheduler response: > {code:java} > "maxCapacity": NaN, > "absoluteMaxCapacity": NaN, {code} > NaN, infinity, -infinity is not valid in JSON syntax: > https://www.json.org/json-en.html > This might be related as well: YARN-10452 > > I managed to reproduce this with AQCv1, where I set the parent queue's > capacity in absolute mode, then I used percentage mode on the > leaf-queue-template. I'm not sure if this is a valid configuration, however > there is no error or warning in RM logs about any configuration error. To > trigger the issue the DominantResourceCalculator must be used. (When using > absolute mode on the leaf-queue-template this issue is not re-producible, > further details on: YARN-10922). > > Reproduction steps: > # Start the cluster with the attached configuration > # Check the Queues page on UI2 (it should work at this point) > # Send an example job (yarn jar hadoop-mapreduce-examples-3.4.0-SNAPSHOT.jar > pi 1 10) > # Check the Queues page on UI2 (it should not be working at this point) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11634) Speed-up TestTimelineClient
[ https://issues.apache.org/jira/browse/YARN-11634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik updated YARN-11634: Priority: Minor (was: Major) > Speed-up TestTimelineClient > --- > > Key: YARN-11634 > URL: https://issues.apache.org/jira/browse/YARN-11634 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: Bence Kosztolnik >Assignee: Bence Kosztolnik >Priority: Minor > > The TimelineConnector.class has a hardcoded 1-minute connection time out, > which makes the TestTimelineClient a long-running test (~15:30 min). > Decreasing the timeout to 10ms will speed up the test run (~56 sec). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11634) Speed-up TestTimelineClient
Bence Kosztolnik created YARN-11634: --- Summary: Speed-up TestTimelineClient Key: YARN-11634 URL: https://issues.apache.org/jira/browse/YARN-11634 Project: Hadoop YARN Issue Type: Improvement Components: yarn Reporter: Bence Kosztolnik Assignee: Bence Kosztolnik The TimelineConnector.class has a hardcoded 1-minute connection time out, which makes the TestTimelineClient a long-running test (~15:30 min). Decreasing the timeout to 10ms will speed up the test run (~56 sec). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11567) Aggregate container launch debug artifacts automatically in case of error
Bence Kosztolnik created YARN-11567: --- Summary: Aggregate container launch debug artifacts automatically in case of error Key: YARN-11567 URL: https://issues.apache.org/jira/browse/YARN-11567 Project: Hadoop YARN Issue Type: Improvement Components: yarn Reporter: Bence Kosztolnik In cases where a container fails to launch without writing to a log file, we often would want to see the artifacts captured by {{yarn.nodemanager.log-container-debug-info.enabled}} in order to better understand the cause of the exit code. To enable this feature for every container maybe over kill, so we need a feature flag to capture these artifacts in case of errors. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-11567) Aggregate container launch debug artifacts automatically in case of error
[ https://issues.apache.org/jira/browse/YARN-11567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik reassigned YARN-11567: --- Assignee: Bence Kosztolnik > Aggregate container launch debug artifacts automatically in case of error > - > > Key: YARN-11567 > URL: https://issues.apache.org/jira/browse/YARN-11567 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: Bence Kosztolnik >Assignee: Bence Kosztolnik >Priority: Minor > > In cases where a container fails to launch without writing to a log file, we > often would want to see the artifacts captured by > {{yarn.nodemanager.log-container-debug-info.enabled}} in order to better > understand the cause of the exit code. To enable this feature for every > container maybe over kill, so we need a feature flag to capture these > artifacts in case of errors. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs
[ https://issues.apache.org/jira/browse/YARN-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik resolved YARN-10345. - Resolution: Duplicate > HsWebServices containerlogs does not honor ACLs for completed jobs > -- > > Key: YARN-10345 > URL: https://issues.apache.org/jira/browse/YARN-10345 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.3.0, 3.2.2, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Bence Kosztolnik >Priority: Critical > Attachments: Screen Shot 2020-07-08 at 12.54.21 PM.png > > > HsWebServices containerlogs does not honor ACLs. User who does not have > permission to view a job is allowed to view the job logs for completed jobs > from YARN UI2 through HsWebServices. > *Repro:* > Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + > HistoryServer runs as mapred > # Run a sample MR job using systest user > # Once the job is complete, access the job logs using hue user from YARN > UI2. > !Screen Shot 2020-07-08 at 12.54.21 PM.png|height=300! > > YARN CLI works fine and does not allow hue user to view systest user job logs. > {code:java} > [hue@pjoseph-cm-2 /]$ > [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002 > WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. > 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at > rmhostname:8032 > Permission denied: user=hue, access=EXECUTE, > inode="/tmp/logs/systest":systest:hadoop:drwxrwx--- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs
[ https://issues.apache.org/jira/browse/YARN-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik reassigned YARN-10345: --- Assignee: Bence Kosztolnik (was: Prabhu Joseph) > HsWebServices containerlogs does not honor ACLs for completed jobs > -- > > Key: YARN-10345 > URL: https://issues.apache.org/jira/browse/YARN-10345 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.3.0, 3.2.2, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Bence Kosztolnik >Priority: Critical > Attachments: Screen Shot 2020-07-08 at 12.54.21 PM.png > > > HsWebServices containerlogs does not honor ACLs. User who does not have > permission to view a job is allowed to view the job logs for completed jobs > from YARN UI2 through HsWebServices. > *Repro:* > Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + > HistoryServer runs as mapred > # Run a sample MR job using systest user > # Once the job is complete, access the job logs using hue user from YARN > UI2. > !Screen Shot 2020-07-08 at 12.54.21 PM.png|height=300! > > YARN CLI works fine and does not allow hue user to view systest user job logs. > {code:java} > [hue@pjoseph-cm-2 /]$ > [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002 > WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. > 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at > rmhostname:8032 > Permission denied: user=hue, access=EXECUTE, > inode="/tmp/logs/systest":systest:hadoop:drwxrwx--- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-11420) Stabilize TestNMClient
[ https://issues.apache.org/jira/browse/YARN-11420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik reassigned YARN-11420: --- Assignee: Bence Kosztolnik > Stabilize TestNMClient > -- > > Key: YARN-11420 > URL: https://issues.apache.org/jira/browse/YARN-11420 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.4.0 >Reporter: Bence Kosztolnik >Assignee: Bence Kosztolnik >Priority: Major > Labels: pull-request-available > > The TestNMClient test methods can stuck if the test container fails, while > the test is expecting it running state. This can happen for example if the > container fails due low memory. To fix this the test should tolerate some > failure like this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11420) Stabilize TestNMClient
Bence Kosztolnik created YARN-11420: --- Summary: Stabilize TestNMClient Key: YARN-11420 URL: https://issues.apache.org/jira/browse/YARN-11420 Project: Hadoop YARN Issue Type: Improvement Components: yarn Affects Versions: 3.4.0 Reporter: Bence Kosztolnik The TestNMClient test methods can stuck if the test container fails, while the test is expecting it running state. This can happen for example if the container fails due low memory. To fix this the test should tolerate some failure like this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-11410) Add default methods for StateMachine
[ https://issues.apache.org/jira/browse/YARN-11410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik reassigned YARN-11410: --- Assignee: Bence Kosztolnik > Add default methods for StateMachine > > > Key: YARN-11410 > URL: https://issues.apache.org/jira/browse/YARN-11410 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bence Kosztolnik >Assignee: Bence Kosztolnik >Priority: Major > Fix For: 3.4.0 > > > The YARN-11395 created a new method in the StateMachine interface, what can > break the compatibility with connected softwares, so the method should be > converted to default method, what can prevent this break -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11410) Add default methods for StateMachine
Bence Kosztolnik created YARN-11410: --- Summary: Add default methods for StateMachine Key: YARN-11410 URL: https://issues.apache.org/jira/browse/YARN-11410 Project: Hadoop YARN Issue Type: Bug Reporter: Bence Kosztolnik Fix For: 3.4.0 The YARN-11395 created a new method in the StateMachine interface, what can break the compatibility with connected softwares, so the method should be converted to default method, what can prevent this break -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-11395) Resource Manager UI, cluster/appattempt/*, can not present FINAL_SAVING state
[ https://issues.apache.org/jira/browse/YARN-11395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik reassigned YARN-11395: --- Assignee: Bence Kosztolnik > Resource Manager UI, cluster/appattempt/*, can not present FINAL_SAVING state > - > > Key: YARN-11395 > URL: https://issues.apache.org/jira/browse/YARN-11395 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.4.0 >Reporter: Bence Kosztolnik >Assignee: Bence Kosztolnik >Priority: Critical > > If an attempt is in *FINAL_SAVING* state, the > *RMAppAttemptBlock#createAttemptHeadRoomTable* method fails with a convert > error, what will results a > {code:java} > RFC6265 Cookie values may not contain character: [ ]{code} > error in the UI an in the logs as well. > RM log: > {code:java} > ... > at java.lang.Thread.run(Thread.java:750) > Caused by: java.lang.IllegalArgumentException: No enum constant > org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.FINAL_SAVING > at java.lang.Enum.valueOf(Enum.java:238) > at > org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.valueOf(YarnApplicationAttemptState.java:27) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMAppAttemptBlock.createAttemptHeadRoomTable(RMAppAttemptBlock.java:424) > at > org.apache.hadoop.yarn.server.webapp.AppAttemptBlock.render(AppAttemptBlock.java:151) > at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:243) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) > at > org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117) > at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848) > at > org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71) > at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) > at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:216) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.appattempt(RmController.java:62) > ... 63 more > 2022-12-05 04:15:33,029 WARN org.eclipse.jetty.server.HttpChannel: > /cluster/appattempt/appattempt_1667297151262_0247_01 > java.lang.IllegalArgumentException: RFC6265 Cookie values may not contain > character: [ ] > at > org.eclipse.jetty.http.Syntax.requireValidRFC6265CookieValue(Syntax.java:136) > ...{code} > This bug was introduced with the YARN-1345 ticket what also caused a similar > error called YARN-4411. In case of the YARN-4411 the enum mapping logic from > RMAppAttemptStates to YarnApplicationAttemptState was modified like this: > - if the state is FINAL_SAVING we should represent the previous state > This error can also be occur in case of ALLOCATED_SAVING, > LAUNCHED_UNMANAGED_SAVING states as well. > So we should modify the *createAttemptHeadRoomTable* method to be able to > handle the previously mentioned 3 state just like in case of YARN-4411 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11395) Resource Manager UI, cluster/appattempt/*, can not present FINAL_SAVING state
Bence Kosztolnik created YARN-11395: --- Summary: Resource Manager UI, cluster/appattempt/*, can not present FINAL_SAVING state Key: YARN-11395 URL: https://issues.apache.org/jira/browse/YARN-11395 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 3.4.0 Reporter: Bence Kosztolnik If an attempt is in *FINAL_SAVING* state, the *RMAppAttemptBlock#createAttemptHeadRoomTable* method fails with a convert error, what will results a {code:java} RFC6265 Cookie values may not contain character: [ ]{code} error in the UI an in the logs as well. RM log: {code:java} ... at java.lang.Thread.run(Thread.java:750) Caused by: java.lang.IllegalArgumentException: No enum constant org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.FINAL_SAVING at java.lang.Enum.valueOf(Enum.java:238) at org.apache.hadoop.yarn.api.records.YarnApplicationAttemptState.valueOf(YarnApplicationAttemptState.java:27) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMAppAttemptBlock.createAttemptHeadRoomTable(RMAppAttemptBlock.java:424) at org.apache.hadoop.yarn.server.webapp.AppAttemptBlock.render(AppAttemptBlock.java:151) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) at org.apache.hadoop.yarn.webapp.View.render(View.java:243) at org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) at org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117) at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848) at org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71) at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:216) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.appattempt(RmController.java:62) ... 63 more 2022-12-05 04:15:33,029 WARN org.eclipse.jetty.server.HttpChannel: /cluster/appattempt/appattempt_1667297151262_0247_01 java.lang.IllegalArgumentException: RFC6265 Cookie values may not contain character: [ ] at org.eclipse.jetty.http.Syntax.requireValidRFC6265CookieValue(Syntax.java:136) ...{code} This bug was introduced with the YARN-1345 ticket what also caused a similar error called YARN-4411. In case of the YARN-4411 the enum mapping logic from RMAppAttemptStates to YarnApplicationAttemptState was modified like this: - if the state is FINAL_SAVING we should represent the previous state This error can also be occur in case of ALLOCATED_SAVING, LAUNCHED_UNMANAGED_SAVING states as well. So we should modify the *createAttemptHeadRoomTable* method to be able to handle the previously mentioned 3 state just like in case of YARN-4411 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11390) TestResourceTrackerService.testNodeRemovalNormally: Shutdown nodes should be 0 now expected: <1> but was: <0>
Bence Kosztolnik created YARN-11390: --- Summary: TestResourceTrackerService.testNodeRemovalNormally: Shutdown nodes should be 0 now expected: <1> but was: <0> Key: YARN-11390 URL: https://issues.apache.org/jira/browse/YARN-11390 Project: Hadoop YARN Issue Type: Bug Components: yarn Reporter: Bence Kosztolnik Assignee: Bence Kosztolnik Some times the TestResourceTrackerService.{*}testNodeRemovalNormally{*} fails with the following message java.lang.AssertionError: Shutdown nodes should be 0 now expected:<1> but was:<0> at org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testNodeRemovalUtilDecomToUntracked(TestResourceTrackerService.java:1723) at org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testNodeRemovalUtil(TestResourceTrackerService.java:1685) at org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testNodeRemovalNormally(TestResourceTrackerService.java:1530) This can happen in case if the hardcoded 1s sleep in the test not enough for proper shut down. To fix this issue we should poll the cluster status with a time out, and see the cluster can reach the expected state -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11390) TestResourceTrackerService.testNodeRemovalNormally: Shutdown nodes should be 0 now expected: <1> but was: <0>
[ https://issues.apache.org/jira/browse/YARN-11390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik updated YARN-11390: Description: Some times the TestResourceTrackerService.{*}testNodeRemovalNormally{*} fails with the following message {noformat} java.lang.AssertionError: Shutdown nodes should be 0 now expected:<1> but was:<0> at org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testNodeRemovalUtilDecomToUntracked(TestResourceTrackerService.java:1723) at org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testNodeRemovalUtil(TestResourceTrackerService.java:1685) at org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testNodeRemovalNormally(TestResourceTrackerService.java:1530){noformat} This can happen in case if the hardcoded 1s sleep in the test not enough for proper shut down. To fix this issue we should poll the cluster status with a time out, and see the cluster can reach the expected state was: Some times the TestResourceTrackerService.{*}testNodeRemovalNormally{*} fails with the following message java.lang.AssertionError: Shutdown nodes should be 0 now expected:<1> but was:<0> at org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testNodeRemovalUtilDecomToUntracked(TestResourceTrackerService.java:1723) at org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testNodeRemovalUtil(TestResourceTrackerService.java:1685) at org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testNodeRemovalNormally(TestResourceTrackerService.java:1530) This can happen in case if the hardcoded 1s sleep in the test not enough for proper shut down. To fix this issue we should poll the cluster status with a time out, and see the cluster can reach the expected state > TestResourceTrackerService.testNodeRemovalNormally: Shutdown nodes should be > 0 now expected: <1> but was: <0> > - > > Key: YARN-11390 > URL: https://issues.apache.org/jira/browse/YARN-11390 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Bence Kosztolnik >Assignee: Bence Kosztolnik >Priority: Major > > Some times the TestResourceTrackerService.{*}testNodeRemovalNormally{*} fails > with the following message > {noformat} > java.lang.AssertionError: Shutdown nodes should be 0 now expected:<1> but > was:<0> > at > org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testNodeRemovalUtilDecomToUntracked(TestResourceTrackerService.java:1723) > at > org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testNodeRemovalUtil(TestResourceTrackerService.java:1685) > at > org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testNodeRemovalNormally(TestResourceTrackerService.java:1530){noformat} > This can happen in case if the hardcoded 1s sleep in the test not enough for > proper shut down. > To fix this issue we should poll the cluster status with a time out, and see > the cluster can reach the expected state -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11356) Upgrade DataTables to 1.11.5 to fix CVEs
[ https://issues.apache.org/jira/browse/YARN-11356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik updated YARN-11356: Description: This ticket is intended to fix the following CVEs in the *DataTables.net* lib, by upgrading the lib to 1.11.5 *CVE-2020-28458 (HIGH severity)* - All versions of package datatables.net are vulnerable to Prototype Pollution due to an incomplete fix for [https://snyk.io/vuln/SNYK-JS-DATATABLESNET-598806]. [https://nvd.nist.gov/vuln/detail/CVE-2020-28458] *CVE-2021-23445 (MEDIUM severity)* - This affects the package datatables.net before 1.11.3. If an array is passed to the HTML escape entities function it would not have its contents escaped. [https://nvd.nist.gov/vuln/detail/CVE-2021-23445] was: This ticket is intended to fix the following CVEs in the *DataTables.net* lib. *CVE-2020-28458 (HIGH severity)* - All versions of package datatables.net are vulnerable to Prototype Pollution due to an incomplete fix for [https://snyk.io/vuln/SNYK-JS-DATATABLESNET-598806]. https://nvd.nist.gov/vuln/detail/CVE-2020-28458 *CVE-2021-23445 (MEDIUM severity)* - This affects the package datatables.net before 1.11.3. If an array is passed to the HTML escape entities function it would not have its contents escaped. https://nvd.nist.gov/vuln/detail/CVE-2021-23445 > Upgrade DataTables to 1.11.5 to fix CVEs > > > Key: YARN-11356 > URL: https://issues.apache.org/jira/browse/YARN-11356 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.3.4 >Reporter: Bence Kosztolnik >Assignee: Bence Kosztolnik >Priority: Major > > This ticket is intended to fix the following CVEs in the *DataTables.net* > lib, by upgrading the lib to 1.11.5 > *CVE-2020-28458 (HIGH severity)* - All versions of package datatables.net are > vulnerable to Prototype Pollution due to an incomplete fix for > [https://snyk.io/vuln/SNYK-JS-DATATABLESNET-598806]. > [https://nvd.nist.gov/vuln/detail/CVE-2020-28458] > *CVE-2021-23445 (MEDIUM severity)* - This affects the package datatables.net > before 1.11.3. If an array is passed to the HTML escape entities function it > would not have its contents escaped. > [https://nvd.nist.gov/vuln/detail/CVE-2021-23445] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11356) Upgrade DataTables to 1.11.5 to fix CVEs
Bence Kosztolnik created YARN-11356: --- Summary: Upgrade DataTables to 1.11.5 to fix CVEs Key: YARN-11356 URL: https://issues.apache.org/jira/browse/YARN-11356 Project: Hadoop YARN Issue Type: Improvement Components: yarn Affects Versions: 3.3.4 Reporter: Bence Kosztolnik This ticket is intended to fix the following CVEs in the *DataTables.net* lib. *CVE-2020-28458 (HIGH severity)* - All versions of package datatables.net are vulnerable to Prototype Pollution due to an incomplete fix for [https://snyk.io/vuln/SNYK-JS-DATATABLESNET-598806]. https://nvd.nist.gov/vuln/detail/CVE-2020-28458 *CVE-2021-23445 (MEDIUM severity)* - This affects the package datatables.net before 1.11.3. If an array is passed to the HTML escape entities function it would not have its contents escaped. https://nvd.nist.gov/vuln/detail/CVE-2021-23445 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-11356) Upgrade DataTables to 1.11.5 to fix CVEs
[ https://issues.apache.org/jira/browse/YARN-11356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik reassigned YARN-11356: --- Assignee: Bence Kosztolnik > Upgrade DataTables to 1.11.5 to fix CVEs > > > Key: YARN-11356 > URL: https://issues.apache.org/jira/browse/YARN-11356 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.3.4 >Reporter: Bence Kosztolnik >Assignee: Bence Kosztolnik >Priority: Major > > This ticket is intended to fix the following CVEs in the *DataTables.net* lib. > *CVE-2020-28458 (HIGH severity)* - All versions of package datatables.net are > vulnerable to Prototype Pollution due to an incomplete fix for > [https://snyk.io/vuln/SNYK-JS-DATATABLESNET-598806]. > https://nvd.nist.gov/vuln/detail/CVE-2020-28458 > *CVE-2021-23445 (MEDIUM severity)* - This affects the package datatables.net > before 1.11.3. If an array is passed to the HTML escape entities function it > would not have its contents escaped. > https://nvd.nist.gov/vuln/detail/CVE-2021-23445 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11348) Deprecated property can not be unset
[ https://issues.apache.org/jira/browse/YARN-11348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik updated YARN-11348: Description: If you try to unset a deprecated property in an *CapacitySchedulerConfiguration* object the value wont be removed. Example failing Test for the *TestCapacitySchedulerConfiguration* class {noformat} @Test public void testDeprecationFeatureWorks() { final String value = "VALUE"; final String goodName = "koko"; final String depName = "dfs.nfs.exports.allowed.hosts"; final CapacitySchedulerConfiguration csConf = createDefaultCsConf(); csConf.set(goodName, value); csConf.unset(goodName); assertNull(csConf.get(goodName)); csConf.set(depName, value); csConf.unset(depName); assertNull(csConf.get(depName)); // fails here }{noformat} was: If you try to unset a deprecated property in an {code:java} CapacitySchedulerConfiguration{code} object the value wont be removed. Example failing Test for the *TestCapacitySchedulerConfiguration* class {noformat} @Test public void testDeprecationFeatureWorks() { final String value = "VALUE"; final String goodName = "koko"; final String depName = "dfs.nfs.exports.allowed.hosts"; final CapacitySchedulerConfiguration csConf = createDefaultCsConf(); csConf.set(goodName, value); csConf.unset(goodName); assertNull(csConf.get(goodName)); csConf.set(depName, value); csConf.unset(depName); assertNull(csConf.get(depName)); // fails here }{noformat} > Deprecated property can not be unset > > > Key: YARN-11348 > URL: https://issues.apache.org/jira/browse/YARN-11348 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bence Kosztolnik >Priority: Major > Labels: newbie > > If you try to unset a deprecated property in an > *CapacitySchedulerConfiguration* object the value wont be removed. > Example failing Test for the *TestCapacitySchedulerConfiguration* class > {noformat} > @Test > public void testDeprecationFeatureWorks() { > final String value = "VALUE"; > final String goodName = "koko"; > final String depName = "dfs.nfs.exports.allowed.hosts"; > final CapacitySchedulerConfiguration csConf = createDefaultCsConf(); > csConf.set(goodName, value); > csConf.unset(goodName); > assertNull(csConf.get(goodName)); > csConf.set(depName, value); > csConf.unset(depName); > assertNull(csConf.get(depName)); // fails here > }{noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11348) Deprecated property can not be unset
[ https://issues.apache.org/jira/browse/YARN-11348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik updated YARN-11348: Labels: newbie (was: ) > Deprecated property can not be unset > > > Key: YARN-11348 > URL: https://issues.apache.org/jira/browse/YARN-11348 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bence Kosztolnik >Priority: Major > Labels: newbie > > If you try to unset a deprecated property in an > {code:java} > CapacitySchedulerConfiguration{code} > object the value wont be removed. > Example failing Test for the *TestCapacitySchedulerConfiguration* class > {noformat} > @Test > public void testDeprecationFeatureWorks() { > final String value = "VALUE"; > final String goodName = "koko"; > final String depName = "dfs.nfs.exports.allowed.hosts"; > final CapacitySchedulerConfiguration csConf = createDefaultCsConf(); > csConf.set(goodName, value); > csConf.unset(goodName); > assertNull(csConf.get(goodName)); > csConf.set(depName, value); > csConf.unset(depName); > assertNull(csConf.get(depName)); // fails here > }{noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11348) Deprecated property can not be unset
[ https://issues.apache.org/jira/browse/YARN-11348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik updated YARN-11348: Description: If you try to unset a deprecated property in an {code:java} CapacitySchedulerConfiguration{code} object the value wont be removed. Example failing Test for the *TestCapacitySchedulerConfiguration* class {noformat} @Test public void testDeprecationFeatureWorks() { final String value = "VALUE"; final String goodName = "koko"; final String depName = "dfs.nfs.exports.allowed.hosts"; final CapacitySchedulerConfiguration csConf = createDefaultCsConf(); csConf.set(goodName, value); csConf.unset(goodName); assertNull(csConf.get(goodName)); csConf.set(depName, value); csConf.unset(depName); assertNull(csConf.get(depName)); // fails here }{noformat} was: If you try to unset a deprecated property in an {code:java} CapacitySchedulerConfiguration{code} object the value wont be removed. Example failing Test for the TestCapacitySchedulerConfiguration class {noformat} @Test public void testDeprecationFeatureWorks() { final String value = "VALUE"; final String goodName = "koko"; final String depName = "dfs.nfs.exports.allowed.hosts"; final CapacitySchedulerConfiguration csConf = createDefaultCsConf(); csConf.set(goodName, value); csConf.unset(goodName); assertNull(csConf.get(goodName)); csConf.set(depName, value); csConf.unset(depName); assertNull(csConf.get(depName)); // fails here }{noformat} > Deprecated property can not be unset > > > Key: YARN-11348 > URL: https://issues.apache.org/jira/browse/YARN-11348 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bence Kosztolnik >Priority: Major > > If you try to unset a deprecated property in an > {code:java} > CapacitySchedulerConfiguration{code} > object the value wont be removed. > Example failing Test for the *TestCapacitySchedulerConfiguration* class > {noformat} > @Test > public void testDeprecationFeatureWorks() { > final String value = "VALUE"; > final String goodName = "koko"; > final String depName = "dfs.nfs.exports.allowed.hosts"; > final CapacitySchedulerConfiguration csConf = createDefaultCsConf(); > csConf.set(goodName, value); > csConf.unset(goodName); > assertNull(csConf.get(goodName)); > csConf.set(depName, value); > csConf.unset(depName); > assertNull(csConf.get(depName)); // fails here > }{noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11348) Deprecated property can not be unset
[ https://issues.apache.org/jira/browse/YARN-11348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik updated YARN-11348: Summary: Deprecated property can not be unset (was: Depricated property can not be unset) > Deprecated property can not be unset > > > Key: YARN-11348 > URL: https://issues.apache.org/jira/browse/YARN-11348 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bence Kosztolnik >Priority: Major > > If you try to unset a deprecated property in an > {code:java} > CapacitySchedulerConfiguration{code} > object the value wont be removed. > Example failing Test for the TestCapacitySchedulerConfiguration class > {noformat} > @Test > public void testDeprecationFeatureWorks() { > final String value = "VALUE"; > final String goodName = "koko"; > final String depName = "dfs.nfs.exports.allowed.hosts"; > final CapacitySchedulerConfiguration csConf = createDefaultCsConf(); > csConf.set(goodName, value); > csConf.unset(goodName); > assertNull(csConf.get(goodName)); > csConf.set(depName, value); > csConf.unset(depName); > assertNull(csConf.get(depName)); // fails here > }{noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11348) Depricated property can not be unset
Bence Kosztolnik created YARN-11348: --- Summary: Depricated property can not be unset Key: YARN-11348 URL: https://issues.apache.org/jira/browse/YARN-11348 Project: Hadoop YARN Issue Type: Bug Reporter: Bence Kosztolnik If you try to unset a deprecated property in an {code:java} CapacitySchedulerConfiguration{code} object the value wont be removed. Example failing Test for the TestCapacitySchedulerConfiguration class {noformat} @Test public void testDeprecationFeatureWorks() { final String value = "VALUE"; final String goodName = "koko"; final String depName = "dfs.nfs.exports.allowed.hosts"; final CapacitySchedulerConfiguration csConf = createDefaultCsConf(); csConf.set(goodName, value); csConf.unset(goodName); assertNull(csConf.get(goodName)); csConf.set(depName, value); csConf.unset(depName); assertNull(csConf.get(depName)); // fails here }{noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11344) Double checked locking in Configuration
[ https://issues.apache.org/jira/browse/YARN-11344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik resolved YARN-11344. - Resolution: Abandoned > Double checked locking in Configuration > --- > > Key: YARN-11344 > URL: https://issues.apache.org/jira/browse/YARN-11344 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: Bence Kosztolnik >Priority: Minor > > Currently the > {code:java} > org.apache.hadoop.conf.Configuration{code} > class use synchronised methods in many cases where double check locking would > be enough, for example at case of *getProps()* and {*}getOverlay(){*}. > The class should be refactored to remove the unnecessary locking points. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11344) Double checked locking in Configuration
[ https://issues.apache.org/jira/browse/YARN-11344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik updated YARN-11344: Summary: Double checked locking in Configuration (was: Double check locking in Configuration) > Double checked locking in Configuration > --- > > Key: YARN-11344 > URL: https://issues.apache.org/jira/browse/YARN-11344 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: Bence Kosztolnik >Priority: Minor > > Currently the > {code:java} > org.apache.hadoop.conf.Configuration{code} > class use synchronised methods in many cases where double check locking would > be enough, for example at case of *getProps()* and {*}getOverlay(){*}. > The class should be refactored to remove the unnecessary locking points. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11344) Double check locking in Configuration
Bence Kosztolnik created YARN-11344: --- Summary: Double check locking in Configuration Key: YARN-11344 URL: https://issues.apache.org/jira/browse/YARN-11344 Project: Hadoop YARN Issue Type: Improvement Components: yarn Reporter: Bence Kosztolnik Currently the {code:java} org.apache.hadoop.conf.Configuration{code} class use synchronised methods in many cases where double check locking would be enough, for example at case of *getProps()* and {*}getOverlay(){*}. The class should be refactored to remove the unnecessary locking points. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-11216) Avoid unnecessary reconstruction of ConfigurationProperties
[ https://issues.apache.org/jira/browse/YARN-11216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik reassigned YARN-11216: --- Assignee: Bence Kosztolnik > Avoid unnecessary reconstruction of ConfigurationProperties > --- > > Key: YARN-11216 > URL: https://issues.apache.org/jira/browse/YARN-11216 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: András Győri >Assignee: Bence Kosztolnik >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > ConfigurationProperties is expensive to create, however, due to its immutable > nature it is possible to copy it/share it between configuration objects (eg. > create a copy constructor). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-11063) Support auto queue creation template wildcards for arbitrary queue depths
[ https://issues.apache.org/jira/browse/YARN-11063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bence Kosztolnik reassigned YARN-11063: --- Assignee: Bence Kosztolnik (was: Andras Gyori) > Support auto queue creation template wildcards for arbitrary queue depths > - > > Key: YARN-11063 > URL: https://issues.apache.org/jira/browse/YARN-11063 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Andras Gyori >Assignee: Bence Kosztolnik >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > With the introduction of YARN-10632, we need to support more than one > wildcard in queue templates. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-11063) Support auto queue creation template wildcards for arbitrary queue depths
[ https://issues.apache.org/jira/browse/YARN-11063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570813#comment-17570813 ] Bence Kosztolnik edited comment on YARN-11063 at 7/25/22 9:58 AM: -- I had a sync with [~gandras] , and we discussed I will fix this issue as a basic ramp-up task was (Author: JIRAUSER292672): I had a sync with [~gandras] , and we discussed I will fix this issue as basic ramp-up task > Support auto queue creation template wildcards for arbitrary queue depths > - > > Key: YARN-11063 > URL: https://issues.apache.org/jira/browse/YARN-11063 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > With the introduction of YARN-10632, we need to support more than one > wildcard in queue templates. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11063) Support auto queue creation template wildcards for arbitrary queue depths
[ https://issues.apache.org/jira/browse/YARN-11063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570813#comment-17570813 ] Bence Kosztolnik commented on YARN-11063: - I had a sync with [~gandras] , and we discussed I will fix this issue as basic ramp-up task > Support auto queue creation template wildcards for arbitrary queue depths > - > > Key: YARN-11063 > URL: https://issues.apache.org/jira/browse/YARN-11063 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > With the introduction of YARN-10632, we need to support more than one > wildcard in queue templates. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org