[ https://issues.apache.org/jira/browse/YARN-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17822185#comment-17822185 ]
ASF GitHub Bot commented on YARN-11656: --------------------------------------- hadoop-yetus commented on PR #6569: URL: https://github.com/apache/hadoop/pull/6569#issuecomment-1971420241 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |:----:|----------:|--------:|:--------:|:-------:| | +0 :ok: | reexec | 0m 52s | | Docker mode activated. | |||| _ Prechecks _ | | +1 :green_heart: | dupname | 0m 1s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 6 new or modified test files. | |||| _ trunk Compile Tests _ | | +0 :ok: | mvndep | 14m 14s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 36m 58s | | trunk passed | | +1 :green_heart: | compile | 9m 3s | | trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | compile | 9m 13s | | trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 2m 5s | | trunk passed | | +1 :green_heart: | mvnsite | 2m 28s | | trunk passed | | +1 :green_heart: | javadoc | 2m 4s | | trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 59s | | trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 4m 10s | | trunk passed | | +1 :green_heart: | shadedclient | 40m 20s | | branch has no errors when building and testing our client artifacts. | |||| _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 31s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 1m 22s | | the patch passed | | +1 :green_heart: | compile | 8m 10s | | the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javac | 8m 10s | | the patch passed | | +1 :green_heart: | compile | 9m 47s | | the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | javac | 9m 47s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 1m 56s | [/results-checkstyle-hadoop-yarn-project_hadoop-yarn.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6569/5/artifact/out/results-checkstyle-hadoop-yarn-project_hadoop-yarn.txt) | hadoop-yarn-project/hadoop-yarn: The patch generated 14 new + 71 unchanged - 1 fixed = 85 total (was 72) | | +1 :green_heart: | mvnsite | 2m 26s | | the patch passed | | +1 :green_heart: | javadoc | 1m 50s | | the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 49s | | the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | -1 :x: | spotbugs | 2m 10s | [/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-common.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6569/5/artifact/out/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-common.html) | hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) | | +1 :green_heart: | shadedclient | 40m 2s | | patch has no errors when building and testing our client artifacts. | |||| _ Other Tests _ | | +1 :green_heart: | unit | 5m 43s | | hadoop-yarn-common in the patch passed. | | -1 :x: | unit | 127m 31s | [/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6569/5/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt) | hadoop-yarn-server-resourcemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 54s | | The patch does not generate ASF License warnings. | | | | 334m 6s | | | | Reason | Tests | |-------:|:------| | SpotBugs | module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common | | | new org.apache.hadoop.yarn.event.multidispatcher.MultiDispatcherExecutor(Logger, MultiDispatcherConfig, String) invokes org.apache.hadoop.yarn.event.multidispatcher.MultiDispatcherExecutor$MultiDispatcherExecutorThread.start() At MultiDispatcherExecutor.java:org.apache.hadoop.yarn.event.multidispatcher.MultiDispatcherExecutor$MultiDispatcherExecutorThread.start() At MultiDispatcherExecutor.java:[line 54] | | Failed junit tests | hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore | | | hadoop.yarn.server.resourcemanager.recovery.TestLeveldbRMStateStore | | | hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler | | | hadoop.yarn.server.resourcemanager.scheduler.TestSchedulerHealth | | | hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore | | | hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA | | | hadoop.yarn.server.resourcemanager.TestResourceManager | | Subsystem | Report/Notes | |----------:|:-------------| | Docker | ClientAPI=1.44 ServerAPI=1.44 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6569/5/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6569 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux c5f4bede693b 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / f0788b5a11a17fa254c739702510509b0f121520 | | Default Java | Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6569/5/testReport/ | | Max. process+thread count | 1884 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6569/5/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This message was automatically generated. > RMStateStore event queue blocked > -------------------------------- > > Key: YARN-11656 > URL: https://issues.apache.org/jira/browse/YARN-11656 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn > Affects Versions: 3.4.1 > Reporter: Bence Kosztolnik > Assignee: Bence Kosztolnik > Priority: Major > Labels: pull-request-available > Attachments: issue.png, log.png > > > h2. Problem statement > > I observed Yarn cluster has pending and available resources as well, but the > cluster utilization is usually around ~50%. The cluster had loaded with 200 > parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 > reduce containers configured, on a 50 nodes cluster, where each node had 8 > cores, and a lot of memory (there was cpu bottleneck). > Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to > persist a RMStateStoreEvent (using FileSystemRMStateStore). > To reduce the impact of the issue: > - create a dispatcher where events can persist in parallel threads > - create metric data for the RMStateStore event queue to be able easily to > identify the problem if occurs on a cluster > {panel:title=Issue visible on UI2} > !issue.png|height=250! > {panel} > Also another way to identify the issue if we can see too much time is > required to store info for app after reach new_saving state > {panel:title=How issue can look like in log} > !log.png|height=250! > {panel} > h2. Solution > Created a *MultiDispatcher* class which implements the Dispatcher interface. > The Dispatcher creates a separate metric object called _Event metrics for > "rm-state-store"_ where we can see > - how many unhandled events are currently present in the event queue for the > specific event type > - how many events were handled for the specific event type > - average execution time for the specific event > The dispatcher has the following configs ( the placeholder is for the > dispatcher name, for example, rm-state-store ) > ||Config name||Description||Default value|| > |yarn.dispatcher.multi-thread.{}.*default-pool-size*|How many parallel > threads should execute the parallel event execution| 4| > |yarn.dispatcher.multi-thread.{}.*max-pool-size*|If the event queue is full > the execution threads will scale up to this many|8| > |yarn.dispatcher.multi-thread.{}.*keep-alive-seconds*|Execution threads will > be destroyed after this many seconds|10| > |yarn.dispatcher.multi-thread.{}.*queue-size*|Size of the eventqueue|1 000 > 000| > |yarn.dispatcher.multi-thread.{}.*monitor-seconds*|The size of the event > queue will be logged with this frequency (if not zero) |30| > |yarn.dispatcher.multi-thread.{}.*graceful-stop-seconds*|After the stop > signal the dispatcher will wait this many seconds to be able to process the > incoming events before terminating them|60| > {panel:title=Example output from RM JMX api} > {noformat} > ... > { > "name": "Hadoop:service=ResourceManager,name=Event metrics for > rm-state-store", > "modelerType": "Event metrics for rm-state-store", > "tag.Context": "yarn", > "tag.Hostname": CENSORED > "RMStateStoreEventType#STORE_APP_ATTEMPT_Current": 51, > "RMStateStoreEventType#STORE_APP_ATTEMPT_NumOps": 0, > "RMStateStoreEventType#STORE_APP_ATTEMPT_AvgTime": 0.0, > "RMStateStoreEventType#STORE_APP_Current": 124, > "RMStateStoreEventType#STORE_APP_NumOps": 46, > "RMStateStoreEventType#STORE_APP_AvgTime": 3318.25, > "RMStateStoreEventType#UPDATE_APP_Current": 31, > "RMStateStoreEventType#UPDATE_APP_NumOps": 16, > "RMStateStoreEventType#UPDATE_APP_AvgTime": 2629.6666666666665, > "RMStateStoreEventType#UPDATE_APP_ATTEMPT_Current": 31, > "RMStateStoreEventType#UPDATE_APP_ATTEMPT_NumOps": 12, > "RMStateStoreEventType#UPDATE_APP_ATTEMPT_AvgTime": 2048.6666666666665, > "RMStateStoreEventType#REMOVE_APP_Current": 12, > "RMStateStoreEventType#REMOVE_APP_NumOps": 3, > "RMStateStoreEventType#REMOVE_APP_AvgTime": 1378.0, > "RMStateStoreEventType#REMOVE_APP_ATTEMPT_Current": 0, > "RMStateStoreEventType#REMOVE_APP_ATTEMPT_NumOps": 0, > "RMStateStoreEventType#REMOVE_APP_ATTEMPT_AvgTime": 0.0, > "RMStateStoreEventType#FENCED_Current": 0, > "RMStateStoreEventType#FENCED_NumOps": 0, > "RMStateStoreEventType#FENCED_AvgTime": 0.0, > "RMStateStoreEventType#STORE_MASTERKEY_Current": 0, > "RMStateStoreEventType#STORE_MASTERKEY_NumOps": 0, > "RMStateStoreEventType#STORE_MASTERKEY_AvgTime": 0.0, > "RMStateStoreEventType#REMOVE_MASTERKEY_Current": 0, > "RMStateStoreEventType#REMOVE_MASTERKEY_NumOps": 0, > "RMStateStoreEventType#REMOVE_MASTERKEY_AvgTime": 0.0, > "RMStateStoreEventType#STORE_DELEGATION_TOKEN_Current": 0, > "RMStateStoreEventType#STORE_DELEGATION_TOKEN_NumOps": 0, > "RMStateStoreEventType#STORE_DELEGATION_TOKEN_AvgTime": 0.0, > "RMStateStoreEventType#REMOVE_DELEGATION_TOKEN_Current": 0, > "RMStateStoreEventType#REMOVE_DELEGATION_TOKEN_NumOps": 0, > "RMStateStoreEventType#REMOVE_DELEGATION_TOKEN_AvgTime": 0.0, > "RMStateStoreEventType#UPDATE_DELEGATION_TOKEN_Current": 0, > "RMStateStoreEventType#UPDATE_DELEGATION_TOKEN_NumOps": 0, > "RMStateStoreEventType#UPDATE_DELEGATION_TOKEN_AvgTime": 0.0, > "RMStateStoreEventType#UPDATE_AMRM_TOKEN_Current": 0, > "RMStateStoreEventType#UPDATE_AMRM_TOKEN_NumOps": 0, > "RMStateStoreEventType#UPDATE_AMRM_TOKEN_AvgTime": 0.0, > "RMStateStoreEventType#STORE_RESERVATION_Current": 0, > "RMStateStoreEventType#STORE_RESERVATION_NumOps": 0, > "RMStateStoreEventType#STORE_RESERVATION_AvgTime": 0.0, > "RMStateStoreEventType#REMOVE_RESERVATION_Current": 0, > "RMStateStoreEventType#REMOVE_RESERVATION_NumOps": 0, > "RMStateStoreEventType#REMOVE_RESERVATION_AvgTime": 0.0, > "RMStateStoreEventType#STORE_PROXY_CA_CERT_Current": 0, > "RMStateStoreEventType#STORE_PROXY_CA_CERT_NumOps": 0, > "RMStateStoreEventType#STORE_PROXY_CA_CERT_AvgTime": 0.0 > }, > ... > {noformat} > {panel} > h2. Testing > I deployed the MultiDispatcher supported version of yarn to the cluster and > applied the following performance test: > {code:bash} > #!/bin/bash > for i in {1..50}; > do > ssh root@$i-node-url 'nohup ./perf.sh 4 1>/dev/null 2>/dev/nul &' & > done > sleep 300 > for i in {1..50}; > do > ssh root@$i-node-url "pkill -9 -f perf" & > done > sleep 5 > echo "DONE" > {code} > Each node had do following perf script > {code:bash} > #!/bin/bash > while true > do > if [ $(ps -o pid= -u hadoop | wc -l) -le $1 ] > then > hadoop jar /opt/hadoop-mapreduce-examples.jar pi 20 20 1>/dev/null > 2>&1 & > fi > sleep 1 > done > {code} > This way in 5 minute (+ wait until all job finish) i could process 332 app. > After i tested the same with the official build i needed 5 minute only to > finish with the first app, after that 221 app were finished. > I also tested it with LeveldbRMStateStore and ZKRMStateStore and did not > found any problem with the implementation -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org