Re: Timeouts in CI jobs (was: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?)
On Wed, Apr 24, 2024 at 06:27:58PM +0200, Stefan Weil via wrote: > Am 20.04.24 um 22:25 schrieb Stefan Weil: > > Am 16.04.24 um 14:17 schrieb Stefan Weil: > > > Am 16.04.24 um 14:10 schrieb Peter Maydell: > > > > > > > The cross-i686-tci job is flaky again, with persistent intermittent > > > > failures due to jobs timing out. > > > [...] > > > > Some of these timeouts are very high -- no test should be taking > > > > 10 minutes, even given TCI and a slowish CI runner -- which suggests > > > > to me that there's some kind of intermittent deadlock going on. > > > > > > > > Can somebody who cares about TCI investigate, please, and track > > > > down whatever this is? > > > > > > I'll have a look. > > > > Short summary: > > > > The "persistent intermittent failures due to jobs timing out" are not > > related to TCI: they also occur if the same tests are run with the > > normal TCG. I suggest that the CI tests should run single threaded. > > Hi Paolo, > > I need help from someone who knows the CI and the build and test framework > better. > > Peter reported intermittent timeouts for the cross-i686-tci job, causing it > to fail. I can reproduce such timeouts locally, but noticed that they are > not limited to TCI. The GitLab CI also shows other examples, such as this > job: > > https://gitlab.com/qemu-project/qemu/-/jobs/6700955287 > > I think the timeouts are caused by running too many parallel processes > during testing. > > The CI uses parallel builds: > > make -j$(expr $(nproc) + 1) all check-build $MAKE_CHECK_ARGS Note that command is running both the compile and test phases of the job. Overcommitting CPUs for the compile phase is a good idea to keep CPUs busy while another process is waiting on I/O, and is almost always safe todo. Overcommitting CPUs for the test phase is less helpful and can cause a variety of problems as you say. > > It looks like `nproc` returns 8, and make runs with 9 threads. > `meson test` uses the same value to run 9 test processes in parallel: > > /builds/qemu-project/qemu/build/pyvenv/bin/meson test --no-rebuild -t 1 > --num-processes 9 --print-errorlogs > > Since the host can only handle 8 parallel threads, 9 threads might already > cause some tests to run non-deterministically. In contributor forks, gitlab CI will be using the public shared runners. These are Google Cloud VMs, which only have 2 vCPUs. In the primary QEMU repo, we have a customer runner registered that uses Azure based VMs. Not sure on the resources we have configured for them offhand. The important thing there is that what you see for CI speed in your fork repo is not neccessarily a match for CI speed in QEMU upstream repo. > > But if some of the individual tests also use multithreading (according to my > tests they do so with at least 3 or 4 threads), things get even worse. Then > there are up to 4 * 9 = 36 threads competing to run on the available 8 > cores. > > In this scenario timeouts are expected and can occur randomly. > > In my tests setting --num-processes to a lower value not only avoided > timeouts but also reduced the processing overhead without increasing the > runtime. > > Could we run all tests with `--num-processes 1`? The question is what impact that has on the overall job execution time. A lot of our jobs are already quite long, which is bad for the turnaround time of CI testing. Reliable CI though is arguably the #1 priority though, otherwise developers cease trusting it. We need to find the balance between avoiding timeouts, while having the shortest pratical job time. The TCI job you show abuot came out at 22 minutes, which is not our worst job, so there is some scope for allowing it to run longer with less parallelism. Timeouts for individual tests are a relatively recent change to QEMU in: commit 4156325cd3c205ce77b82de2c48127ccacddaf5b Author: Daniel P. Berrangé Date: Fri Dec 15 08:03:57 2023 +0100 mtest2make: stop disabling meson test timeouts Read the full commit message of that for the background rationale, but especially this paragraph: The main risk of this change is that the individual test timeouts might be too short to allow completion in high load scenarios. Thus, there is likely to be some short term pain where we have to bump the timeouts for certain tests to make them reliable enough. The preceeding few patches raised the timeouts for all failures that were immediately apparent in GitLab CI. which highlights the problem you're looking at. The expectation was that we would need to bump the timeouts for various tests until we get the the point where they reliably run in GitLab CI, both forks and upstream. With regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Timeouts in CI jobs (was: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?)
Am 20.04.24 um 22:25 schrieb Stefan Weil: Am 16.04.24 um 14:17 schrieb Stefan Weil: Am 16.04.24 um 14:10 schrieb Peter Maydell: The cross-i686-tci job is flaky again, with persistent intermittent failures due to jobs timing out. [...] Some of these timeouts are very high -- no test should be taking 10 minutes, even given TCI and a slowish CI runner -- which suggests to me that there's some kind of intermittent deadlock going on. Can somebody who cares about TCI investigate, please, and track down whatever this is? I'll have a look. Short summary: The "persistent intermittent failures due to jobs timing out" are not related to TCI: they also occur if the same tests are run with the normal TCG. I suggest that the CI tests should run single threaded. Hi Paolo, I need help from someone who knows the CI and the build and test framework better. Peter reported intermittent timeouts for the cross-i686-tci job, causing it to fail. I can reproduce such timeouts locally, but noticed that they are not limited to TCI. The GitLab CI also shows other examples, such as this job: https://gitlab.com/qemu-project/qemu/-/jobs/6700955287 I think the timeouts are caused by running too many parallel processes during testing. The CI uses parallel builds: make -j$(expr $(nproc) + 1) all check-build $MAKE_CHECK_ARGS It looks like `nproc` returns 8, and make runs with 9 threads. `meson test` uses the same value to run 9 test processes in parallel: /builds/qemu-project/qemu/build/pyvenv/bin/meson test --no-rebuild -t 1 --num-processes 9 --print-errorlogs Since the host can only handle 8 parallel threads, 9 threads might already cause some tests to run non-deterministically. But if some of the individual tests also use multithreading (according to my tests they do so with at least 3 or 4 threads), things get even worse. Then there are up to 4 * 9 = 36 threads competing to run on the available 8 cores. In this scenario timeouts are expected and can occur randomly. In my tests setting --num-processes to a lower value not only avoided timeouts but also reduced the processing overhead without increasing the runtime. Could we run all tests with `--num-processes 1`? Thanks, Stefan
Re: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?
Am 16.04.24 um 14:17 schrieb Stefan Weil: Am 16.04.24 um 14:10 schrieb Peter Maydell: The cross-i686-tci job is flaky again, with persistent intermittent failures due to jobs timing out. [...] Some of these timeouts are very high -- no test should be taking 10 minutes, even given TCI and a slowish CI runner -- which suggests to me that there's some kind of intermittent deadlock going on. Can somebody who cares about TCI investigate, please, and track down whatever this is? I'll have a look. Short summary: The "persistent intermittent failures due to jobs timing out" are not related to TCI: they also occur if the same tests are run with the normal TCG. I suggest that the CI tests should run single threaded. But let's have a look on details of my results. I have run `time make test` using different scenarios on the rather old and not so performant VM which I typically use for QEMU builds. I did not restrict the tests to selected architectures like it is done in the QEMU CI tests. Therefore I had more tests which all ended successfully: Ok: 848 Expected Fail: 0 Fail: 0 Unexpected Pass:0 Skipped:68 Timeout:0 --- 1st test with normal TCG `nohup time ../configure --enable-modules --disable-werror && nohup time make -j4 && nohup time make test` [...] Full log written to /home/stefan/src/gitlab/qemu-project/qemu/bin/ndebug/x86_64-linux-gnu/meson-logs/testlog.txt 2296.08user 1525.02system 21:49.78elapsed 291%CPU (0avgtext+0avgdata 633476maxresident)k 1730448inputs+14140528outputs (11668major+56827263minor)pagefaults 0swaps --- 2nd test with TCI `nohup time ../configure --enable-modules --disable-werror --enable-tcg-interpreter && nohup time make -j4 && nohup time make test` [...] Full log written to /home/stefan/src/gitlab/qemu-project/qemu/bin/ndebug/x86_64-linux-gnu/meson-logs/testlog.txt 3766.74user 1521.38system 26:50.51elapsed 328%CPU (0avgtext+0avgdata 633012maxresident)k 32768inputs+14145080outputs (3033major+56121586minor)pagefaults 0swaps --- So the total test time with TCI was 26:50.51 minutes while for the normal test it was 21:49.78 minutes. These 10 single tests had the longest duration: 1st test with normal TCG 94/916 qtest-arm / qtest-arm/qom-test 373.41s 99/916 qtest-aarch64 / qtest-aarch64/qom-test 398.43s 100/916 qtest-i386 / qtest-i386/bios-tables-test 188.06s 103/916 qtest-x86_64 / qtest-x86_64/bios-tables-test 228.33s 106/916 qtest-aarch64 / qtest-aarch64/migration-test 201.15s 119/916 qtest-i386 / qtest-i386/migration-test 253.58s 126/916 qtest-x86_64 / qtest-x86_64/migration-test 266.66s 143/916 qtest-arm / qtest-arm/test-hmp 101.72s 144/916 qtest-aarch64 / qtest-aarch64/test-hmp 113.10s 163/916 qtest-arm / qtest-arm/aspeed_smc-test 256.92s 2nd test with TCI 68/916 qtest-arm / qtest-arm/qom-test 375.35s 82/916 qtest-aarch64 / qtest-aarch64/qom-test 403.50s 93/916 qtest-i386 / qtest-i386/bios-tables-test 192.22s 99/916 qtest-aarch64 / qtest-aarch64/bios-tables-test 379.92s 100/916 qtest-x86_64 / qtest-x86_64/bios-tables-test 240.19s 103/916 qtest-aarch64 / qtest-aarch64/migration-test 223.49s 106/916 qtest-ppc64 / qtest-ppc64/pxe-test 418.42s 113/916 qtest-i386 / qtest-i386/migration-test 284.96s 118/916 qtest-arm / qtest-arm/aspeed_smc-test 271.10s 119/916 qtest-x86_64 / qtest-x86_64/migration-test 287.36s --- Summary: TCI is not much slower than the normal TCG. Surprisingly it was even faster for the tests 99 and 103. For other tests like test 106 TCI was about half as fast as normal TCG, but in summary it is not "factors" slower. A total test time of 26:50 minutes is also not so bad compared with the 21:49 minutes of the normal TCG. No single test (including subtests) with TCI exceeded 10 minutes, the longest one was well below that margin with 418 seconds. --- The tests above were running with x86_64, and I could not reproduce timeouts. The Gitlab CI tests were running with i686 and used different configure options. Therefore I made additional tests with 32 bit builds in a chroot environment (Debian GNU Linux bullseye i386) with the original configure options. As expected that reduced the number of tests to 250. All tests passed with `make test`: 3rd test with normal TCG Ok: 250 Expected Fail: 0 Fail: 0 Unexpected Pass:0 Skipped:8 Timeout:0 Full log written to /root/qemu/bin/ndebug/i586-linux-gnu/meson-logs/testlog.txt 855.30user 450.53system 6:45.57elapsed 321%CPU (0avgtext+0avgdata 609180maxresident)k 28232inputs+4772968outputs (64944major+8328814minor)pagefaults 0swaps 4th test with TCI Ok: 250 Expected Fail: 0 Fail: 0 Unexpected Pass:0 Skipped:8 Timeout:0 Full log
Re: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?
Am 16.04.24 um 14:10 schrieb Peter Maydell: The cross-i686-tci job is flaky again, with persistent intermittent failures due to jobs timing out. [...] Some of these timeouts are very high -- no test should be taking 10 minutes, even given TCI and a slowish CI runner -- which suggests to me that there's some kind of intermittent deadlock going on. Can somebody who cares about TCI investigate, please, and track down whatever this is? I'll have a look. Regards Stefan
cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?
The cross-i686-tci job is flaky again, with persistent intermittent failures due to jobs timing out. https://gitlab.com/qemu-project/qemu/-/issues/2285 has the details with links to 8 CI jobs in the last week or so with timeouts, typically something like: 16/258 qemu:qtest+qtest-aarch64 / qtest-aarch64/test-hmp TIMEOUT240.17s killed by signal 15 SIGTERM 73/258 qemu:qtest+qtest-ppc / qtest-ppc/boot-serial-test TIMEOUT360.14s killed by signal 15 SIGTERM 78/258 qemu:qtest+qtest-i386 / qtest-i386/ide-test TIMEOUT 60.09s killed by signal 15 SIGTERM 253/258 qemu:softfloat+softfloat-ops / fp-test-mul TIMEOUT 30.11s killed by signal 15 SIGTERM 254/258 qemu:softfloat+softfloat-ops / fp-test-div TIMEOUT 30.25s killed by signal 15 SIGTERM 255/258 qemu:qtest+qtest-aarch64 / qtest-aarch64/migration-test TIMEOUT480.23s killed by signal 15 SIGTERM 257/258 qemu:qtest+qtest-aarch64 / qtest-aarch64/bios-tables-test TIMEOUT610.10s killed by signal 15 SIGTERM but not always those exact tests. This isn't the first time this CI job for TCI has been flaky either. Some of these timeouts are very high -- no test should be taking 10 minutes, even given TCI and a slowish CI runner -- which suggests to me that there's some kind of intermittent deadlock going on. Can somebody who cares about TCI investigate, please, and track down whatever this is? (My alternate suggestion is that we mark TCI as deprecated in 9.1 and drop it entirely, if nobody cares enough about it...) thanks -- PMM