Re: Timeouts in CI jobs (was: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?)

2024-04-24 Thread Daniel P . Berrangé
On Wed, Apr 24, 2024 at 06:27:58PM +0200, Stefan Weil via wrote:
> Am 20.04.24 um 22:25 schrieb Stefan Weil:
> > Am 16.04.24 um 14:17 schrieb Stefan Weil:
> > > Am 16.04.24 um 14:10 schrieb Peter Maydell:
> > > 
> > > > The cross-i686-tci job is flaky again, with persistent intermittent
> > > > failures due to jobs timing out.
> > > [...]
> > > > Some of these timeouts are very high -- no test should be taking
> > > > 10 minutes, even given TCI and a slowish CI runner -- which suggests
> > > > to me that there's some kind of intermittent deadlock going on.
> > > > 
> > > > Can somebody who cares about TCI investigate, please, and track
> > > > down whatever this is?
> > > 
> > > I'll have a look.
> > 
> > Short summary:
> > 
> > The "persistent intermittent failures due to jobs timing out" are not
> > related to TCI: they also occur if the same tests are run with the
> > normal TCG. I suggest that the CI tests should run single threaded.
> 
> Hi Paolo,
> 
> I need help from someone who knows the CI and the build and test framework
> better.
> 
> Peter reported intermittent timeouts for the cross-i686-tci job, causing it
> to fail. I can reproduce such timeouts locally, but noticed that they are
> not limited to TCI. The GitLab CI also shows other examples, such as this
> job:
> 
> https://gitlab.com/qemu-project/qemu/-/jobs/6700955287
> 
> I think the timeouts are caused by running too many parallel processes
> during testing.
> 
> The CI uses parallel builds:
> 
> make -j$(expr $(nproc) + 1) all check-build $MAKE_CHECK_ARGS

Note that command is running both the compile and test phases of
the job. Overcommitting CPUs for the compile phase is a good
idea to keep CPUs busy while another process is waiting on
I/O, and is almost always safe todo.

Overcommitting CPUs for the test phase is less helpful and
can cause a variety of problems as you say.

> 
> It looks like `nproc` returns 8, and make runs with 9 threads.
> `meson test` uses the same value to run 9 test processes in parallel:
> 
> /builds/qemu-project/qemu/build/pyvenv/bin/meson test  --no-rebuild -t 1
> --num-processes 9 --print-errorlogs
> 
> Since the host can only handle 8 parallel threads, 9 threads might already
> cause some tests to run non-deterministically.

In contributor forks, gitlab CI will be using the public shared
runners. These are Google Cloud VMs, which only have 2 vCPUs.

In the primary QEMU repo, we have a customer runner registered
that uses Azure based VMs. Not sure on the resources we have
configured for them offhand.

The important thing there is that what you see for CI speed in
your fork repo is not neccessarily a match for CI speed in QEMU
upstream repo.

> 
> But if some of the individual tests also use multithreading (according to my
> tests they do so with at least 3 or 4 threads), things get even worse. Then
> there are up to 4 * 9 = 36 threads competing to run on the available 8
> cores.
> 
> In this scenario timeouts are expected and can occur randomly.
> 
> In my tests setting --num-processes to a lower value not only avoided
> timeouts but also reduced the processing overhead without increasing the
> runtime.
> 
> Could we run all tests with `--num-processes 1`?

The question is what impact that has on the overall job execution
time. A lot of our jobs are already quite long, which is bad for
the turnaround time of CI testing.  Reliable CI though is arguably
the #1 priority though, otherwise developers cease trusting it.
We need to find the balance between avoiding timeouts, while having
the shortest pratical job time.  The TCI job you show abuot came
out at 22 minutes, which is not our worst job, so there is some
scope for allowing it to run longer with less parallelism.

Timeouts for individual tests are a relatively  recent change to
QEMU in:

  commit 4156325cd3c205ce77b82de2c48127ccacddaf5b
  Author: Daniel P. Berrangé 
  Date:   Fri Dec 15 08:03:57 2023 +0100

mtest2make: stop disabling meson test timeouts

Read the full commit message of that for the background rationale,
but especially this paragraph:

The main risk of this change is that the individual test timeouts might
be too short to allow completion in high load scenarios. Thus, there is
likely to be some short term pain where we have to bump the timeouts for
certain tests to make them reliable enough. The preceeding few patches
raised the timeouts for all failures that were immediately apparent
in GitLab CI.

which highlights the problem you're looking at.

The expectation was that we would need to bump the timeouts for various
tests until we get the the point where they reliably run in GitLab CI,
both forks and upstream.

With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Timeouts in CI jobs (was: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?)

2024-04-24 Thread Stefan Weil via

Am 20.04.24 um 22:25 schrieb Stefan Weil:

Am 16.04.24 um 14:17 schrieb Stefan Weil:

Am 16.04.24 um 14:10 schrieb Peter Maydell:


The cross-i686-tci job is flaky again, with persistent intermittent
failures due to jobs timing out.

[...]

Some of these timeouts are very high -- no test should be taking
10 minutes, even given TCI and a slowish CI runner -- which suggests
to me that there's some kind of intermittent deadlock going on.

Can somebody who cares about TCI investigate, please, and track
down whatever this is?


I'll have a look.


Short summary:

The "persistent intermittent failures due to jobs timing out" are not 
related to TCI: they also occur if the same tests are run with the 
normal TCG. I suggest that the CI tests should run single threaded.


Hi Paolo,

I need help from someone who knows the CI and the build and test 
framework better.


Peter reported intermittent timeouts for the cross-i686-tci job, causing 
it to fail. I can reproduce such timeouts locally, but noticed that they 
are not limited to TCI. The GitLab CI also shows other examples, such as 
this job:


https://gitlab.com/qemu-project/qemu/-/jobs/6700955287

I think the timeouts are caused by running too many parallel processes 
during testing.


The CI uses parallel builds:

make -j$(expr $(nproc) + 1) all check-build $MAKE_CHECK_ARGS

It looks like `nproc` returns 8, and make runs with 9 threads.
`meson test` uses the same value to run 9 test processes in parallel:

/builds/qemu-project/qemu/build/pyvenv/bin/meson test  --no-rebuild -t 1 
 --num-processes 9 --print-errorlogs


Since the host can only handle 8 parallel threads, 9 threads might 
already cause some tests to run non-deterministically.


But if some of the individual tests also use multithreading (according 
to my tests they do so with at least 3 or 4 threads), things get even 
worse. Then there are up to 4 * 9 = 36 threads competing to run on the 
available 8 cores.


In this scenario timeouts are expected and can occur randomly.

In my tests setting --num-processes to a lower value not only avoided 
timeouts but also reduced the processing overhead without increasing the 
runtime.


Could we run all tests with `--num-processes 1`?

Thanks,
Stefan




Re: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?

2024-04-20 Thread Stefan Weil via

Am 16.04.24 um 14:17 schrieb Stefan Weil:

Am 16.04.24 um 14:10 schrieb Peter Maydell:


The cross-i686-tci job is flaky again, with persistent intermittent
failures due to jobs timing out.

[...]

Some of these timeouts are very high -- no test should be taking
10 minutes, even given TCI and a slowish CI runner -- which suggests
to me that there's some kind of intermittent deadlock going on.

Can somebody who cares about TCI investigate, please, and track
down whatever this is?


I'll have a look.


Short summary:

The "persistent intermittent failures due to jobs timing out" are not 
related to TCI: they also occur if the same tests are run with the 
normal TCG. I suggest that the CI tests should run single threaded.


But let's have a look on details of my results.

I have run `time make test` using different scenarios on the rather old 
and not so performant VM which I typically use for QEMU builds. I did 
not restrict the tests to selected architectures like it is done in the 
QEMU CI tests. Therefore I had more tests which all ended successfully:


Ok: 848
Expected Fail:  0
Fail:   0
Unexpected Pass:0
Skipped:68
Timeout:0

---

1st test with normal TCG

`nohup time ../configure --enable-modules --disable-werror && nohup time 
make -j4 && nohup time make test`

[...]
Full log written to 
/home/stefan/src/gitlab/qemu-project/qemu/bin/ndebug/x86_64-linux-gnu/meson-logs/testlog.txt
2296.08user 1525.02system 21:49.78elapsed 291%CPU (0avgtext+0avgdata 
633476maxresident)k

1730448inputs+14140528outputs (11668major+56827263minor)pagefaults 0swaps

---

2nd test with TCI

`nohup time ../configure --enable-modules --disable-werror 
--enable-tcg-interpreter && nohup time make -j4 && nohup time make test`

[...]
Full log written to 
/home/stefan/src/gitlab/qemu-project/qemu/bin/ndebug/x86_64-linux-gnu/meson-logs/testlog.txt
3766.74user 1521.38system 26:50.51elapsed 328%CPU (0avgtext+0avgdata 
633012maxresident)k

32768inputs+14145080outputs (3033major+56121586minor)pagefaults 0swaps

---

So the total test time with TCI was 26:50.51 minutes while for the 
normal test it was 21:49.78 minutes.


These 10 single tests had the longest duration:

1st test with normal TCG

  94/916 qtest-arm / qtest-arm/qom-test 373.41s
  99/916 qtest-aarch64 / qtest-aarch64/qom-test 398.43s
100/916 qtest-i386 / qtest-i386/bios-tables-test   188.06s
103/916 qtest-x86_64 / qtest-x86_64/bios-tables-test   228.33s
106/916 qtest-aarch64 / qtest-aarch64/migration-test   201.15s
119/916 qtest-i386 / qtest-i386/migration-test 253.58s
126/916 qtest-x86_64 / qtest-x86_64/migration-test 266.66s
143/916 qtest-arm / qtest-arm/test-hmp 101.72s
144/916 qtest-aarch64 / qtest-aarch64/test-hmp 113.10s
163/916 qtest-arm / qtest-arm/aspeed_smc-test  256.92s

2nd test with TCI

  68/916 qtest-arm / qtest-arm/qom-test 375.35s
  82/916 qtest-aarch64 / qtest-aarch64/qom-test 403.50s
  93/916 qtest-i386 / qtest-i386/bios-tables-test   192.22s
  99/916 qtest-aarch64 / qtest-aarch64/bios-tables-test 379.92s
100/916 qtest-x86_64 / qtest-x86_64/bios-tables-test   240.19s
103/916 qtest-aarch64 / qtest-aarch64/migration-test   223.49s
106/916 qtest-ppc64 / qtest-ppc64/pxe-test 418.42s
113/916 qtest-i386 / qtest-i386/migration-test 284.96s
118/916 qtest-arm / qtest-arm/aspeed_smc-test  271.10s
119/916 qtest-x86_64 / qtest-x86_64/migration-test 287.36s

---

Summary:

TCI is not much slower than the normal TCG. Surprisingly it was even 
faster for the tests 99 and 103. For other tests like test 106 TCI was 
about half as fast as normal TCG, but in summary it is not "factors" 
slower. A total test time of 26:50 minutes is also not so bad compared 
with the 21:49 minutes of the normal TCG.


No single test (including subtests) with TCI exceeded 10 minutes, the 
longest one was well below that margin with 418 seconds.


---

The tests above were running with x86_64, and I could not reproduce 
timeouts. The Gitlab CI tests were running with i686 and used different 
configure options. Therefore I made additional tests with 32 bit builds 
in a chroot environment (Debian GNU Linux bullseye i386) with the 
original configure options. As expected that reduced the number of tests 
to 250. All tests passed with `make test`:


3rd test with normal TCG

Ok: 250
Expected Fail:  0
Fail:   0
Unexpected Pass:0
Skipped:8
Timeout:0

Full log written to 
/root/qemu/bin/ndebug/i586-linux-gnu/meson-logs/testlog.txt
855.30user 450.53system 6:45.57elapsed 321%CPU (0avgtext+0avgdata 
609180maxresident)k

28232inputs+4772968outputs (64944major+8328814minor)pagefaults 0swaps


4th test with TCI

Ok: 250
Expected Fail:  0
Fail:   0
Unexpected Pass:0
Skipped:8
Timeout:0

Full log 

Re: cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?

2024-04-16 Thread Stefan Weil via

Am 16.04.24 um 14:10 schrieb Peter Maydell:


The cross-i686-tci job is flaky again, with persistent intermittent
failures due to jobs timing out.

[...]

Some of these timeouts are very high -- no test should be taking
10 minutes, even given TCI and a slowish CI runner -- which suggests
to me that there's some kind of intermittent deadlock going on.

Can somebody who cares about TCI investigate, please, and track
down whatever this is?



I'll have a look.

Regards

Stefan




cross-i686-tci CI job is flaky again (timeouts): can somebody who cares about TCI investigate?

2024-04-16 Thread Peter Maydell
The cross-i686-tci job is flaky again, with persistent intermittent
failures due to jobs timing out.
https://gitlab.com/qemu-project/qemu/-/issues/2285 has the details
with links to 8 CI jobs in the last week or so with timeouts, typically
something like:

 16/258 qemu:qtest+qtest-aarch64 / qtest-aarch64/test-hmp
TIMEOUT240.17s   killed by signal 15 SIGTERM
 73/258 qemu:qtest+qtest-ppc / qtest-ppc/boot-serial-test
TIMEOUT360.14s   killed by signal 15 SIGTERM
 78/258 qemu:qtest+qtest-i386 / qtest-i386/ide-test
TIMEOUT 60.09s   killed by signal 15 SIGTERM
253/258 qemu:softfloat+softfloat-ops / fp-test-mul
TIMEOUT 30.11s   killed by signal 15 SIGTERM
254/258 qemu:softfloat+softfloat-ops / fp-test-div
TIMEOUT 30.25s   killed by signal 15 SIGTERM
255/258 qemu:qtest+qtest-aarch64 / qtest-aarch64/migration-test
TIMEOUT480.23s   killed by signal 15 SIGTERM
257/258 qemu:qtest+qtest-aarch64 / qtest-aarch64/bios-tables-test
TIMEOUT610.10s   killed by signal 15 SIGTERM

but not always those exact tests. This isn't the first time
this CI job for TCI has been flaky either.

Some of these timeouts are very high -- no test should be taking
10 minutes, even given TCI and a slowish CI runner -- which suggests
to me that there's some kind of intermittent deadlock going on.

Can somebody who cares about TCI investigate, please, and track
down whatever this is?

(My alternate suggestion is that we mark TCI as deprecated in 9.1
and drop it entirely, if nobody cares enough about it...)

thanks
-- PMM