Bug#1005889: dbus: flaky autopkgtest on ppc64el: dbus/integration/transient-services.sh.test
Hi, On 21-02-2022 20:03, Paul Gevers wrote: I have a possible patch which I'll upload soon. Would you be able to schedule several consecutive runs on the affected hardware to make sure it's really fixed? 10 runs should be enough for a reasonable level of confidence. Sure, but anybody (with Salsa credentials) can schedule those jobs. Just hitting the retry button will do. Results should be fast too as they are scheduled with higher prio. 11 runs happened. None of them failed. Paul OpenPGP_signature Description: OpenPGP digital signature
Bug#1005889: dbus: flaky autopkgtest on ppc64el: dbus/integration/transient-services.sh.test
Hi Simon, On 21-02-2022 12:10, Simon McVittie wrote: Is there anything unusual about the ppc64el CI-runners compared with other architectures? (For example: lots of CPUs, few CPUs, lots of RAM, less RAM, lots of I/O bandwidth, running on tmpfs, using qemu, using lxc, running many tests in parallel, ...) Our ppc64el runners are quite similar in terms of CPU, RAM etc as most of our amd64/i386/arm64 workers. The thing I noticed them to be different is that they seem to run in a virtual environment: debian@ci-worker-ppc64el-01:~$ lspci 00:01.0 Ethernet controller: Red Hat, Inc. Virtio network device 00:02.0 SCSI storage controller: Red Hat, Inc. Virtio SCSI 00:03.0 USB controller: Red Hat, Inc. QEMU XHCI Host Controller (rev 01) 00:04.0 Communication controller: Red Hat, Inc. Virtio console 00:05.0 Unclassified device [00ff]: Red Hat, Inc. Virtio memory balloon 00:06.0 VGA compatible controller: Device 1234: (rev 02) From https://ci.debian.net/packages/d/dbus/testing/ppc64el/ it looks like this is failing about 25% of the time, does that match your experience? I was totally judging form this page, so yes. Bail out! /run/user/1000/dbus-1/services is not a directory My best guess at the root cause for this is that when gnome-desktop-testing-runner schedules lots of unit tests in a newly-opened user session, if the integration test for transient services happens to be one of the first ones to be run, then the session dbus-daemon will not necessarily have been started by systemd socket activation just yet. If the test runner has a large number of CPU cores, then that makes it more likely that the test will win the race with the dbus-daemon, resulting in failure. I don't experience our ppc64el hosts as extremely fast, but who knows. I have a possible patch which I'll upload soon. Would you be able to schedule several consecutive runs on the affected hardware to make sure it's really fixed? 10 runs should be enough for a reasonable level of confidence. Sure, but anybody (with Salsa credentials) can schedule those jobs. Just hitting the retry button will do. Results should be fast too as they are scheduled with higher prio. Paul OpenPGP_signature Description: OpenPGP digital signature
Bug#1005889: dbus: flaky autopkgtest on ppc64el: dbus/integration/transient-services.sh.test
On Wed, 16 Feb 2022 at 21:15:16 +0100, Paul Gevers wrote: > I looked at the results of the autopkgtest of you package on ppc64el because > it was showing up as a regression for the upload of glibc. I noticed that > the test regularly fails since the beginning of February this year. The > failure is always the same (so far), and it happens on multiple of our > hosts. Is there anything unusual about the ppc64el CI-runners compared with other architectures? (For example: lots of CPUs, few CPUs, lots of RAM, less RAM, lots of I/O bandwidth, running on tmpfs, using qemu, using lxc, running many tests in parallel, ...) >From https://ci.debian.net/packages/d/dbus/testing/ppc64el/ it looks like this is failing about 25% of the time, does that match your experience? >From the timing you mention, I think this was probably triggered by systemd having added a Recommends on dbus-user-session in libpam-systemd. Previously, this test would have been skipped if dbus was tested in a relatively minimal container. > Bail out! /run/user/1000/dbus-1/services is not a directory My best guess at the root cause for this is that when gnome-desktop-testing-runner schedules lots of unit tests in a newly-opened user session, if the integration test for transient services happens to be one of the first ones to be run, then the session dbus-daemon will not necessarily have been started by systemd socket activation just yet. If the test runner has a large number of CPU cores, then that makes it more likely that the test will win the race with the dbus-daemon, resulting in failure. I have a possible patch which I'll upload soon. Would you be able to schedule several consecutive runs on the affected hardware to make sure it's really fixed? 10 runs should be enough for a reasonable level of confidence. smcv
Bug#1005889: dbus: flaky autopkgtest on ppc64el: dbus/integration/transient-services.sh.test
Source: dbus Version: 1.12.20-3 Severity: serious X-Debbugs-CC: debian...@lists.debian.org User: debian...@lists.debian.org Usertags: flaky Dear maintainer(s), I looked at the results of the autopkgtest of you package on ppc64el because it was showing up as a regression for the upload of glibc. I noticed that the test regularly fails since the beginning of February this year. The failure is always the same (so far), and it happens on multiple of our hosts. Because the unstable-to-testing migration software now blocks on regressions in testing, flaky tests, i.e. tests that flip between passing and failing without changes to the list of installed packages, are causing people unrelated to your package to spend time on these tests. Don't hesitate to reach out if you need help and some more information from our infrastructure. Paul https://ci.debian.net/packages/d/dbus/testing/ppc64el/ https://ci.debian.net/data/autopkgtest/testing/ppc64el/d/dbus/18861487/log.gz Running test: dbus-debug-build/integration/transient-services.sh_with_config.test 1..2 Bail out! /run/user/1000/dbus-1/services is not a directory FAIL: dbus-debug-build/integration/transient-services.sh_with_config.test (Child process exited with code 1) [...] SUMMARY: total=88; passed=87; skipped=0; failed=1; user=81.8s; system=55.2s; maxrss=23680 FAIL: dbus-debug-build/integration/transient-services.sh_with_config.test (Child process exited with code 1) OpenPGP_signature Description: OpenPGP digital signature