Bug#1005889: dbus: flaky autopkgtest on ppc64el: dbus/integration/transient-services.sh.test

2022-02-22 Thread Paul Gevers

Hi,

On 21-02-2022 20:03, Paul Gevers wrote:

I have a possible patch which I'll upload soon. Would you be able to
schedule several consecutive runs on the affected hardware to make
sure it's really fixed? 10 runs should be enough for a reasonable level
of confidence.


Sure, but anybody (with Salsa credentials) can schedule those jobs. Just 
hitting the retry button will do. Results should be fast too as they are 
scheduled with higher prio.


11 runs happened. None of them failed.

Paul


OpenPGP_signature
Description: OpenPGP digital signature


Bug#1005889: dbus: flaky autopkgtest on ppc64el: dbus/integration/transient-services.sh.test

2022-02-21 Thread Paul Gevers

Hi Simon,

On 21-02-2022 12:10, Simon McVittie wrote:

Is there anything unusual about the ppc64el CI-runners compared with other
architectures? (For example: lots of CPUs, few CPUs, lots of RAM, less RAM,
lots of I/O bandwidth, running on tmpfs, using qemu, using lxc, running
many tests in parallel, ...)


Our ppc64el runners are quite similar in terms of CPU, RAM etc as most 
of our amd64/i386/arm64 workers. The thing I noticed them to be 
different is that they seem to run in a virtual environment:

debian@ci-worker-ppc64el-01:~$ lspci
00:01.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:02.0 SCSI storage controller: Red Hat, Inc. Virtio SCSI
00:03.0 USB controller: Red Hat, Inc. QEMU XHCI Host Controller (rev 01)
00:04.0 Communication controller: Red Hat, Inc. Virtio console
00:05.0 Unclassified device [00ff]: Red Hat, Inc. Virtio memory balloon
00:06.0 VGA compatible controller: Device 1234: (rev 02)


From https://ci.debian.net/packages/d/dbus/testing/ppc64el/ it looks like

this is failing about 25% of the time, does that match your experience?


I was totally judging form this page, so yes.


Bail out! /run/user/1000/dbus-1/services is not a directory


My best guess at the root cause for this is that when
gnome-desktop-testing-runner schedules lots of unit tests in a
newly-opened user session, if the integration test for transient
services happens to be one of the first ones to be run, then the session
dbus-daemon will not necessarily have been started by systemd socket
activation just yet. If the test runner has a large number of CPU cores,
then that makes it more likely that the test will win the race with the
dbus-daemon, resulting in failure.


I don't experience our ppc64el hosts as extremely fast, but who knows.


I have a possible patch which I'll upload soon. Would you be able to
schedule several consecutive runs on the affected hardware to make
sure it's really fixed? 10 runs should be enough for a reasonable level
of confidence.


Sure, but anybody (with Salsa credentials) can schedule those jobs. Just 
hitting the retry button will do. Results should be fast too as they are 
scheduled with higher prio.


Paul


OpenPGP_signature
Description: OpenPGP digital signature


Bug#1005889: dbus: flaky autopkgtest on ppc64el: dbus/integration/transient-services.sh.test

2022-02-21 Thread Simon McVittie
On Wed, 16 Feb 2022 at 21:15:16 +0100, Paul Gevers wrote:
> I looked at the results of the autopkgtest of you package on ppc64el because
> it was showing up as a regression for the upload of glibc. I noticed that
> the test regularly fails since the beginning of February this year. The
> failure is always the same (so far), and it happens on multiple of our
> hosts.

Is there anything unusual about the ppc64el CI-runners compared with other
architectures? (For example: lots of CPUs, few CPUs, lots of RAM, less RAM,
lots of I/O bandwidth, running on tmpfs, using qemu, using lxc, running
many tests in parallel, ...)

>From https://ci.debian.net/packages/d/dbus/testing/ppc64el/ it looks like
this is failing about 25% of the time, does that match your experience?

>From the timing you mention, I think this was probably triggered by
systemd having added a Recommends on dbus-user-session in libpam-systemd.
Previously, this test would have been skipped if dbus was tested in a
relatively minimal container.

> Bail out! /run/user/1000/dbus-1/services is not a directory

My best guess at the root cause for this is that when
gnome-desktop-testing-runner schedules lots of unit tests in a
newly-opened user session, if the integration test for transient
services happens to be one of the first ones to be run, then the session
dbus-daemon will not necessarily have been started by systemd socket
activation just yet. If the test runner has a large number of CPU cores,
then that makes it more likely that the test will win the race with the
dbus-daemon, resulting in failure.

I have a possible patch which I'll upload soon. Would you be able to
schedule several consecutive runs on the affected hardware to make
sure it's really fixed? 10 runs should be enough for a reasonable level
of confidence.

smcv



Bug#1005889: dbus: flaky autopkgtest on ppc64el: dbus/integration/transient-services.sh.test

2022-02-16 Thread Paul Gevers

Source: dbus
Version: 1.12.20-3
Severity: serious
X-Debbugs-CC: debian...@lists.debian.org
User: debian...@lists.debian.org
Usertags: flaky

Dear maintainer(s),

I looked at the results of the autopkgtest of you package on ppc64el 
because it was showing up as a regression for the upload of glibc. I 
noticed that the test regularly fails since the beginning of February 
this year. The failure is always the same (so far), and it happens on 
multiple of our hosts.


Because the unstable-to-testing migration software now blocks on
regressions in testing, flaky tests, i.e. tests that flip between
passing and failing without changes to the list of installed packages,
are causing people unrelated to your package to spend time on these
tests.

Don't hesitate to reach out if you need help and some more information
from our infrastructure.

Paul

https://ci.debian.net/packages/d/dbus/testing/ppc64el/

https://ci.debian.net/data/autopkgtest/testing/ppc64el/d/dbus/18861487/log.gz

Running test: 
dbus-debug-build/integration/transient-services.sh_with_config.test

1..2
Bail out! /run/user/1000/dbus-1/services is not a directory
FAIL: 
dbus-debug-build/integration/transient-services.sh_with_config.test 
(Child process exited with code 1)


[...]

SUMMARY: total=88; passed=87; skipped=0; failed=1; user=81.8s; 
system=55.2s; maxrss=23680
FAIL: 
dbus-debug-build/integration/transient-services.sh_with_config.test 
(Child process exited with code 1)


OpenPGP_signature
Description: OpenPGP digital signature