Bug#985617: glibc: flaky autopkgtest on most architectures
On 2021-04-25 10:39, Simon McVittie wrote: > On Sun, 25 Apr 2021 at 10:14:51 +0100, Simon McVittie wrote: > > On Sun, 25 Apr 2021 at 08:11:48 +0200, Paul Gevers wrote: > > > On 25-04-2021 01:55, Aurelien Jarno wrote: > > > > It appears that all the failures are related to containers. I have been > > > > able to reproduce the issue with a bullseye kernel, which defaults to > > > > kernel.unprivileged_userns_clone=1. It seems the autopkgtest runners > > > > still use a buster kernel (at least in the case of this build log). > > Looking at support/test-container.c, it seems that these tests will > automatically be skipped (FAIL_UNSUPPORTED) on a kernel that restricts > userns creation (like buster), and will be run (and perhaps fail) > on a kernel that does not (like bullseye). So it is not necessarily > a *regression* that they fail - they might just never have been tried > before we started using bullseye kernels. > > The brute-force approach to making the autopkgtest not be flaky would be > to make these tests FAIL_UNSUPPORTED unconditionally, which will result > in the same coverage we would have had on buster kernels. Obviously it > would be better if they could be made to pass, but some reliable testing > is better than none. > > These tests seem to be failing here (support/test-container.c:1095): > > execvp (new_child_proc[0], new_child_proc); > > /* Or don't run the child? */ > FAIL_EXIT1 ("Unable to exec %s\n", new_child_proc[0]); > > It would be useful if this printed strerror(errno) at least, so that we > can see whether it's ENOENT or EACCES or something else. > > Perhaps the test support code is not copying/mounting everything that needs > to be copied/mounted into the container's filesystem? More debug logging in > support/test-container.c would probably be helpful here - perhaps even > running 'find . -ls' in the new_root_path before chrooting into it? Yes, this is exactly the problem. This is due to patch any/local-rtlddir-cross.diff, which remove a snippet of code installing the ld.so symlink. Instead this is done in an ugly way in the debian/rules.d/build.mk. Both can be dropped to make things working fine. However I am not sure what are the consequences on cross builds, which anyway also use the same code from build.mk. I am currently investigating. Aurelien -- Aurelien Jarno GPG: 4096R/1DDD8C9B aurel...@aurel32.net http://www.aurel32.net
Bug#985617: glibc: flaky autopkgtest on most architectures
On Sun, 25 Apr 2021 at 10:14:51 +0100, Simon McVittie wrote: > On Sun, 25 Apr 2021 at 08:11:48 +0200, Paul Gevers wrote: > > On 25-04-2021 01:55, Aurelien Jarno wrote: > > > It appears that all the failures are related to containers. I have been > > > able to reproduce the issue with a bullseye kernel, which defaults to > > > kernel.unprivileged_userns_clone=1. It seems the autopkgtest runners > > > still use a buster kernel (at least in the case of this build log). Looking at support/test-container.c, it seems that these tests will automatically be skipped (FAIL_UNSUPPORTED) on a kernel that restricts userns creation (like buster), and will be run (and perhaps fail) on a kernel that does not (like bullseye). So it is not necessarily a *regression* that they fail - they might just never have been tried before we started using bullseye kernels. The brute-force approach to making the autopkgtest not be flaky would be to make these tests FAIL_UNSUPPORTED unconditionally, which will result in the same coverage we would have had on buster kernels. Obviously it would be better if they could be made to pass, but some reliable testing is better than none. These tests seem to be failing here (support/test-container.c:1095): execvp (new_child_proc[0], new_child_proc); /* Or don't run the child? */ FAIL_EXIT1 ("Unable to exec %s\n", new_child_proc[0]); It would be useful if this printed strerror(errno) at least, so that we can see whether it's ENOENT or EACCES or something else. Perhaps the test support code is not copying/mounting everything that needs to be copied/mounted into the container's filesystem? More debug logging in support/test-container.c would probably be helpful here - perhaps even running 'find . -ls' in the new_root_path before chrooting into it? smcv
Bug#985617: glibc: flaky autopkgtest on most architectures
On Sun, 25 Apr 2021 at 08:11:48 +0200, Paul Gevers wrote: > On 25-04-2021 01:55, Aurelien Jarno wrote: > > It appears that all the failures are related to containers. I have been > > able to reproduce the issue with a bullseye kernel, which defaults to > > kernel.unprivileged_userns_clone=1. It seems the autopkgtest runners > > still use a buster kernel (at least in the case of this build log). > > That's correct, all workers run stable except s390x. > > > Could it be that kernel.unprivileged_userns_clone is enabled on some of > > the runners? > > If I want to make our workers equal, I guess > changing them all to the default sounds sane, right? Do you know if the > default is different for buster and bullseye? The default was kernel.unprivileged_userns_clone=0 in buster kernels and was switched to kernel.unprivileged_userns_clone=1 in bullseye kernels. References: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=898446 https://salsa.debian.org/kernel-team/linux/-/commit/a381917851e762684ebe28e04c5ae0d8be7f42c7 If you want a quick way to get consistent behaviour, installing the bubblewrap package from bullseye (but not buster-backports!) installs a sysctl.d fragment to set kernel.unprivileged_userns_clone=1 even on older kernels. smcv
Bug#985617: glibc: flaky autopkgtest on most architectures
Hi Aurelien, On 25-04-2021 01:55, Aurelien Jarno wrote: > It appears that all the failures are related to containers. I have been > able to reproduce the issue with a bullseye kernel, which defaults to > kernel.unprivileged_userns_clone=1. It seems the autopkgtest runners > still use a buster kernel (at least in the case of this build log). That's correct, all workers run stable except s390x. > Could it be that kernel.unprivileged_userns_clone is enabled on some of > the runners? It doesn't seem to be the case of all the runners as the > autopkgtest ran successfully for the latest glibc upload. paul@mulciber ~/debian-maint/ci.d.n-config $ rake -j40 run:workers # Enter command to run (use arrow keys for history): $ cat /proc/sys/kernel/unprivileged_userns_clone [] ci-worker-armhf-01: 0 ci-worker13: 1 ci-worker-s390x-01: 1 ci-worker12: 0 ci-worker11: 0 ci-worker03: 0 ci-worker05: 0 ci-worker-i386-04: 1 ci-worker-i386-01: 1 ci-worker-i386-03: 1 ci-worker06: 0 ci-worker01: 1 ci-worker09: 0 ci-worker07: 0 ci-worker-i386-02: 0 ci-worker02: 0 ci-worker10: 0 ci-worker-ppc64el-02: 0 ci-worker-ppc64el-04: 0 ci-worker04: 0 ci-worker08: 0 ci-worker-arm64-04: 0 ci-worker-ppc64el-03: 0 ci-worker-arm64-07: 1 ci-worker-arm64-02: 0 ci-worker-arm64-05: 0 ci-worker-arm64-06: 0 ci-worker-arm64-03: 0 ci-worker-arm64-11: 0 ci-worker-arm64-08: 0 ci-worker-arm64-09: 1 ci-worker-arm64-10: 0 ci-worker-ppc64el-01: 0 [Note: some ci-workerXX are i386 workers, most are amd64]. > In anycase as it is reproducible with the bullseye kernel, this > definitely needs a fix. Thanks for working on this. If I want to make our workers equal, I guess changing them all to the default sounds sane, right? Do you know if the default is different for buster and bullseye? If so, does it make sense to already go with the bullseye default? Paul OpenPGP_signature Description: OpenPGP digital signature
Bug#985617: glibc: flaky autopkgtest on most architectures
On 2021-04-23 17:47, Paul Gevers wrote: > Hi Aurelien, > > On 23-04-2021 14:49, Aurelien Jarno wrote: > > Nope, unfortunately it seems the mail didn't reach me or the mailing > > list, maybe it was too big? > > It did reach the BTS. I guess size may have been a factor yes, the log > can be picked up in the BTS. Yes, I confirm it is archived in the BTS. It appears that all the failures are related to containers. I have been able to reproduce the issue with a bullseye kernel, which defaults to kernel.unprivileged_userns_clone=1. It seems the autopkgtest runners still use a buster kernel (at least in the case of this build log). Could it be that kernel.unprivileged_userns_clone is enabled on some of the runners? It doesn't seem to be the case of all the runners as the autopkgtest ran successfully for the latest glibc upload. In anycase as it is reproducible with the bullseye kernel, this definitely needs a fix. Regards, Aurelien -- Aurelien Jarno GPG: 4096R/1DDD8C9B aurel...@aurel32.net http://www.aurel32.net signature.asc Description: PGP signature
Bug#985617: glibc: flaky autopkgtest on most architectures
Hi Aurelien, On 23-04-2021 14:49, Aurelien Jarno wrote: > Nope, unfortunately it seems the mail didn't reach me or the mailing > list, maybe it was too big? It did reach the BTS. I guess size may have been a factor yes, the log can be picked up in the BTS. Paul OpenPGP_signature Description: OpenPGP digital signature
Bug#985617: glibc: flaky autopkgtest on most architectures
On 2021-04-22 22:26, Paul Gevers wrote: > Hi Aurelien, > > On Mon, 22 Mar 2021 19:54:22 +0100 Paul Gevers wrote: > > Hi Aurelien,, > > > > On 21-03-2021 00:03, Aurelien Jarno wrote: > > > Yes, could you please provide a full log? I am not able to reproduce the > > > issue locally nor on barriere.d.o, so I have no idea what fails. > > > > Please find attached a full log of a failure. > > > > Please let me know if I need to try to get more info. > > Did you see this reply? Did it help? Nope, unfortunately it seems the mail didn't reach me or the mailing list, maybe it was too big? Regards, Aurelien -- Aurelien Jarno GPG: 4096R/1DDD8C9B aurel...@aurel32.net http://www.aurel32.net signature.asc Description: PGP signature
Bug#985617: glibc: flaky autopkgtest on most architectures
Hi Aurelien, On Mon, 22 Mar 2021 19:54:22 +0100 Paul Gevers wrote: > Hi Aurelien,, > > On 21-03-2021 00:03, Aurelien Jarno wrote: > > Yes, could you please provide a full log? I am not able to reproduce the > > issue locally nor on barriere.d.o, so I have no idea what fails. > > Please find attached a full log of a failure. > > Please let me know if I need to try to get more info. Did you see this reply? Did it help? Paul OpenPGP_signature Description: OpenPGP digital signature
Bug#985617: glibc: flaky autopkgtest on most architectures
Hi Aurelien, On 21-03-2021 00:03, Aurelien Jarno wrote: > Yes, could you please provide a full log? I am not able to reproduce the > issue locally nor on barriere.d.o, so I have no idea what fails. Of course when you try to, it doesn't work. I had 5 runs on arm64 which all succeeded. I'm wondering if this flakyness comes from activities in parallel runs. I'll try again tomorrow. Paul OpenPGP_signature Description: OpenPGP digital signature
Bug#985617: glibc: flaky autopkgtest on most architectures
control: tag -1 + moreinfo On 2021-03-20 21:05, Paul Gevers wrote: > Source: glibc > Version: 2.31-9 > Severity: serious > Tags: sid bullseye > X-Debbugs-CC: debian...@lists.debian.org > User: debian...@lists.debian.org > Usertags: flaky > > Dear maintainer(s), > > Your package has an autopkgtest, great. However, I looked into > the history of your autopkgtest [1] and I noticed it fails regularly > on lately. Unfortunately, the log of glibc is so long that it gets > truncated on the ci.d.n infrastructure, so I can't copy any useful log. This look suspicious as there has no been no new upload since the toolchain freeze. Has anything changed in the autopkgtest infrastructure lately? > Because the unstable-to-testing migration software now blocks on > regressions in testing, flaky tests, i.e. tests that flip between > passing and failing without changes to the list of installed packages, > are causing people unrelated to your package to spend time on these > tests. > > Please get in touch if you need a full log, I can try to generate one. Yes, could you please provide a full log? I am not able to reproduce the issue locally nor on barriere.d.o, so I have no idea what fails. Thanks, Aurelien -- Aurelien Jarno GPG: 4096R/1DDD8C9B aurel...@aurel32.net http://www.aurel32.net signature.asc Description: PGP signature
Processed: Re: Bug#985617: glibc: flaky autopkgtest on most architectures
Processing control commands: > tag -1 + moreinfo Bug #985617 [src:glibc] glibc: flaky autopkgtest on most architectures Added tag(s) moreinfo. -- 985617: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=985617 Debian Bug Tracking System Contact ow...@bugs.debian.org with problems
Bug#985617: glibc: flaky autopkgtest on most architectures
Source: glibc Version: 2.31-9 Severity: serious Tags: sid bullseye X-Debbugs-CC: debian...@lists.debian.org User: debian...@lists.debian.org Usertags: flaky Dear maintainer(s), Your package has an autopkgtest, great. However, I looked into the history of your autopkgtest [1] and I noticed it fails regularly on lately. Unfortunately, the log of glibc is so long that it gets truncated on the ci.d.n infrastructure, so I can't copy any useful log. Because the unstable-to-testing migration software now blocks on regressions in testing, flaky tests, i.e. tests that flip between passing and failing without changes to the list of installed packages, are causing people unrelated to your package to spend time on these tests. Please get in touch if you need a full log, I can try to generate one. Paul https://ci.debian.net/packages/g/glibc/testing/amd64/ https://ci.debian.net/packages/g/glibc/testing/arm64/ https://ci.debian.net/packages/g/glibc/testing/armhf/ https://ci.debian.net/packages/g/glibc/testing/i386/ OpenPGP_signature Description: OpenPGP digital signature