Bug#985617: glibc: flaky autopkgtest on most architectures

2021-04-25 Thread Aurelien Jarno
On 2021-04-25 10:39, Simon McVittie wrote:
> On Sun, 25 Apr 2021 at 10:14:51 +0100, Simon McVittie wrote:
> > On Sun, 25 Apr 2021 at 08:11:48 +0200, Paul Gevers wrote:
> > > On 25-04-2021 01:55, Aurelien Jarno wrote:
> > > > It appears that all the failures are related to containers. I have been
> > > > able to reproduce the issue with a bullseye kernel, which defaults to
> > > > kernel.unprivileged_userns_clone=1. It seems the autopkgtest runners
> > > > still use a buster kernel (at least in the case of this build log).
> 
> Looking at support/test-container.c, it seems that these tests will
> automatically be skipped (FAIL_UNSUPPORTED) on a kernel that restricts
> userns creation (like buster), and will be run (and perhaps fail)
> on a kernel that does not (like bullseye). So it is not necessarily
> a *regression* that they fail - they might just never have been tried
> before we started using bullseye kernels.
> 
> The brute-force approach to making the autopkgtest not be flaky would be
> to make these tests FAIL_UNSUPPORTED unconditionally, which will result
> in the same coverage we would have had on buster kernels. Obviously it
> would be better if they could be made to pass, but some reliable testing
> is better than none.
> 
> These tests seem to be failing here (support/test-container.c:1095):
> 
>   execvp (new_child_proc[0], new_child_proc);
> 
>   /* Or don't run the child?  */
>   FAIL_EXIT1 ("Unable to exec %s\n", new_child_proc[0]);
> 
> It would be useful if this printed strerror(errno) at least, so that we
> can see whether it's ENOENT or EACCES or something else.
> 
> Perhaps the test support code is not copying/mounting everything that needs
> to be copied/mounted into the container's filesystem? More debug logging in
> support/test-container.c would probably be helpful here - perhaps even
> running 'find . -ls' in the new_root_path before chrooting into it?

Yes, this is exactly the problem. This is due to patch
any/local-rtlddir-cross.diff, which remove a snippet of code installing
the ld.so symlink. Instead this is done in an ugly way in the
debian/rules.d/build.mk. Both can be dropped to make things working
fine. However I am not sure what are the consequences on cross builds,
which anyway also use the same code from build.mk. I am currently
investigating.

Aurelien

-- 
Aurelien Jarno  GPG: 4096R/1DDD8C9B
aurel...@aurel32.net http://www.aurel32.net



Bug#985617: glibc: flaky autopkgtest on most architectures

2021-04-25 Thread Simon McVittie
On Sun, 25 Apr 2021 at 10:14:51 +0100, Simon McVittie wrote:
> On Sun, 25 Apr 2021 at 08:11:48 +0200, Paul Gevers wrote:
> > On 25-04-2021 01:55, Aurelien Jarno wrote:
> > > It appears that all the failures are related to containers. I have been
> > > able to reproduce the issue with a bullseye kernel, which defaults to
> > > kernel.unprivileged_userns_clone=1. It seems the autopkgtest runners
> > > still use a buster kernel (at least in the case of this build log).

Looking at support/test-container.c, it seems that these tests will
automatically be skipped (FAIL_UNSUPPORTED) on a kernel that restricts
userns creation (like buster), and will be run (and perhaps fail)
on a kernel that does not (like bullseye). So it is not necessarily
a *regression* that they fail - they might just never have been tried
before we started using bullseye kernels.

The brute-force approach to making the autopkgtest not be flaky would be
to make these tests FAIL_UNSUPPORTED unconditionally, which will result
in the same coverage we would have had on buster kernels. Obviously it
would be better if they could be made to pass, but some reliable testing
is better than none.

These tests seem to be failing here (support/test-container.c:1095):

  execvp (new_child_proc[0], new_child_proc);

  /* Or don't run the child?  */
  FAIL_EXIT1 ("Unable to exec %s\n", new_child_proc[0]);

It would be useful if this printed strerror(errno) at least, so that we
can see whether it's ENOENT or EACCES or something else.

Perhaps the test support code is not copying/mounting everything that needs
to be copied/mounted into the container's filesystem? More debug logging in
support/test-container.c would probably be helpful here - perhaps even
running 'find . -ls' in the new_root_path before chrooting into it?

smcv



Bug#985617: glibc: flaky autopkgtest on most architectures

2021-04-25 Thread Simon McVittie
On Sun, 25 Apr 2021 at 08:11:48 +0200, Paul Gevers wrote:
> On 25-04-2021 01:55, Aurelien Jarno wrote:
> > It appears that all the failures are related to containers. I have been
> > able to reproduce the issue with a bullseye kernel, which defaults to
> > kernel.unprivileged_userns_clone=1. It seems the autopkgtest runners
> > still use a buster kernel (at least in the case of this build log).
> 
> That's correct, all workers run stable except s390x.
> 
> > Could it be that kernel.unprivileged_userns_clone is enabled on some of
> > the runners?
>
> If I want to make our workers equal, I guess
> changing them all to the default sounds sane, right? Do you know if the
> default is different for buster and bullseye?

The default was kernel.unprivileged_userns_clone=0 in buster kernels and
was switched to kernel.unprivileged_userns_clone=1 in bullseye kernels.

References:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=898446
https://salsa.debian.org/kernel-team/linux/-/commit/a381917851e762684ebe28e04c5ae0d8be7f42c7

If you want a quick way to get consistent behaviour, installing the
bubblewrap package from bullseye (but not buster-backports!) installs
a sysctl.d fragment to set kernel.unprivileged_userns_clone=1 even on
older kernels.

smcv



Bug#985617: glibc: flaky autopkgtest on most architectures

2021-04-25 Thread Paul Gevers
Hi Aurelien,

On 25-04-2021 01:55, Aurelien Jarno wrote:
> It appears that all the failures are related to containers. I have been
> able to reproduce the issue with a bullseye kernel, which defaults to
> kernel.unprivileged_userns_clone=1. It seems the autopkgtest runners
> still use a buster kernel (at least in the case of this build log).

That's correct, all workers run stable except s390x.

> Could it be that kernel.unprivileged_userns_clone is enabled on some of
> the runners? It doesn't seem to be the case of all the runners as the
> autopkgtest ran successfully for the latest glibc upload.

paul@mulciber ~/debian-maint/ci.d.n-config $ rake -j40 run:workers
# Enter command to run (use arrow keys for history):
$ cat /proc/sys/kernel/unprivileged_userns_clone
[]
  ci-worker-armhf-01: 0
 ci-worker13: 1
  ci-worker-s390x-01: 1
 ci-worker12: 0
 ci-worker11: 0
 ci-worker03: 0
 ci-worker05: 0
   ci-worker-i386-04: 1
   ci-worker-i386-01: 1
   ci-worker-i386-03: 1
 ci-worker06: 0
 ci-worker01: 1
 ci-worker09: 0
 ci-worker07: 0
   ci-worker-i386-02: 0
 ci-worker02: 0
 ci-worker10: 0
ci-worker-ppc64el-02: 0
ci-worker-ppc64el-04: 0
 ci-worker04: 0
 ci-worker08: 0
  ci-worker-arm64-04: 0
ci-worker-ppc64el-03: 0
  ci-worker-arm64-07: 1
  ci-worker-arm64-02: 0
  ci-worker-arm64-05: 0
  ci-worker-arm64-06: 0
  ci-worker-arm64-03: 0
  ci-worker-arm64-11: 0
  ci-worker-arm64-08: 0
  ci-worker-arm64-09: 1
  ci-worker-arm64-10: 0
ci-worker-ppc64el-01: 0

[Note: some ci-workerXX are i386 workers, most are amd64].

> In anycase as it is reproducible with the bullseye kernel, this
> definitely needs a fix.

Thanks for working on this. If I want to make our workers equal, I guess
changing them all to the default sounds sane, right? Do you know if the
default is different for buster and bullseye? If so, does it make sense
to already go with the bullseye default?

Paul



OpenPGP_signature
Description: OpenPGP digital signature


Bug#985617: glibc: flaky autopkgtest on most architectures

2021-04-24 Thread Aurelien Jarno
On 2021-04-23 17:47, Paul Gevers wrote:
> Hi Aurelien,
> 
> On 23-04-2021 14:49, Aurelien Jarno wrote:
> > Nope, unfortunately it seems the mail didn't reach me or the mailing
> > list, maybe it was too big?
> 
> It did reach the BTS. I guess size may have been a factor yes, the log
> can be picked up in the BTS.

Yes, I confirm it is archived in the BTS.

It appears that all the failures are related to containers. I have been
able to reproduce the issue with a bullseye kernel, which defaults to
kernel.unprivileged_userns_clone=1. It seems the autopkgtest runners
still use a buster kernel (at least in the case of this build log).

Could it be that kernel.unprivileged_userns_clone is enabled on some of
the runners? It doesn't seem to be the case of all the runners as the
autopkgtest ran successfully for the latest glibc upload.

In anycase as it is reproducible with the bullseye kernel, this
definitely needs a fix.

Regards,
Aurelien

-- 
Aurelien Jarno  GPG: 4096R/1DDD8C9B
aurel...@aurel32.net http://www.aurel32.net


signature.asc
Description: PGP signature


Bug#985617: glibc: flaky autopkgtest on most architectures

2021-04-23 Thread Paul Gevers
Hi Aurelien,

On 23-04-2021 14:49, Aurelien Jarno wrote:
> Nope, unfortunately it seems the mail didn't reach me or the mailing
> list, maybe it was too big?

It did reach the BTS. I guess size may have been a factor yes, the log
can be picked up in the BTS.

Paul



OpenPGP_signature
Description: OpenPGP digital signature


Bug#985617: glibc: flaky autopkgtest on most architectures

2021-04-23 Thread Aurelien Jarno
On 2021-04-22 22:26, Paul Gevers wrote:
> Hi Aurelien,
> 
> On Mon, 22 Mar 2021 19:54:22 +0100 Paul Gevers  wrote:
> > Hi Aurelien,,
> > 
> > On 21-03-2021 00:03, Aurelien Jarno wrote:
> > > Yes, could you please provide a full log? I am not able to reproduce the
> > > issue locally nor on barriere.d.o, so I have no idea what fails.
> > 
> > Please find attached a full log of a failure.
> > 
> > Please let me know if I need to try to get more info.
> 
> Did you see this reply? Did it help?

Nope, unfortunately it seems the mail didn't reach me or the mailing
list, maybe it was too big?

Regards,
Aurelien

-- 
Aurelien Jarno  GPG: 4096R/1DDD8C9B
aurel...@aurel32.net http://www.aurel32.net


signature.asc
Description: PGP signature


Bug#985617: glibc: flaky autopkgtest on most architectures

2021-04-22 Thread Paul Gevers
Hi Aurelien,

On Mon, 22 Mar 2021 19:54:22 +0100 Paul Gevers  wrote:
> Hi Aurelien,,
> 
> On 21-03-2021 00:03, Aurelien Jarno wrote:
> > Yes, could you please provide a full log? I am not able to reproduce the
> > issue locally nor on barriere.d.o, so I have no idea what fails.
> 
> Please find attached a full log of a failure.
> 
> Please let me know if I need to try to get more info.

Did you see this reply? Did it help?

Paul



OpenPGP_signature
Description: OpenPGP digital signature


Bug#985617: glibc: flaky autopkgtest on most architectures

2021-03-21 Thread Paul Gevers
Hi Aurelien,

On 21-03-2021 00:03, Aurelien Jarno wrote:
> Yes, could you please provide a full log? I am not able to reproduce the
> issue locally nor on barriere.d.o, so I have no idea what fails.

Of course when you try to, it doesn't work. I had 5 runs on arm64 which
all succeeded. I'm wondering if this flakyness comes from activities in
parallel runs.

I'll try again tomorrow.

Paul



OpenPGP_signature
Description: OpenPGP digital signature


Bug#985617: glibc: flaky autopkgtest on most architectures

2021-03-20 Thread Aurelien Jarno
control: tag -1 + moreinfo

On 2021-03-20 21:05, Paul Gevers wrote:
> Source: glibc
> Version: 2.31-9
> Severity: serious
> Tags: sid bullseye
> X-Debbugs-CC: debian...@lists.debian.org
> User: debian...@lists.debian.org
> Usertags: flaky
> 
> Dear maintainer(s),
> 
> Your package has an autopkgtest, great. However, I looked into
> the history of your autopkgtest [1] and I noticed it fails regularly
> on lately. Unfortunately, the log of glibc is so long that it gets
> truncated on the ci.d.n infrastructure, so I can't copy any useful log.

This look suspicious as there has no been no new upload since the
toolchain freeze. Has anything changed in the autopkgtest infrastructure
lately?

> Because the unstable-to-testing migration software now blocks on
> regressions in testing, flaky tests, i.e. tests that flip between
> passing and failing without changes to the list of installed packages,
> are causing people unrelated to your package to spend time on these
> tests.
> 
> Please get in touch if you need a full log, I can try to generate one.

Yes, could you please provide a full log? I am not able to reproduce the
issue locally nor on barriere.d.o, so I have no idea what fails.

Thanks,
Aurelien

-- 
Aurelien Jarno  GPG: 4096R/1DDD8C9B
aurel...@aurel32.net http://www.aurel32.net


signature.asc
Description: PGP signature


Processed: Re: Bug#985617: glibc: flaky autopkgtest on most architectures

2021-03-20 Thread Debian Bug Tracking System
Processing control commands:

> tag -1 + moreinfo
Bug #985617 [src:glibc] glibc: flaky autopkgtest on most architectures
Added tag(s) moreinfo.

-- 
985617: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=985617
Debian Bug Tracking System
Contact ow...@bugs.debian.org with problems



Bug#985617: glibc: flaky autopkgtest on most architectures

2021-03-20 Thread Paul Gevers
Source: glibc
Version: 2.31-9
Severity: serious
Tags: sid bullseye
X-Debbugs-CC: debian...@lists.debian.org
User: debian...@lists.debian.org
Usertags: flaky

Dear maintainer(s),

Your package has an autopkgtest, great. However, I looked into
the history of your autopkgtest [1] and I noticed it fails regularly
on lately. Unfortunately, the log of glibc is so long that it gets
truncated on the ci.d.n infrastructure, so I can't copy any useful log.

Because the unstable-to-testing migration software now blocks on
regressions in testing, flaky tests, i.e. tests that flip between
passing and failing without changes to the list of installed packages,
are causing people unrelated to your package to spend time on these
tests.

Please get in touch if you need a full log, I can try to generate one.

Paul

https://ci.debian.net/packages/g/glibc/testing/amd64/
https://ci.debian.net/packages/g/glibc/testing/arm64/
https://ci.debian.net/packages/g/glibc/testing/armhf/
https://ci.debian.net/packages/g/glibc/testing/i386/




OpenPGP_signature
Description: OpenPGP digital signature