bug#43344: "basic" system tests fail (and all the other ones) on guix master

2020-09-17 Thread Mark H Weaver
Marius Bakke  writes:

> Leo Famulari  writes:
>
>> On Wed, Sep 16, 2020 at 09:36:49PM -0400, Mark H Weaver wrote:
>>> Next, it probably makes sense to test 5.8.9 with the above commit
>>> reverted.  Would someone like to try it?  If that works, we can avoid
>>> the bisection, resume kernel updates, and revert this change across all
>>> of our kernels until a better solution is found.
>>
>> I'll do this tonight.
>
> A fix for this is available in mainline:
>
>   
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=88ce2a530cc9865a894454b2e40eba5957a60e1a
>
> (via )

Thank you, Marius!  I see that this patch has been included in the
recently released linux-libre-5.8.10.  Assuming that it fixes our
problem (which I suspect it does), we can simply update our kernels and
close this bug.

Leo, would you like to update our kernels?  I won't be able to get to it
today, and I'm looking to offload that job anyway.

  Mark





bug#43344: "basic" system tests fail (and all the other ones) on guix master

2020-09-17 Thread Danny Milosavljevic
Hi Leo,

On Thu, 17 Sep 2020 10:05:44 -0400
Leo Famulari  wrote:

> On Wed, Sep 16, 2020 at 09:36:49PM -0400, Mark H Weaver wrote:
> > > On Wed, Sep 16, 2020 at 04:52:45PM +0200, Danny Milosavljevic wrote:  
> > >> commit 692d0626557451c4b557397f20b7394b612d0289
> > >> Author: Christoph Hellwig 
> > >> Date:   Tue Sep 1 11:59:41 2020 +0200
> > >> 
> > >> block: fix locking in bdev_del_partition  
> 
> > Next, it probably makes sense to test 5.8.9 with the above commit
> > reverted.  Would someone like to try it?  If that works, we can avoid
> > the bisection, resume kernel updates, and revert this change across all
> > of our kernels until a better solution is found.  
> 
> Using linux-libre 5.8.9 with that commit reverted, `guix system vm` does
> work again.

Thanks for testing it!

What the commit does is lock the drive whose partition table is being
modified.  That sounds like a good idea in general.  But parted does not
expect the error code ENOMEM and thus misjudges the situation.

See also:

  
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=88ce2a530cc9865a894454b2e40eba5957a60e1a

for the fix that goes into Linux 5.9.

We could also patch parted to also accept ENOMEM in addition to ENXIO if we
wanted to.  I wouldn't (it diverges from what upstream is doing--too much work).


pgp54szCUvtu9.pgp
Description: OpenPGP digital signature


bug#43344: "basic" system tests fail (and all the other ones) on guix master

2020-09-17 Thread Leo Famulari
On Thu, Sep 17, 2020 at 10:40:07AM +0200, Marius Bakke wrote:
> Leo Famulari  writes:
> 
> > On Wed, Sep 16, 2020 at 09:36:49PM -0400, Mark H Weaver wrote:
> >> Next, it probably makes sense to test 5.8.9 with the above commit
> >> reverted.  Would someone like to try it?  If that works, we can avoid
> >> the bisection, resume kernel updates, and revert this change across all
> >> of our kernels until a better solution is found.
> >
> > I'll do this tonight.
> 
> A fix for this is available in mainline:
> 
>   
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=88ce2a530cc9865a894454b2e40eba5957a60e1a
> 
> (via )

What is the recommended way to report this to the stable maintainers?


signature.asc
Description: PGP signature


bug#43344: "basic" system tests fail (and all the other ones) on guix master

2020-09-17 Thread Leo Famulari
On Wed, Sep 16, 2020 at 09:36:49PM -0400, Mark H Weaver wrote:
> > On Wed, Sep 16, 2020 at 04:52:45PM +0200, Danny Milosavljevic wrote:
> >> commit 692d0626557451c4b557397f20b7394b612d0289
> >> Author: Christoph Hellwig 
> >> Date:   Tue Sep 1 11:59:41 2020 +0200
> >> 
> >> block: fix locking in bdev_del_partition

> Next, it probably makes sense to test 5.8.9 with the above commit
> reverted.  Would someone like to try it?  If that works, we can avoid
> the bisection, resume kernel updates, and revert this change across all
> of our kernels until a better solution is found.

Using linux-libre 5.8.9 with that commit reverted, `guix system vm` does
work again.





bug#43344: "basic" system tests fail (and all the other ones) on guix master

2020-09-17 Thread Marius Bakke
Leo Famulari  writes:

> On Wed, Sep 16, 2020 at 09:36:49PM -0400, Mark H Weaver wrote:
>> Next, it probably makes sense to test 5.8.9 with the above commit
>> reverted.  Would someone like to try it?  If that works, we can avoid
>> the bisection, resume kernel updates, and revert this change across all
>> of our kernels until a better solution is found.
>
> I'll do this tonight.

A fix for this is available in mainline:

  
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=88ce2a530cc9865a894454b2e40eba5957a60e1a

(via )


signature.asc
Description: PGP signature


bug#43344: "basic" system tests fail (and all the other ones) on guix master

2020-09-16 Thread Leo Famulari
On Wed, Sep 16, 2020 at 09:36:49PM -0400, Mark H Weaver wrote:
> Next, it probably makes sense to test 5.8.9 with the above commit
> reverted.  Would someone like to try it?  If that works, we can avoid
> the bisection, resume kernel updates, and revert this change across all
> of our kernels until a better solution is found.

I'll do this tonight.





bug#43344: "basic" system tests fail (and all the other ones) on guix master

2020-09-16 Thread Mark H Weaver
Leo Famulari  writes:

> On Wed, Sep 16, 2020 at 04:52:45PM +0200, Danny Milosavljevic wrote:
>> commit 692d0626557451c4b557397f20b7394b612d0289
>> Author: Christoph Hellwig 
>> Date:   Tue Sep 1 11:59:41 2020 +0200
>> 
>> block: fix locking in bdev_del_partition
>> 
>> [ Upstream commit 08fc1ab6d748ab1a690fd483f41e2938984ce353 ]
>> 
>> We need to hold the whole device bd_mutex to protect against
>> other thread concurrently deleting out partition before we get
>> to it, and thus causing a use after free.
>> 
>> Fixes: cddae808aeb7 ("block: pass a hd_struct to delete_partition")
>> Reported-by: syzbot+6448f3c229bc52b82...@syzkaller.appspotmail.com
>> Signed-off-by: Christoph Hellwig 
>> Signed-off-by: Jens Axboe 
>> Signed-off-by: Sasha Levin 
>> 
>> ?
>
> Do you think that's the faulty commit? I won't have time to test it
> today.

Looks like a good bet to me.  Many thanks to Leo for testing 5.8.9, and
to Danny for identifying the likely culprit.

Next, it probably makes sense to test 5.8.9 with the above commit
reverted.  Would someone like to try it?  If that works, we can avoid
the bisection, resume kernel updates, and revert this change across all
of our kernels until a better solution is found.

Mark





bug#43344: "basic" system tests fail (and all the other ones) on guix master

2020-09-16 Thread Leo Famulari
On Wed, Sep 16, 2020 at 04:52:45PM +0200, Danny Milosavljevic wrote:
> commit 692d0626557451c4b557397f20b7394b612d0289
> Author: Christoph Hellwig 
> Date:   Tue Sep 1 11:59:41 2020 +0200
> 
> block: fix locking in bdev_del_partition
> 
> [ Upstream commit 08fc1ab6d748ab1a690fd483f41e2938984ce353 ]
> 
> We need to hold the whole device bd_mutex to protect against
> other thread concurrently deleting out partition before we get
> to it, and thus causing a use after free.
> 
> Fixes: cddae808aeb7 ("block: pass a hd_struct to delete_partition")
> Reported-by: syzbot+6448f3c229bc52b82...@syzkaller.appspotmail.com
> Signed-off-by: Christoph Hellwig 
> Signed-off-by: Jens Axboe 
> Signed-off-by: Sasha Levin 
> 
> ?

Do you think that's the faulty commit? I won't have time to test it
today.


signature.asc
Description: PGP signature


bug#43344: "basic" system tests fail (and all the other ones) on guix master

2020-09-16 Thread Danny Milosavljevic
On Wed, 16 Sep 2020 16:52:45 +0200
Danny Milosavljevic  wrote:

> commit 692d0626557451c4b557397f20b7394b612d0289
> Author: Christoph Hellwig 
> Date:   Tue Sep 1 11:59:41 2020 +0200
> 
> block: fix locking in bdev_del_partition
> 
> [ Upstream commit 08fc1ab6d748ab1a690fd483f41e2938984ce353 ]
> 
> We need to hold the whole device bd_mutex to protect against
> other thread concurrently deleting out partition before we get
> to it, and thus causing a use after free.
> 
> Fixes: cddae808aeb7 ("block: pass a hd_struct to delete_partition")
> Reported-by: syzbot+6448f3c229bc52b82...@syzkaller.appspotmail.com
> Signed-off-by: Christoph Hellwig 
> Signed-off-by: Jens Axboe 
> Signed-off-by: Sasha Levin 
> 

int bdev_del_partition(struct block_device *bdev, int partno)
{
struct block_device *bdevp;
struct hd_struct *part = NULL;
int ret;

bdevp = bdget_disk(bdev->bd_disk, partno);
if (!bdevp)
return -ENOMEM; <--

...
}

struct block_device *bdget_disk(struct gendisk *disk, int partno)
{
struct hd_struct *part;
struct block_device *bdev = NULL;

part = disk_get_part(disk, partno);
if (part)
bdev = bdget(part_devt(part));
disk_put_part(part);

return bdev;
}

struct block_device *bdget(dev_t dev)
{
struct block_device *bdev;
struct inode *inode;

inode = iget5_locked(blockdev_superblock, hash(dev),
bdev_test, bdev_set, );

if (!inode)
return NULL; <
[...]
}


pgp2dJ7uUWHk3.pgp
Description: OpenPGP digital signature


bug#43344: "basic" system tests fail (and all the other ones) on guix master

2020-09-16 Thread Danny Milosavljevic
commit 692d0626557451c4b557397f20b7394b612d0289
Author: Christoph Hellwig 
Date:   Tue Sep 1 11:59:41 2020 +0200

block: fix locking in bdev_del_partition

[ Upstream commit 08fc1ab6d748ab1a690fd483f41e2938984ce353 ]

We need to hold the whole device bd_mutex to protect against
other thread concurrently deleting out partition before we get
to it, and thus causing a use after free.

Fixes: cddae808aeb7 ("block: pass a hd_struct to delete_partition")
Reported-by: syzbot+6448f3c229bc52b82...@syzkaller.appspotmail.com
Signed-off-by: Christoph Hellwig 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin 

?


pgp9BQ2bhPlBt.pgp
Description: OpenPGP digital signature


bug#43344: "basic" system tests fail (and all the other ones) on guix master

2020-09-16 Thread Leo Famulari
On Tue, Sep 15, 2020 at 07:06:28PM -0400, Leo Famulari wrote:
> I will try to reproduce the bug with 5.8.9 now. I will try the bisection
> if time permits.

It also fails with 5.8.9.





bug#43344: "basic" system tests fail (and all the other ones) on guix master

2020-09-16 Thread Ludovic Courtès
Hi,

Danny Milosavljevic  skribis:

> environment variable `PATH' set to 
> `/gnu/store/j3jlpncfqvykkq6sx7h4ly1rdcr2a8qq'
> creating partition table with 2 partitions (20.0 MiB, 40.0 MiB)...
> Error: Partition(s) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 
> 17, .

[...]

>   1. :
>   program: "parted"
>   arguments: ("--script" "/dev/vda" "mklabel" "msdos" "mkpart" "primary" 
> "e)
>   exit-status: 1
>   term-signal: #f
>   stop-signal: #f

The code in question in Parted:

--8<---cut here---start->8---
if (!add_partition (disk, part)) {
ok[i - 1] = 0;
errnums[i - 1] = errno;
}

[…]

char *bad_part_list = NULL;
/* now warn about any errors */
for (i = 1; i <= lpn; i++) {
if (ok[i - 1] || errnums[i - 1] == ENXIO)
continue;
if (bad_part_list == NULL) {
bad_part_list = malloc (lpn * 5);
if (!bad_part_list)
goto cleanup;
bad_part_list[0] = 0;
}
sprintf (bad_part_list + strlen (bad_part_list), "%d, ", i);
}
if (bad_part_list == NULL)
ret = 1;
else {
bad_part_list[strlen (bad_part_list) - 2] = 0;
if (ped_exception_throw (
PED_EXCEPTION_ERROR,
PED_EXCEPTION_IGNORE_CANCEL,
_("Partition(s) %s on %s have been written, but we have 
"
  "been unable to inform the kernel of the change, "
  "probably because it/they are in use.  As a result, "
  "the old partition(s) will remain in use.  You "
  "should reboot now before making further changes."),
bad_part_list, disk->dev->path) == PED_EXCEPTION_IGNORE)
ret = 1;
free (bad_part_list);
}
--8<---cut here---end--->8---

With the patch below, I strace’d ‘parted’, which gives:

--8<---cut here---start->8---
$ make check-system TESTS=basic

[…]

ioctl(3, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, 
length=0, pno=253, devname="", volname=""}}) = -1 ENOMEM (Cannot allocate 
memory)
ioctl(3, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, 
length=0, pno=254, devname="", volname=""}}) = -1 ENOMEM (Cannot allocate 
memory)
ioctl(3, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, 
length=0, pno=255, devname="", volname=""}}) = -1 ENOMEM (Cannot allocate 
memory)
ioctl(3, BLKPG, {op=BLKPG_DEL_PARTITION, flags=0, datalen=152, data={start=0, 
length=0, pno=256, devname="", volname=""}}) = -1 ENOMEM (Cannot allocate 
memory)
write(2, "Error", 5Error)= 5
write(2, ": ", 2: )   = 2
write(2, "Partition(s) 1, 2, 3, 4, 5, 6, 7"..., 495Partition(s) 1, 2, 3, 4, 5, 
6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 
27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 
47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64 on 
/dev/vda have been written, but we have been unable to inform the kernel of the 
change, probably because it/they are in use.  As a result, the old partition(s) 
will remain in use.  You should reboot now before making further changes.) = 495
write(2, "\n", 1
)   = 1
--8<---cut here---end--->8---

So I threw more virtual RAM at it:

diff --git a/gnu/system/vm.scm b/gnu/system/vm.scm
@@ -446,6 +450,7 @@ system that is passed to 'populate-root-file-system'."
  #:bootloader-installer
  #+(bootloader-installer bootloader)))
#:system system
+   #:memory-size 1024
#:make-disk-image? #t
#:disk-image-size disk-image-size
#:disk-image-format disk-image-format

… but that doesn’t help.

Ideas?

Thanks,
Ludo’.

diff --git a/gnu/build/vm.scm b/gnu/build/vm.scm
index 287d099f79..e793b5b518 100644
--- a/gnu/build/vm.scm
+++ b/gnu/build/vm.scm
@@ -297,7 +297,7 @@ actual /dev name based on DEVICE."
  partition-size)
 partitions)
", "))
-  (apply invoke "parted" "--script"
+  (apply invoke "strace" "parted" "--script"
  device "mklabel" label-type
  (options partitions offset))
 
diff --git a/gnu/packages/linux.scm b/gnu/packages/linux.scm
index 59ffb334e0..72fb3ca49d 100644
--- a/gnu/packages/linux.scm
+++ b/gnu/packages/linux.scm
@@ -349,15 +349,15 @@ corresponding UPSTREAM-SOURCE (an origin), using the given DEBLOB-SCRIPTS."
 
 

bug#43344: "basic" system tests fail (and all the other ones) on guix master

2020-09-15 Thread Leo Famulari
On Tue, Sep 15, 2020 at 06:34:21PM -0400, Mark H Weaver wrote:
> That's useful information, but we should not stay frozen on the 5.8.7
> kernel for much longer.  5.8.8 contains many bug fixes, some of which
> might fix potentially exploitable flaws.
> 
> It would be useful to know if this problem still occurs with 5.8.9,
> which has since come out.  If so, we should do a bisection between 5.8.7
> and 5.8.8 to find out which upstream commit introduced the problem.
> 
> Would anyone like to investigate this further?

I will try to reproduce the bug with 5.8.9 now. I will try the bisection
if time permits.





bug#43344: "basic" system tests fail (and all the other ones) on guix master

2020-09-15 Thread Mark H Weaver
Hi,

Danny Milosavljevic  writes:

> On Mon, 14 Sep 2020 15:26:52 +0200
> Mathieu Othacehe  wrote:
>
>> Anything special with your hardware? KVM support disabled maybe?
>
> The culprit had been the Linux kernel update to 5.8.8.
>
> After downgrading to 5.8.7 it works just fine--no other changes done.
>
> Previously, I had tried also to add wipefs -a before the parted--that
> hadn't helped either.

That's useful information, but we should not stay frozen on the 5.8.7
kernel for much longer.  5.8.8 contains many bug fixes, some of which
might fix potentially exploitable flaws.

It would be useful to know if this problem still occurs with 5.8.9,
which has since come out.  If so, we should do a bisection between 5.8.7
and 5.8.8 to find out which upstream commit introduced the problem.

Would anyone like to investigate this further?

Mark





bug#43344: "basic" system tests fail (and all the other ones) on guix master

2020-09-14 Thread Danny Milosavljevic
Hi Mathieu,

On Mon, 14 Sep 2020 15:26:52 +0200
Mathieu Othacehe  wrote:

> Anything special with your hardware? KVM support disabled maybe?

The culprit had been the Linux kernel update to 5.8.8.

After downgrading to 5.8.7 it works just fine--no other changes done.

Previously, I had tried also to add wipefs -a before the parted--that
hadn't helped either.


pgpTyZmxu0s3A.pgp
Description: OpenPGP digital signature


bug#43344: "basic" system tests fail (and all the other ones) on guix master

2020-09-14 Thread Mathieu Othacehe


Hello Danny,

>   1. :
>   program: "parted"
>   arguments: ("--script" "/dev/vda" "mklabel" "msdos" "mkpart" "primary" 
> "e)

So it looks like the parted script failed inside the VM, while running
"initialize-partition-table".

This work fine both on the build farm and on my machine, which makes the
debug harder :(.

Anything special with your hardware? KVM support disabled maybe?

Thanks,

Mathieu





bug#43344: "basic" system tests fail (and all the other ones) on guix master

2020-09-11 Thread Danny Milosavljevic
Hi,

as of guix master commit 0fb974be9c3e1e22a2145c9c602c44cd10cef2b0 all system
tests, including "basic", fail:

$ guix environment --pure guix --ad-hoc git guile-readline guile-json nano 
guile-zlib guile-lzlib
(env)$ make TESTS=basic check-system
loading '/gnu/store/s3limrgxj4pd6b4psra66phary2nmqx4-linux-vm-loader'...
[2.776050] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
environment variable `PATH' set to `/gnu/store/j3jlpncfqvykkq6sx7h4ly1rdcr2a8qq'
creating partition table with 2 partitions (20.0 MiB, 40.0 MiB)...
Error: Partition(s) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, .
Backtrace:
   3 (primitive-load "/gnu/store/liacrhaiab35pcfzdqsa71fvp7f?")
In ./gnu/build/vm.scm:
   470:21  2 (initialize-hard-disk "/dev/vda" #:bootloader-package _ ?)
300:2  1 (initialize-partition-table "/dev/vda" (#<@ build-log 29023 1
 @ build-log 29023 1
?@ build-log 29023 1
?@ build-log 29023 1
?@ build-log 29023 1
)@ build-log 29023 1
 @ build-log 29023 1
?@ build-log 29023 1
?@ build-log 29023 1
?@ build-log 29023 1
)@ build-log 29023 1
@ build-log 29023 1

In ./guix/build/utils.scm:
654:6  0 (invoke _ . _)

./guix/build/utils.scm:654:6: In procedure invoke:
ERROR:
  1. :
  program: "parted"
  arguments: ("--script" "/dev/vda" "mklabel" "msdos" "mkpart" "primary" "e)
  exit-status: 1
  term-signal: #f
  stop-signal: #f
[3.446031] Unregister pv shared memory for cpu 0
[3.447989] reboot: Restarting system
[3.449167] reboot: machine restart
Backtrace: 
   1 (primitive-load "/gnu/store/1xp94hddh88szwkbqyb49jirmf5?")
In ./gnu/build/vm.scm:
   198:12  0 (load-in-linux-vm _ #:output _ #:qemu _ #:memory-size _ ?)

./gnu/build/vm.scm:198:12: In procedure load-in-linux-vm:
guest VM code exited with a non-zero status 256
note: keeping build directory `/tmp/guix-build-qemu-image.drv-0'
builder for `/gnu/store/ks5w1bwcm2rg1fq5s9hc78grxpnm0wpf-qemu-image.drv' failed1
build of /gnu/store/ks5w1bwcm2rg1fq5s9hc78grxpnm0wpf-qemu-image.drv failed
View build log at '/var/log/guix/drvs/ks/5w1bwcm2rg1fq5s9hc78grxpnm0wpf-qemu-im.
cannot build derivation `/gnu/store/8vhwchcxlns64nnqr13p0liw745s6pp8-run-vm.sh.t
cannot build derivation `/gnu/store/672vqgm8mx0a6k2iwsf13q73gmk5vdc4-basic.drv't
guix build: error: build of `/gnu/store/672vqgm8mx0a6k2iwsf13q73gmk5vdc4-basic.d
make: *** [Makefile:6007: check-system] Error 1
dannym@bayfront ~/src/guix-master/guix [env]$

The same happens on my laptop.

./pre-inst-env guix describe
Git checkout:
  repository: /home/dannym/src/guix-master/guix
  branch: master
  commit: 0fb974be9c3e1e22a2145c9c602c44cd10cef2b0

(ssh://dan...@git.sv.gnu.org/srv/git/guix.git)

I've tried to add a call to wipefs before parted--it doesn't help.

Before the problem surfaced I saw a qemu major update.


pgpwa_KQkNr7l.pgp
Description: OpenPGP digital signature