Bug#982459: mdadm examine corrupts host ext4

2022-12-05 Thread Diederik de Haas
Control: retitle -1 mdadm --examine in chroot without /dev mounted corrupts 
host's filesystem
Control: found -1 5.10.127-2
Control: fixed -1 5.18.2-1~bpo11+1

On Tuesday, 2 August 2022 11:03:09 CET Chris Hofstaedtler wrote:
> Control: reassign -1 src:linux

On 10 Feb 2021 14:29:52 +0100 Patrick Cernko  wrote:
> $MDADM --examine --scan --config=partitions
> 
> If I run this command in a chroot on a machine with md0 as host's root 
> filesystem WITHOUT mounting /proc, /sys and /dev in the chroot, mdadm 
> CORRUPTS the host's root filesystem (/dev/md0 with ext4 filesystem 
> format). I can reproduce this problem every time I do this. 
> 
> Kernel: Linux 5.4.78.1.amd64-smp (SMP w/4 CPU cores)
> Kernel taint flags: TAINT_PROPRIETARY_MODULE, TAINT_USER, 
> TAINT_OOT_MODULE, TAINT_UNSIGNED_MODULE

Patrick: AFAICT, that is not a Debian (provided) kernel.
Are or were you able to reproduce this issue with a Debian kernel?
If so, which (exact) version?

> * Håkan T Johansson  [220801 19:31]:
> > On Sun, 31 Jul 2022, Chris Hofstaedtler wrote:
> > > I can't see a difference that should matter from userspace.
> > > 
> > > I have stared a bit at the kernel code... there have been quite some
> > > changes and fixes in this area. Which kernel version were you
> > > running when testing this?
> > > 
> > > Could you retry on something >= 5.9? I.e. some version with patch
> > > 08fc1ab6d748ab1a690fd483f41e2938984ce353.
> > 
> > I believe that I was running 5.10 (bullseye).

Håkan: IIUC, the bug occurs with the 5.10.127-2 kernel.
If you try it with the most recent 5.10 kernel, does the issue still occur?
If we have a 'good' and a 'bad' 5.10 kernel, that would make it easier to
narrow down in which commit it was fixed.

> > It looks like 5.18 (from backports) does not show the issue!  (i.e. works)
> > 
> > host:
> > linux-image-5.18.0-0.bpo.1-amd64  5.18.2-1~bpo11+1
> > 
> > [bug still occurs with]
> > host:
> >linux-image-5.10.0-16-amd64   5.10.127-2

Updated the bug accordingly.

> > This time I did get some dmesg BUG output as well (attached).

For reference [dmesg 1]:
[mån aug  1 15:53:08 2022] BUG: kernel NULL pointer dereference, address: 
0010
[mån aug  1 15:53:08 2022] #PF: supervisor read access in kernel mode
[mån aug  1 15:53:08 2022] #PF: error_code(0x) - not-present page
[mån aug  1 15:53:08 2022] PGD 0 P4D 0 
[mån aug  1 15:53:08 2022] Oops:  [#1] SMP PTI
[mån aug  1 15:53:08 2022] CPU: 2 PID: 284256 Comm: cron Tainted: P   
OE 5.10.0-16-amd64 #1 Debian 5.10.127-2
[mån aug  1 15:53:08 2022] Hardware name: Dell Computer Corporation PowerEdge 
2850/0T7971, BIOS A04 09/22/2005
[mån aug  1 15:53:08 2022] RIP: 0010:__ext4_journal_get_write_access+0x29/0x120 
[ext4]
[mån aug  1 15:53:08 2022] Code: 00 0f 1f 44 00 00 41 57 41 56 41 89 f6 41 55 
41 54 49 89 d4 55 48 89 cd 53 48 83 ec 10 48 89 3c 24 e8 ab d7 bb e1 48 8b 45 
30 <4c> 8b 78 10 4d 85 ff 74 2f 49 8b 87 e0 00 00 00 49 8b 9f 88 03 00
[mån aug  1 15:53:08 2022] RSP: 0018:ae27c059fd60 EFLAGS: 00010246
[mån aug  1 15:53:08 2022] RAX:  RBX: 9d1b94505480 RCX: 
9d1bc52e5e38
[mån aug  1 15:53:08 2022] RDX: 9d1bc13782d8 RSI: 0c14 RDI: 
c096feb0
[mån aug  1 15:53:08 2022] RBP: 9d1bc52e5e38 R08: 9d1be04d5230 R09: 
0001
[mån aug  1 15:53:08 2022] R10: 9d1bc985f000 R11: 001d R12: 
9d1bc13782d8
[mån aug  1 15:53:08 2022] R13: 9d1be04d5000 R14: 0c14 R15: 
9d1bc13782d8
[mån aug  1 15:53:08 2022] FS:  7fed5ecb1840() 
GS:9d1cd7c8() knlGS:
[mån aug  1 15:53:08 2022] CS:  0010 DS:  ES:  CR0: 80050033
[mån aug  1 15:53:08 2022] CR2: 0010 CR3: 0001a46d8000 CR4: 
06e0
[mån aug  1 15:53:08 2022] Call Trace:
[mån aug  1 15:53:08 2022]  ext4_orphan_del+0x23f/0x290 [ext4]
[mån aug  1 15:53:08 2022]  ext4_evict_inode+0x31f/0x630 [ext4]
[mån aug  1 15:53:08 2022]  evict+0xd1/0x1a0
[mån aug  1 15:53:08 2022]  __dentry_kill+0xe4/0x180
[mån aug  1 15:53:08 2022]  dput+0x149/0x2f0
[mån aug  1 15:53:08 2022]  __fput+0xe4/0x240
[mån aug  1 15:53:08 2022]  task_work_run+0x65/0xa0
[mån aug  1 15:53:08 2022]  exit_to_user_mode_prepare+0x111/0x120
[mån aug  1 15:53:08 2022]  syscall_exit_to_user_mode+0x28/0x140
[mån aug  1 15:53:08 2022]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[mån aug  1 15:53:08 2022] RIP: 0033:0x7fed5eea2d77

> > I also noticed that the BUG: report in dmesg does not happen directly
> > when doing 'mdadm --examine --scan --config=partitions'.  It rather
> > occurs when some activity happens on the host filesystem, e.g.
> > a 'touch /root/a' command.
> > 
> > I have tried with both kernels several times, and it was repeatable that
> > 5.10 got stuck while 5.18 does not show issues.

Repeatable is good :-)
If you have a minimal set of steps to reproduce the issue, can you share that?

> If you have the time, maybe trying the various kernel 

Bug#982459: mdadm examine corrupts host ext4

2022-08-02 Thread Chris Hofstaedtler
Control: reassign -1 src:linux

Dear Håkan,

thanks for reporting back and testing!

* Håkan T Johansson  [220801 19:31]:
> On Sun, 31 Jul 2022, Chris Hofstaedtler wrote:
> 
> > I can't see a difference that should matter from userspace.
> > 
> > I have stared a bit at the kernel code... there have been quite some
> > changes and fixes in this area. Which kernel version were you
> > running when testing this?
> > 
> > Could you retry on something >= 5.9? I.e. some version with patch
> >08fc1ab6d748ab1a690fd483f41e2938984ce353.
> 
> I believe that I was running 5.10 (bullseye).
> 
> It looks like 5.18 (from backports) does not show the issue!  (i.e. works)

Okay, I think we are now clearly in "this is not an mdadm bug per
se" territory (-> reassigning to src:linux).

[..]
>   This time I did get some dmesg BUG output as well (attached).
>   It does not seem to be the same backtrace on two occurances.
> 
>   I also noticed that the BUG: report in dmesg does not happen directly
>   when doing 'mdadm --examine --scan --config=partitions'.  It rather
>   occurs when some activity happens on the host filesystem, e.g.
>   a 'touch /root/a' command.
> 
> host:
>   linux-image-5.18.0-0.bpo.1-amd64  5.18.2-1~bpo11+1
> 
>   (did not re-install anything else, except upgraded zfs, also from
>   backports (since pure bullseye would not compile with 5.18))
> 
>   Does not exhibit the problem.
> 
> I have tried with both kernels several times, and it was repeatable that
> 5.10 got stuck while 5.18 does not show issues.

Its good that this now works in 5.18. However I'm not sure how we
should find the commit fixing this - in 5.14 lots of block layer
code was shuffled around/refactored.

If you have the time, maybe trying the various kernel versions
between 5.10 and 5.18 would be a good start. If they are not in
backports anymore, they should still be at
  http://snapshot.debian.org/package/linux/

> Reminder: to get the issue, /dev/ should not be mounted in the chroot.
> With /dev/ mounted, 5.10 also works.

I'll see if I can repro this on 5.10, but need to find a box first.

Best,
Chris

> [mån aug  1 15:53:08 2022] BUG: kernel NULL pointer dereference, address: 
> 0010
> [mån aug  1 15:53:08 2022] #PF: supervisor read access in kernel mode
> [mån aug  1 15:53:08 2022] #PF: error_code(0x) - not-present page
> [mån aug  1 15:53:08 2022] PGD 0 P4D 0 
> [mån aug  1 15:53:08 2022] Oops:  [#1] SMP PTI
> [mån aug  1 15:53:08 2022] CPU: 2 PID: 284256 Comm: cron Tainted: P   
> OE 5.10.0-16-amd64 #1 Debian 5.10.127-2
> [mån aug  1 15:53:08 2022] Hardware name: Dell Computer Corporation PowerEdge 
> 2850/0T7971, BIOS A04 09/22/2005
> [mån aug  1 15:53:08 2022] RIP: 
> 0010:__ext4_journal_get_write_access+0x29/0x120 [ext4]
> [mån aug  1 15:53:08 2022] Code: 00 0f 1f 44 00 00 41 57 41 56 41 89 f6 41 55 
> 41 54 49 89 d4 55 48 89 cd 53 48 83 ec 10 48 89 3c 24 e8 ab d7 bb e1 48 8b 45 
> 30 <4c> 8b 78 10 4d 85 ff 74 2f 49 8b 87 e0 00 00 00 49 8b 9f 88 03 00
> [mån aug  1 15:53:08 2022] RSP: 0018:ae27c059fd60 EFLAGS: 00010246
> [mån aug  1 15:53:08 2022] RAX:  RBX: 9d1b94505480 RCX: 
> 9d1bc52e5e38
> [mån aug  1 15:53:08 2022] RDX: 9d1bc13782d8 RSI: 0c14 RDI: 
> c096feb0
> [mån aug  1 15:53:08 2022] RBP: 9d1bc52e5e38 R08: 9d1be04d5230 R09: 
> 0001
> [mån aug  1 15:53:08 2022] R10: 9d1bc985f000 R11: 001d R12: 
> 9d1bc13782d8
> [mån aug  1 15:53:08 2022] R13: 9d1be04d5000 R14: 0c14 R15: 
> 9d1bc13782d8
> [mån aug  1 15:53:08 2022] FS:  7fed5ecb1840() 
> GS:9d1cd7c8() knlGS:
> [mån aug  1 15:53:08 2022] CS:  0010 DS:  ES:  CR0: 80050033
> [mån aug  1 15:53:08 2022] CR2: 0010 CR3: 0001a46d8000 CR4: 
> 06e0
> [mån aug  1 15:53:08 2022] Call Trace:
> [mån aug  1 15:53:08 2022]  ext4_orphan_del+0x23f/0x290 [ext4]
> [mån aug  1 15:53:08 2022]  ext4_evict_inode+0x31f/0x630 [ext4]
> [mån aug  1 15:53:08 2022]  evict+0xd1/0x1a0
> [mån aug  1 15:53:08 2022]  __dentry_kill+0xe4/0x180
> [mån aug  1 15:53:08 2022]  dput+0x149/0x2f0
> [mån aug  1 15:53:08 2022]  __fput+0xe4/0x240
> [mån aug  1 15:53:08 2022]  task_work_run+0x65/0xa0
> [mån aug  1 15:53:08 2022]  exit_to_user_mode_prepare+0x111/0x120
> [mån aug  1 15:53:08 2022]  syscall_exit_to_user_mode+0x28/0x140
> [mån aug  1 15:53:08 2022]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [mån aug  1 15:53:08 2022] RIP: 0033:0x7fed5eea2d77
> [mån aug  1 15:53:08 2022] Code: 44 00 00 48 8b 15 19 a1 0c 00 f7 d8 64 89 02 
> b8 ff ff ff ff eb bc 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 b8 03 00 00 00 0f 
> 05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 e9 a0 0c 00 f7 d8 64 89 02 b8
> [mån aug  1 15:53:08 2022] RSP: 002b:7ffd50452818 EFLAGS: 0202 
> ORIG_RAX: 0003
> [mån aug  1 15:53:08 2022] RAX:  RBX: 55dab4578910 RCX: 
> 7fed5eea2d77
> [mån 

Bug#982459: mdadm examine corrupts host ext4

2022-08-01 Thread Håkan T Johansson


On Sun, 31 Jul 2022, Chris Hofstaedtler wrote:


I can't see a difference that should matter from userspace.

I have stared a bit at the kernel code... there have been quite some
changes and fixes in this area. Which kernel version were you
running when testing this?

Could you retry on something >= 5.9? I.e. some version with patch
   08fc1ab6d748ab1a690fd483f41e2938984ce353.


Dear Chris,

I believe that I was running 5.10 (bullseye).

It looks like 5.18 (from backports) does not show the issue!  (i.e. works)

Some more details:

I have now tried again:

host:
  linux-image-5.10.0-16-amd64   5.10.127-2
  mdadm 4.2-1~bpo11+1
chroot:
  mdadm 4.1-11

  Some more details:

  This time I did get some dmesg BUG output as well (attached).
  It does not seem to be the same backtrace on two occurances.

  I also noticed that the BUG: report in dmesg does not happen directly
  when doing 'mdadm --examine --scan --config=partitions'.  It rather
  occurs when some activity happens on the host filesystem, e.g.
  a 'touch /root/a' command.

host:
  linux-image-5.18.0-0.bpo.1-amd64  5.18.2-1~bpo11+1

  (did not re-install anything else, except upgraded zfs, also from
  backports (since pure bullseye would not compile with 5.18))

  Does not exhibit the problem.

I have tried with both kernels several times, and it was repeatable that 
5.10 got stuck while 5.18 does not show issues.


Reminder: to get the issue, /dev/ should not be mounted in the chroot.
With /dev/ mounted, 5.10 also works.

Best regards,
Håkan[mÃ¥n aug  1 15:53:08 2022] BUG: kernel NULL pointer dereference, address: 
0010
[mån aug  1 15:53:08 2022] #PF: supervisor read access in kernel mode
[mån aug  1 15:53:08 2022] #PF: error_code(0x) - not-present page
[mån aug  1 15:53:08 2022] PGD 0 P4D 0 
[mån aug  1 15:53:08 2022] Oops:  [#1] SMP PTI
[mån aug  1 15:53:08 2022] CPU: 2 PID: 284256 Comm: cron Tainted: P   
OE 5.10.0-16-amd64 #1 Debian 5.10.127-2
[mån aug  1 15:53:08 2022] Hardware name: Dell Computer Corporation PowerEdge 
2850/0T7971, BIOS A04 09/22/2005
[mån aug  1 15:53:08 2022] RIP: 
0010:__ext4_journal_get_write_access+0x29/0x120 [ext4]
[mån aug  1 15:53:08 2022] Code: 00 0f 1f 44 00 00 41 57 41 56 41 89 f6 41 55 
41 54 49 89 d4 55 48 89 cd 53 48 83 ec 10 48 89 3c 24 e8 ab d7 bb e1 48 8b 45 
30 <4c> 8b 78 10 4d 85 ff 74 2f 49 8b 87 e0 00 00 00 49 8b 9f 88 03 00
[mån aug  1 15:53:08 2022] RSP: 0018:ae27c059fd60 EFLAGS: 00010246
[mån aug  1 15:53:08 2022] RAX:  RBX: 9d1b94505480 RCX: 
9d1bc52e5e38
[mån aug  1 15:53:08 2022] RDX: 9d1bc13782d8 RSI: 0c14 RDI: 
c096feb0
[mån aug  1 15:53:08 2022] RBP: 9d1bc52e5e38 R08: 9d1be04d5230 R09: 
0001
[mån aug  1 15:53:08 2022] R10: 9d1bc985f000 R11: 001d R12: 
9d1bc13782d8
[mån aug  1 15:53:08 2022] R13: 9d1be04d5000 R14: 0c14 R15: 
9d1bc13782d8
[mån aug  1 15:53:08 2022] FS:  7fed5ecb1840() 
GS:9d1cd7c8() knlGS:
[mån aug  1 15:53:08 2022] CS:  0010 DS:  ES:  CR0: 80050033
[mån aug  1 15:53:08 2022] CR2: 0010 CR3: 0001a46d8000 CR4: 
06e0
[mån aug  1 15:53:08 2022] Call Trace:
[mån aug  1 15:53:08 2022]  ext4_orphan_del+0x23f/0x290 [ext4]
[mån aug  1 15:53:08 2022]  ext4_evict_inode+0x31f/0x630 [ext4]
[mån aug  1 15:53:08 2022]  evict+0xd1/0x1a0
[mån aug  1 15:53:08 2022]  __dentry_kill+0xe4/0x180
[mån aug  1 15:53:08 2022]  dput+0x149/0x2f0
[mån aug  1 15:53:08 2022]  __fput+0xe4/0x240
[mån aug  1 15:53:08 2022]  task_work_run+0x65/0xa0
[mån aug  1 15:53:08 2022]  exit_to_user_mode_prepare+0x111/0x120
[mån aug  1 15:53:08 2022]  syscall_exit_to_user_mode+0x28/0x140
[mån aug  1 15:53:08 2022]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[mån aug  1 15:53:08 2022] RIP: 0033:0x7fed5eea2d77
[mån aug  1 15:53:08 2022] Code: 44 00 00 48 8b 15 19 a1 0c 00 f7 d8 64 89 02 
b8 ff ff ff ff eb bc 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 b8 03 00 00 00 0f 
05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 e9 a0 0c 00 f7 d8 64 89 02 b8
[mån aug  1 15:53:08 2022] RSP: 002b:7ffd50452818 EFLAGS: 0202 
ORIG_RAX: 0003
[mån aug  1 15:53:08 2022] RAX:  RBX: 55dab4578910 RCX: 
7fed5eea2d77
[mån aug  1 15:53:08 2022] RDX: 7fed5ef6e8a0 RSI:  RDI: 
0006
[mån aug  1 15:53:08 2022] RBP:  R08:  R09: 
7fed5ef6dbe0
[mån aug  1 15:53:08 2022] R10: 006f R11: 0202 R12: 
7fed5ef6f4a0
[mån aug  1 15:53:08 2022] R13:  R14:  R15: 
0001
[mån aug  1 15:53:08 2022] Modules linked in: msr autofs4 nfsd auth_rpcgss 
nfsv3 nfs_acl nfs lockd grace sunrpc nfs_ssc fscache xt_mac xt_length xt_recent 
xt_multiport xt_tcpudp xt_state xt_conntrack 

Bug#982459: mdadm examine corrupts host ext4

2022-07-30 Thread Chris Hofstaedtler
Hi Håkan,

* Håkan T Johansson  [220730 23:43]:
> I have now tried with the mdadm 4.2~rc2-2 installed in both the chroot
> environment (tried only that first), and also the host system.
> Unfortunately, the host / fs is still affected when running
> 'update-initramfs -u', when /dev is not mounted.
[..]

> is kind of readable, though, then I'm lost.

I can't see a difference that should matter from userspace.

I have stared a bit at the kernel code... there have been quite some
changes and fixes in this area. Which kernel version were you
running when testing this?

Could you retry on something >= 5.9? I.e. some version with patch
08fc1ab6d748ab1a690fd483f41e2938984ce353.

Thanks,
Chris