Bug#982459: mdadm examine corrupts host ext4
Control: retitle -1 mdadm --examine in chroot without /dev mounted corrupts host's filesystem Control: found -1 5.10.127-2 Control: fixed -1 5.18.2-1~bpo11+1 On Tuesday, 2 August 2022 11:03:09 CET Chris Hofstaedtler wrote: > Control: reassign -1 src:linux On 10 Feb 2021 14:29:52 +0100 Patrick Cernko wrote: > $MDADM --examine --scan --config=partitions > > If I run this command in a chroot on a machine with md0 as host's root > filesystem WITHOUT mounting /proc, /sys and /dev in the chroot, mdadm > CORRUPTS the host's root filesystem (/dev/md0 with ext4 filesystem > format). I can reproduce this problem every time I do this. > > Kernel: Linux 5.4.78.1.amd64-smp (SMP w/4 CPU cores) > Kernel taint flags: TAINT_PROPRIETARY_MODULE, TAINT_USER, > TAINT_OOT_MODULE, TAINT_UNSIGNED_MODULE Patrick: AFAICT, that is not a Debian (provided) kernel. Are or were you able to reproduce this issue with a Debian kernel? If so, which (exact) version? > * Håkan T Johansson [220801 19:31]: > > On Sun, 31 Jul 2022, Chris Hofstaedtler wrote: > > > I can't see a difference that should matter from userspace. > > > > > > I have stared a bit at the kernel code... there have been quite some > > > changes and fixes in this area. Which kernel version were you > > > running when testing this? > > > > > > Could you retry on something >= 5.9? I.e. some version with patch > > > 08fc1ab6d748ab1a690fd483f41e2938984ce353. > > > > I believe that I was running 5.10 (bullseye). Håkan: IIUC, the bug occurs with the 5.10.127-2 kernel. If you try it with the most recent 5.10 kernel, does the issue still occur? If we have a 'good' and a 'bad' 5.10 kernel, that would make it easier to narrow down in which commit it was fixed. > > It looks like 5.18 (from backports) does not show the issue! (i.e. works) > > > > host: > > linux-image-5.18.0-0.bpo.1-amd64 5.18.2-1~bpo11+1 > > > > [bug still occurs with] > > host: > >linux-image-5.10.0-16-amd64 5.10.127-2 Updated the bug accordingly. > > This time I did get some dmesg BUG output as well (attached). For reference [dmesg 1]: [mån aug 1 15:53:08 2022] BUG: kernel NULL pointer dereference, address: 0010 [mån aug 1 15:53:08 2022] #PF: supervisor read access in kernel mode [mån aug 1 15:53:08 2022] #PF: error_code(0x) - not-present page [mån aug 1 15:53:08 2022] PGD 0 P4D 0 [mån aug 1 15:53:08 2022] Oops: [#1] SMP PTI [mån aug 1 15:53:08 2022] CPU: 2 PID: 284256 Comm: cron Tainted: P OE 5.10.0-16-amd64 #1 Debian 5.10.127-2 [mån aug 1 15:53:08 2022] Hardware name: Dell Computer Corporation PowerEdge 2850/0T7971, BIOS A04 09/22/2005 [mån aug 1 15:53:08 2022] RIP: 0010:__ext4_journal_get_write_access+0x29/0x120 [ext4] [mån aug 1 15:53:08 2022] Code: 00 0f 1f 44 00 00 41 57 41 56 41 89 f6 41 55 41 54 49 89 d4 55 48 89 cd 53 48 83 ec 10 48 89 3c 24 e8 ab d7 bb e1 48 8b 45 30 <4c> 8b 78 10 4d 85 ff 74 2f 49 8b 87 e0 00 00 00 49 8b 9f 88 03 00 [mån aug 1 15:53:08 2022] RSP: 0018:ae27c059fd60 EFLAGS: 00010246 [mån aug 1 15:53:08 2022] RAX: RBX: 9d1b94505480 RCX: 9d1bc52e5e38 [mån aug 1 15:53:08 2022] RDX: 9d1bc13782d8 RSI: 0c14 RDI: c096feb0 [mån aug 1 15:53:08 2022] RBP: 9d1bc52e5e38 R08: 9d1be04d5230 R09: 0001 [mån aug 1 15:53:08 2022] R10: 9d1bc985f000 R11: 001d R12: 9d1bc13782d8 [mån aug 1 15:53:08 2022] R13: 9d1be04d5000 R14: 0c14 R15: 9d1bc13782d8 [mån aug 1 15:53:08 2022] FS: 7fed5ecb1840() GS:9d1cd7c8() knlGS: [mån aug 1 15:53:08 2022] CS: 0010 DS: ES: CR0: 80050033 [mån aug 1 15:53:08 2022] CR2: 0010 CR3: 0001a46d8000 CR4: 06e0 [mån aug 1 15:53:08 2022] Call Trace: [mån aug 1 15:53:08 2022] ext4_orphan_del+0x23f/0x290 [ext4] [mån aug 1 15:53:08 2022] ext4_evict_inode+0x31f/0x630 [ext4] [mån aug 1 15:53:08 2022] evict+0xd1/0x1a0 [mån aug 1 15:53:08 2022] __dentry_kill+0xe4/0x180 [mån aug 1 15:53:08 2022] dput+0x149/0x2f0 [mån aug 1 15:53:08 2022] __fput+0xe4/0x240 [mån aug 1 15:53:08 2022] task_work_run+0x65/0xa0 [mån aug 1 15:53:08 2022] exit_to_user_mode_prepare+0x111/0x120 [mån aug 1 15:53:08 2022] syscall_exit_to_user_mode+0x28/0x140 [mån aug 1 15:53:08 2022] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [mån aug 1 15:53:08 2022] RIP: 0033:0x7fed5eea2d77 > > I also noticed that the BUG: report in dmesg does not happen directly > > when doing 'mdadm --examine --scan --config=partitions'. It rather > > occurs when some activity happens on the host filesystem, e.g. > > a 'touch /root/a' command. > > > > I have tried with both kernels several times, and it was repeatable that > > 5.10 got stuck while 5.18 does not show issues. Repeatable is good :-) If you have a minimal set of steps to reproduce the issue, can you share that? > If you have the time, maybe trying the various kernel
Bug#982459: mdadm examine corrupts host ext4
Control: reassign -1 src:linux Dear Håkan, thanks for reporting back and testing! * Håkan T Johansson [220801 19:31]: > On Sun, 31 Jul 2022, Chris Hofstaedtler wrote: > > > I can't see a difference that should matter from userspace. > > > > I have stared a bit at the kernel code... there have been quite some > > changes and fixes in this area. Which kernel version were you > > running when testing this? > > > > Could you retry on something >= 5.9? I.e. some version with patch > >08fc1ab6d748ab1a690fd483f41e2938984ce353. > > I believe that I was running 5.10 (bullseye). > > It looks like 5.18 (from backports) does not show the issue! (i.e. works) Okay, I think we are now clearly in "this is not an mdadm bug per se" territory (-> reassigning to src:linux). [..] > This time I did get some dmesg BUG output as well (attached). > It does not seem to be the same backtrace on two occurances. > > I also noticed that the BUG: report in dmesg does not happen directly > when doing 'mdadm --examine --scan --config=partitions'. It rather > occurs when some activity happens on the host filesystem, e.g. > a 'touch /root/a' command. > > host: > linux-image-5.18.0-0.bpo.1-amd64 5.18.2-1~bpo11+1 > > (did not re-install anything else, except upgraded zfs, also from > backports (since pure bullseye would not compile with 5.18)) > > Does not exhibit the problem. > > I have tried with both kernels several times, and it was repeatable that > 5.10 got stuck while 5.18 does not show issues. Its good that this now works in 5.18. However I'm not sure how we should find the commit fixing this - in 5.14 lots of block layer code was shuffled around/refactored. If you have the time, maybe trying the various kernel versions between 5.10 and 5.18 would be a good start. If they are not in backports anymore, they should still be at http://snapshot.debian.org/package/linux/ > Reminder: to get the issue, /dev/ should not be mounted in the chroot. > With /dev/ mounted, 5.10 also works. I'll see if I can repro this on 5.10, but need to find a box first. Best, Chris > [mån aug 1 15:53:08 2022] BUG: kernel NULL pointer dereference, address: > 0010 > [mån aug 1 15:53:08 2022] #PF: supervisor read access in kernel mode > [mån aug 1 15:53:08 2022] #PF: error_code(0x) - not-present page > [mån aug 1 15:53:08 2022] PGD 0 P4D 0 > [mån aug 1 15:53:08 2022] Oops: [#1] SMP PTI > [mån aug 1 15:53:08 2022] CPU: 2 PID: 284256 Comm: cron Tainted: P > OE 5.10.0-16-amd64 #1 Debian 5.10.127-2 > [mån aug 1 15:53:08 2022] Hardware name: Dell Computer Corporation PowerEdge > 2850/0T7971, BIOS A04 09/22/2005 > [mån aug 1 15:53:08 2022] RIP: > 0010:__ext4_journal_get_write_access+0x29/0x120 [ext4] > [mån aug 1 15:53:08 2022] Code: 00 0f 1f 44 00 00 41 57 41 56 41 89 f6 41 55 > 41 54 49 89 d4 55 48 89 cd 53 48 83 ec 10 48 89 3c 24 e8 ab d7 bb e1 48 8b 45 > 30 <4c> 8b 78 10 4d 85 ff 74 2f 49 8b 87 e0 00 00 00 49 8b 9f 88 03 00 > [mån aug 1 15:53:08 2022] RSP: 0018:ae27c059fd60 EFLAGS: 00010246 > [mån aug 1 15:53:08 2022] RAX: RBX: 9d1b94505480 RCX: > 9d1bc52e5e38 > [mån aug 1 15:53:08 2022] RDX: 9d1bc13782d8 RSI: 0c14 RDI: > c096feb0 > [mån aug 1 15:53:08 2022] RBP: 9d1bc52e5e38 R08: 9d1be04d5230 R09: > 0001 > [mån aug 1 15:53:08 2022] R10: 9d1bc985f000 R11: 001d R12: > 9d1bc13782d8 > [mån aug 1 15:53:08 2022] R13: 9d1be04d5000 R14: 0c14 R15: > 9d1bc13782d8 > [mån aug 1 15:53:08 2022] FS: 7fed5ecb1840() > GS:9d1cd7c8() knlGS: > [mån aug 1 15:53:08 2022] CS: 0010 DS: ES: CR0: 80050033 > [mån aug 1 15:53:08 2022] CR2: 0010 CR3: 0001a46d8000 CR4: > 06e0 > [mån aug 1 15:53:08 2022] Call Trace: > [mån aug 1 15:53:08 2022] ext4_orphan_del+0x23f/0x290 [ext4] > [mån aug 1 15:53:08 2022] ext4_evict_inode+0x31f/0x630 [ext4] > [mån aug 1 15:53:08 2022] evict+0xd1/0x1a0 > [mån aug 1 15:53:08 2022] __dentry_kill+0xe4/0x180 > [mån aug 1 15:53:08 2022] dput+0x149/0x2f0 > [mån aug 1 15:53:08 2022] __fput+0xe4/0x240 > [mån aug 1 15:53:08 2022] task_work_run+0x65/0xa0 > [mån aug 1 15:53:08 2022] exit_to_user_mode_prepare+0x111/0x120 > [mån aug 1 15:53:08 2022] syscall_exit_to_user_mode+0x28/0x140 > [mån aug 1 15:53:08 2022] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > [mån aug 1 15:53:08 2022] RIP: 0033:0x7fed5eea2d77 > [mån aug 1 15:53:08 2022] Code: 44 00 00 48 8b 15 19 a1 0c 00 f7 d8 64 89 02 > b8 ff ff ff ff eb bc 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 b8 03 00 00 00 0f > 05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 e9 a0 0c 00 f7 d8 64 89 02 b8 > [mån aug 1 15:53:08 2022] RSP: 002b:7ffd50452818 EFLAGS: 0202 > ORIG_RAX: 0003 > [mån aug 1 15:53:08 2022] RAX: RBX: 55dab4578910 RCX: > 7fed5eea2d77 > [mån
Bug#982459: mdadm examine corrupts host ext4
On Sun, 31 Jul 2022, Chris Hofstaedtler wrote: I can't see a difference that should matter from userspace. I have stared a bit at the kernel code... there have been quite some changes and fixes in this area. Which kernel version were you running when testing this? Could you retry on something >= 5.9? I.e. some version with patch 08fc1ab6d748ab1a690fd483f41e2938984ce353. Dear Chris, I believe that I was running 5.10 (bullseye). It looks like 5.18 (from backports) does not show the issue! (i.e. works) Some more details: I have now tried again: host: linux-image-5.10.0-16-amd64 5.10.127-2 mdadm 4.2-1~bpo11+1 chroot: mdadm 4.1-11 Some more details: This time I did get some dmesg BUG output as well (attached). It does not seem to be the same backtrace on two occurances. I also noticed that the BUG: report in dmesg does not happen directly when doing 'mdadm --examine --scan --config=partitions'. It rather occurs when some activity happens on the host filesystem, e.g. a 'touch /root/a' command. host: linux-image-5.18.0-0.bpo.1-amd64 5.18.2-1~bpo11+1 (did not re-install anything else, except upgraded zfs, also from backports (since pure bullseye would not compile with 5.18)) Does not exhibit the problem. I have tried with both kernels several times, and it was repeatable that 5.10 got stuck while 5.18 does not show issues. Reminder: to get the issue, /dev/ should not be mounted in the chroot. With /dev/ mounted, 5.10 also works. Best regards, Håkan[mÃ¥n aug 1 15:53:08 2022] BUG: kernel NULL pointer dereference, address: 0010 [mÃ¥n aug 1 15:53:08 2022] #PF: supervisor read access in kernel mode [mÃ¥n aug 1 15:53:08 2022] #PF: error_code(0x) - not-present page [mÃ¥n aug 1 15:53:08 2022] PGD 0 P4D 0 [mÃ¥n aug 1 15:53:08 2022] Oops: [#1] SMP PTI [mÃ¥n aug 1 15:53:08 2022] CPU: 2 PID: 284256 Comm: cron Tainted: P OE 5.10.0-16-amd64 #1 Debian 5.10.127-2 [mÃ¥n aug 1 15:53:08 2022] Hardware name: Dell Computer Corporation PowerEdge 2850/0T7971, BIOS A04 09/22/2005 [mÃ¥n aug 1 15:53:08 2022] RIP: 0010:__ext4_journal_get_write_access+0x29/0x120 [ext4] [mÃ¥n aug 1 15:53:08 2022] Code: 00 0f 1f 44 00 00 41 57 41 56 41 89 f6 41 55 41 54 49 89 d4 55 48 89 cd 53 48 83 ec 10 48 89 3c 24 e8 ab d7 bb e1 48 8b 45 30 <4c> 8b 78 10 4d 85 ff 74 2f 49 8b 87 e0 00 00 00 49 8b 9f 88 03 00 [mÃ¥n aug 1 15:53:08 2022] RSP: 0018:ae27c059fd60 EFLAGS: 00010246 [mÃ¥n aug 1 15:53:08 2022] RAX: RBX: 9d1b94505480 RCX: 9d1bc52e5e38 [mÃ¥n aug 1 15:53:08 2022] RDX: 9d1bc13782d8 RSI: 0c14 RDI: c096feb0 [mÃ¥n aug 1 15:53:08 2022] RBP: 9d1bc52e5e38 R08: 9d1be04d5230 R09: 0001 [mÃ¥n aug 1 15:53:08 2022] R10: 9d1bc985f000 R11: 001d R12: 9d1bc13782d8 [mÃ¥n aug 1 15:53:08 2022] R13: 9d1be04d5000 R14: 0c14 R15: 9d1bc13782d8 [mÃ¥n aug 1 15:53:08 2022] FS: 7fed5ecb1840() GS:9d1cd7c8() knlGS: [mÃ¥n aug 1 15:53:08 2022] CS: 0010 DS: ES: CR0: 80050033 [mÃ¥n aug 1 15:53:08 2022] CR2: 0010 CR3: 0001a46d8000 CR4: 06e0 [mÃ¥n aug 1 15:53:08 2022] Call Trace: [mÃ¥n aug 1 15:53:08 2022] ext4_orphan_del+0x23f/0x290 [ext4] [mÃ¥n aug 1 15:53:08 2022] ext4_evict_inode+0x31f/0x630 [ext4] [mÃ¥n aug 1 15:53:08 2022] evict+0xd1/0x1a0 [mÃ¥n aug 1 15:53:08 2022] __dentry_kill+0xe4/0x180 [mÃ¥n aug 1 15:53:08 2022] dput+0x149/0x2f0 [mÃ¥n aug 1 15:53:08 2022] __fput+0xe4/0x240 [mÃ¥n aug 1 15:53:08 2022] task_work_run+0x65/0xa0 [mÃ¥n aug 1 15:53:08 2022] exit_to_user_mode_prepare+0x111/0x120 [mÃ¥n aug 1 15:53:08 2022] syscall_exit_to_user_mode+0x28/0x140 [mÃ¥n aug 1 15:53:08 2022] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [mÃ¥n aug 1 15:53:08 2022] RIP: 0033:0x7fed5eea2d77 [mÃ¥n aug 1 15:53:08 2022] Code: 44 00 00 48 8b 15 19 a1 0c 00 f7 d8 64 89 02 b8 ff ff ff ff eb bc 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 e9 a0 0c 00 f7 d8 64 89 02 b8 [mÃ¥n aug 1 15:53:08 2022] RSP: 002b:7ffd50452818 EFLAGS: 0202 ORIG_RAX: 0003 [mÃ¥n aug 1 15:53:08 2022] RAX: RBX: 55dab4578910 RCX: 7fed5eea2d77 [mÃ¥n aug 1 15:53:08 2022] RDX: 7fed5ef6e8a0 RSI: RDI: 0006 [mÃ¥n aug 1 15:53:08 2022] RBP: R08: R09: 7fed5ef6dbe0 [mÃ¥n aug 1 15:53:08 2022] R10: 006f R11: 0202 R12: 7fed5ef6f4a0 [mÃ¥n aug 1 15:53:08 2022] R13: R14: R15: 0001 [mÃ¥n aug 1 15:53:08 2022] Modules linked in: msr autofs4 nfsd auth_rpcgss nfsv3 nfs_acl nfs lockd grace sunrpc nfs_ssc fscache xt_mac xt_length xt_recent xt_multiport xt_tcpudp xt_state xt_conntrack
Bug#982459: mdadm examine corrupts host ext4
Hi Håkan, * Håkan T Johansson [220730 23:43]: > I have now tried with the mdadm 4.2~rc2-2 installed in both the chroot > environment (tried only that first), and also the host system. > Unfortunately, the host / fs is still affected when running > 'update-initramfs -u', when /dev is not mounted. [..] > is kind of readable, though, then I'm lost. I can't see a difference that should matter from userspace. I have stared a bit at the kernel code... there have been quite some changes and fixes in this area. Which kernel version were you running when testing this? Could you retry on something >= 5.9? I.e. some version with patch 08fc1ab6d748ab1a690fd483f41e2938984ce353. Thanks, Chris