[Bug 2075110] Re: md: nvme over tcp with a striped underlying md raid device leads to data corruption

Matthew Ruffell Wed, 07 Aug 2024 22:15:25 -0700

Performing verification for Noble.

I started a n2-standard-2 instance on Google cloud, running Noble.


I installed 6.8.0-39-generic from -updates, rebooted, and followed the 
instructions in the
testcase.

$ sudo ./check md/001
md/001 (Raid with bitmap on tcp nvmet with opt-io-size over bitmap size)

Having a look at dmesg:

unknown: run blktests md/001 at 2024-08-08 04:26:39
root[1982]: run blktests md/001
kernel: brd: module loaded
(udev-worker)[1987]: dm-0: Process '/usr/bin/unshare -m /usr/bin/snap 
auto-import --mount=/dev/dm-0' failed with exit code 1.
kernel: Key type psk registered
kernel: nvmet: adding nsid 1 to subsystem blktests-subsystem-1
kernel: nvmet_tcp: enabling port 0 (127.0.0.1:4420)
kernel: nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for 
NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
kernel: nvme nvme1: creating 2 I/O queues.
kernel: nvme nvme1: mapped 2/0/0 default/read/poll queues.
kernel: nvme nvme1: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420, 
hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
(udev-worker)[2018]: nvme1n1: Process '/usr/bin/unshare -m /usr/bin/snap 
auto-import --mount=/dev/nvme1n1' failed with exit code 1.
(udev-worker)[2018]: md127: Process '/usr/bin/unshare -m /usr/bin/snap 
auto-import --mount=/dev/md127' failed with exit code 1.
kernel: md/raid1:md127: active with 1 out of 2 mirrors
kernel: ------------[ cut here ]------------
kernel: WARNING: CPU: 0 PID: 50 at net/core/skbuff.c:6995 
skb_splice_from_iter+0x139/0x370
kernel: Modules linked in: nvme_tcp nvmet_tcp nvmet nvme_keyring brd raid1 
cfg80211 8021q garp mrp stp llc binfmt_misc nls_iso8859_1 intel_rapl_msr 
intel_rapl_common intel_uncore_frequency_common isst_if_common nfit 
crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic 
ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd rapl 
pvpanic_mmio pvpanic nvme psmouse i2c_piix4 input_leds mac_hid serio_raw 
dm_multipath nvme_fabrics nvme_core nvme_auth efi_pstore nfnetlink dmi_sysfs 
virtio_rng ip_tables x_tables autofs4
kernel: CPU: 0 PID: 50 Comm: kworker/0:1H Not tainted 6.8.0-39-generic 
#39-Ubuntu
kernel: Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
Google 06/27/2024
kernel: Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
kernel: RIP: 0010:skb_splice_from_iter+0x139/0x370
kernel: Code: 39 e1 48 8b 53 08 49 0f 47 cc 49 89 cd f6 c2 01 0f 85 c0 01 00 00 
66 90 48 89 da 48 8b 12 80 e6 08 0f 84 8e 00 00 00 4d 89 fe <0f> 0b 49 c7 c0 fb 
ff ff ff 48 8b 85 68 ff ff ff 41 01 46 70 41 01
kernel: RSP: 0018:ffffbd92001b3a30 EFLAGS: 00010246
kernel: RAX: 0000000000000000 RBX: fffff5f1c48d9b40 RCX: 0000000000001000
kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
kernel: RBP: ffffbd92001b3ad8 R08: 0000000000000000 R09: 0000000000000000
kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000020e8
kernel: R13: 0000000000001000 R14: ffff96834b496400 R15: ffff96834b496400
kernel: FS:  0000000000000000(0000) GS:ffff968477c00000(0000) 
knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00007507bcfe5f84 CR3: 000000010b49c002 CR4: 00000000003706f0
kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kernel: Call Trace:
kernel:  <TASK>
kernel:  ? show_regs+0x6d/0x80
kernel:  ? __warn+0x89/0x160
kernel:  ? skb_splice_from_iter+0x139/0x370
kernel:  ? report_bug+0x17e/0x1b0
kernel:  ? handle_bug+0x51/0xa0
kernel:  ? exc_invalid_op+0x18/0x80
kernel:  ? asm_exc_invalid_op+0x1b/0x20
kernel:  ? skb_splice_from_iter+0x139/0x370
kernel:  tcp_sendmsg_locked+0x352/0xd70
kernel:  ? tcp_push+0x159/0x190
kernel:  ? tcp_sendmsg_locked+0x9c4/0xd70
kernel:  tcp_sendmsg+0x2c/0x50
kernel:  inet_sendmsg+0x42/0x80
kernel:  sock_sendmsg+0x118/0x150
kernel:  nvme_tcp_try_send_data+0x18b/0x4c0 [nvme_tcp]
kernel:  nvme_tcp_try_send+0x23c/0x300 [nvme_tcp]
kernel:  nvme_tcp_io_work+0x40/0xe0 [nvme_tcp]
kernel:  process_one_work+0x16c/0x350
kernel:  worker_thread+0x306/0x440
kernel:  ? _raw_spin_unlock_irqrestore+0x11/0x60
kernel:  ? __pfx_worker_thread+0x10/0x10
kernel:  kthread+0xef/0x120
kernel:  ? __pfx_kthread+0x10/0x10
kernel:  ret_from_fork+0x44/0x70
kernel:  ? __pfx_kthread+0x10/0x10
kernel:  ret_from_fork_asm+0x1b/0x30
kernel:  </TASK>
kernel: ---[ end trace 0000000000000000 ]---
kernel: nvme nvme1: failed to send request -5
kernel: nvme nvme1: I/O tag 111 (106f) type 4 opcode 0x0 (I/O Cmd) QID 1 timeout
kernel: nvme nvme1: starting error recovery
kernel: block nvme1n1: no usable path - requeuing I/O
kernel: nvme nvme1: Reconnecting in 10 seconds...

blktests md/001 hangs the system, in this particular scenario.

I then restarted the instance, enabled -proposed2, and installed
6.8.0-41-generic:

6.8.0-41-generic #41-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug  2 20:41:06 UTC
2024

I can now run md/001 multiple times, and it passes within a second each time.
The hang now longer occurs.

$ sudo ./check md/001
md/001 (Raid with bitmap on tcp nvmet with opt-io-size over bitmap size) 
[passed]
    runtime    ...  0.441s
$ sudo ./check md/001
md/001 (Raid with bitmap on tcp nvmet with opt-io-size over bitmap size) 
[passed]
    runtime  0.441s  ...  0.405s
$ sudo ./check md/001
md/001 (Raid with bitmap on tcp nvmet with opt-io-size over bitmap size) 
[passed]
    runtime  0.405s  ...  0.410s
$ sudo ./check md/001
md/001 (Raid with bitmap on tcp nvmet with opt-io-size over bitmap size) 
[passed]
    runtime  0.410s  ...  0.429s
$ sudo ./check md/001
md/001 (Raid with bitmap on tcp nvmet with opt-io-size over bitmap size) 
[passed]
    runtime  0.429s  ...  0.408s
    
dmesg has:

unknown: run blktests md/001 at 2024-08-08 05:02:40
root[2377]: run blktests md/001
kernel: brd: module loaded
kernel: nvmet: adding nsid 1 to subsystem blktests-subsystem-1
kernel: nvmet_tcp: enabling port 0 (127.0.0.1:4420)
kernel: nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for 
NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
kernel: nvme nvme1: creating 2 I/O queues.
kernel: nvme nvme1: mapped 2/0/0 default/read/poll queues.
kernel: nvme nvme1: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420, 
hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
kernel: md/raid1:md127: active with 1 out of 2 mirrors
kernel: md127: detected capacity change from 0 to 2093056
kernel: md127: detected capacity change from 2093056 to 0
kernel: md: md127 stopped.
kernel: nvme nvme1: Removing ctrl: NQN "blktests-subsystem-1"
multipathd[190]: nvme1n1: path already removed
kernel: brd: module unloaded
sudo[2342]: pam_unix(sudo:session): session closed for user root

The 6.8.0-41-generic kernel in -proposed2 fixes the issue. Happy to mark
verified for Noble.

** Tags removed: verification-needed-noble-linux
** Tags added: verification-done-noble-linux

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2075110

Title:
  md: nvme over tcp with a striped underlying md raid device leads to
  data corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2075110/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2075110] Re: md: nvme over tcp with a striped underlying md raid device leads to data corruption

Reply via email to