** Description changed: + BugLink: https://bugs.launchpad.net/bugs/2139322 + + [Impact] + + Enable mlx5 ovs hardware offload on 6.8 kernel, we see different issues on our production environment, it only happens under real and heavy workloads. Issue 1, general protection fault: [75202.650580] general protection fault, probably for non-canonical address 0x9cad655f9b42c237: 0000 [#1] PREEMPT SMP NOPTI [75202.661464] CPU: 29 PID: 0 Comm: swapper/29 Kdump: loaded Not tainted 6.8.0-51-generic #52~22.04.1-Ubuntu [75202.671039] Hardware name: Dell Inc. PowerEdge R7525/0H3K7P, BIOS 2.15.2 04/02/2024 [75202.678701] RIP: 0010:kmalloc_trace+0xd7/0x360 [75202.683158] Code: 83 78 10 00 48 8b 38 0f 84 36 02 00 00 48 85 ff 0f 84 2d 02 00 00 41 8b 44 24 28 49 8b 9c 24 b8 00 00 00 49 8b 34 24 48 01 f8 <48> 33 18 48 89 c1 48 89 f8 48 0f c9 48 31 cb 48 8d 8a 00 20 00 00 [75202.701933] RSP: 0018:ffffabfc19a08990 EFLAGS: 00010282 [75202.707166] RAX: 9cad655f9b42c237 RBX: 1c00e25717636e48 RCX: 0000000000000000 [75202.714310] RDX: 000000bec1e5c01d RSI: 000000000003b980 RDI: 9cad655f9b42c1b7 [75202.721449] RBP: ffffabfc19a089e0 R08: 0000000000000000 R09: 0000000000000000 [75202.728593] R10: ffffabfc19a08a00 R11: 0000000000000000 R12: ffff94db00050c00 [75202.735735] R13: 0000000000000920 R14: 00000000000000d8 R15: 0000000000000000 [75202.742876] FS: 0000000000000000(0000) GS:ffff95da7cc80000(0000) knlGS:0000000000000000 [75202.750971] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [75202.756722] CR2: 00007a5f6af90010 CR3: 0000010263b44002 CR4: 0000000000f70ef0 [75202.763866] PKRU: 55555554 [75202.766581] Call Trace: [75202.769033] <IRQ> [75202.771053] ? show_regs+0x6d/0x80 [75202.774483] ? die_addr+0x37/0xa0 [75202.777807] ? exc_general_protection+0x1db/0x480 [75202.782525] ? asm_exc_general_protection+0x27/0x30 [75202.787412] ? kmalloc_trace+0xd7/0x360 [75202.791261] ? flow_offload_alloc+0x64/0x120 [nf_flow_table] [75202.796938] flow_offload_alloc+0x64/0x120 [nf_flow_table] [75202.802431] ? nf_conntrack_in+0x113/0x360 [nf_conntrack] [75202.807846] ? flow_offload_alloc+0x64/0x120 [nf_flow_table] [75202.813517] tcf_ct_flow_table_process_conn+0xc2/0x1e0 [act_ct] [75202.819444] tcf_ct_act+0x6c8/0xae0 [act_ct] [75202.823726] tcf_action_exec+0xbc/0x190 [75202.827571] __tcf_classify+0xcb/0x1f0 [75202.831332] tcf_classify+0xff/0x260 [75202.834920] tc_run+0xa3/0x110 [75202.837987] __netif_receive_skb_core.constprop.0+0x459/0xf90 [75202.843744] ? dev_gro_receive+0xc0/0x350 [75202.847763] ? srso_alias_return_thunk+0x5/0xfbef5 [75202.852565] ? napi_gro_receive+0x73/0x220 [75202.856675] __netif_receive_skb_list_core+0xfd/0x250 [75202.861736] netif_receive_skb_list_internal+0x1a3/0x2d0 [75202.867056] ? srso_alias_return_thunk+0x5/0xfbef5 [75202.871858] ? mlx5e_rx_cq_process_basic_cqe_comp+0x2f7/0x310 [mlx5_core] [75202.878752] napi_complete_done+0x74/0x1c0 [75202.882855] mlx5e_napi_poll+0x190/0x7b0 [mlx5_core] [75202.887911] __napi_poll+0x33/0x200 [75202.891753] net_rx_action+0x181/0x2e0 [75202.895849] handle_softirqs+0xdb/0x340 [75202.900027] __irq_exit_rcu+0xd9/0x100 [75202.904103] irq_exit_rcu+0xe/0x20 [75202.907828] common_interrupt+0xa4/0xb0 [75202.911983] </IRQ> [75202.914387] <TASK> [75202.916786] asm_common_interrupt+0x27/0x40 [75202.921258] RIP: 0010:mwait_idle+0x50/0x80 This is caused by use-after-free in slab (kmalloc-256). - Issue 2, soft lockup: [148720.717134] watchdog: BUG: soft lockup - CPU#3 stuck for 7923s! [swapper/3:0] - [148720.725207] Modules linked in: act_csum act_pedit act_tunnel_key vhost_net vhost tap vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd xt_CT xt_tcpudp nft_compat nf_tables veth + [148720.725207] Modules linked in: act_csum act_pedit act_tunnel_key vhost_net vhost tap vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd xt_CT xt_tcpudp nft_compat nf_tables veth act_ct nf_flow_table nf_conntrack_netlink nvme_fabrics nvme_keyring xfs dm_crypt act_skbedit act_vlan act_mirred cls_matchall geneve ip6_udp_tunnel udp_tunnel nfnetlink_cttimeout nfnet link act_gact cls_flower sch_ingress openvswitch nsh nf_conncount nf_nat 8021q garp mrp stp llc bonding sunrpc binfmt_misc nls_iso8859_1 mlx5_vdpa vringh vhost_iotlb vdpa intel_rapl_ms r intel_rapl_common amd64_edac edac_mce_amd kvm_amd kvm irqbypass rapl dell_wmi video ledtrig_audio sparse_keymap dell_smbios dcdbas dell_wmi_descriptor wmi_bmof ipmi_ssif ccp ptdma k1 0temp acpi_power_meter ipmi_si acpi_ipmi ipmi_devintf ipmi_msghandler mac_hid dm_service_time sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 msr efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov [148720.725328] async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c mlx5_ib ib_uverbs macsec ib_core ses enclosure raid1 raid0 bcache mlx5_core crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel mlxfw mpt3sas sha256_ssse3 nvme psample ahci sha1_ssse3 raid_class tg3 nvme_core tls libahci xhci_pci mgag200 nvme_auth scsi_transport_sas i2c_algo_bit pci_hyperv_intf i2c_piix4 xhci_pci_renesas wmi aesni_intel crypto_simd cryptd [148720.725385] CPU: 3 PID: 0 Comm: swapper/3 Kdump: loaded Tainted: G L 6.8.0-57-generic #59~22.04.1-Ubuntu [148720.725388] Hardware name: Dell Inc. PowerEdge R7525/0H3K7P, BIOS 2.16.3 09/10/2024 [148720.725390] RIP: 0010:flow_offload_hash_cmp+0x1f/0x40 [nf_flow_table] [148720.725398] Code: 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 8b 47 08 ba 32 00 00 00 48 8d 7e 08 48 89 c6 48 89 e5 e8 62 4a b6 fa 5d <85> c0 0f 95 c0 0f b6 c0 31 d2 31 f6 31 ff e9 b9 3b ee fa 66 66 2e [148720.725401] RSP: 0018:ffffad9f403fc928 EFLAGS: 00000246 [148720.725404] RAX: 0000000000000004 RBX: ffff8a8f9a3c3a40 RCX: 0000000000000000 [148720.725406] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [148720.725409] RBP: ffffad9f403fc990 R08: 0000000000000000 R09: 000000000000003c [148720.725411] R10: 000000000000003c R11: 0000000000000000 R12: ffff89b49b080000 [148720.725413] R13: 0000000000000000 R14: ffff89b49b09e6b8 R15: ffff89b2ba69ea58 [148720.725415] FS: 0000000000000000(0000) GS:ffff8a8f3bf80000(0000) knlGS:0000000000000000 [148720.725417] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [148720.725419] CR2: 000056c0ae793900 CR3: 000000021d904002 CR4: 0000000000f70ef0 [148720.725421] PKRU: 55555554 [148720.725423] Call Trace: [148720.725426] <IRQ> [148720.725428] ? show_regs+0x6d/0x80 [148720.725435] ? watchdog_timer_fn+0x206/0x290 [148720.725441] ? __pfx_watchdog_timer_fn+0x10/0x10 [148720.725445] ? __hrtimer_run_queues+0x112/0x2a0 [148720.725450] ? srso_alias_return_thunk+0x5/0xfbef5 [148720.725457] ? hrtimer_interrupt+0xf6/0x250 [148720.725462] ? __sysvec_apic_timer_interrupt+0x51/0x120 [148720.725467] ? sysvec_apic_timer_interrupt+0x3b/0xd0 [148720.725473] ? asm_sysvec_apic_timer_interrupt+0x1b/0x20 [148720.725479] ? flow_offload_hash_cmp+0x1f/0x40 [nf_flow_table] [148720.725484] ? flow_offload_lookup+0xb2/0x180 [nf_flow_table] [148720.725491] tcf_ct_flow_table_lookup.isra.0+0x244/0x6b0 [act_ct] [148720.725494] ? srso_alias_return_thunk+0x5/0xfbef5 [148720.725499] ? ovs_dp_process_packet+0x1af/0x220 [openvswitch] [148720.725518] tcf_ct_act+0x23d/0xae0 [act_ct] [148720.725524] tcf_action_exec+0xbc/0x190 [148720.725531] __tcf_classify+0xcb/0x1f0 [148720.725535] tcf_classify+0xff/0x260 [148720.725539] tc_run+0xa3/0x110 [148720.725543] ? srso_alias_return_thunk+0x5/0xfbef5 [148720.725547] __netif_receive_skb_core.constprop.0+0x459/0xf90 [148720.725552] ? dev_gro_receive+0x150/0x350 [148720.725557] ? srso_alias_return_thunk+0x5/0xfbef5 [148720.725560] ? napi_gro_receive+0x73/0x220 [148720.725564] __netif_receive_skb_list_core+0xfd/0x250 [148720.725569] netif_receive_skb_list_internal+0x1a3/0x2d0 [148720.725573] ? srso_alias_return_thunk+0x5/0xfbef5 [148720.725578] ? mlx5e_rx_cq_process_basic_cqe_comp+0x2f7/0x310 [mlx5_core] [148720.725688] napi_complete_done+0x74/0x1c0 [148720.725693] mlx5e_napi_poll+0x190/0x7b0 [mlx5_core] [148720.725782] __napi_poll+0x33/0x200 [148720.725786] net_rx_action+0x181/0x2e0 [148720.725792] handle_softirqs+0xdb/0x340 [148720.725799] __irq_exit_rcu+0xd9/0x100 [148720.725802] irq_exit_rcu+0xe/0x20 before soft lockup, we see some error messages from mlx5, e.g.: [486111.016058] mlx5_core 0000:41:00.1 ens3f1: NETDEV WATCHDOG: CPU: 119: transmit queue 0 timed out 17547 ms [486111.025773] mlx5_core 0000:41:00.1 ens3f1: TX timeout detected [486111.031726] mlx5_core 0000:41:00.1 ens3f1: TX timeout on queue: 0, SQ: 0x11d0, CQ: 0x1487, SQ Cons: 0xae7a SQ Prod: 0xaec3, usecs since last trans: 17562000 [486111.045845] mlx5_core 0000:41:00.1 ens3f1: EQ 0x7: Cons = 0x8ac57014, irqn = 0x5f5 - Kernel cmdline: GRUB_CMDLINE_LINUX_DEFAULT="console=tty0 console=ttyS0,115200n8 nvme_core.multipath=0 amd_iommu=on iommu=pt probe_vf=0 transparent_hugepage=never hugepagesz=1G hugepages=1536 default_hugepagesz=1G" + + [Fix] + + This upstream commit fixes it: + + commit 03428ca5cee9f0792edc996c06ce4514816af1fb + Author: Florian Westphal <[email protected]> + Date: Tue Jan 14 00:50:36 2025 +0100 + + netfilter: conntrack: rework offload nf_conn timeout extension logic + + This patch fixes ct use-after-free and packet gets stuck issues, which + should be related to the above two call traces. + + + [Test Plan] + + This issue can only be reproduced on our production environment with mlx5 NIC and ovs hw-offload enabled. + We need to run the kernel on the environment for few weeks to confirm it's fixed. + + [Where problems could occur] + + The patch makes sure to take a refcount on ct and test offload bits, it could prevent ct being used after it's removed. + And also modifies flow offload teardown logic, if there is anything wrong, the ovs flow offload might be broken.
-- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2139322 Title: Enable mlx5 ovs hardware offload causes multiple issues To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2139322/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
