[Bug 1840854] Re: mlx5_core reports hardware checksum error for padded packets on Mellanox NICs
Hi Jens, I highly recommend you go through the pain to upgrade the kernel on your GPU cluster to something modern, like 4.15.0-91-generic. There was quite a few regressions around the 4.15.0-56 to 4.15.0-58 mark, as we merged a lot of upstream stable patches in at that time. 4.15.0-91 is pretty stable these days, and you can probably leave it long term on that kernel. In this bug, the fix landed in the mlx5_core driver, which is a kernel module. Kernel modules are only compatible with the kernel that they were compiled for, since Linux does not have a stable ABI / binary interface. So, this isn't as easy as just copying over a fixed kernel module. The kmod package doesn't actually have any kernel modules in it, just the blacklists and things defined in /etc/modules-load.d and /etc/modprobe.d Nvidia drivers should be built with dkms, and *should* work without too much hassle. I know that theory doesn't always align with reality though. Anyway, I recommend you upgrade to a newer kernel on your GPU cluster. Thanks, Matthew -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1840854 Title: mlx5_core reports hardware checksum error for padded packets on Mellanox NICs To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840854/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1840854] Re: mlx5_core reports hardware checksum error for padded packets on Mellanox NICs
Hi Jeff, hmm, didn't get notified by launchpad about your answer :(((. Anyway, tried another machine with 4.15.0-91-generic and indeed, it seems to be fixed. Now the problem is, that our GPU machines are running 4.15.0-58-generic and cannot be upgraded because all the nvidia stuff is very picky and we do not have the time to upgrade the cluster to a new kernel version. Therefore the question: is this just a driver problem? Copying over the kmod from a 4.15.0-91 isn't probably a problem, I guess. Or rebuilding the kmod for 4.15.0-58 may work out of the box as well... -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1840854 Title: mlx5_core reports hardware checksum error for padded packets on Mellanox NICs To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840854/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1840854] Re: mlx5_core reports hardware checksum error for padded packets on Mellanox NICs
Hi Jens, As the fix was landed in 4.15.0-59, I would expect that you would likely still see issues in 4.15.0-58. The current Bionic GA kernel is 4.15.0.91.83 in -updates. You should try an updated kernel and see if that resolves the issue. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1840854 Title: mlx5_core reports hardware checksum error for padded packets on Mellanox NICs To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840854/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1840854] Re: mlx5_core reports hardware checksum error for padded packets on Mellanox NICs
We use 'Linux kino6 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux' (Ubuntu 18.04.3 LTS) and see all the time 'hw csum failure's: [ +28.297139] kino6_0: hw csum failure [ +0.003607] CPU: 12 PID: 0 Comm: swapper/12 Tainted: P O 4.15.0-58-generic #64-Ubuntu [ +0.03] Hardware name: GIGABYTE G291-281-00/MG51-G21-00, BIOS R06 11/19/2019 [ +0.01] Call Trace: [ +0.02] [ +0.11] dump_stack+0x63/0x8b [ +0.08] netdev_rx_csum_fault+0x38/0x40 [ +0.03] __skb_checksum_complete+0xbc/0xd0 [ +0.05] nf_ip_checksum+0xc3/0xf0 [ +0.18] tcp_error+0x162/0x1c0 [nf_conntrack] [ +0.06] ? ttwu_do_wakeup+0x1e/0x140 [ +0.11] nf_conntrack_in+0x14f/0x500 [nf_conntrack] [ +0.07] ? csum_partial_ext+0x9/0x10 [ +0.07] ? __skb_checksum+0x6b/0x300 [ +0.06] ipv4_conntrack_in+0x1c/0x20 [nf_conntrack_ipv4] [ +0.05] nf_hook_slow+0x48/0xc0 [ +0.04] ? skb_send_sock+0x50/0x50 [ +0.05] ip_rcv+0x2fa/0x360 [ +0.03] ? inet_del_offload+0x40/0x40 [ +0.04] __netif_receive_skb_core+0x432/0xb40 [ +0.03] ? update_curr+0xf2/0x1d0 [ +0.04] ? tcp4_gro_receive+0x137/0x1a0 [ +0.03] __netif_receive_skb+0x18/0x60 [ +0.03] ? __netif_receive_skb+0x18/0x60 [ +0.03] netif_receive_skb_internal+0x45/0xe0 [ +0.04] napi_gro_receive+0xc5/0xf0 [ +0.36] mlx5e_handle_rx_cqe_mpwrq+0x465/0x860 [mlx5_core] [ +0.28] mlx5e_poll_rx_cq+0xd1/0x8b0 [mlx5_core] [ +0.25] mlx5e_napi_poll+0x9d/0x290 [mlx5_core] [ +0.04] net_rx_action+0x140/0x3a0 [ +0.05] __do_softirq+0xe4/0x2d4 [ +0.06] irq_exit+0xc5/0xd0 [ +0.03] do_IRQ+0x8a/0xe0 [ +0.03] common_interrupt+0x8c/0x8c [ +0.02] [ +0.05] RIP: 0010:cpuidle_enter_state+0xa7/0x2f0 [ +0.02] RSP: 0018:ad7a00283e68 EFLAGS: 0246 ORIG_RAX: ffdd [ +0.04] RAX: 89f13fd22840 RBX: 02273da74e91 RCX: 001f [ +0.02] RDX: 02273da74e91 RSI: feba65558937 RDI: [ +0.02] RBP: ad7a00283ea8 R08: 0004 R09: 00022080 [ +0.01] R10: ad7a00283e38 R11: 07e0dcda6658 R12: cd5a00503298 [ +0.02] R13: 0003 R14: b3f72e78 R15: [ +0.04] ? cpuidle_enter_state+0x97/0x2f0 [ +0.03] cpuidle_enter+0x17/0x20 [ +0.05] call_cpuidle+0x23/0x40 [ +0.04] do_idle+0x18c/0x1f0 [ +0.04] cpu_startup_entry+0x73/0x80 [ +0.05] start_secondary+0x1ab/0x200 [ +0.05] secondary_startup_64+0xa5/0xb0 -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1840854 Title: mlx5_core reports hardware checksum error for padded packets on Mellanox NICs To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840854/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1840854] Re: mlx5_core reports hardware checksum error for padded packets on Mellanox NICs
Hi @mohamadh, Are you using the in-tree drivers that Ubuntu supplies? I have seen some customers using out of tree MOFED drivers from the Mellanox website that have been out of date, and being able to reproduce this on newer kernels like 4.15.0-66. Upon updating to newer MOFED drivers or reverting to the normal Ubuntu supplied driver seemed to fix things for them. If you are using the normal Ubuntu kernel with inbuilt drivers, can you open a new bug report and include your /var/log/kern.log with the Call trace? We can't simply ship a set of upstream patches untested, and every patch needs to fix a specific problem, so we will work with you to find the patch that fixes things. Thanks, Matthew -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1840854 Title: mlx5_core reports hardware checksum error for padded packets on Mellanox NICs To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840854/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1840854] Re: mlx5_core reports hardware checksum error for padded packets on Mellanox NICs
Hi, I see that the issue still reproduced on newer kernels > 4.15.0-69, to fix the issue should get all the following upstream patches: net/mlx5e: Rx, Fix checksum calculation for new hardware --> db849faa9bef993a1379dc510623f750a72fa7ce net/mlx5e: Rx, Check ip headers sanity - > 0318a7b7fcad9765931146efa7ca3a034194737c net/mlx5e: Rx, Fixup skb checksum for packets with tail padding --> 0aa1d18615c163f92935b806dcaff9157645233a net/mlx5e: XDP, Avoid checksum complete when XDP prog is loaded --> 5d0bb3bac4b9f6c22280b04545626fdfd99edc6b mlx5: fix get_ip_proto() --> ef6fcd455278c2be3032a346cc66d9dd9866b787 net/mlx5e: Allow reporting of checksum unnecessary --> b856df28f9230a47669efbdd57896084caadb2b3 net/mlx5e: don't set CHECKSUM_COMPLETE on SCTP packets --> fe1dc069990c1f290ef6b99adb46332c03258f38 net/mlx5e: Set ECN for received packets using CQE indication --> f007c13d4ad62f494c83897eda96437005df4a91 net/mlx5e: Add likely to the common RX checksum flow --> 63a612f984a1fae040ab6f1c6a0f1fdcdf1954b8 net/mlx5e: CHECKSUM_COMPLETE offload for VLAN/QinQ packets --> f938daeee95eb36ef6b431bf054a5cc6cdada112 -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1840854 Title: mlx5_core reports hardware checksum error for padded packets on Mellanox NICs To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840854/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1840854] Re: mlx5_core reports hardware checksum error for padded packets on Mellanox NICs
Patches could be found in Eoan as well, close it with Fix-released. ** Changed in: linux (Ubuntu) Status: Incomplete => Fix Released -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1840854 Title: mlx5_core reports hardware checksum error for padded packets on Mellanox NICs To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840854/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1840854] Re: mlx5_core reports hardware checksum error for padded packets on Mellanox NICs
Hi Matthew, I was going through some kernel bugs and it looks this one has already been released, but just didn't get updated automatically as its LP number is not directly mentioned in the changelog. Marking it as Fix Released. Hope this helps! cheers, Mauricio ** Changed in: linux (Ubuntu Bionic) Status: Fix Committed => Fix Released -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1840854 Title: mlx5_core reports hardware checksum error for padded packets on Mellanox NICs To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840854/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1840854] Re: mlx5_core reports hardware checksum error for padded packets on Mellanox NICs
The customer installed 4.15.0-59 from -proposed to a machine with Mellanox Ethernet CX4LX cards, using the mlx5_core kernel module. Checksums are now calculated correctly and the kernel spat does not occur when the devices are brought up. Marking this as verified. ** Tags added: verification-done-bionic -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1840854 Title: mlx5_core reports hardware checksum error for padded packets on Mellanox NICs To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840854/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs