[Bug 1840854] Re: mlx5_core reports hardware checksum error for padded packets on Mellanox NICs

2020-04-13 Thread Matthew Ruffell
Hi Jens,

I highly recommend you go through the pain to upgrade the kernel on your
GPU cluster to something modern, like 4.15.0-91-generic. There was quite
a few regressions around the 4.15.0-56 to 4.15.0-58 mark, as we merged a
lot of upstream stable patches in at that time.

4.15.0-91 is pretty stable these days, and you can probably leave it
long term on that kernel.

In this bug, the fix landed in the mlx5_core driver, which is a kernel
module. Kernel modules are only compatible with the kernel that they
were compiled for, since Linux does not have a stable ABI / binary
interface.

So, this isn't as easy as just copying over a fixed kernel module. The
kmod package doesn't actually have any kernel modules in it, just the
blacklists and things defined in /etc/modules-load.d and /etc/modprobe.d

Nvidia drivers should be built with dkms, and *should* work without too
much hassle. I know that theory doesn't always align with reality
though.

Anyway, I recommend you upgrade to a newer kernel on your GPU cluster.

Thanks,
Matthew

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1840854

Title:
  mlx5_core reports hardware checksum error for padded packets on
  Mellanox NICs

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840854/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1840854] Re: mlx5_core reports hardware checksum error for padded packets on Mellanox NICs

2020-04-12 Thread Jens Elkner
Hi Jeff,

hmm, didn't get notified by launchpad about your answer :(((. Anyway,
tried another machine with 4.15.0-91-generic and indeed, it seems to be
fixed.

Now the problem is, that our GPU machines are running 4.15.0-58-generic
and cannot be upgraded because all the nvidia stuff is very picky and we
do not have the time to upgrade the cluster to a new kernel version.

Therefore the question: is this just a driver problem? Copying over the
kmod from a 4.15.0-91 isn't probably a problem, I guess. Or rebuilding
the kmod for 4.15.0-58 may work out of the box as well...

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1840854

Title:
  mlx5_core reports hardware checksum error for padded packets on
  Mellanox NICs

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840854/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1840854] Re: mlx5_core reports hardware checksum error for padded packets on Mellanox NICs

2020-03-16 Thread Jeff Lane
Hi Jens, 
As the fix was landed in 4.15.0-59, I would expect that you would likely still 
see issues in 4.15.0-58.  The current Bionic GA kernel is 4.15.0.91.83 in 
-updates.  You should try an updated kernel and see if that resolves the issue.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1840854

Title:
  mlx5_core reports hardware checksum error for padded packets on
  Mellanox NICs

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840854/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1840854] Re: mlx5_core reports hardware checksum error for padded packets on Mellanox NICs

2020-03-16 Thread Jens Elkner
We use 'Linux kino6 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41
UTC 2019 x86_64 x86_64 x86_64 GNU/Linux' (Ubuntu 18.04.3 LTS) and see
all the time 'hw csum failure's:

[ +28.297139] kino6_0: hw csum failure
[  +0.003607] CPU: 12 PID: 0 Comm: swapper/12 Tainted: P   O 
4.15.0-58-generic #64-Ubuntu
[  +0.03] Hardware name: GIGABYTE G291-281-00/MG51-G21-00, BIOS R06 
11/19/2019
[  +0.01] Call Trace:
[  +0.02]  
[  +0.11]  dump_stack+0x63/0x8b
[  +0.08]  netdev_rx_csum_fault+0x38/0x40
[  +0.03]  __skb_checksum_complete+0xbc/0xd0
[  +0.05]  nf_ip_checksum+0xc3/0xf0
[  +0.18]  tcp_error+0x162/0x1c0 [nf_conntrack]
[  +0.06]  ? ttwu_do_wakeup+0x1e/0x140
[  +0.11]  nf_conntrack_in+0x14f/0x500 [nf_conntrack]
[  +0.07]  ? csum_partial_ext+0x9/0x10
[  +0.07]  ? __skb_checksum+0x6b/0x300
[  +0.06]  ipv4_conntrack_in+0x1c/0x20 [nf_conntrack_ipv4]
[  +0.05]  nf_hook_slow+0x48/0xc0
[  +0.04]  ? skb_send_sock+0x50/0x50
[  +0.05]  ip_rcv+0x2fa/0x360
[  +0.03]  ? inet_del_offload+0x40/0x40
[  +0.04]  __netif_receive_skb_core+0x432/0xb40
[  +0.03]  ? update_curr+0xf2/0x1d0
[  +0.04]  ? tcp4_gro_receive+0x137/0x1a0
[  +0.03]  __netif_receive_skb+0x18/0x60
[  +0.03]  ? __netif_receive_skb+0x18/0x60
[  +0.03]  netif_receive_skb_internal+0x45/0xe0
[  +0.04]  napi_gro_receive+0xc5/0xf0
[  +0.36]  mlx5e_handle_rx_cqe_mpwrq+0x465/0x860 [mlx5_core]
[  +0.28]  mlx5e_poll_rx_cq+0xd1/0x8b0 [mlx5_core]
[  +0.25]  mlx5e_napi_poll+0x9d/0x290 [mlx5_core]
[  +0.04]  net_rx_action+0x140/0x3a0
[  +0.05]  __do_softirq+0xe4/0x2d4
[  +0.06]  irq_exit+0xc5/0xd0
[  +0.03]  do_IRQ+0x8a/0xe0
[  +0.03]  common_interrupt+0x8c/0x8c
[  +0.02]  
[  +0.05] RIP: 0010:cpuidle_enter_state+0xa7/0x2f0
[  +0.02] RSP: 0018:ad7a00283e68 EFLAGS: 0246 ORIG_RAX: 
ffdd
[  +0.04] RAX: 89f13fd22840 RBX: 02273da74e91 RCX: 001f
[  +0.02] RDX: 02273da74e91 RSI: feba65558937 RDI: 
[  +0.02] RBP: ad7a00283ea8 R08: 0004 R09: 00022080
[  +0.01] R10: ad7a00283e38 R11: 07e0dcda6658 R12: cd5a00503298
[  +0.02] R13: 0003 R14: b3f72e78 R15: 
[  +0.04]  ? cpuidle_enter_state+0x97/0x2f0
[  +0.03]  cpuidle_enter+0x17/0x20
[  +0.05]  call_cpuidle+0x23/0x40
[  +0.04]  do_idle+0x18c/0x1f0
[  +0.04]  cpu_startup_entry+0x73/0x80
[  +0.05]  start_secondary+0x1ab/0x200
[  +0.05]  secondary_startup_64+0xa5/0xb0

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1840854

Title:
  mlx5_core reports hardware checksum error for padded packets on
  Mellanox NICs

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840854/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1840854] Re: mlx5_core reports hardware checksum error for padded packets on Mellanox NICs

2019-12-01 Thread Matthew Ruffell
Hi @mohamadh,

Are you using the in-tree drivers that Ubuntu supplies?

I have seen some customers using out of tree MOFED drivers from the
Mellanox website that have been out of date, and being able to reproduce
this on newer kernels like 4.15.0-66. Upon updating to newer MOFED
drivers or reverting to the normal Ubuntu supplied driver seemed to fix
things for them.

If you are using the normal Ubuntu kernel with inbuilt drivers, can you
open a new bug report and include your /var/log/kern.log with the Call
trace?

We can't simply ship a set of upstream patches untested, and every patch
needs to fix a specific problem, so we will work with you to find the
patch that fixes things.

Thanks,
Matthew

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1840854

Title:
  mlx5_core reports hardware checksum error for padded packets on
  Mellanox NICs

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840854/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1840854] Re: mlx5_core reports hardware checksum error for padded packets on Mellanox NICs

2019-12-01 Thread Mohammad Heib
Hi,
I see that the issue still reproduced on newer kernels > 4.15.0-69, 
to fix the issue should get all the following upstream patches:

net/mlx5e: Rx, Fix checksum calculation for new hardware --> 
db849faa9bef993a1379dc510623f750a72fa7ce
 net/mlx5e: Rx, Check ip headers sanity - > 
0318a7b7fcad9765931146efa7ca3a034194737c
 net/mlx5e: Rx, Fixup skb checksum for packets with tail padding --> 
0aa1d18615c163f92935b806dcaff9157645233a
 net/mlx5e: XDP, Avoid checksum complete when XDP prog is loaded --> 
5d0bb3bac4b9f6c22280b04545626fdfd99edc6b
 mlx5: fix get_ip_proto() --> ef6fcd455278c2be3032a346cc66d9dd9866b787
 net/mlx5e: Allow reporting of checksum unnecessary --> 
b856df28f9230a47669efbdd57896084caadb2b3
 net/mlx5e: don't set CHECKSUM_COMPLETE on SCTP packets --> 
fe1dc069990c1f290ef6b99adb46332c03258f38
 net/mlx5e: Set ECN for received packets using CQE indication --> 
f007c13d4ad62f494c83897eda96437005df4a91
 net/mlx5e: Add likely to the common RX checksum flow --> 
63a612f984a1fae040ab6f1c6a0f1fdcdf1954b8 
 net/mlx5e: CHECKSUM_COMPLETE offload for VLAN/QinQ packets --> 
f938daeee95eb36ef6b431bf054a5cc6cdada112

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1840854

Title:
  mlx5_core reports hardware checksum error for padded packets on
  Mellanox NICs

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840854/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1840854] Re: mlx5_core reports hardware checksum error for padded packets on Mellanox NICs

2019-10-03 Thread Po-Hsu Lin
Patches could be found in Eoan as well, close it with Fix-released.

** Changed in: linux (Ubuntu)
   Status: Incomplete => Fix Released

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1840854

Title:
  mlx5_core reports hardware checksum error for padded packets on
  Mellanox NICs

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840854/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1840854] Re: mlx5_core reports hardware checksum error for padded packets on Mellanox NICs

2019-09-23 Thread Mauricio Faria de Oliveira
Hi Matthew,

I was going through some kernel bugs and it looks this one has already been 
released,
but just didn't get updated automatically as its LP number is not directly 
mentioned
in the changelog.

Marking it as Fix Released.
Hope this helps!

cheers,
Mauricio

** Changed in: linux (Ubuntu Bionic)
   Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1840854

Title:
  mlx5_core reports hardware checksum error for padded packets on
  Mellanox NICs

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840854/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1840854] Re: mlx5_core reports hardware checksum error for padded packets on Mellanox NICs

2019-08-20 Thread Matthew Ruffell
The customer installed 4.15.0-59 from -proposed to a machine with
Mellanox Ethernet CX4LX cards, using the mlx5_core kernel module.

Checksums are now calculated correctly and the kernel spat does not
occur when the devices are brought up.

Marking this as verified.

** Tags added: verification-done-bionic

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1840854

Title:
  mlx5_core reports hardware checksum error for padded packets on
  Mellanox NICs

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840854/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs