** Description changed: + [ Impact ] + + Due to limitations in how Netplan handles SR-IOV devices, features such as + VF-LAG and Scalable Functions couldn't be used. Certain configuration steps + must happen in a very specific order and Netplan fails to perform the set up correctly. + + This SRU addresses the following two problems: + + 1) Fail to activate Mellanox VF-LAG - + https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1988018 + + Netplan must wait until the backend finishes adding interfaces to the Bond + and the Mellanox driver reports the VF-LAG feature as "active" before binding VFs to + the driver. + + See also https://bugs.launchpad.net/netplan/+bug/2083008 + + This problem is fixed by introducing a proper ordering in the configuration process + and monitoring the driver state until it reports as ready (or times out). + + 2) Impossibility to set the embedded switch mode without Virtual + Functions - https://bugs.launchpad.net/netplan/+bug/2020409 + + Netplan wouldn't allow setting the e-switch mode without having Virtual Functions + defined in the YAML. Setting the e-switch mode should be allowed independently of + the existence of Virtual Functions. + This problem prevents the use of Scalable Functions without SR-IOV. + + + [ Test Plan ] + + To reproduce the problems addressed by this SRU one needs to + have access to Mellanox network interfaces that support SR-IOV. + + In this particular case we'll need help from the bug reporters (https://bugs.launchpad.net/netplan/+bug/2083008) + to install and test the new netplan.io version in production. + + The fixes for the problem 1) described above were already verified and + solved the problem (more details https://bugs.launchpad.net/netplan/+bug/2083008). + + The fixes for the problem 2) were tested on real hardware when they were implemented + (see https://github.com/canonical/netplan/pull/454 for details) but still need to be + tested on Ubuntu 22.04. + + * detailed instructions how to reproduce the bug + + Problem 1) + + A configuration file that looks like the one below can be used + to test the fix. + + After booting the system with this configuration, the Mellanox driver + should report the LAG state as "active". + It can be checked in the debugfs file: /sys/kernel/debug/mlx5/{pci_addr}/lag/state + + network: + version: 2 + ethernets: + ens4f0np0: + virtual-function-count: 16 + embedded-switch-mode: switchdev + delay-virtual-functions-rebind: true + + ens4f1np1: + virtual-function-count: 16 + embedded-switch-mode: switchdev + delay-virtual-functions-rebind: true + + bonds: + bond0: + interfaces: + - ens4f0np0 + - ens4f1np1 + parameters: + mode: active-backup + + Problem 2) + + A configuration like the below can be used to test if the e-switch mode + can be set to "switchdev" without Virtual Functions: + + network: + version: 2 + ethernets: + enp3s0f0np0: + match: + macaddress: 98:03:9b:c3:ef:ba + mtu: 9000 + set-name: enp3s0f0np0 + embedded-switch-mode: switchdev + enp3s0f1np1: + match: + macaddress: 98:03:9b:c3:ef:bb + mtu: 9000 + set-name: enp3s0f1np1 + embedded-switch-mode: switchdev + + After applying the configuration, the e-switch mode can be checked with + the devlink tool. For example: + + root@node-laveran:~# devlink dev eswitch show pci/0000:03:00.0 + pci/0000:03:00.0: mode switchdev inline-mode none encap-mode basic + root@node-laveran:~# devlink dev eswitch show pci/0000:03:00.1 + pci/0000:03:00.1: mode switchdev inline-mode none encap-mode basic + + [ Where problems could occur ] + + These changes should affect only SR-IOV related scenarios. + Undetected problems could cause Netplan to fail to configure the device + and Virtual Functions wouldn't be created anymore. + + [ Other Info ] + + Related work: + + https://bugs.launchpad.net/ubuntu/+source/netplan.io/+bug/1988018 + https://bugs.launchpad.net/netplan/+bug/2020409 + https://github.com/canonical/netplan/pull/439 + https://github.com/canonical/netplan/pull/454 + + + ---- Original bug description ---- + During system initialization there is a specific sequence that must be followed to enable the use of hardware offload and VF-LAG. Intermittently one may see that VF-LAG initialization fails: [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: lag map port 1:1 port 2:2 shared_fdb:1 [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_cmd_check:782:(pid 9): CREATE_LAG(0x840) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x7d49cb) [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_create_lag:248:(pid 9): Failed to create LAG (-22) [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_activate_lag:288:(pid 9): Failed to activate VF LAG - Make sure all VFs are unbound prior to VF LAG activation or deactivation + Make sure all VFs are unbound prior to VF LAG activation or deactivation This is caused by rebinding the driver prior to the VF lag being ready. A sysfs knob has recently been added to the driver [0] and we should monitor it before attempting to rebind the driver: - $ cat /sys/kernel/debug/mlx5/0000\:08\:00.0/lag/state + $ cat /sys/kernel/debug/mlx5/0000\:08\:00.0/lag/state The kernel feature is available in the upcoming Kinetic 5.19 kernel and we should probably backport it to the Jammy 5.15 kernel. 0: https://github.com/torvalds/linux/commit/7f46a0b7327ae261f9981888708dbca22c283900
** Summary changed: - [mlx5] Intermittent VF-LAG activation failure + [SRU][mlx5] Intermittent VF-LAG activation failure -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1988018 Title: [SRU][mlx5] Intermittent VF-LAG activation failure To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1988018/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
