** Description changed:

+ [ Impact ]
+ 
+ Due to limitations in how Netplan handles SR-IOV devices, features such as
+ VF-LAG and Scalable Functions couldn't be used. Certain configuration steps
+ must happen in a very specific order and Netplan fails to perform the set up 
correctly.
+ 
+ This SRU addresses the following two problems:
+ 
+ 1) Fail to activate Mellanox VF-LAG -
+ https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1988018
+ 
+ Netplan must wait until the backend finishes adding interfaces to the Bond
+ and the Mellanox driver reports the VF-LAG feature as "active" before binding 
VFs to
+ the driver.
+ 
+ See also https://bugs.launchpad.net/netplan/+bug/2083008
+ 
+ This problem is fixed by introducing a proper ordering in the configuration 
process
+ and monitoring the driver state until it reports as ready (or times out).
+ 
+ 2) Impossibility to set the embedded switch mode without Virtual
+ Functions - https://bugs.launchpad.net/netplan/+bug/2020409
+ 
+ Netplan wouldn't allow setting the e-switch mode without having Virtual 
Functions
+ defined in the YAML. Setting the e-switch mode should be allowed 
independently of
+ the existence of Virtual Functions.
+ This problem prevents the use of Scalable Functions without SR-IOV. 
+ 
+ 
+ [ Test Plan ]
+ 
+ To reproduce the problems addressed by this SRU one needs to
+ have access to Mellanox network interfaces that support SR-IOV.
+ 
+ In this particular case we'll need help from the bug reporters 
(https://bugs.launchpad.net/netplan/+bug/2083008)
+ to install and test the new netplan.io version in production.
+ 
+ The fixes for the problem 1) described above were already verified and
+ solved the problem (more details 
https://bugs.launchpad.net/netplan/+bug/2083008).
+ 
+ The fixes for the problem 2) were tested on real hardware when they were 
implemented
+ (see https://github.com/canonical/netplan/pull/454 for details) but still 
need to be
+ tested on Ubuntu 22.04.
+ 
+  * detailed instructions how to reproduce the bug
+ 
+ Problem 1)
+ 
+ A configuration file that looks like the one below can be used
+ to test the fix.
+ 
+ After booting the system with this configuration, the Mellanox driver
+ should report the LAG state as "active".
+ It can be checked in the debugfs file: 
/sys/kernel/debug/mlx5/{pci_addr}/lag/state
+ 
+ network:
+   version: 2
+   ethernets:
+     ens4f0np0:
+       virtual-function-count: 16
+       embedded-switch-mode: switchdev
+       delay-virtual-functions-rebind: true
+ 
+     ens4f1np1:
+       virtual-function-count: 16
+       embedded-switch-mode: switchdev
+       delay-virtual-functions-rebind: true
+ 
+   bonds:
+     bond0:
+       interfaces:
+         - ens4f0np0
+         - ens4f1np1
+       parameters:
+         mode: active-backup
+ 
+ Problem 2)
+ 
+ A configuration like the below can be used to test if the e-switch mode
+ can be set to "switchdev" without Virtual Functions:
+ 
+ network:
+   version: 2
+   ethernets:
+     enp3s0f0np0:
+       match:
+         macaddress: 98:03:9b:c3:ef:ba
+       mtu: 9000
+       set-name: enp3s0f0np0
+       embedded-switch-mode: switchdev
+     enp3s0f1np1:
+       match:
+         macaddress: 98:03:9b:c3:ef:bb
+       mtu: 9000
+       set-name: enp3s0f1np1
+       embedded-switch-mode: switchdev
+ 
+ After applying the configuration, the e-switch mode can be checked with
+ the devlink tool. For example:
+ 
+ root@node-laveran:~# devlink dev eswitch show pci/0000:03:00.0
+ pci/0000:03:00.0: mode switchdev inline-mode none encap-mode basic
+ root@node-laveran:~# devlink dev eswitch show pci/0000:03:00.1
+ pci/0000:03:00.1: mode switchdev inline-mode none encap-mode basic
+ 
+ [ Where problems could occur ]
+ 
+ These changes should affect only SR-IOV related scenarios.
+ Undetected problems could cause Netplan to fail to configure the device
+ and Virtual Functions wouldn't be created anymore.
+ 
+ [ Other Info ]
+ 
+ Related work:
+ 
+ https://bugs.launchpad.net/ubuntu/+source/netplan.io/+bug/1988018
+ https://bugs.launchpad.net/netplan/+bug/2020409
+ https://github.com/canonical/netplan/pull/439
+ https://github.com/canonical/netplan/pull/454
+ 
+ 
+ ---- Original bug description ----
+ 
  During system initialization there is a specific sequence that must be
  followed to enable the use of hardware offload and VF-LAG.
  
  Intermittently one may see that VF-LAG initialization fails:
  [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: lag map port 1:1 port 2:2 
shared_fdb:1
  [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_cmd_check:782:(pid 
9): CREATE_LAG(0x840) op_mod(0x0) failed, status bad parameter(0x3), syndrome 
(0x7d49cb)
  [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_create_lag:248:(pid 
9): Failed to create LAG (-22)
  [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_activate_lag:288:(pid 
9): Failed to activate VF LAG
-                            Make sure all VFs are unbound prior to VF LAG 
activation or deactivation
+                            Make sure all VFs are unbound prior to VF LAG 
activation or deactivation
  
  This is caused by rebinding the driver prior to the VF lag being ready.
  
  A sysfs knob has recently been added to the driver [0] and we should
  monitor it before attempting to rebind the driver:
  
-     $ cat /sys/kernel/debug/mlx5/0000\:08\:00.0/lag/state
+     $ cat /sys/kernel/debug/mlx5/0000\:08\:00.0/lag/state
  
  The kernel feature is available in the upcoming Kinetic 5.19 kernel and
  we should probably backport it to the Jammy 5.15 kernel.
  
  0:
  
https://github.com/torvalds/linux/commit/7f46a0b7327ae261f9981888708dbca22c283900

** Summary changed:

- [mlx5] Intermittent VF-LAG activation failure
+ [SRU][mlx5] Intermittent VF-LAG activation failure

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1988018

Title:
  [SRU][mlx5] Intermittent VF-LAG activation failure

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1988018/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to