[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
This bug is awaiting verification that the linux-xilinx- zynqmp/5.15.0-1029.33 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-xilinx-zynqmp' to 'verification-done-jammy-linux-xilinx-zynqmp'. If the problem still exists, change the tag 'verification-needed-jammy-linux-xilinx-zynqmp' to 'verification-failed-jammy-linux-xilinx-zynqmp'. If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you! ** Tags added: kernel-spammed-jammy-linux-xilinx-zynqmp-v2 verification-needed-jammy-linux-xilinx-zynqmp -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Jammy: Fix Released Status in linux source package in Mantic: Fix Released Status in linux source package in Noble: In Progress Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices:
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
This bug is awaiting verification that the linux-nvidia-tegra- igx/5.15.0-1010.10 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-nvidia-tegra- igx' to 'verification-done-jammy-linux-nvidia-tegra-igx'. If the problem still exists, change the tag 'verification-needed-jammy-linux-nvidia- tegra-igx' to 'verification-failed-jammy-linux-nvidia-tegra-igx'. If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you! ** Tags added: kernel-spammed-jammy-linux-nvidia-tegra-igx-v2 verification-needed-jammy-linux-nvidia-tegra-igx -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Jammy: Fix Released Status in linux source package in Mantic: Fix Released Status in linux source package in Noble: In Progress Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType:
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
This bug is awaiting verification that the linux-nvidia- tegra-5.15/5.15.0-1023.23~20.04.1 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal-linux- nvidia-tegra-5.15' to 'verification-done-focal-linux-nvidia-tegra-5.15'. If the problem still exists, change the tag 'verification-needed-focal- linux-nvidia-tegra-5.15' to 'verification-failed-focal-linux-nvidia- tegra-5.15'. If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you! ** Tags added: kernel-spammed-focal-linux-nvidia-tegra-5.15-v2 verification-needed-focal-linux-nvidia-tegra-5.15 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Jammy: Fix Released Status in linux source package in Mantic: Fix Released Status in linux source package in Noble: In Progress Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
This bug is awaiting verification that the linux-nvidia- tegra/5.15.0-1023.23 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-nvidia-tegra' to 'verification-done-jammy-linux-nvidia-tegra'. If the problem still exists, change the tag 'verification-needed-jammy-linux-nvidia-tegra' to 'verification-failed-jammy-linux-nvidia-tegra'. If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you! ** Tags added: kernel-spammed-jammy-linux-nvidia-tegra-v2 verification-needed-jammy-linux-nvidia-tegra -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Jammy: Fix Released Status in linux source package in Mantic: Fix Released Status in linux source package in Noble: In Progress Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
This bug is awaiting verification that the linux- nvidia-6.5/6.5.0-1014.14 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-nvidia-6.5' to 'verification-done-jammy-linux-nvidia-6.5'. If the problem still exists, change the tag 'verification-needed-jammy-linux-nvidia-6.5' to 'verification-failed-jammy-linux-nvidia-6.5'. If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you! ** Tags added: kernel-spammed-jammy-linux-nvidia-6.5-v2 verification-needed-jammy-linux-nvidia-6.5 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Jammy: Fix Released Status in linux source package in Mantic: Fix Released Status in linux source package in Noble: In Progress Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
This bug is awaiting verification that the linux-gcp- fips/5.15.0-1055.63+fips2 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-gcp-fips' to 'verification-done-jammy-linux-gcp-fips'. If the problem still exists, change the tag 'verification-needed-jammy-linux-gcp-fips' to 'verification-failed-jammy-linux-gcp-fips'. If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you! ** Tags added: kernel-spammed-jammy-linux-gcp-fips-v2 verification-needed-jammy-linux-gcp-fips -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Jammy: Fix Released Status in linux source package in Mantic: Fix Released Status in linux source package in Noble: In Progress Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
This bug is awaiting verification that the linux-aws- fips/5.15.0-1056.61+fips1 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-aws-fips' to 'verification-done-jammy-linux-aws-fips'. If the problem still exists, change the tag 'verification-needed-jammy-linux-aws-fips' to 'verification-failed-jammy-linux-aws-fips'. If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you! ** Tags added: kernel-spammed-jammy-linux-aws-fips-v2 verification-needed-jammy-linux-aws-fips -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Jammy: Fix Released Status in linux source package in Mantic: Fix Released Status in linux source package in Noble: In Progress Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
This bug is awaiting verification that the linux-aws/5.15.0-1056.61 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-aws' to 'verification-done-jammy- linux-aws'. If the problem still exists, change the tag 'verification- needed-jammy-linux-aws' to 'verification-failed-jammy-linux-aws'. If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you! ** Tags added: kernel-spammed-jammy-linux-aws-v2 verification-needed-jammy-linux-aws -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Jammy: Fix Released Status in linux source package in Mantic: Fix Released Status in linux source package in Noble: In Progress Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
This bug is awaiting verification that the linux-raspi/5.15.0-1048.51 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-raspi' to 'verification-done-jammy- linux-raspi'. If the problem still exists, change the tag 'verification- needed-jammy-linux-raspi' to 'verification-failed-jammy-linux-raspi'. If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you! ** Tags added: kernel-spammed-jammy-linux-raspi-v2 verification-needed-jammy-linux-raspi -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Jammy: Fix Released Status in linux source package in Mantic: Fix Released Status in linux source package in Noble: In Progress Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
This bug is awaiting verification that the linux-kvm/5.15.0-1052.57 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-kvm' to 'verification-done-jammy- linux-kvm'. If the problem still exists, change the tag 'verification- needed-jammy-linux-kvm' to 'verification-failed-jammy-linux-kvm'. If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you! ** Tags added: kernel-spammed-jammy-linux-kvm-v2 verification-needed-jammy-linux-kvm -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Jammy: Fix Released Status in linux source package in Mantic: Fix Released Status in linux source package in Noble: In Progress Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
This bug is awaiting verification that the linux-oracle/5.15.0-1053.59 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-oracle' to 'verification-done- jammy-linux-oracle'. If the problem still exists, change the tag 'verification-needed-jammy-linux-oracle' to 'verification-failed-jammy- linux-oracle'. If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you! ** Tags added: kernel-spammed-jammy-linux-oracle-v2 verification-needed-jammy-linux-oracle -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Jammy: Fix Released Status in linux source package in Mantic: Fix Released Status in linux source package in Noble: In Progress Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
This bug is awaiting verification that the linux-intel- iotg/5.15.0-1050.56 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-intel-iotg' to 'verification-done-jammy-linux-intel-iotg'. If the problem still exists, change the tag 'verification-needed-jammy-linux-intel-iotg' to 'verification-failed-jammy-linux-intel-iotg'. If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you! ** Tags added: kernel-spammed-jammy-linux-intel-iotg-v2 verification-needed-jammy-linux-intel-iotg -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Jammy: Fix Released Status in linux source package in Mantic: Fix Released Status in linux source package in Noble: In Progress Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
This bug was fixed in the package linux - 5.15.0-100.110 --- linux (5.15.0-100.110) jammy; urgency=medium * jammy/linux: 5.15.0-100.110 -proposed tracker (LP: #2052616) * i915 regression introduced with 5.5 kernel (LP: #2044131) - drm/i915: Skip some timing checks on BXT/GLK DSI transcoders * Audio balancing setting doesn't work with the cirrus codec (LP: #2051050) - ALSA: hda/cs8409: Suppress vmaster control for Dolphin models * partproke is broken on empty loopback device (LP: #2049689) - block: Move checking GENHD_FL_NO_PART to bdev_add_partition() * CVE-2023-0340 - vhost: use kzalloc() instead of kmalloc() followed by memset() * CVE-2023-51780 - atm: Fix Use-After-Free in do_vcc_ioctl * CVE-2023-6915 - ida: Fix crash in ida_free when the bitmap is empty * CVE-2024-0646 - net: tls, update curr on splice as well * CVE-2024-0565 - smb: client: fix OOB in receive_encrypted_standard() * CVE-2023-51781 - appletalk: Fix Use-After-Free in atalk_ioctl * Jammy update: v5.15.143 upstream stable release (LP: #2050858) - vdpa/mlx5: preserve CVQ vringh index - hrtimers: Push pending hrtimers away from outgoing CPU earlier - i2c: designware: Fix corrupted memory seen in the ISR - netfilter: ipset: fix race condition between swap/destroy and kernel side add/del/test - tg3: Move the [rt]x_dropped counters to tg3_napi - tg3: Increment tx_dropped in tg3_tso_bug() - kconfig: fix memory leak from range properties - drm/amdgpu: correct chunk_ptr to a pointer to chunk. - platform/x86: asus-wmi: Adjust tablet/lidflip handling to use enum - platform/x86: asus-wmi: Add support for ROG X13 tablet mode - platform/x86: asus-wmi: Simplify tablet-mode-switch probing - platform/x86: asus-wmi: Simplify tablet-mode-switch handling - platform/x86: asus-wmi: Move i8042 filter install to shared asus-wmi code - of: dynamic: Fix of_reconfig_get_state_change() return value documentation - platform/x86: wmi: Allow duplicate GUIDs for drivers that use struct wmi_driver - platform/x86: wmi: Skip blocks with zero instances - ipv6: fix potential NULL deref in fib6_add() - octeontx2-pf: Add missing mutex lock in otx2_get_pauseparam - octeontx2-af: Check return value of nix_get_nixlf before using nixlf - hv_netvsc: rndis_filter needs to select NLS - r8152: Rename RTL8152_UNPLUG to RTL8152_INACCESSIBLE - r8152: Add RTL8152_INACCESSIBLE checks to more loops - r8152: Add RTL8152_INACCESSIBLE to r8156b_wait_loading_flash() - r8152: Add RTL8152_INACCESSIBLE to r8153_pre_firmware_1() - r8152: Add RTL8152_INACCESSIBLE to r8153_aldps_en() - mlxbf-bootctl: correctly identify secure boot with development keys - platform/mellanox: Add null pointer checks for devm_kasprintf() - platform/mellanox: Check devm_hwmon_device_register_with_groups() return value - arcnet: restoring support for multiple Sohard Arcnet cards - net: stmmac: fix FPE events losing - octeontx2-af: fix a use-after-free in rvu_npa_register_reporters - i40e: Fix unexpected MFS warning message - net: bnxt: fix a potential use-after-free in bnxt_init_tc - ionic: fix snprintf format length warning - ionic: Fix dim work handling in split interrupt mode - ipv4: ip_gre: Avoid skb_pull() failure in ipgre_xmit() - net: hns: fix fake link up on xge port - octeontx2-af: Update Tx link register range - netfilter: nf_tables: validate family when identifying table via handle - netfilter: xt_owner: Fix for unsafe access of sk->sk_socket - tcp: do not accept ACK of bytes we never sent - bpf: sockmap, updating the sg structure should also update curr - psample: Require 'CAP_NET_ADMIN' when joining "packets" group - net: add missing kdoc for struct genl_multicast_group::flags - drop_monitor: Require 'CAP_SYS_ADMIN' when joining "events" group - tee: optee: Fix supplicant based device enumeration - RDMA/hns: Fix unnecessary err return when using invalid congest control algorithm - RDMA/irdma: Do not modify to SQD on error - RDMA/irdma: Add wait for suspend on SQD - arm64: dts: rockchip: Expand reg size of vdec node for RK3399 - RDMA/rtrs-srv: Do not unconditionally enable irq - RDMA/rtrs-clt: Start hb after path_up - RDMA/rtrs-srv: Check return values while processing info request - RDMA/rtrs-srv: Free srv_mr iu only when always_invalidate is true - RDMA/rtrs-srv: Destroy path files after making sure no IOs in-flight - RDMA/rtrs-clt: Fix the max_send_wr setting - RDMA/rtrs-clt: Remove the warnings for req in_use check - RDMA/bnxt_re: Correct module description string - hwmon: (acpi_power_meter) Fix 4.29 MW bug - hwmon: (nzxt-kraken2) Fix error handling path in kraken2_probe() - ASoC: wm_adsp: fix memleak in wm_adsp_buffer_populate - RDMA/core: Fix umem iterator
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
This bug is awaiting verification that the linux- hwe-6.5/6.5.0-25.25~22.04.1 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux- hwe-6.5' to 'verification-done-jammy-linux-hwe-6.5'. If the problem still exists, change the tag 'verification-needed-jammy-linux-hwe-6.5' to 'verification-failed-jammy-linux-hwe-6.5'. If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you! ** Tags added: kernel-spammed-jammy-linux-hwe-6.5-v2 verification-needed-jammy-linux-hwe-6.5 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Jammy: Fix Committed Status in linux source package in Mantic: Fix Released Status in linux source package in Noble: In Progress Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
This bug is awaiting verification that the linux-aws/6.5.0-1015.15 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-mantic-linux-aws' to 'verification-done-mantic- linux-aws'. If the problem still exists, change the tag 'verification- needed-mantic-linux-aws' to 'verification-failed-mantic-linux-aws'. If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you! ** Tags added: kernel-spammed-mantic-linux-aws-v2 verification-needed-mantic-linux-aws -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Jammy: Fix Committed Status in linux source package in Mantic: Fix Released Status in linux source package in Noble: In Progress Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
This bug is awaiting verification that the linux-azure/6.5.0-1016.16 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-mantic-linux-azure' to 'verification-done- mantic-linux-azure'. If the problem still exists, change the tag 'verification-needed-mantic-linux-azure' to 'verification-failed-mantic- linux-azure'. If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you! ** Tags added: kernel-spammed-mantic-linux-azure-v2 verification-needed-mantic-linux-azure -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Jammy: Fix Committed Status in linux source package in Mantic: Fix Released Status in linux source package in Noble: In Progress Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
** Tags removed: verification-needed-mantic-linux ** Tags added: verification-done-mantic-linux -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Jammy: Fix Committed Status in linux source package in Mantic: Fix Committed Status in linux source package in Noble: In Progress Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq crw-rw 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CRDA: N/A CasperMD5CheckResult: pass CloudArchitecture: x86_64 CloudID: none CloudName: none CloudPlatform: none CloudSubPlatform: config DistroRelease: Ubuntu 22.04 InstallationDate: Installed on 2023-08-22 (24 days ago) InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
LP update: Mantic update: Due to lack of reproduction environment I have been performing following regression test: 1. Setup: nic: 2port E810-C both interfaces set up in bonding kernel: 6.5.0-25-generic 2. Test cases: 0) verified that code from the change is used during driver init a) stress traffic for 12h (multiple streams of iperf (tcp)) b) if up/down during stress traffic c) reload driver during stress traffic Look for any issues related to traffic processing, look for tx_hangs 3. Result: No issues have been detected during test execution -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Jammy: Fix Committed Status in linux source package in Mantic: Fix Committed Status in linux source package in Noble: In Progress Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq crw-rw 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
This bug is awaiting verification that the linux-ibm-gt- fips/5.15.0-1055.58+fips1 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-ibm-gt-fips' to 'verification-done-jammy-linux-ibm-gt-fips'. If the problem still exists, change the tag 'verification-needed-jammy-linux-ibm-gt-fips' to 'verification-failed-jammy-linux-ibm-gt-fips'. If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you! ** Tags added: kernel-spammed-jammy-linux-ibm-gt-fips-v2 verification-needed-jammy-linux-ibm-gt-fips -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Jammy: Fix Committed Status in linux source package in Mantic: Fix Committed Status in linux source package in Noble: In Progress Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
Hi Roxana, Mantic verification is still not finished. I did some touch tests without stress traffic. I'm trying to get my hands on E810 device to finish testing, I'll update ticket once it's done. Wishful ETA EOW 09. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Jammy: Fix Committed Status in linux source package in Mantic: Fix Committed Status in linux source package in Noble: In Progress Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq crw-rw 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CRDA: N/A CasperMD5CheckResult: pass CloudArchitecture: x86_64 CloudID: none CloudName: none CloudPlatform: none CloudSubPlatform: config DistroRelease: Ubuntu
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
Hi Robert! Thanks for testing this on jammy. I marked the tag as verified ('verification-done-jammy-linux') to reflect that. Could you share the results from mantic? we need to release this next week and we need a confirmation this works as expected. If your test looks fine, please remove 'verification-needed-mantic-linux' tag and add 'verification-done-mantic-linux'. Thanks! ** Tags removed: verification-needed-jammy-linux ** Tags added: verification-done-jammy-linux -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Jammy: Fix Committed Status in linux source package in Mantic: Fix Committed Status in linux source package in Noble: In Progress Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq crw-rw 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error:
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
Jammy update: Due to lack of reproduction environment I have been performing following regression test: 1. Setup: nic: 2port E810-XXV both interfaces set up in bonding kernel: 5.15.0-100-generic 2. Test cases: 0) verified that code from the change is used during driver init a) stress traffic for 48h (multiple streams of iperf (tcp)) b) if up/down during stress traffic c) pf reset during stress traffic Look for any issues related to traffic processing, look for tx_hangs 3. Result: No issues have been detected during test execution Mantic tests in progress. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Jammy: Fix Committed Status in linux source package in Mantic: Fix Committed Status in linux source package in Noble: In Progress Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq crw-rw 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
This bug is awaiting verification that the linux/6.5.0-25.25 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-mantic-linux' to 'verification-done-mantic-linux'. If the problem still exists, change the tag 'verification-needed-mantic- linux' to 'verification-failed-mantic-linux'. If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you! ** Tags added: kernel-spammed-mantic-linux-v2 verification-needed-mantic-linux -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Jammy: Fix Committed Status in linux source package in Mantic: Fix Committed Status in linux source package in Noble: In Progress Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq crw-rw 1 root audio 116, 33
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
Houston, we have a problem... This bug is notoriously difficult to reproduce. The only environment that presented it is now in production and will not be available for testing anymore. Which means that this cannot be tested, unless anyone can suggest a new way of reproducing it. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Jammy: Fix Committed Status in linux source package in Mantic: Fix Committed Status in linux source package in Noble: In Progress Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq crw-rw 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CRDA: N/A CasperMD5CheckResult: pass CloudArchitecture: x86_64 CloudID: none CloudName: none
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
This bug is awaiting verification that the linux/5.15.0-100.110 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux' to 'verification-done-jammy-linux'. If the problem still exists, change the tag 'verification-needed-jammy- linux' to 'verification-failed-jammy-linux'. If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you! ** Tags added: kernel-spammed-jammy-linux-v2 verification-needed-jammy-linux -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Jammy: Fix Committed Status in linux source package in Mantic: Fix Committed Status in linux source package in Noble: In Progress Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq crw-rw 1 root audio 116, 33
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
Yes, HWE kernels based on the Jammy/Mantic/Noble kernels should get this fix automatically when the GA versions get released. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Jammy: Fix Committed Status in linux source package in Mantic: Fix Committed Status in linux source package in Noble: In Progress Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq crw-rw 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CRDA: N/A CasperMD5CheckResult: pass CloudArchitecture: x86_64 CloudID: none CloudName: none CloudPlatform: none CloudSubPlatform: config DistroRelease: Ubuntu 22.04 InstallationDate: Installed on 2023-08-22 (24 days ago) InstallationMedia:
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
Thx a log Heitor! With no mention of some new package fixing this I did not correlate that to any patch to the kernel. Will the be fixed in the HWE kernel as well then? -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Jammy: Fix Committed Status in linux source package in Mantic: Fix Committed Status in linux source package in Noble: In Progress Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq crw-rw 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CRDA: N/A CasperMD5CheckResult: pass CloudArchitecture: x86_64 CloudID: none CloudName: none CloudPlatform: none CloudSubPlatform: config DistroRelease: Ubuntu 22.04 InstallationDate: Installed on 2023-08-22
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
@christian-rhomann "Fix committed" here means that the patches have been merged into Ubuntu's kernel tree for that specific release. The patch Robert submitted is the one from upstream, not the test patch from the comments here. E.g. for Jammy: https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/jammy/commit/?h=master-next=fc26d7737e3a -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Jammy: Fix Committed Status in linux source package in Mantic: Fix Committed Status in linux source package in Noble: In Progress Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq crw-rw 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CRDA: N/A CasperMD5CheckResult: pass
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
** Changed in: linux (Ubuntu Mantic) Status: In Progress => Fix Committed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Jammy: Fix Committed Status in linux source package in Mantic: Fix Committed Status in linux source package in Noble: In Progress Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq crw-rw 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CRDA: N/A CasperMD5CheckResult: pass CloudArchitecture: x86_64 CloudID: none CloudName: none CloudPlatform: none CloudSubPlatform: config DistroRelease: Ubuntu 22.04 InstallationDate: Installed on 2023-08-22 (24 days ago) InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
@Robert thanks for keeping this bug alive and updated! 1) More debug info required? @Robert, reading your post https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/50 again, I am wondering if you asked me to provided more debug info with NVM 4.4 on my E810 NICs? Would this help in any way? 2) @smb changed this bug to "fix commited" for Jammy - is this really the correct state? As @Andre said in https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/41, just manually commenting out some lines in the ice kernel module is "not a fix". 3) Will the two "fixes" you referred to in https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/52 make it to any other kernel than 6.8? Either by Intel or by Ubuntu applying them there? Otherwise I am wondering if and when 6.8 will be, once out, made available as HWE for Jammy? -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Jammy: Fix Committed Status in linux source package in Mantic: In Progress Status in linux source package in Noble: In Progress Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
Switching status for Noble to In Progress. Target release for Noble is 6.8 (which includes fix) but it's not out yet, status will be changed once 6.8 will be introduced. ** Changed in: linux (Ubuntu Noble) Status: Invalid => In Progress -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Jammy: Fix Committed Status in linux source package in Mantic: In Progress Status in linux source package in Noble: In Progress Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq crw-rw 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CRDA: N/A CasperMD5CheckResult: pass CloudArchitecture: x86_64 CloudID: none CloudName: none CloudPlatform: none CloudSubPlatform: config
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
Fix already included in 6.8 ** Changed in: linux (Ubuntu Noble) Status: In Progress => Invalid -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: Invalid Status in linux source package in Jammy: Fix Committed Status in linux source package in Mantic: In Progress Status in linux source package in Noble: Invalid Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq crw-rw 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CRDA: N/A CasperMD5CheckResult: pass CloudArchitecture: x86_64 CloudID: none CloudName: none CloudPlatform: none CloudSubPlatform: config DistroRelease: Ubuntu 22.04 InstallationDate: Installed on 2023-08-22 (24 days ago) InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
Hey Christian, Intel proposed change [1] which is targeting this problem and based on our testing in fact it solves the problem. This change is currently added to Ubuntu Kernels. I'm also keeping an eye on [2] but right now I don't yet see "business need" to incorporate it to Ubuntu Kernel. This patch furthers limit problematic part of the code by adding (in addition to NVM caps check) verification based on DDP package. 1 - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 2 - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20240122/039100.html -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Jammy: Fix Committed Status in linux source package in Mantic: In Progress Status in linux source package in Noble: In Progress Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq crw-rw 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion:
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
@Stefan Could you kindly elaborate on the "Fix Commmited"? Was there any change to the kernel that would fix this issue? Is this fixed with 4.40 NVM from Intel? Reading Roberts post (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/50) again, it seems that he is only guessing that there was something fixed by Intel. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Jammy: Fix Committed Status in linux source package in Mantic: In Progress Status in linux source package in Noble: In Progress Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq crw-rw 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CRDA: N/A CasperMD5CheckResult: pass
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
** Changed in: linux (Ubuntu Jammy) Status: In Progress => Fix Committed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Jammy: Fix Committed Status in linux source package in Mantic: In Progress Status in linux source package in Noble: In Progress Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq crw-rw 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CRDA: N/A CasperMD5CheckResult: pass CloudArchitecture: x86_64 CloudID: none CloudName: none CloudPlatform: none CloudSubPlatform: config DistroRelease: Ubuntu 22.04 InstallationDate: Installed on 2023-08-22 (24 days ago) InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
** Also affects: linux (Ubuntu Noble) Importance: Medium Assignee: Robert Malz (rmalz) Status: In Progress -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Jammy: In Progress Status in linux source package in Mantic: In Progress Status in linux source package in Noble: In Progress Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq crw-rw 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CRDA: N/A CasperMD5CheckResult: pass CloudArchitecture: x86_64 CloudID: none CloudName: none CloudPlatform: none CloudSubPlatform: config DistroRelease: Ubuntu 22.04 InstallationDate: Installed on 2023-08-22 (24 days ago) InstallationMedia: Ubuntu-Server
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
** Changed in: linux (Ubuntu) Status: Invalid => Confirmed ** Changed in: linux (Ubuntu) Status: Confirmed => In Progress ** Changed in: linux (Ubuntu) Importance: Undecided => Medium ** Changed in: linux (Ubuntu) Assignee: (unassigned) => Robert Malz (rmalz) ** Changed in: linux (Ubuntu Jammy) Assignee: (unassigned) => Robert Malz (rmalz) ** Changed in: linux (Ubuntu Mantic) Assignee: (unassigned) => Robert Malz (rmalz) -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Jammy: In Progress Status in linux source package in Mantic: In Progress Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq crw-rw 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
** Also affects: linux (Ubuntu Jammy) Importance: Undecided Status: New ** Also affects: linux (Ubuntu Mantic) Importance: Undecided Status: New ** Changed in: linux (Ubuntu Jammy) Importance: Undecided => Medium ** Changed in: linux (Ubuntu Jammy) Status: New => In Progress ** Changed in: linux (Ubuntu Mantic) Importance: Undecided => Medium ** Changed in: linux (Ubuntu Mantic) Status: New => In Progress ** Changed in: linux (Ubuntu) Status: Confirmed => Invalid -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: Invalid Status in linux source package in Jammy: In Progress Status in linux source package in Mantic: In Progress Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq crw-rw 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser',
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
Hey @Christian, 1a) No need, AQ 0x000A returns NVM capabilities regardless of configuration applied (it's done during driver init) 1b) That's the point, I noticed you upgraded to 4.3 which I currently don't have access to and I wanted to verify capabilities on 4.3. NVM caps should be similar on the same NVM version in single head of family so values I had access to would be the same you had on 4.2 (meaning there is no point of collecting these) 1c) No, any "recent" kernel/driver version will support enabling debug logs by adding dyndbg=+p param to module. We only care for logs which are retrieved from NVM and printed with debug flags. 2) The issue based on recent patches from Intel is caused by performing LAG related operations without proper support from NVM. Release notes does not always tell every feature change so there is a possibility that 4.4 introduced sriov_lag capability but I cannot verify it. Worst case scenario is that NVM 4.4 will introduce sriov_lag capability, meaning patches added recently to upstream kernel will have no effect, and also issue will still reproduce. In this scenario currently there will be no 'workaround' for it. Best case scenario is that NVM 4.4 will introduce sriov_lag capability and issue will no longer reproduce. In this scenario no additional patches to the driver will be required. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: Confirmed Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
@Robert, first thanks a lot for pursuing this issue! 1) I certainly can provide the debugging info. May I ask if ... a) the system in question would need to have an active LAG (LACP) for this to be helpful? We did switch to active-backup on all our machines due to this very issue. b) this requires FW version 4.20? All our machines currently run 4.30 already. c) this requires a certain kernel / ice driver version? 2) There now is FW 4.40 out [1]. But there seem to be no fixes related to LAG / LACP, some regarding SRIO-V though. But I guess you are convinced the issue is not within the FW, but rather the ice driver? [1] - https://www.intel.com/content/www/us/en/download/19624/non-volatile-memory-nvm-update-utility-for-intel-ethernet-network-adapter-e810-series.html -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: Confirmed Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq crw-rw 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion:
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
@Christian, Can you verify your device capabilities returned from 0x000A looking for SRIOV lag? I have attached a script "parse_aq_0xA.py" you need to load driver with dyndbg=+p and replace a buffer in script. Note: buffer has to come from CQ CMD: opcode 0x000A Expected result: (...) resp cap: 0x92 -- this capability we are looking for resp maj_ver: 0x1 resp min_ver: 0x0 resp number: 0x1 -- This is value we want to check. resp logical_id: 0x0 resp phys_id: 0x0 If it is set to 0x1 patch [1] will disable lag handler and simplify initialization logic to something like in comment #40 Buffer available in the script comes from CVL4.20 NVM [1] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of- Mon-20231211/038588.html -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: Confirmed Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq crw-rw 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
Script to verify AQ 0x000A capabilities ** Attachment added: "parse_aq_0xA.py" https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/+attachment/5736421/+files/parse_aq_0xA.py -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: Confirmed Bug description: [Impact] * Issue is causing transmit hang on E810 ports with bonding enabled. * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. [Fix] * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. This change has been tested in an environment where reproduction is easily achieved. After multiple iterations, no reproduction has been observed. * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. [Test Plan] * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. * The issue could appear on a random node, making reproduction hard to achieve. * Multiple stress tests on single host with similar configuration did not trigger a reproduction. [Where problems could occur] * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 [Other Info] * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. * Original description of the case below: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq crw-rw 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CRDA: N/A CasperMD5CheckResult: pass CloudArchitecture: x86_64 CloudID: none CloudName: none CloudPlatform: none CloudSubPlatform: config DistroRelease: Ubuntu 22.04 InstallationDate: Installed on 2023-08-22 (24 days ago) InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release amd64 (20230810) IwConfig: Error: [Errno 2] No such file or
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
** Description changed: + [Impact] + * Issue is causing transmit hang on E810 ports with bonding enabled. + * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). + * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. + + [Fix] + * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. + This change has been tested in an environment where reproduction is easily achieved. + After multiple iterations, no reproduction has been observed. + * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. + + [Test Plan] + * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. + * The issue could appear on a random node, making reproduction hard to achieve. + * Multiple stress tests on single host with similar configuration did not trigger a reproduction. + + [Where problems could occur] + * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 + * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. + Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. + + [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 + [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 - I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. + [Other Info] + * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added. + * Original description of the case below: + + + + I'm having issues with an Intel E810-XXV card on a Dell server under + Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping - --- + --- ProblemType: Bug AlsaDevices: - total 0 - crw-rw 1 root audio 116, 1 Sep 12 20:05 seq - crw-rw 1 root audio 116, 33 Sep 12 20:05 timer + total 0 + crw-rw 1 root audio 116, 1 Sep 12 20:05 seq + crw-rw 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CRDA: N/A CasperMD5CheckResult: pass CloudArchitecture: x86_64 CloudID: none CloudName: none CloudPlatform: none CloudSubPlatform: config DistroRelease: Ubuntu 22.04 InstallationDate: Installed on 2023-08-22 (24 days ago) InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release amd64 (20230810) IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' MachineType: Dell Inc. PowerEdge R7515 Package: linux (not installed) PciMultimedia: - + ProcFB: 0 mgag200drmfb ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
FWIW, we updated our NICs to 4.30 as they were individually purchased and not part of pre-built servers and also have this issue. So in essence the issue also exists with the latest firmware. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: Confirmed Bug description: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq crw-rw 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CRDA: N/A CasperMD5CheckResult: pass CloudArchitecture: x86_64 CloudID: none CloudName: none CloudPlatform: none CloudSubPlatform: config DistroRelease: Ubuntu 22.04 InstallationDate: Installed on 2023-08-22 (24 days ago) InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release amd64 (20230810) IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' MachineType: Dell Inc. PowerEdge R7515 Package: linux (not installed) PciMultimedia: ProcFB: 0 mgag200drmfb ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic root=UUID=cfb5f171-77e6-4fcd-947b-52901f51b26a ro ProcVersionSignature: Ubuntu 5.15.0-83.92-generic 5.15.116 RelatedPackageVersions: linux-restricted-modules-5.15.0-83-generic N/A linux-backports-modules-5.15.0-83-generic N/A linux-firmware 20220329.git681281e4-0ubuntu3.18 RfKill: Error: [Errno 2] No such file or directory: 'rfkill' Tags: jammy uec-images Uname: Linux 5.15.0-83-generic x86_64 UpgradeStatus: No upgrade log present (probably fresh install) UserGroups: N/A _MarkForUpload: True dmi.bios.date: 07/27/2023 dmi.bios.release: 2.12 dmi.bios.vendor: Dell Inc. dmi.bios.version: 2.12.4 dmi.board.name: 0J91V2 dmi.board.vendor: Dell Inc. dmi.board.version: A01 dmi.chassis.type: 23 dmi.chassis.vendor: Dell Inc. dmi.modalias: dmi:bvnDellInc.:bvr2.12.4:bd07/27/2023:br2.12:svnDellInc.:pnPowerEdgeR7515:pvr:rvnDellInc.:rn0J91V2:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=08FD;ModelName=PowerEdgeR7515: dmi.product.family: PowerEdge dmi.product.name: PowerEdge R7515 dmi.product.sku: SKU=08FD;ModelName=PowerEdge R7515 dmi.sys.vendor: Dell Inc. --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 15 03:13 seq crw-rw 1 root audio 116, 33 Sep 15 03:13 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: Cannot stat file /proc/215602/fd/10: Permission denied Cannot stat file /proc/323635/fd/10: Permission denied CRDA: N/A CasperMD5CheckResult: unknown CloudArchitecture: x86_64 CloudID: maas CloudName: maas CloudPlatform: maas CloudSubPlatform: seed-dir (http://10.3.4.7:5248/MAAS/metadata/) DistroRelease: Ubuntu 22.04 IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' MachineType: Dell Inc. PowerEdge R7525
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
Yeah, I knew about that 4.30 update in Intel website, but it is not present on Dell tools yet and the customer did not want to void their warranty (potentially), so I did not try it. That is something to keep in mind while we debug it. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: Confirmed Bug description: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq crw-rw 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CRDA: N/A CasperMD5CheckResult: pass CloudArchitecture: x86_64 CloudID: none CloudName: none CloudPlatform: none CloudSubPlatform: config DistroRelease: Ubuntu 22.04 InstallationDate: Installed on 2023-08-22 (24 days ago) InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release amd64 (20230810) IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' MachineType: Dell Inc. PowerEdge R7515 Package: linux (not installed) PciMultimedia: ProcFB: 0 mgag200drmfb ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic root=UUID=cfb5f171-77e6-4fcd-947b-52901f51b26a ro ProcVersionSignature: Ubuntu 5.15.0-83.92-generic 5.15.116 RelatedPackageVersions: linux-restricted-modules-5.15.0-83-generic N/A linux-backports-modules-5.15.0-83-generic N/A linux-firmware 20220329.git681281e4-0ubuntu3.18 RfKill: Error: [Errno 2] No such file or directory: 'rfkill' Tags: jammy uec-images Uname: Linux 5.15.0-83-generic x86_64 UpgradeStatus: No upgrade log present (probably fresh install) UserGroups: N/A _MarkForUpload: True dmi.bios.date: 07/27/2023 dmi.bios.release: 2.12 dmi.bios.vendor: Dell Inc. dmi.bios.version: 2.12.4 dmi.board.name: 0J91V2 dmi.board.vendor: Dell Inc. dmi.board.version: A01 dmi.chassis.type: 23 dmi.chassis.vendor: Dell Inc. dmi.modalias: dmi:bvnDellInc.:bvr2.12.4:bd07/27/2023:br2.12:svnDellInc.:pnPowerEdgeR7515:pvr:rvnDellInc.:rn0J91V2:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=08FD;ModelName=PowerEdgeR7515: dmi.product.family: PowerEdge dmi.product.name: PowerEdge R7515 dmi.product.sku: SKU=08FD;ModelName=PowerEdge R7515 dmi.sys.vendor: Dell Inc. --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 15 03:13 seq crw-rw 1 root audio 116, 33 Sep 15 03:13 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: Cannot stat file /proc/215602/fd/10: Permission denied Cannot stat file /proc/323635/fd/10: Permission denied CRDA: N/A CasperMD5CheckResult: unknown CloudArchitecture: x86_64 CloudID: maas CloudName: maas CloudPlatform: maas CloudSubPlatform: seed-dir (http://10.3.4.7:5248/MAAS/metadata/) DistroRelease: Ubuntu 22.04 IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
@Andre, I successfully installed on the machines with NVMUpdate64 tool. That is on HPE machines. $ sudo ./nvmupdate64e Intel(R) Ethernet NVM Update Tool NVMUpdate version 1.39.56.8 Copyright(C) 2013 - 2023 Intel Corporation. WARNING: To avoid damage to your device, do not stop the update or reboot or power off the system during this update. Inventory in progress. Please wait [**+...] Num Description Ver.(hex) DevId S:BStatus === == = == == 01) Intel(R) Ethernet Network Adapter 4.48(4.30) 159B 00:016 Up to date E810-XXV-2 02) Intel(R) Ethernet Network Adapter N/A(N/A) 1521 00:072 Update not I350-T4 for OCP NIC 3.0 available Tool execution completed with the following status: An error occurred accessing the device. Press any key to exit. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: Confirmed Bug description: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq crw-rw 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CRDA: N/A CasperMD5CheckResult: pass CloudArchitecture: x86_64 CloudID: none CloudName: none CloudPlatform: none CloudSubPlatform: config DistroRelease: Ubuntu 22.04 InstallationDate: Installed on 2023-08-22 (24 days ago) InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release amd64 (20230810) IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' MachineType: Dell Inc. PowerEdge R7515 Package: linux (not installed) PciMultimedia: ProcFB: 0 mgag200drmfb ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic root=UUID=cfb5f171-77e6-4fcd-947b-52901f51b26a ro ProcVersionSignature: Ubuntu 5.15.0-83.92-generic 5.15.116 RelatedPackageVersions: linux-restricted-modules-5.15.0-83-generic N/A linux-backports-modules-5.15.0-83-generic N/A linux-firmware 20220329.git681281e4-0ubuntu3.18 RfKill: Error: [Errno 2] No such file or directory: 'rfkill' Tags: jammy uec-images Uname: Linux 5.15.0-83-generic x86_64 UpgradeStatus: No upgrade log present (probably fresh install) UserGroups: N/A _MarkForUpload: True dmi.bios.date: 07/27/2023 dmi.bios.release: 2.12 dmi.bios.vendor: Dell Inc. dmi.bios.version: 2.12.4 dmi.board.name: 0J91V2 dmi.board.vendor: Dell Inc. dmi.board.version: A01 dmi.chassis.type: 23 dmi.chassis.vendor: Dell Inc. dmi.modalias: dmi:bvnDellInc.:bvr2.12.4:bd07/27/2023:br2.12:svnDellInc.:pnPowerEdgeR7515:pvr:rvnDellInc.:rn0J91V2:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=08FD;ModelName=PowerEdgeR7515: dmi.product.family: PowerEdge dmi.product.name: PowerEdge R7515 dmi.product.sku: SKU=08FD;ModelName=PowerEdge R7515 dmi.sys.vendor: Dell Inc. --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 15 03:13 seq crw-rw 1 root audio
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
@Bartosz $ ethtool -i enp65s0f0 |grep firmware-version firmware-version: 4.20 0x8001784b 22.0.9 This is the latest firmware supported by Dell. You will find 4.3 available on Intel website, but it is not available yet through dell firmware tools. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: Confirmed Bug description: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq crw-rw 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CRDA: N/A CasperMD5CheckResult: pass CloudArchitecture: x86_64 CloudID: none CloudName: none CloudPlatform: none CloudSubPlatform: config DistroRelease: Ubuntu 22.04 InstallationDate: Installed on 2023-08-22 (24 days ago) InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release amd64 (20230810) IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' MachineType: Dell Inc. PowerEdge R7515 Package: linux (not installed) PciMultimedia: ProcFB: 0 mgag200drmfb ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic root=UUID=cfb5f171-77e6-4fcd-947b-52901f51b26a ro ProcVersionSignature: Ubuntu 5.15.0-83.92-generic 5.15.116 RelatedPackageVersions: linux-restricted-modules-5.15.0-83-generic N/A linux-backports-modules-5.15.0-83-generic N/A linux-firmware 20220329.git681281e4-0ubuntu3.18 RfKill: Error: [Errno 2] No such file or directory: 'rfkill' Tags: jammy uec-images Uname: Linux 5.15.0-83-generic x86_64 UpgradeStatus: No upgrade log present (probably fresh install) UserGroups: N/A _MarkForUpload: True dmi.bios.date: 07/27/2023 dmi.bios.release: 2.12 dmi.bios.vendor: Dell Inc. dmi.bios.version: 2.12.4 dmi.board.name: 0J91V2 dmi.board.vendor: Dell Inc. dmi.board.version: A01 dmi.chassis.type: 23 dmi.chassis.vendor: Dell Inc. dmi.modalias: dmi:bvnDellInc.:bvr2.12.4:bd07/27/2023:br2.12:svnDellInc.:pnPowerEdgeR7515:pvr:rvnDellInc.:rn0J91V2:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=08FD;ModelName=PowerEdgeR7515: dmi.product.family: PowerEdge dmi.product.name: PowerEdge R7515 dmi.product.sku: SKU=08FD;ModelName=PowerEdge R7515 dmi.sys.vendor: Dell Inc. --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 15 03:13 seq crw-rw 1 root audio 116, 33 Sep 15 03:13 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: Cannot stat file /proc/215602/fd/10: Permission denied Cannot stat file /proc/323635/fd/10: Permission denied CRDA: N/A CasperMD5CheckResult: unknown CloudArchitecture: x86_64 CloudID: maas CloudName: maas CloudPlatform: maas CloudSubPlatform: seed-dir (http://10.3.4.7:5248/MAAS/metadata/) DistroRelease: Ubuntu 22.04 IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
What is the cards firmware ? $ ethtool -i |grep firmware-version -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: Confirmed Bug description: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq crw-rw 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CRDA: N/A CasperMD5CheckResult: pass CloudArchitecture: x86_64 CloudID: none CloudName: none CloudPlatform: none CloudSubPlatform: config DistroRelease: Ubuntu 22.04 InstallationDate: Installed on 2023-08-22 (24 days ago) InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release amd64 (20230810) IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' MachineType: Dell Inc. PowerEdge R7515 Package: linux (not installed) PciMultimedia: ProcFB: 0 mgag200drmfb ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic root=UUID=cfb5f171-77e6-4fcd-947b-52901f51b26a ro ProcVersionSignature: Ubuntu 5.15.0-83.92-generic 5.15.116 RelatedPackageVersions: linux-restricted-modules-5.15.0-83-generic N/A linux-backports-modules-5.15.0-83-generic N/A linux-firmware 20220329.git681281e4-0ubuntu3.18 RfKill: Error: [Errno 2] No such file or directory: 'rfkill' Tags: jammy uec-images Uname: Linux 5.15.0-83-generic x86_64 UpgradeStatus: No upgrade log present (probably fresh install) UserGroups: N/A _MarkForUpload: True dmi.bios.date: 07/27/2023 dmi.bios.release: 2.12 dmi.bios.vendor: Dell Inc. dmi.bios.version: 2.12.4 dmi.board.name: 0J91V2 dmi.board.vendor: Dell Inc. dmi.board.version: A01 dmi.chassis.type: 23 dmi.chassis.vendor: Dell Inc. dmi.modalias: dmi:bvnDellInc.:bvr2.12.4:bd07/27/2023:br2.12:svnDellInc.:pnPowerEdgeR7515:pvr:rvnDellInc.:rn0J91V2:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=08FD;ModelName=PowerEdgeR7515: dmi.product.family: PowerEdge dmi.product.name: PowerEdge R7515 dmi.product.sku: SKU=08FD;ModelName=PowerEdge R7515 dmi.sys.vendor: Dell Inc. --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 15 03:13 seq crw-rw 1 root audio 116, 33 Sep 15 03:13 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: Cannot stat file /proc/215602/fd/10: Permission denied Cannot stat file /proc/323635/fd/10: Permission denied CRDA: N/A CasperMD5CheckResult: unknown CloudArchitecture: x86_64 CloudID: maas CloudName: maas CloudPlatform: maas CloudSubPlatform: seed-dir (http://10.3.4.7:5248/MAAS/metadata/) DistroRelease: Ubuntu 22.04 IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' MachineType: Dell Inc. PowerEdge R7525 NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair Package: linux (not installed) PciMultimedia: ProcFB: 0
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
I have tried this (patches suggested in comment #40) and the problem seems to have gone away. It may be too soon to say but my test scenario (which never gave me a false negative before) finished without issues. Of course this is not a 'fix', so I'm curious to see what the OP has to say about this result. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: Confirmed Bug description: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq crw-rw 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CRDA: N/A CasperMD5CheckResult: pass CloudArchitecture: x86_64 CloudID: none CloudName: none CloudPlatform: none CloudSubPlatform: config DistroRelease: Ubuntu 22.04 InstallationDate: Installed on 2023-08-22 (24 days ago) InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release amd64 (20230810) IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' MachineType: Dell Inc. PowerEdge R7515 Package: linux (not installed) PciMultimedia: ProcFB: 0 mgag200drmfb ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic root=UUID=cfb5f171-77e6-4fcd-947b-52901f51b26a ro ProcVersionSignature: Ubuntu 5.15.0-83.92-generic 5.15.116 RelatedPackageVersions: linux-restricted-modules-5.15.0-83-generic N/A linux-backports-modules-5.15.0-83-generic N/A linux-firmware 20220329.git681281e4-0ubuntu3.18 RfKill: Error: [Errno 2] No such file or directory: 'rfkill' Tags: jammy uec-images Uname: Linux 5.15.0-83-generic x86_64 UpgradeStatus: No upgrade log present (probably fresh install) UserGroups: N/A _MarkForUpload: True dmi.bios.date: 07/27/2023 dmi.bios.release: 2.12 dmi.bios.vendor: Dell Inc. dmi.bios.version: 2.12.4 dmi.board.name: 0J91V2 dmi.board.vendor: Dell Inc. dmi.board.version: A01 dmi.chassis.type: 23 dmi.chassis.vendor: Dell Inc. dmi.modalias: dmi:bvnDellInc.:bvr2.12.4:bd07/27/2023:br2.12:svnDellInc.:pnPowerEdgeR7515:pvr:rvnDellInc.:rn0J91V2:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=08FD;ModelName=PowerEdgeR7515: dmi.product.family: PowerEdge dmi.product.name: PowerEdge R7515 dmi.product.sku: SKU=08FD;ModelName=PowerEdge R7515 dmi.sys.vendor: Dell Inc. --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 15 03:13 seq crw-rw 1 root audio 116, 33 Sep 15 03:13 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: Cannot stat file /proc/215602/fd/10: Permission denied Cannot stat file /proc/323635/fd/10: Permission denied CRDA: N/A CasperMD5CheckResult: unknown CloudArchitecture: x86_64 CloudID: maas CloudName: maas CloudPlatform: maas CloudSubPlatform: seed-dir (http://10.3.4.7:5248/MAAS/metadata/) DistroRelease: Ubuntu 22.04
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
1) Andre, after I switched to active-backup the issue is gone (so far). But yeah, we are looking for a reproducer as well. It's hard to narrow down some random issue - also likely for Intel. 2) But I just received an email from an Intel developer with a suggested change to the driver to narrow down the issue further. I quote ... --- cut --- Could you edit file (from kernel source tree base) drivers/net/ethernet/intel/ice/ice_lag.c . Then find the functions ice_init_lag()and ice_deinit_lag(). Then add this line to the beginning of the functions return 0; and return; respectively. the patch nomenclature would look something like this: * Memory will be freed in ice_deinit_lag */ int ice_init_lag(struct ice_pf *pf) { struct device *dev = ice_pf_to_dev(pf); struct ice_lag *lag; struct ice_vsi *vsi; int err; + return 0; pf->lag = kzalloc(sizeof(*lag), GFP_KERNEL); if (!pf->lag) return -ENOMEM; lag = pf->lag; ……… * This function is meant to only be called on driver remove/shutdown */ void ice_deinit_lag(struct ice_pf *pf) { struct ice_lag *lag; + return; lag = pf->lag; Then re-build the driver and try to reproduce the problem? --- cut --- So in essence I believe this just skips offloading the bonding / LACP to the HW. I will set this up on one or two of our machines to test. Would you please also try this on your systems? -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: Confirmed Bug description: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq crw-rw 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CRDA: N/A CasperMD5CheckResult: pass CloudArchitecture: x86_64 CloudID: none CloudName: none CloudPlatform: none CloudSubPlatform: config DistroRelease: Ubuntu 22.04 InstallationDate: Installed on 2023-08-22 (24 days ago) InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release amd64 (20230810) IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' MachineType: Dell Inc. PowerEdge R7515 Package: linux (not installed) PciMultimedia: ProcFB: 0 mgag200drmfb ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic root=UUID=cfb5f171-77e6-4fcd-947b-52901f51b26a ro ProcVersionSignature: Ubuntu 5.15.0-83.92-generic 5.15.116 RelatedPackageVersions: linux-restricted-modules-5.15.0-83-generic N/A linux-backports-modules-5.15.0-83-generic N/A linux-firmware 20220329.git681281e4-0ubuntu3.18 RfKill: Error: [Errno 2] No such file or directory: 'rfkill' Tags: jammy uec-images Uname: Linux 5.15.0-83-generic x86_64 UpgradeStatus: No upgrade log present (probably fresh install) UserGroups: N/A _MarkForUpload: True dmi.bios.date: 07/27/2023 dmi.bios.release: 2.12 dmi.bios.vendor: Dell Inc. dmi.bios.version: 2.12.4 dmi.board.name: 0J91V2 dmi.board.vendor: Dell Inc. dmi.board.version: A01 dmi.chassis.type: 23 dmi.chassis.vendor: Dell
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
Hi Christian In my tests, I also saw the same issues with active-backup too. Do you know a way to reproduce this issue? I'm having a hard time to find a consistent reproducer, currently I need to deploy a complete openstack, run a ser of load tests on it and eventually the problem shows up, but it takes many hours and not on all hosts. It would be much easier to have just one machine and trigger the issue in some other way. Also, this is fixed upstream, some changes between 1.8.x and 1.9.x of upstream source drivers fixed the problem (they are at 1.12.x now, so it has been fixed for quite a while now). The problem is that whatever the fix is, is has not been imported to kernels 5.15 (jammy GA), 6.2 (jammy HWE), 6.5 (cosmic GA). I could not reliably test upstream mainline 6.6 because there is no ubuntu currently shipping this package and the pure upstream kernel breaks a lot of stuff in ubuntu. I mention this because of your post in the intel mailing list. They will probably not be able to help much. Let me know if you find a consistent reproducer. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: Confirmed Bug description: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq crw-rw 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CRDA: N/A CasperMD5CheckResult: pass CloudArchitecture: x86_64 CloudID: none CloudName: none CloudPlatform: none CloudSubPlatform: config DistroRelease: Ubuntu 22.04 InstallationDate: Installed on 2023-08-22 (24 days ago) InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release amd64 (20230810) IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' MachineType: Dell Inc. PowerEdge R7515 Package: linux (not installed) PciMultimedia: ProcFB: 0 mgag200drmfb ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic root=UUID=cfb5f171-77e6-4fcd-947b-52901f51b26a ro ProcVersionSignature: Ubuntu 5.15.0-83.92-generic 5.15.116 RelatedPackageVersions: linux-restricted-modules-5.15.0-83-generic N/A linux-backports-modules-5.15.0-83-generic N/A linux-firmware 20220329.git681281e4-0ubuntu3.18 RfKill: Error: [Errno 2] No such file or directory: 'rfkill' Tags: jammy uec-images Uname: Linux 5.15.0-83-generic x86_64 UpgradeStatus: No upgrade log present (probably fresh install) UserGroups: N/A _MarkForUpload: True dmi.bios.date: 07/27/2023 dmi.bios.release: 2.12 dmi.bios.vendor: Dell Inc. dmi.bios.version: 2.12.4 dmi.board.name: 0J91V2 dmi.board.vendor: Dell Inc. dmi.board.version: A01 dmi.chassis.type: 23 dmi.chassis.vendor: Dell Inc. dmi.modalias: dmi:bvnDellInc.:bvr2.12.4:bd07/27/2023:br2.12:svnDellInc.:pnPowerEdgeR7515:pvr:rvnDellInc.:rn0J91V2:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=08FD;ModelName=PowerEdgeR7515: dmi.product.family: PowerEdge dmi.product.name: PowerEdge R7515 dmi.product.sku: SKU=08FD;ModelName=PowerEdge R7515 dmi.sys.vendor: Dell Inc. --- ProblemType: Bug AlsaDevices: total 0
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
I ran into this issue on 22.04 LTS (using HWE kernel 6.2) on a 100G dual-port E810 NIC. Also with LACP only, active-backup works without issues. To bring this more to the attention of the driver devs, I posted to the intel-wired-lan ML: https://lists.osuosl.org/pipermail/intel-wired- lan/Week-of-Mon-20231120/038096.html -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: Confirmed Bug description: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq crw-rw 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CRDA: N/A CasperMD5CheckResult: pass CloudArchitecture: x86_64 CloudID: none CloudName: none CloudPlatform: none CloudSubPlatform: config DistroRelease: Ubuntu 22.04 InstallationDate: Installed on 2023-08-22 (24 days ago) InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release amd64 (20230810) IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' MachineType: Dell Inc. PowerEdge R7515 Package: linux (not installed) PciMultimedia: ProcFB: 0 mgag200drmfb ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic root=UUID=cfb5f171-77e6-4fcd-947b-52901f51b26a ro ProcVersionSignature: Ubuntu 5.15.0-83.92-generic 5.15.116 RelatedPackageVersions: linux-restricted-modules-5.15.0-83-generic N/A linux-backports-modules-5.15.0-83-generic N/A linux-firmware 20220329.git681281e4-0ubuntu3.18 RfKill: Error: [Errno 2] No such file or directory: 'rfkill' Tags: jammy uec-images Uname: Linux 5.15.0-83-generic x86_64 UpgradeStatus: No upgrade log present (probably fresh install) UserGroups: N/A _MarkForUpload: True dmi.bios.date: 07/27/2023 dmi.bios.release: 2.12 dmi.bios.vendor: Dell Inc. dmi.bios.version: 2.12.4 dmi.board.name: 0J91V2 dmi.board.vendor: Dell Inc. dmi.board.version: A01 dmi.chassis.type: 23 dmi.chassis.vendor: Dell Inc. dmi.modalias: dmi:bvnDellInc.:bvr2.12.4:bd07/27/2023:br2.12:svnDellInc.:pnPowerEdgeR7515:pvr:rvnDellInc.:rn0J91V2:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=08FD;ModelName=PowerEdgeR7515: dmi.product.family: PowerEdge dmi.product.name: PowerEdge R7515 dmi.product.sku: SKU=08FD;ModelName=PowerEdge R7515 dmi.sys.vendor: Dell Inc. --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 15 03:13 seq crw-rw 1 root audio 116, 33 Sep 15 03:13 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: Cannot stat file /proc/215602/fd/10: Permission denied Cannot stat file /proc/323635/fd/10: Permission denied CRDA: N/A CasperMD5CheckResult: unknown CloudArchitecture: x86_64 CloudID: maas CloudName: maas CloudPlatform: maas CloudSubPlatform: seed-dir (http://10.3.4.7:5248/MAAS/metadata/) DistroRelease:
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
Removing lacp bonding (using just one interface without any kind of bonding) seemed to help, I'm not seeing the issue anymore. Still testing. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: Confirmed Bug description: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq crw-rw 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CRDA: N/A CasperMD5CheckResult: pass CloudArchitecture: x86_64 CloudID: none CloudName: none CloudPlatform: none CloudSubPlatform: config DistroRelease: Ubuntu 22.04 InstallationDate: Installed on 2023-08-22 (24 days ago) InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release amd64 (20230810) IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' MachineType: Dell Inc. PowerEdge R7515 Package: linux (not installed) PciMultimedia: ProcFB: 0 mgag200drmfb ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic root=UUID=cfb5f171-77e6-4fcd-947b-52901f51b26a ro ProcVersionSignature: Ubuntu 5.15.0-83.92-generic 5.15.116 RelatedPackageVersions: linux-restricted-modules-5.15.0-83-generic N/A linux-backports-modules-5.15.0-83-generic N/A linux-firmware 20220329.git681281e4-0ubuntu3.18 RfKill: Error: [Errno 2] No such file or directory: 'rfkill' Tags: jammy uec-images Uname: Linux 5.15.0-83-generic x86_64 UpgradeStatus: No upgrade log present (probably fresh install) UserGroups: N/A _MarkForUpload: True dmi.bios.date: 07/27/2023 dmi.bios.release: 2.12 dmi.bios.vendor: Dell Inc. dmi.bios.version: 2.12.4 dmi.board.name: 0J91V2 dmi.board.vendor: Dell Inc. dmi.board.version: A01 dmi.chassis.type: 23 dmi.chassis.vendor: Dell Inc. dmi.modalias: dmi:bvnDellInc.:bvr2.12.4:bd07/27/2023:br2.12:svnDellInc.:pnPowerEdgeR7515:pvr:rvnDellInc.:rn0J91V2:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=08FD;ModelName=PowerEdgeR7515: dmi.product.family: PowerEdge dmi.product.name: PowerEdge R7515 dmi.product.sku: SKU=08FD;ModelName=PowerEdge R7515 dmi.sys.vendor: Dell Inc. --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 15 03:13 seq crw-rw 1 root audio 116, 33 Sep 15 03:13 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: Cannot stat file /proc/215602/fd/10: Permission denied Cannot stat file /proc/323635/fd/10: Permission denied CRDA: N/A CasperMD5CheckResult: unknown CloudArchitecture: x86_64 CloudID: maas CloudName: maas CloudPlatform: maas CloudSubPlatform: seed-dir (http://10.3.4.7:5248/MAAS/metadata/) DistroRelease: Ubuntu 22.04 IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' MachineType: Dell Inc. PowerEdge R7525 NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
Disabling TSO on both legs of the bond in all hosts did not help. After 2h30min working well, it happened again. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: Confirmed Bug description: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq crw-rw 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CRDA: N/A CasperMD5CheckResult: pass CloudArchitecture: x86_64 CloudID: none CloudName: none CloudPlatform: none CloudSubPlatform: config DistroRelease: Ubuntu 22.04 InstallationDate: Installed on 2023-08-22 (24 days ago) InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release amd64 (20230810) IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' MachineType: Dell Inc. PowerEdge R7515 Package: linux (not installed) PciMultimedia: ProcFB: 0 mgag200drmfb ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic root=UUID=cfb5f171-77e6-4fcd-947b-52901f51b26a ro ProcVersionSignature: Ubuntu 5.15.0-83.92-generic 5.15.116 RelatedPackageVersions: linux-restricted-modules-5.15.0-83-generic N/A linux-backports-modules-5.15.0-83-generic N/A linux-firmware 20220329.git681281e4-0ubuntu3.18 RfKill: Error: [Errno 2] No such file or directory: 'rfkill' Tags: jammy uec-images Uname: Linux 5.15.0-83-generic x86_64 UpgradeStatus: No upgrade log present (probably fresh install) UserGroups: N/A _MarkForUpload: True dmi.bios.date: 07/27/2023 dmi.bios.release: 2.12 dmi.bios.vendor: Dell Inc. dmi.bios.version: 2.12.4 dmi.board.name: 0J91V2 dmi.board.vendor: Dell Inc. dmi.board.version: A01 dmi.chassis.type: 23 dmi.chassis.vendor: Dell Inc. dmi.modalias: dmi:bvnDellInc.:bvr2.12.4:bd07/27/2023:br2.12:svnDellInc.:pnPowerEdgeR7515:pvr:rvnDellInc.:rn0J91V2:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=08FD;ModelName=PowerEdgeR7515: dmi.product.family: PowerEdge dmi.product.name: PowerEdge R7515 dmi.product.sku: SKU=08FD;ModelName=PowerEdge R7515 dmi.sys.vendor: Dell Inc. --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 15 03:13 seq crw-rw 1 root audio 116, 33 Sep 15 03:13 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: Cannot stat file /proc/215602/fd/10: Permission denied Cannot stat file /proc/323635/fd/10: Permission denied CRDA: N/A CasperMD5CheckResult: unknown CloudArchitecture: x86_64 CloudID: maas CloudName: maas CloudPlatform: maas CloudSubPlatform: seed-dir (http://10.3.4.7:5248/MAAS/metadata/) DistroRelease: Ubuntu 22.04 IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' MachineType: Dell Inc. PowerEdge R7525 NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair Package: linux (not
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
Got a suggestion to try disabling TSO which helped in similar cases (same queue timeout error) in e1000e driver. Will report back soon. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: Confirmed Bug description: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq crw-rw 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CRDA: N/A CasperMD5CheckResult: pass CloudArchitecture: x86_64 CloudID: none CloudName: none CloudPlatform: none CloudSubPlatform: config DistroRelease: Ubuntu 22.04 InstallationDate: Installed on 2023-08-22 (24 days ago) InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release amd64 (20230810) IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' MachineType: Dell Inc. PowerEdge R7515 Package: linux (not installed) PciMultimedia: ProcFB: 0 mgag200drmfb ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic root=UUID=cfb5f171-77e6-4fcd-947b-52901f51b26a ro ProcVersionSignature: Ubuntu 5.15.0-83.92-generic 5.15.116 RelatedPackageVersions: linux-restricted-modules-5.15.0-83-generic N/A linux-backports-modules-5.15.0-83-generic N/A linux-firmware 20220329.git681281e4-0ubuntu3.18 RfKill: Error: [Errno 2] No such file or directory: 'rfkill' Tags: jammy uec-images Uname: Linux 5.15.0-83-generic x86_64 UpgradeStatus: No upgrade log present (probably fresh install) UserGroups: N/A _MarkForUpload: True dmi.bios.date: 07/27/2023 dmi.bios.release: 2.12 dmi.bios.vendor: Dell Inc. dmi.bios.version: 2.12.4 dmi.board.name: 0J91V2 dmi.board.vendor: Dell Inc. dmi.board.version: A01 dmi.chassis.type: 23 dmi.chassis.vendor: Dell Inc. dmi.modalias: dmi:bvnDellInc.:bvr2.12.4:bd07/27/2023:br2.12:svnDellInc.:pnPowerEdgeR7515:pvr:rvnDellInc.:rn0J91V2:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=08FD;ModelName=PowerEdgeR7515: dmi.product.family: PowerEdge dmi.product.name: PowerEdge R7515 dmi.product.sku: SKU=08FD;ModelName=PowerEdge R7515 dmi.sys.vendor: Dell Inc. --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 15 03:13 seq crw-rw 1 root audio 116, 33 Sep 15 03:13 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: Cannot stat file /proc/215602/fd/10: Permission denied Cannot stat file /proc/323635/fd/10: Permission denied CRDA: N/A CasperMD5CheckResult: unknown CloudArchitecture: x86_64 CloudID: maas CloudName: maas CloudPlatform: maas CloudSubPlatform: seed-dir (http://10.3.4.7:5248/MAAS/metadata/) DistroRelease: Ubuntu 22.04 IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' MachineType: Dell Inc. PowerEdge R7525 NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
https://www.mail- archive.com/e1000-de...@lists.sourceforge.net/msg12747.html similar issue -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: Confirmed Bug description: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq crw-rw 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CRDA: N/A CasperMD5CheckResult: pass CloudArchitecture: x86_64 CloudID: none CloudName: none CloudPlatform: none CloudSubPlatform: config DistroRelease: Ubuntu 22.04 InstallationDate: Installed on 2023-08-22 (24 days ago) InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release amd64 (20230810) IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' MachineType: Dell Inc. PowerEdge R7515 Package: linux (not installed) PciMultimedia: ProcFB: 0 mgag200drmfb ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic root=UUID=cfb5f171-77e6-4fcd-947b-52901f51b26a ro ProcVersionSignature: Ubuntu 5.15.0-83.92-generic 5.15.116 RelatedPackageVersions: linux-restricted-modules-5.15.0-83-generic N/A linux-backports-modules-5.15.0-83-generic N/A linux-firmware 20220329.git681281e4-0ubuntu3.18 RfKill: Error: [Errno 2] No such file or directory: 'rfkill' Tags: jammy uec-images Uname: Linux 5.15.0-83-generic x86_64 UpgradeStatus: No upgrade log present (probably fresh install) UserGroups: N/A _MarkForUpload: True dmi.bios.date: 07/27/2023 dmi.bios.release: 2.12 dmi.bios.vendor: Dell Inc. dmi.bios.version: 2.12.4 dmi.board.name: 0J91V2 dmi.board.vendor: Dell Inc. dmi.board.version: A01 dmi.chassis.type: 23 dmi.chassis.vendor: Dell Inc. dmi.modalias: dmi:bvnDellInc.:bvr2.12.4:bd07/27/2023:br2.12:svnDellInc.:pnPowerEdgeR7515:pvr:rvnDellInc.:rn0J91V2:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=08FD;ModelName=PowerEdgeR7515: dmi.product.family: PowerEdge dmi.product.name: PowerEdge R7515 dmi.product.sku: SKU=08FD;ModelName=PowerEdge R7515 dmi.sys.vendor: Dell Inc. --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 15 03:13 seq crw-rw 1 root audio 116, 33 Sep 15 03:13 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: Cannot stat file /proc/215602/fd/10: Permission denied Cannot stat file /proc/323635/fd/10: Permission denied CRDA: N/A CasperMD5CheckResult: unknown CloudArchitecture: x86_64 CloudID: maas CloudName: maas CloudPlatform: maas CloudSubPlatform: seed-dir (http://10.3.4.7:5248/MAAS/metadata/) DistroRelease: Ubuntu 22.04 IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' MachineType: Dell Inc. PowerEdge R7525 NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair Package: linux (not installed) PciMultimedia:
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
I have not tested without the bond, but I believe this issue probably is not directly related to the fact that the interface is bonded, which would mean removing the bond will not help. While I will try to test this if possible (depends on customer doing reconfiguration of switch side), I appreciate any suggestion or workaround that could unblock the deployment. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: Confirmed Bug description: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq crw-rw 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CRDA: N/A CasperMD5CheckResult: pass CloudArchitecture: x86_64 CloudID: none CloudName: none CloudPlatform: none CloudSubPlatform: config DistroRelease: Ubuntu 22.04 InstallationDate: Installed on 2023-08-22 (24 days ago) InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release amd64 (20230810) IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' MachineType: Dell Inc. PowerEdge R7515 Package: linux (not installed) PciMultimedia: ProcFB: 0 mgag200drmfb ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic root=UUID=cfb5f171-77e6-4fcd-947b-52901f51b26a ro ProcVersionSignature: Ubuntu 5.15.0-83.92-generic 5.15.116 RelatedPackageVersions: linux-restricted-modules-5.15.0-83-generic N/A linux-backports-modules-5.15.0-83-generic N/A linux-firmware 20220329.git681281e4-0ubuntu3.18 RfKill: Error: [Errno 2] No such file or directory: 'rfkill' Tags: jammy uec-images Uname: Linux 5.15.0-83-generic x86_64 UpgradeStatus: No upgrade log present (probably fresh install) UserGroups: N/A _MarkForUpload: True dmi.bios.date: 07/27/2023 dmi.bios.release: 2.12 dmi.bios.vendor: Dell Inc. dmi.bios.version: 2.12.4 dmi.board.name: 0J91V2 dmi.board.vendor: Dell Inc. dmi.board.version: A01 dmi.chassis.type: 23 dmi.chassis.vendor: Dell Inc. dmi.modalias: dmi:bvnDellInc.:bvr2.12.4:bd07/27/2023:br2.12:svnDellInc.:pnPowerEdgeR7515:pvr:rvnDellInc.:rn0J91V2:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=08FD;ModelName=PowerEdgeR7515: dmi.product.family: PowerEdge dmi.product.name: PowerEdge R7515 dmi.product.sku: SKU=08FD;ModelName=PowerEdge R7515 dmi.sys.vendor: Dell Inc. --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 15 03:13 seq crw-rw 1 root audio 116, 33 Sep 15 03:13 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: Cannot stat file /proc/215602/fd/10: Permission denied Cannot stat file /proc/323635/fd/10: Permission denied CRDA: N/A CasperMD5CheckResult: unknown CloudArchitecture: x86_64 CloudID: maas CloudName: maas CloudPlatform: maas CloudSubPlatform: seed-dir
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
I added logs from a machine that I'm not sure was affected (infra01), adding more logs below for the one that is certainly affected (cloud002). ** Description changed: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 12 20:05 seq crw-rw 1 root audio 116, 33 Sep 12 20:05 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.5 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CRDA: N/A CasperMD5CheckResult: pass CloudArchitecture: x86_64 CloudID: none CloudName: none CloudPlatform: none CloudSubPlatform: config DistroRelease: Ubuntu 22.04 InstallationDate: Installed on 2023-08-22 (24 days ago) InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release amd64 (20230810) IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' MachineType: Dell Inc. PowerEdge R7515 Package: linux (not installed) PciMultimedia: ProcFB: 0 mgag200drmfb ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic root=UUID=cfb5f171-77e6-4fcd-947b-52901f51b26a ro ProcVersionSignature: Ubuntu 5.15.0-83.92-generic 5.15.116 RelatedPackageVersions: linux-restricted-modules-5.15.0-83-generic N/A linux-backports-modules-5.15.0-83-generic N/A linux-firmware 20220329.git681281e4-0ubuntu3.18 RfKill: Error: [Errno 2] No such file or directory: 'rfkill' Tags: jammy uec-images Uname: Linux 5.15.0-83-generic x86_64 UpgradeStatus: No upgrade log present (probably fresh install) UserGroups: N/A _MarkForUpload: True dmi.bios.date: 07/27/2023 dmi.bios.release: 2.12 dmi.bios.vendor: Dell Inc. dmi.bios.version: 2.12.4 dmi.board.name: 0J91V2 dmi.board.vendor: Dell Inc. dmi.board.version: A01 dmi.chassis.type: 23 dmi.chassis.vendor: Dell Inc. dmi.modalias: dmi:bvnDellInc.:bvr2.12.4:bd07/27/2023:br2.12:svnDellInc.:pnPowerEdgeR7515:pvr:rvnDellInc.:rn0J91V2:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=08FD;ModelName=PowerEdgeR7515: dmi.product.family: PowerEdge dmi.product.name: PowerEdge R7515 dmi.product.sku: SKU=08FD;ModelName=PowerEdge R7515 dmi.sys.vendor: Dell Inc. + --- + ProblemType: Bug + AlsaDevices: + total 0 + crw-rw 1 root audio 116, 1 Sep 15 03:13 seq + crw-rw 1 root audio 116, 33 Sep 15 03:13 timer + AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' + ApportVersion: 2.20.11-0ubuntu82.5 + Architecture: amd64 + ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' + AudioDevicesInUse: + Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: Cannot stat file /proc/215602/fd/10: Permission denied + Cannot stat file /proc/323635/fd/10: Permission denied + CRDA: N/A + CasperMD5CheckResult: unknown + CloudArchitecture: x86_64 + CloudID: maas + CloudName: maas + CloudPlatform: maas + CloudSubPlatform: seed-dir (http://10.3.4.7:5248/MAAS/metadata/) + DistroRelease: Ubuntu 22.04 + IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' + MachineType: Dell Inc. PowerEdge R7525 + NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair + Package: linux (not installed) + PciMultimedia: + + ProcFB: 0 mgag200drmfb + ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-6.2.0-32-generic root=UUID=9b437790-e6e2-4a2e-af79-5b13fee932af ro + ProcVersionSignature: Ubuntu 6.2.0-32.32~22.04.1-generic 6.2.16 +
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
** Tags added: apport-collected jammy uec-images ** Description changed: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing) - one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet - after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace - the switch does log that the bond is flapping + --- + ProblemType: Bug + AlsaDevices: + total 0 + crw-rw 1 root audio 116, 1 Sep 12 20:05 seq + crw-rw 1 root audio 116, 33 Sep 12 20:05 timer + AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' + ApportVersion: 2.20.11-0ubuntu82.5 + Architecture: amd64 + ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' + AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: + CRDA: N/A + CasperMD5CheckResult: pass + CloudArchitecture: x86_64 + CloudID: none + CloudName: none + CloudPlatform: none + CloudSubPlatform: config + DistroRelease: Ubuntu 22.04 + InstallationDate: Installed on 2023-08-22 (24 days ago) + InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release amd64 (20230810) + IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' + MachineType: Dell Inc. PowerEdge R7515 + Package: linux (not installed) + PciMultimedia: + + ProcFB: 0 mgag200drmfb + ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic root=UUID=cfb5f171-77e6-4fcd-947b-52901f51b26a ro + ProcVersionSignature: Ubuntu 5.15.0-83.92-generic 5.15.116 + RelatedPackageVersions: + linux-restricted-modules-5.15.0-83-generic N/A + linux-backports-modules-5.15.0-83-generic N/A + linux-firmware 20220329.git681281e4-0ubuntu3.18 + RfKill: Error: [Errno 2] No such file or directory: 'rfkill' + Tags: jammy uec-images + Uname: Linux 5.15.0-83-generic x86_64 + UpgradeStatus: No upgrade log present (probably fresh install) + UserGroups: N/A + _MarkForUpload: True + dmi.bios.date: 07/27/2023 + dmi.bios.release: 2.12 + dmi.bios.vendor: Dell Inc. + dmi.bios.version: 2.12.4 + dmi.board.name: 0J91V2 + dmi.board.vendor: Dell Inc. + dmi.board.version: A01 + dmi.chassis.type: 23 + dmi.chassis.vendor: Dell Inc. + dmi.modalias: dmi:bvnDellInc.:bvr2.12.4:bd07/27/2023:br2.12:svnDellInc.:pnPowerEdgeR7515:pvr:rvnDellInc.:rn0J91V2:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=08FD;ModelName=PowerEdgeR7515: + dmi.product.family: PowerEdge + dmi.product.name: PowerEdge R7515 + dmi.product.sku: SKU=08FD;ModelName=PowerEdge R7515 + dmi.sys.vendor: Dell Inc. ** Attachment added: "CurrentDmesg.txt" https://bugs.launchpad.net/bugs/2036239/+attachment/5701312/+files/CurrentDmesg.txt -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036239 Title: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out Status in linux package in Ubuntu: Confirmed Bug description: I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy. Details: - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results. - using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface. - machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while - it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network
[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
This is the log from the HWE kernel: [33219.508873] [ cut here ] [33219.508877] NETDEV WATCHDOG: enp161s0f1 (ice): transmit queue 35 timed out [33219.508932] WARNING: CPU: 48 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x21f/0x230 [33219.508940] Modules linked in: sch_ingress nf_conntrack_netlink geneve ip6_udp_tunnel udp_tunnel xt_CT dm_crypt scsi_transport_iscsi veth nfnetlink_cttimeout openvswitch nsh nf_conncount unix_diag nft_masq zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_vsock vmw_vsock_virtio_transport_common vhost vhost_iotlb vsock xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink bridge sunrpc nvme_fabrics 8021q garp mrp stp llc bonding tls binfmt_misc ipmi_ssif intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd dell_wmi kvm_amd video ledtrig_audio nls_iso8859_1 irdma sparse_keymap kvm i40e irqbypass dell_smbios dcdbas ib_uverbs rapl dell_wmi_descriptor wmi_bmof ib_core ccp ptdma k10temp acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ramoops [33219.509051] reed_solomon pstore_blk pstore_zone efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear cdc_ether usbnet mii mgag200 i2c_algo_bit drm_shmem_helper drm_kms_helper syscopyarea crct10dif_pclmul sysfillrect sysimgblt crc32_pclmul bcache polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 nvme aesni_intel crypto_simd nvme_core ahci xhci_pci cryptd ice tg3 libahci drm megaraid_sas i2c_piix4 xhci_pci_renesas nvme_common wmi [33219.509114] CPU: 48 PID: 0 Comm: swapper/48 Tainted: P O 6.2.0-32-generic #32~22.04.1-Ubuntu [33219.509116] Hardware name: Dell Inc. PowerEdge R7525/03WYW4, BIOS 2.12.4 07/26/2023 [33219.509118] RIP: 0010:dev_watchdog+0x21f/0x230 [33219.509122] Code: 00 e9 31 ff ff ff 4c 89 e7 c6 05 66 83 78 01 01 e8 56 00 f8 ff 44 89 f1 4c 89 e6 48 c7 c7 08 4f e4 b7 48 89 c2 e8 61 df 2b ff <0f> 0b e9 22 ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 [33219.509123] RSP: 0018:b42719fd0e70 EFLAGS: 00010246 [33219.509125] RAX: RBX: 9bd91b3e74c8 RCX: [33219.509126] RDX: RSI: RDI: [33219.509127] RBP: b42719fd0e98 R08: R09: [33219.509128] R10: R11: R12: 9bd91b3e7000 [33219.509129] R13: 9bd91b3e741c R14: 0023 R15: [33219.509130] FS: () GS:9b573de0() knlGS: [33219.509132] CS: 0010 DS: ES: CR0: 80050033 [33219.509133] CR2: 55fd64034000 CR3: 010273ae2004 CR4: 00770ee0 [33219.509135] PKRU: 5554 [33219.509135] Call Trace: [33219.509137] [33219.509140] ? show_regs+0x72/0x90 [33219.509145] ? dev_watchdog+0x21f/0x230 [33219.509147] ? __warn+0x8d/0x160 [33219.509151] ? dev_watchdog+0x21f/0x230 [33219.509154] ? report_bug+0x1bb/0x1d0 [33219.509158] ? handle_bug+0x46/0x90 [33219.509162] ? exc_invalid_op+0x19/0x80 [33219.509165] ? asm_exc_invalid_op+0x1b/0x20 [33219.509171] ? dev_watchdog+0x21f/0x230 [33219.509174] ? __pfx_dev_watchdog+0x10/0x10 [33219.509176] call_timer_fn+0x2c/0x160 [33219.509180] ? __pfx_dev_watchdog+0x10/0x10 [33219.509182] __run_timers.part.0+0x1fb/0x2b0 [33219.509185] ? ktime_get+0x46/0xc0 [33219.509187] ? __pfx_tick_sched_timer+0x10/0x10 [33219.509191] ? native_apic_msr_write+0x46/0x70 [33219.509194] ? lapic_next_event+0x20/0x30 [33219.509197] ? clockevents_program_event+0xb5/0x140 [33219.509200] run_timer_softirq+0x2a/0x60 [33219.509202] __do_softirq+0xdd/0x330 [33219.509205] ? hrtimer_interrupt+0x12b/0x250 [33219.509208] __irq_exit_rcu+0xa2/0xd0 [33219.509210] irq_exit_rcu+0xe/0x20 [33219.509212] sysvec_apic_timer_interrupt+0x96/0xb0 [33219.509215] [33219.509216] [33219.509216] asm_sysvec_apic_timer_interrupt+0x1b/0x20 [33219.509219] RIP: 0010:mwait_idle+0x55/0x90 [33219.509222] Code: 31 d2 48 89 d1 65 48 8b 04 25 40 18 03 00 0f 01 c8 48 8b 00 a8 08 75 14 eb 07 0f 00 2d 24 d2 35 00 31 c0 48 89 c1 fb 0f 01 c9 06 fb 0f 1f 44 00 00 65 48 8b 04 25 40 18 03 00 f0 80 60 02 df [33219.509224] RSP: 0018:b42700587e80 EFLAGS: 0246 [33219.509225] RAX: RBX: 9ad9ccd999c0 RCX: [33219.509226] RDX: RSI: RDI: [33219.509227] RBP: b42700587e80 R08: R09: [33219.509229] R10: R11: R12: [33219.509230] R13: R14: R15: