[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-04-26 Thread Ubuntu Kernel Bot
This bug is awaiting verification that the linux-xilinx-
zynqmp/5.15.0-1029.33 kernel in -proposed solves the problem. Please
test the kernel and update this bug with the results. If the problem is
solved, change the tag 'verification-needed-jammy-linux-xilinx-zynqmp'
to 'verification-done-jammy-linux-xilinx-zynqmp'. If the problem still
exists, change the tag 'verification-needed-jammy-linux-xilinx-zynqmp'
to 'verification-failed-jammy-linux-xilinx-zynqmp'.


If verification is not done by 5 working days from today, this fix will
be dropped from the source code, and this bug will be closed.


See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
to enable and use -proposed. Thank you!


** Tags added: kernel-spammed-jammy-linux-xilinx-zynqmp-v2 
verification-needed-jammy-linux-xilinx-zynqmp

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Jammy:
  Fix Released
Status in linux source package in Mantic:
  Fix Released
Status in linux source package in Noble:
  In Progress

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: Bug
  AlsaDevices:
   

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-04-15 Thread Ubuntu Kernel Bot
This bug is awaiting verification that the linux-nvidia-tegra-
igx/5.15.0-1010.10 kernel in -proposed solves the problem. Please test
the kernel and update this bug with the results. If the problem is
solved, change the tag 'verification-needed-jammy-linux-nvidia-tegra-
igx' to 'verification-done-jammy-linux-nvidia-tegra-igx'. If the problem
still exists, change the tag 'verification-needed-jammy-linux-nvidia-
tegra-igx' to 'verification-failed-jammy-linux-nvidia-tegra-igx'.


If verification is not done by 5 working days from today, this fix will
be dropped from the source code, and this bug will be closed.


See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
to enable and use -proposed. Thank you!


** Tags added: kernel-spammed-jammy-linux-nvidia-tegra-igx-v2 
verification-needed-jammy-linux-nvidia-tegra-igx

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Jammy:
  Fix Released
Status in linux source package in Mantic:
  Fix Released
Status in linux source package in Noble:
  In Progress

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: 

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-04-02 Thread Ubuntu Kernel Bot
This bug is awaiting verification that the linux-nvidia-
tegra-5.15/5.15.0-1023.23~20.04.1 kernel in -proposed solves the
problem. Please test the kernel and update this bug with the results. If
the problem is solved, change the tag 'verification-needed-focal-linux-
nvidia-tegra-5.15' to 'verification-done-focal-linux-nvidia-tegra-5.15'.
If the problem still exists, change the tag 'verification-needed-focal-
linux-nvidia-tegra-5.15' to 'verification-failed-focal-linux-nvidia-
tegra-5.15'.


If verification is not done by 5 working days from today, this fix will
be dropped from the source code, and this bug will be closed.


See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
to enable and use -proposed. Thank you!


** Tags added: kernel-spammed-focal-linux-nvidia-tegra-5.15-v2 
verification-needed-focal-linux-nvidia-tegra-5.15

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Jammy:
  Fix Released
Status in linux source package in Mantic:
  Fix Released
Status in linux source package in Noble:
  In Progress

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-04-02 Thread Ubuntu Kernel Bot
This bug is awaiting verification that the linux-nvidia-
tegra/5.15.0-1023.23 kernel in -proposed solves the problem. Please test
the kernel and update this bug with the results. If the problem is
solved, change the tag 'verification-needed-jammy-linux-nvidia-tegra' to
'verification-done-jammy-linux-nvidia-tegra'. If the problem still
exists, change the tag 'verification-needed-jammy-linux-nvidia-tegra' to
'verification-failed-jammy-linux-nvidia-tegra'.


If verification is not done by 5 working days from today, this fix will
be dropped from the source code, and this bug will be closed.


See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
to enable and use -proposed. Thank you!


** Tags added: kernel-spammed-jammy-linux-nvidia-tegra-v2 
verification-needed-jammy-linux-nvidia-tegra

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Jammy:
  Fix Released
Status in linux source package in Mantic:
  Fix Released
Status in linux source package in Noble:
  In Progress

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: Bug
  AlsaDevices:
   total 0

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-03-26 Thread Ubuntu Kernel Bot
This bug is awaiting verification that the linux-
nvidia-6.5/6.5.0-1014.14 kernel in -proposed solves the problem. Please
test the kernel and update this bug with the results. If the problem is
solved, change the tag 'verification-needed-jammy-linux-nvidia-6.5' to
'verification-done-jammy-linux-nvidia-6.5'. If the problem still exists,
change the tag 'verification-needed-jammy-linux-nvidia-6.5' to
'verification-failed-jammy-linux-nvidia-6.5'.


If verification is not done by 5 working days from today, this fix will
be dropped from the source code, and this bug will be closed.


See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
to enable and use -proposed. Thank you!


** Tags added: kernel-spammed-jammy-linux-nvidia-6.5-v2 
verification-needed-jammy-linux-nvidia-6.5

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Jammy:
  Fix Released
Status in linux source package in Mantic:
  Fix Released
Status in linux source package in Noble:
  In Progress

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-03-26 Thread Ubuntu Kernel Bot
This bug is awaiting verification that the linux-gcp-
fips/5.15.0-1055.63+fips2 kernel in -proposed solves the problem. Please
test the kernel and update this bug with the results. If the problem is
solved, change the tag 'verification-needed-jammy-linux-gcp-fips' to
'verification-done-jammy-linux-gcp-fips'. If the problem still exists,
change the tag 'verification-needed-jammy-linux-gcp-fips' to
'verification-failed-jammy-linux-gcp-fips'.


If verification is not done by 5 working days from today, this fix will
be dropped from the source code, and this bug will be closed.


See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
to enable and use -proposed. Thank you!


** Tags added: kernel-spammed-jammy-linux-gcp-fips-v2 
verification-needed-jammy-linux-gcp-fips

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Jammy:
  Fix Released
Status in linux source package in Mantic:
  Fix Released
Status in linux source package in Noble:
  In Progress

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root 

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-03-19 Thread Ubuntu Kernel Bot
This bug is awaiting verification that the linux-aws-
fips/5.15.0-1056.61+fips1 kernel in -proposed solves the problem. Please
test the kernel and update this bug with the results. If the problem is
solved, change the tag 'verification-needed-jammy-linux-aws-fips' to
'verification-done-jammy-linux-aws-fips'. If the problem still exists,
change the tag 'verification-needed-jammy-linux-aws-fips' to
'verification-failed-jammy-linux-aws-fips'.


If verification is not done by 5 working days from today, this fix will
be dropped from the source code, and this bug will be closed.


See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
to enable and use -proposed. Thank you!


** Tags added: kernel-spammed-jammy-linux-aws-fips-v2 
verification-needed-jammy-linux-aws-fips

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Jammy:
  Fix Released
Status in linux source package in Mantic:
  Fix Released
Status in linux source package in Noble:
  In Progress

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root 

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-03-14 Thread Ubuntu Kernel Bot
This bug is awaiting verification that the linux-aws/5.15.0-1056.61
kernel in -proposed solves the problem. Please test the kernel and
update this bug with the results. If the problem is solved, change the
tag 'verification-needed-jammy-linux-aws' to 'verification-done-jammy-
linux-aws'. If the problem still exists, change the tag 'verification-
needed-jammy-linux-aws' to 'verification-failed-jammy-linux-aws'.


If verification is not done by 5 working days from today, this fix will
be dropped from the source code, and this bug will be closed.


See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
to enable and use -proposed. Thank you!


** Tags added: kernel-spammed-jammy-linux-aws-v2 
verification-needed-jammy-linux-aws

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Jammy:
  Fix Released
Status in linux source package in Mantic:
  Fix Released
Status in linux source package in Noble:
  In Progress

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-03-07 Thread Ubuntu Kernel Bot
This bug is awaiting verification that the linux-raspi/5.15.0-1048.51
kernel in -proposed solves the problem. Please test the kernel and
update this bug with the results. If the problem is solved, change the
tag 'verification-needed-jammy-linux-raspi' to 'verification-done-jammy-
linux-raspi'. If the problem still exists, change the tag 'verification-
needed-jammy-linux-raspi' to 'verification-failed-jammy-linux-raspi'.


If verification is not done by 5 working days from today, this fix will
be dropped from the source code, and this bug will be closed.


See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
to enable and use -proposed. Thank you!


** Tags added: kernel-spammed-jammy-linux-raspi-v2 
verification-needed-jammy-linux-raspi

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Jammy:
  Fix Released
Status in linux source package in Mantic:
  Fix Released
Status in linux source package in Noble:
  In Progress

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-03-07 Thread Ubuntu Kernel Bot
This bug is awaiting verification that the linux-kvm/5.15.0-1052.57
kernel in -proposed solves the problem. Please test the kernel and
update this bug with the results. If the problem is solved, change the
tag 'verification-needed-jammy-linux-kvm' to 'verification-done-jammy-
linux-kvm'. If the problem still exists, change the tag 'verification-
needed-jammy-linux-kvm' to 'verification-failed-jammy-linux-kvm'.


If verification is not done by 5 working days from today, this fix will
be dropped from the source code, and this bug will be closed.


See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
to enable and use -proposed. Thank you!


** Tags added: kernel-spammed-jammy-linux-kvm-v2 
verification-needed-jammy-linux-kvm

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Jammy:
  Fix Released
Status in linux source package in Mantic:
  Fix Released
Status in linux source package in Noble:
  In Progress

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-03-07 Thread Ubuntu Kernel Bot
This bug is awaiting verification that the linux-oracle/5.15.0-1053.59
kernel in -proposed solves the problem. Please test the kernel and
update this bug with the results. If the problem is solved, change the
tag 'verification-needed-jammy-linux-oracle' to 'verification-done-
jammy-linux-oracle'. If the problem still exists, change the tag
'verification-needed-jammy-linux-oracle' to 'verification-failed-jammy-
linux-oracle'.


If verification is not done by 5 working days from today, this fix will
be dropped from the source code, and this bug will be closed.


See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
to enable and use -proposed. Thank you!


** Tags added: kernel-spammed-jammy-linux-oracle-v2 
verification-needed-jammy-linux-oracle

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Jammy:
  Fix Released
Status in linux source package in Mantic:
  Fix Released
Status in linux source package in Noble:
  In Progress

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-03-07 Thread Ubuntu Kernel Bot
This bug is awaiting verification that the linux-intel-
iotg/5.15.0-1050.56 kernel in -proposed solves the problem. Please test
the kernel and update this bug with the results. If the problem is
solved, change the tag 'verification-needed-jammy-linux-intel-iotg' to
'verification-done-jammy-linux-intel-iotg'. If the problem still exists,
change the tag 'verification-needed-jammy-linux-intel-iotg' to
'verification-failed-jammy-linux-intel-iotg'.


If verification is not done by 5 working days from today, this fix will
be dropped from the source code, and this bug will be closed.


See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
to enable and use -proposed. Thank you!


** Tags added: kernel-spammed-jammy-linux-intel-iotg-v2 
verification-needed-jammy-linux-intel-iotg

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Jammy:
  Fix Released
Status in linux source package in Mantic:
  Fix Released
Status in linux source package in Noble:
  In Progress

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-03-07 Thread Launchpad Bug Tracker
This bug was fixed in the package linux - 5.15.0-100.110

---
linux (5.15.0-100.110) jammy; urgency=medium

  * jammy/linux: 5.15.0-100.110 -proposed tracker (LP: #2052616)

  * i915 regression introduced with 5.5 kernel (LP: #2044131)
- drm/i915: Skip some timing checks on BXT/GLK DSI transcoders

  * Audio balancing setting doesn't work with the cirrus codec (LP: #2051050)
- ALSA: hda/cs8409: Suppress vmaster control for Dolphin models

  * partproke is broken on empty loopback device (LP: #2049689)
- block: Move checking GENHD_FL_NO_PART to bdev_add_partition()

  * CVE-2023-0340
- vhost: use kzalloc() instead of kmalloc() followed by memset()

  * CVE-2023-51780
- atm: Fix Use-After-Free in do_vcc_ioctl

  * CVE-2023-6915
- ida: Fix crash in ida_free when the bitmap is empty

  * CVE-2024-0646
- net: tls, update curr on splice as well

  * CVE-2024-0565
- smb: client: fix OOB in receive_encrypted_standard()

  * CVE-2023-51781
- appletalk: Fix Use-After-Free in atalk_ioctl

  * Jammy update: v5.15.143 upstream stable release (LP: #2050858)
- vdpa/mlx5: preserve CVQ vringh index
- hrtimers: Push pending hrtimers away from outgoing CPU earlier
- i2c: designware: Fix corrupted memory seen in the ISR
- netfilter: ipset: fix race condition between swap/destroy and kernel side
  add/del/test
- tg3: Move the [rt]x_dropped counters to tg3_napi
- tg3: Increment tx_dropped in tg3_tso_bug()
- kconfig: fix memory leak from range properties
- drm/amdgpu: correct chunk_ptr to a pointer to chunk.
- platform/x86: asus-wmi: Adjust tablet/lidflip handling to use enum
- platform/x86: asus-wmi: Add support for ROG X13 tablet mode
- platform/x86: asus-wmi: Simplify tablet-mode-switch probing
- platform/x86: asus-wmi: Simplify tablet-mode-switch handling
- platform/x86: asus-wmi: Move i8042 filter install to shared asus-wmi code
- of: dynamic: Fix of_reconfig_get_state_change() return value documentation
- platform/x86: wmi: Allow duplicate GUIDs for drivers that use struct
  wmi_driver
- platform/x86: wmi: Skip blocks with zero instances
- ipv6: fix potential NULL deref in fib6_add()
- octeontx2-pf: Add missing mutex lock in otx2_get_pauseparam
- octeontx2-af: Check return value of nix_get_nixlf before using nixlf
- hv_netvsc: rndis_filter needs to select NLS
- r8152: Rename RTL8152_UNPLUG to RTL8152_INACCESSIBLE
- r8152: Add RTL8152_INACCESSIBLE checks to more loops
- r8152: Add RTL8152_INACCESSIBLE to r8156b_wait_loading_flash()
- r8152: Add RTL8152_INACCESSIBLE to r8153_pre_firmware_1()
- r8152: Add RTL8152_INACCESSIBLE to r8153_aldps_en()
- mlxbf-bootctl: correctly identify secure boot with development keys
- platform/mellanox: Add null pointer checks for devm_kasprintf()
- platform/mellanox: Check devm_hwmon_device_register_with_groups() return
  value
- arcnet: restoring support for multiple Sohard Arcnet cards
- net: stmmac: fix FPE events losing
- octeontx2-af: fix a use-after-free in rvu_npa_register_reporters
- i40e: Fix unexpected MFS warning message
- net: bnxt: fix a potential use-after-free in bnxt_init_tc
- ionic: fix snprintf format length warning
- ionic: Fix dim work handling in split interrupt mode
- ipv4: ip_gre: Avoid skb_pull() failure in ipgre_xmit()
- net: hns: fix fake link up on xge port
- octeontx2-af: Update Tx link register range
- netfilter: nf_tables: validate family when identifying table via handle
- netfilter: xt_owner: Fix for unsafe access of sk->sk_socket
- tcp: do not accept ACK of bytes we never sent
- bpf: sockmap, updating the sg structure should also update curr
- psample: Require 'CAP_NET_ADMIN' when joining "packets" group
- net: add missing kdoc for struct genl_multicast_group::flags
- drop_monitor: Require 'CAP_SYS_ADMIN' when joining "events" group
- tee: optee: Fix supplicant based device enumeration
- RDMA/hns: Fix unnecessary err return when using invalid congest control
  algorithm
- RDMA/irdma: Do not modify to SQD on error
- RDMA/irdma: Add wait for suspend on SQD
- arm64: dts: rockchip: Expand reg size of vdec node for RK3399
- RDMA/rtrs-srv: Do not unconditionally enable irq
- RDMA/rtrs-clt: Start hb after path_up
- RDMA/rtrs-srv: Check return values while processing info request
- RDMA/rtrs-srv: Free srv_mr iu only when always_invalidate is true
- RDMA/rtrs-srv: Destroy path files after making sure no IOs in-flight
- RDMA/rtrs-clt: Fix the max_send_wr setting
- RDMA/rtrs-clt: Remove the warnings for req in_use check
- RDMA/bnxt_re: Correct module description string
- hwmon: (acpi_power_meter) Fix 4.29 MW bug
- hwmon: (nzxt-kraken2) Fix error handling path in kraken2_probe()
- ASoC: wm_adsp: fix memleak in wm_adsp_buffer_populate
- RDMA/core: Fix umem iterator 

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-03-06 Thread Ubuntu Kernel Bot
This bug is awaiting verification that the linux-
hwe-6.5/6.5.0-25.25~22.04.1 kernel in -proposed solves the problem.
Please test the kernel and update this bug with the results. If the
problem is solved, change the tag 'verification-needed-jammy-linux-
hwe-6.5' to 'verification-done-jammy-linux-hwe-6.5'. If the problem
still exists, change the tag 'verification-needed-jammy-linux-hwe-6.5'
to 'verification-failed-jammy-linux-hwe-6.5'.


If verification is not done by 5 working days from today, this fix will
be dropped from the source code, and this bug will be closed.


See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
to enable and use -proposed. Thank you!


** Tags added: kernel-spammed-jammy-linux-hwe-6.5-v2 
verification-needed-jammy-linux-hwe-6.5

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Jammy:
  Fix Committed
Status in linux source package in Mantic:
  Fix Released
Status in linux source package in Noble:
  In Progress

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-03-06 Thread Ubuntu Kernel Bot
This bug is awaiting verification that the linux-aws/6.5.0-1015.15
kernel in -proposed solves the problem. Please test the kernel and
update this bug with the results. If the problem is solved, change the
tag 'verification-needed-mantic-linux-aws' to 'verification-done-mantic-
linux-aws'. If the problem still exists, change the tag 'verification-
needed-mantic-linux-aws' to 'verification-failed-mantic-linux-aws'.


If verification is not done by 5 working days from today, this fix will
be dropped from the source code, and this bug will be closed.


See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
to enable and use -proposed. Thank you!


** Tags added: kernel-spammed-mantic-linux-aws-v2 
verification-needed-mantic-linux-aws

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Jammy:
  Fix Committed
Status in linux source package in Mantic:
  Fix Released
Status in linux source package in Noble:
  In Progress

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-03-06 Thread Ubuntu Kernel Bot
This bug is awaiting verification that the linux-azure/6.5.0-1016.16
kernel in -proposed solves the problem. Please test the kernel and
update this bug with the results. If the problem is solved, change the
tag 'verification-needed-mantic-linux-azure' to 'verification-done-
mantic-linux-azure'. If the problem still exists, change the tag
'verification-needed-mantic-linux-azure' to 'verification-failed-mantic-
linux-azure'.


If verification is not done by 5 working days from today, this fix will
be dropped from the source code, and this bug will be closed.


See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
to enable and use -proposed. Thank you!


** Tags added: kernel-spammed-mantic-linux-azure-v2 
verification-needed-mantic-linux-azure

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Jammy:
  Fix Committed
Status in linux source package in Mantic:
  Fix Released
Status in linux source package in Noble:
  In Progress

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-03-01 Thread Robert Malz
** Tags removed: verification-needed-mantic-linux
** Tags added: verification-done-mantic-linux

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Jammy:
  Fix Committed
Status in linux source package in Mantic:
  Fix Committed
Status in linux source package in Noble:
  In Progress

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CRDA: N/A
  CasperMD5CheckResult: pass
  CloudArchitecture: x86_64
  CloudID: none
  CloudName: none
  CloudPlatform: none
  CloudSubPlatform: config
  DistroRelease: Ubuntu 22.04
  InstallationDate: Installed on 2023-08-22 (24 days ago)
  InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy 

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-03-01 Thread Robert Malz
LP update:

Mantic update:
Due to lack of reproduction environment I have been performing following 
regression test:
1. Setup:
   nic: 2port E810-C
both interfaces set up in bonding
   kernel: 6.5.0-25-generic
2. Test cases:
   0) verified that code from the change is used during driver init
   a) stress traffic for 12h (multiple streams of iperf (tcp))
   b) if up/down during stress traffic
   c) reload driver during stress traffic
Look for any issues related to traffic processing, look for tx_hangs
3. Result: No issues have been detected during test execution

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Jammy:
  Fix Committed
Status in linux source package in Mantic:
  Fix Committed
Status in linux source package in Noble:
  In Progress

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-02-29 Thread Ubuntu Kernel Bot
This bug is awaiting verification that the linux-ibm-gt-
fips/5.15.0-1055.58+fips1 kernel in -proposed solves the problem. Please
test the kernel and update this bug with the results. If the problem is
solved, change the tag 'verification-needed-jammy-linux-ibm-gt-fips' to
'verification-done-jammy-linux-ibm-gt-fips'. If the problem still
exists, change the tag 'verification-needed-jammy-linux-ibm-gt-fips' to
'verification-failed-jammy-linux-ibm-gt-fips'.


If verification is not done by 5 working days from today, this fix will
be dropped from the source code, and this bug will be closed.


See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
to enable and use -proposed. Thank you!


** Tags added: kernel-spammed-jammy-linux-ibm-gt-fips-v2 
verification-needed-jammy-linux-ibm-gt-fips

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Jammy:
  Fix Committed
Status in linux source package in Mantic:
  Fix Committed
Status in linux source package in Noble:
  In Progress

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: Bug
  AlsaDevices:
   total 

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-02-27 Thread Robert Malz
Hi Roxana,
Mantic verification is still not finished.
I did some touch tests without stress traffic.
I'm trying to get my hands on E810 device to finish testing, I'll update ticket 
once it's done.
Wishful ETA EOW 09.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Jammy:
  Fix Committed
Status in linux source package in Mantic:
  Fix Committed
Status in linux source package in Noble:
  In Progress

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CRDA: N/A
  CasperMD5CheckResult: pass
  CloudArchitecture: x86_64
  CloudID: none
  CloudName: none
  CloudPlatform: none
  CloudSubPlatform: config
  DistroRelease: Ubuntu 

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-02-26 Thread Roxana Nicolescu
Hi Robert! Thanks for testing this on jammy. I marked the tag as verified 
('verification-done-jammy-linux') to reflect that.
Could you share the results from mantic? we need to release this next week and 
we need a confirmation this works as expected.
If your test looks fine, please remove 'verification-needed-mantic-linux' tag 
and add 'verification-done-mantic-linux'.
Thanks! 

** Tags removed: verification-needed-jammy-linux
** Tags added: verification-done-jammy-linux

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Jammy:
  Fix Committed
Status in linux source package in Mantic:
  Fix Committed
Status in linux source package in Noble:
  In Progress

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: 

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-02-13 Thread Robert Malz
Jammy update:
Due to lack of reproduction environment I have been performing following 
regression test:
1. Setup:
   nic: 2port E810-XXV
both interfaces set up in bonding 
   kernel: 5.15.0-100-generic
2. Test cases:
   0) verified that code from the change is used during driver init
   a) stress traffic for 48h (multiple streams of iperf (tcp))
   b) if up/down during stress traffic
   c) pf reset during stress traffic
Look for any issues related to traffic processing, look for tx_hangs
3. Result: No issues have been detected during test execution

Mantic tests in progress.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Jammy:
  Fix Committed
Status in linux source package in Mantic:
  Fix Committed
Status in linux source package in Noble:
  In Progress

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-02-08 Thread Ubuntu Kernel Bot
This bug is awaiting verification that the linux/6.5.0-25.25 kernel in
-proposed solves the problem. Please test the kernel and update this bug
with the results. If the problem is solved, change the tag
'verification-needed-mantic-linux' to 'verification-done-mantic-linux'.
If the problem still exists, change the tag 'verification-needed-mantic-
linux' to 'verification-failed-mantic-linux'.


If verification is not done by 5 working days from today, this fix will
be dropped from the source code, and this bug will be closed.


See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
to enable and use -proposed. Thank you!


** Tags added: kernel-spammed-mantic-linux-v2 verification-needed-mantic-linux

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Jammy:
  Fix Committed
Status in linux source package in Mantic:
  Fix Committed
Status in linux source package in Noble:
  In Progress

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw 1 root audio 116, 33 

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-02-08 Thread Andre Ruiz
Houston, we have a problem...

This bug is notoriously difficult to reproduce. The only environment
that presented it is now in production and will not be available for
testing anymore. Which means that this cannot be tested, unless anyone
can suggest a new way of reproducing it.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Jammy:
  Fix Committed
Status in linux source package in Mantic:
  Fix Committed
Status in linux source package in Noble:
  In Progress

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CRDA: N/A
  CasperMD5CheckResult: pass
  CloudArchitecture: x86_64
  CloudID: none
  CloudName: none
  

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-02-08 Thread Ubuntu Kernel Bot
This bug is awaiting verification that the linux/5.15.0-100.110 kernel
in -proposed solves the problem. Please test the kernel and update this
bug with the results. If the problem is solved, change the tag
'verification-needed-jammy-linux' to 'verification-done-jammy-linux'. If
the problem still exists, change the tag 'verification-needed-jammy-
linux' to 'verification-failed-jammy-linux'.


If verification is not done by 5 working days from today, this fix will
be dropped from the source code, and this bug will be closed.


See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
to enable and use -proposed. Thank you!


** Tags added: kernel-spammed-jammy-linux-v2 verification-needed-jammy-linux

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Jammy:
  Fix Committed
Status in linux source package in Mantic:
  Fix Committed
Status in linux source package in Noble:
  In Progress

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw 1 root audio 116, 33 

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-02-01 Thread Heitor Alves de Siqueira
Yes, HWE kernels based on the Jammy/Mantic/Noble kernels should get this
fix automatically when the GA versions get released.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Jammy:
  Fix Committed
Status in linux source package in Mantic:
  Fix Committed
Status in linux source package in Noble:
  In Progress

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CRDA: N/A
  CasperMD5CheckResult: pass
  CloudArchitecture: x86_64
  CloudID: none
  CloudName: none
  CloudPlatform: none
  CloudSubPlatform: config
  DistroRelease: Ubuntu 22.04
  InstallationDate: Installed on 2023-08-22 (24 days ago)
  InstallationMedia: 

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-02-01 Thread Christian Rohmann
Thx a log Heitor! With no mention of some new package fixing this I did not 
correlate that to any patch to the kernel.
Will the be fixed in the HWE kernel as well then?

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Jammy:
  Fix Committed
Status in linux source package in Mantic:
  Fix Committed
Status in linux source package in Noble:
  In Progress

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CRDA: N/A
  CasperMD5CheckResult: pass
  CloudArchitecture: x86_64
  CloudID: none
  CloudName: none
  CloudPlatform: none
  CloudSubPlatform: config
  DistroRelease: Ubuntu 22.04
  InstallationDate: Installed on 2023-08-22 

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-01-31 Thread Heitor Alves de Siqueira
@christian-rhomann "Fix committed" here means that the patches have been merged 
into Ubuntu's kernel tree for that specific release. The patch Robert submitted 
is the one from upstream, not the test patch from the comments here.
E.g. for Jammy:
https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/jammy/commit/?h=master-next=fc26d7737e3a

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Jammy:
  Fix Committed
Status in linux source package in Mantic:
  Fix Committed
Status in linux source package in Noble:
  In Progress

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CRDA: N/A
  CasperMD5CheckResult: pass

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-01-31 Thread Stefan Bader
** Changed in: linux (Ubuntu Mantic)
   Status: In Progress => Fix Committed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Jammy:
  Fix Committed
Status in linux source package in Mantic:
  Fix Committed
Status in linux source package in Noble:
  In Progress

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CRDA: N/A
  CasperMD5CheckResult: pass
  CloudArchitecture: x86_64
  CloudID: none
  CloudName: none
  CloudPlatform: none
  CloudSubPlatform: config
  DistroRelease: Ubuntu 22.04
  InstallationDate: Installed on 2023-08-22 (24 days ago)
  InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release 

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-01-31 Thread Christian Rohmann
@Robert thanks for keeping this bug alive and updated!

1) More debug info required?

@Robert, reading your post 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/50 again,
I am wondering if you asked me to provided more debug info with NVM 4.4 on my 
E810 NICs? Would this help in any way?

2) @smb changed this  bug to "fix commited" for Jammy - is this really the 
correct state?
As @Andre said in 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/41, just 
manually commenting out some lines in the ice kernel module is "not a fix".

3) Will the two "fixes" you referred to in
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/52
make it to any other kernel than 6.8? Either by Intel or by Ubuntu
applying them there? Otherwise I am wondering if and when 6.8 will be,
once out, made available as HWE for Jammy?

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Jammy:
  Fix Committed
Status in linux source package in Mantic:
  In Progress
Status in linux source package in Noble:
  In Progress

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond 

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-01-30 Thread Robert Malz
Switching status for Noble to In Progress.
Target release for Noble is 6.8 (which includes fix) but it's not out yet, 
status will be changed once 6.8 will be introduced.

** Changed in: linux (Ubuntu Noble)
   Status: Invalid => In Progress

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Jammy:
  Fix Committed
Status in linux source package in Mantic:
  In Progress
Status in linux source package in Noble:
  In Progress

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CRDA: N/A
  CasperMD5CheckResult: pass
  CloudArchitecture: x86_64
  CloudID: none
  CloudName: none
  CloudPlatform: none
  CloudSubPlatform: config

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-01-29 Thread Robert Malz
Fix already included in 6.8

** Changed in: linux (Ubuntu Noble)
   Status: In Progress => Invalid

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  Invalid
Status in linux source package in Jammy:
  Fix Committed
Status in linux source package in Mantic:
  In Progress
Status in linux source package in Noble:
  Invalid

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CRDA: N/A
  CasperMD5CheckResult: pass
  CloudArchitecture: x86_64
  CloudID: none
  CloudName: none
  CloudPlatform: none
  CloudSubPlatform: config
  DistroRelease: Ubuntu 22.04
  InstallationDate: Installed on 2023-08-22 (24 days ago)
  InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy 

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-01-25 Thread Robert Malz
Hey Christian, Intel proposed change [1]
which is targeting this problem and based on our testing in fact it solves the 
problem.
This change is currently added to Ubuntu Kernels.

I'm also keeping an eye on [2] but right now I don't yet see "business need" to 
incorporate it to Ubuntu Kernel.
This patch furthers limit problematic part of the code by adding (in addition 
to NVM caps check) verification based on DDP package.

1 - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
2 - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20240122/039100.html

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Jammy:
  Fix Committed
Status in linux source package in Mantic:
  In Progress
Status in linux source package in Noble:
  In Progress

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-01-25 Thread Christian Rohmann
@Stefan Could you kindly elaborate on the "Fix Commmited"? Was there any
change to the kernel that would fix this issue? Is this fixed with 4.40
NVM from Intel?

Reading Roberts post
(https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/50)
again, it seems that he is only guessing that there was something fixed
by Intel.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Jammy:
  Fix Committed
Status in linux source package in Mantic:
  In Progress
Status in linux source package in Noble:
  In Progress

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CRDA: N/A
  CasperMD5CheckResult: pass
  

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-01-24 Thread Stefan Bader
** Changed in: linux (Ubuntu Jammy)
   Status: In Progress => Fix Committed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Jammy:
  Fix Committed
Status in linux source package in Mantic:
  In Progress
Status in linux source package in Noble:
  In Progress

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CRDA: N/A
  CasperMD5CheckResult: pass
  CloudArchitecture: x86_64
  CloudID: none
  CloudName: none
  CloudPlatform: none
  CloudSubPlatform: config
  DistroRelease: Ubuntu 22.04
  InstallationDate: Installed on 2023-08-22 (24 days ago)
  InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release 

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-01-12 Thread Heitor Alves de Siqueira
** Also affects: linux (Ubuntu Noble)
   Importance: Medium
 Assignee: Robert Malz (rmalz)
   Status: In Progress

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Jammy:
  In Progress
Status in linux source package in Mantic:
  In Progress
Status in linux source package in Noble:
  In Progress

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CRDA: N/A
  CasperMD5CheckResult: pass
  CloudArchitecture: x86_64
  CloudID: none
  CloudName: none
  CloudPlatform: none
  CloudSubPlatform: config
  DistroRelease: Ubuntu 22.04
  InstallationDate: Installed on 2023-08-22 (24 days ago)
  InstallationMedia: Ubuntu-Server 

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-01-12 Thread Heitor Alves de Siqueira
** Changed in: linux (Ubuntu)
   Status: Invalid => Confirmed

** Changed in: linux (Ubuntu)
   Status: Confirmed => In Progress

** Changed in: linux (Ubuntu)
   Importance: Undecided => Medium

** Changed in: linux (Ubuntu)
 Assignee: (unassigned) => Robert Malz (rmalz)

** Changed in: linux (Ubuntu Jammy)
 Assignee: (unassigned) => Robert Malz (rmalz)

** Changed in: linux (Ubuntu Mantic)
 Assignee: (unassigned) => Robert Malz (rmalz)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Jammy:
  In Progress
Status in linux source package in Mantic:
  In Progress

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit 

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-01-12 Thread Stefan Bader
** Also affects: linux (Ubuntu Jammy)
   Importance: Undecided
   Status: New

** Also affects: linux (Ubuntu Mantic)
   Importance: Undecided
   Status: New

** Changed in: linux (Ubuntu Jammy)
   Importance: Undecided => Medium

** Changed in: linux (Ubuntu Jammy)
   Status: New => In Progress

** Changed in: linux (Ubuntu Mantic)
   Importance: Undecided => Medium

** Changed in: linux (Ubuntu Mantic)
   Status: New => In Progress

** Changed in: linux (Ubuntu)
   Status: Confirmed => Invalid

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  Invalid
Status in linux source package in Jammy:
  In Progress
Status in linux source package in Mantic:
  In Progress

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', 

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-01-10 Thread Robert Malz
Hey @Christian,
1a) No need, AQ 0x000A returns NVM capabilities regardless of configuration 
applied (it's done during driver init)
1b) That's the point, I noticed you upgraded to 4.3 which I currently don't 
have access to and I wanted to verify capabilities on 4.3. NVM caps should be 
similar on the same NVM version in single head of family so values I had access 
to would be the same you had on 4.2 (meaning there is no point of collecting 
these)
1c) No, any "recent" kernel/driver version will support enabling debug logs by 
adding dyndbg=+p param to module. We only care for logs which are retrieved 
from NVM and printed with debug flags.

2) The issue based on recent patches from Intel is caused by performing LAG 
related operations without proper support from NVM. Release notes does not 
always tell every feature change so there is a possibility that 4.4 introduced 
sriov_lag capability but I cannot verify it.
Worst case scenario is that NVM 4.4 will introduce sriov_lag capability, 
meaning patches added recently to upstream kernel will have no effect, and also 
issue will still reproduce. In this scenario currently there will be no 
'workaround' for it.
Best case scenario is that NVM 4.4 will introduce sriov_lag capability and 
issue will no longer reproduce.
In this scenario no additional patches to the driver will be required.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything 

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-01-04 Thread Christian Rohmann
@Robert, 
first thanks a lot for pursuing this issue!

1) I certainly can provide the debugging info. May I ask if ...

  a) the system in question would need to have an active LAG (LACP) for
this to be helpful? We did switch to active-backup on all our machines
due to this very issue.

  b) this requires FW version 4.20? All our machines currently run 4.30
already.

  c) this requires a certain kernel / ice driver version?


2) There now is FW 4.40 out [1]. But there seem to be no fixes related
to LAG / LACP, some regarding SRIO-V though. But I guess you are
convinced the issue is not within the FW, but rather the ice driver?


[1] - 
https://www.intel.com/content/www/us/en/download/19624/non-volatile-memory-nvm-update-utility-for-intel-ethernet-network-adapter-e810-series.html

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-01-04 Thread Robert Malz
@Christian,
Can you verify your device capabilities returned from 0x000A looking for SRIOV 
lag?
I have attached a script "parse_aq_0xA.py" you need to load driver with 
dyndbg=+p and replace a buffer in script.
Note: buffer has to come from CQ CMD: opcode 0x000A
Expected result:
(...)
resp cap: 0x92 -- this capability we are looking for
resp maj_ver: 0x1
resp min_ver: 0x0
resp number: 0x1   -- This is value we want to check.
resp logical_id: 0x0
resp phys_id: 0x0

If it is set to 0x1 patch [1] will disable lag handler and simplify 
initialization logic to something like in comment #40
Buffer available in the script comes from CVL4.20 NVM

[1] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-
Mon-20231211/038588.html

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-01-04 Thread Robert Malz
Script to verify AQ 0x000A capabilities

** Attachment added: "parse_aq_0xA.py"
   
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/+attachment/5736421/+files/parse_aq_0xA.py

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  [Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
  
  [Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
This change has been tested in an environment where reproduction is 
easily achieved.
After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
  
  [Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to 
achieve.
  * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
  
  [Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.

  [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6

  [Other Info]
  * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
  * Original description of the case below:
  
  

  I'm having issues with an Intel E810-XXV card on a Dell server under
  Ubuntu Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  ---
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CRDA: N/A
  CasperMD5CheckResult: pass
  CloudArchitecture: x86_64
  CloudID: none
  CloudName: none
  CloudPlatform: none
  CloudSubPlatform: config
  DistroRelease: Ubuntu 22.04
  InstallationDate: Installed on 2023-08-22 (24 days ago)
  InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release 
amd64 (20230810)
  IwConfig: Error: [Errno 2] No such file or 

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2024-01-04 Thread Robert Malz
** Description changed:

+ [Impact]
+  * Issue is causing transmit hang on E810 ports with bonding enabled.
+  * Based on the provided logs, TX hang can last for even a couple of 
minutes, but in most scenarios, the network will be recovered after the ice 
driver performs a PF reset (TX hang handler routine).
+  * Originally, the issue was observed during Tempest tests on a newly 
created OpenStack cluster, resulting in a lack of certification.
+ 
+ [Fix]
+ * Initially, a workaround has been proposed by Intel engineers to disable 
LAG initialization [1].
+   This change has been tested in an environment where reproduction is 
easily achieved.
+   After multiple iterations, no reproduction has been observed.
+ * Shortly after, Intel proposed a patch [2] to disable LAG initialization 
if NVM does not expose proper capabilities.
+ 
+ [Test Plan]
+ * To reproduce the issue, over a 20-node cluster was used with Ceph-based 
storage. The problem could sometimes manifest while deploying a cluster or 
after the cluster was already deployed during the Tempest test run.
+ * The issue could appear on a random node, making reproduction hard to 
achieve.
+ * Multiple stress tests on single host with similar configuration did not 
trigger a reproduction.
+ 
+ [Where problems could occur]
+ * All ice drivers with ice_lag_event_handler registered can expose the 
issue. This handler is not implemented in 20.04
+ * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG 
capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this 
capability will be released.
+   Although potentialy issue is caused by using features without proper FW 
support [2], we want to take a closer look once NVMs with proper support are 
introduced.
+   
+ [1] - 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
+ [2] - 
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html
 4d50fcdc2476eef94c14c6761073af5667bb43b6
  
- I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu 
Jammy.
+ [Other Info]
+ * Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice 
driver backported from mainline kernel from before patch [2] was added.
+ * Original description of the case below:
+ 
+ 
+ 
+ I'm having issues with an Intel E810-XXV card on a Dell server under
+ Ubuntu Jammy.
  
  Details:
  
  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)
  
  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.
  
  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as the
  problem seems to be in the interface.
  
  - machine installed by maas. No issues during installation, but at that
  time bond is not formed yet, later when linux is booted, the bond is
  formed and works without issues for a while
  
  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered by
  some tests that I run after openstack finishes installing)
  
  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet
  
  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace
  
  - the switch does log that the bond is flapping
- --- 
+ ---
  ProblemType: Bug
  AlsaDevices:
-  total 0
-  crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
-  crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
+  total 0
+  crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
+  crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CRDA: N/A
  CasperMD5CheckResult: pass
  CloudArchitecture: x86_64
  CloudID: none
  CloudName: none
  CloudPlatform: none
  CloudSubPlatform: config
  DistroRelease: Ubuntu 22.04
  InstallationDate: Installed on 2023-08-22 (24 days ago)
  InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release 
amd64 (20230810)
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
  MachineType: Dell Inc. PowerEdge R7515
  Package: linux (not installed)
  PciMultimedia:
-  
+ 
  ProcFB: 0 mgag200drmfb
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic 

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2023-12-07 Thread Christian Rohmann
FWIW, we updated our NICs to 4.30 as they were individually purchased
and not part of pre-built servers and also have this issue.

So in essence the issue also exists with the latest firmware.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  
  I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu 
Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CRDA: N/A
  CasperMD5CheckResult: pass
  CloudArchitecture: x86_64
  CloudID: none
  CloudName: none
  CloudPlatform: none
  CloudSubPlatform: config
  DistroRelease: Ubuntu 22.04
  InstallationDate: Installed on 2023-08-22 (24 days ago)
  InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release 
amd64 (20230810)
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
  MachineType: Dell Inc. PowerEdge R7515
  Package: linux (not installed)
  PciMultimedia:
   
  ProcFB: 0 mgag200drmfb
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic 
root=UUID=cfb5f171-77e6-4fcd-947b-52901f51b26a ro
  ProcVersionSignature: Ubuntu 5.15.0-83.92-generic 5.15.116
  RelatedPackageVersions:
   linux-restricted-modules-5.15.0-83-generic N/A
   linux-backports-modules-5.15.0-83-generic  N/A
   linux-firmware 20220329.git681281e4-0ubuntu3.18
  RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
  Tags:  jammy uec-images
  Uname: Linux 5.15.0-83-generic x86_64
  UpgradeStatus: No upgrade log present (probably fresh install)
  UserGroups: N/A
  _MarkForUpload: True
  dmi.bios.date: 07/27/2023
  dmi.bios.release: 2.12
  dmi.bios.vendor: Dell Inc.
  dmi.bios.version: 2.12.4
  dmi.board.name: 0J91V2
  dmi.board.vendor: Dell Inc.
  dmi.board.version: A01
  dmi.chassis.type: 23
  dmi.chassis.vendor: Dell Inc.
  dmi.modalias: 
dmi:bvnDellInc.:bvr2.12.4:bd07/27/2023:br2.12:svnDellInc.:pnPowerEdgeR7515:pvr:rvnDellInc.:rn0J91V2:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=08FD;ModelName=PowerEdgeR7515:
  dmi.product.family: PowerEdge
  dmi.product.name: PowerEdge R7515
  dmi.product.sku: SKU=08FD;ModelName=PowerEdge R7515
  dmi.sys.vendor: Dell Inc.
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 15 03:13 seq
   crw-rw 1 root audio 116, 33 Sep 15 03:13 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse:
   Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with 
exit code 1: Cannot stat file /proc/215602/fd/10: Permission denied
   Cannot stat file /proc/323635/fd/10: Permission denied
  CRDA: N/A
  CasperMD5CheckResult: unknown
  CloudArchitecture: x86_64
  CloudID: maas
  CloudName: maas
  CloudPlatform: maas
  CloudSubPlatform: seed-dir (http://10.3.4.7:5248/MAAS/metadata/)
  DistroRelease: Ubuntu 22.04
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
  MachineType: Dell Inc. PowerEdge R7525
  

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2023-12-07 Thread Andre Ruiz
Yeah, I knew about that 4.30 update in Intel website, but it is not present on 
Dell tools yet and the customer did not want to void their warranty 
(potentially), so I did not try it. That is something to keep in mind while we 
debug it.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  
  I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu 
Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CRDA: N/A
  CasperMD5CheckResult: pass
  CloudArchitecture: x86_64
  CloudID: none
  CloudName: none
  CloudPlatform: none
  CloudSubPlatform: config
  DistroRelease: Ubuntu 22.04
  InstallationDate: Installed on 2023-08-22 (24 days ago)
  InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release 
amd64 (20230810)
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
  MachineType: Dell Inc. PowerEdge R7515
  Package: linux (not installed)
  PciMultimedia:
   
  ProcFB: 0 mgag200drmfb
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic 
root=UUID=cfb5f171-77e6-4fcd-947b-52901f51b26a ro
  ProcVersionSignature: Ubuntu 5.15.0-83.92-generic 5.15.116
  RelatedPackageVersions:
   linux-restricted-modules-5.15.0-83-generic N/A
   linux-backports-modules-5.15.0-83-generic  N/A
   linux-firmware 20220329.git681281e4-0ubuntu3.18
  RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
  Tags:  jammy uec-images
  Uname: Linux 5.15.0-83-generic x86_64
  UpgradeStatus: No upgrade log present (probably fresh install)
  UserGroups: N/A
  _MarkForUpload: True
  dmi.bios.date: 07/27/2023
  dmi.bios.release: 2.12
  dmi.bios.vendor: Dell Inc.
  dmi.bios.version: 2.12.4
  dmi.board.name: 0J91V2
  dmi.board.vendor: Dell Inc.
  dmi.board.version: A01
  dmi.chassis.type: 23
  dmi.chassis.vendor: Dell Inc.
  dmi.modalias: 
dmi:bvnDellInc.:bvr2.12.4:bd07/27/2023:br2.12:svnDellInc.:pnPowerEdgeR7515:pvr:rvnDellInc.:rn0J91V2:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=08FD;ModelName=PowerEdgeR7515:
  dmi.product.family: PowerEdge
  dmi.product.name: PowerEdge R7515
  dmi.product.sku: SKU=08FD;ModelName=PowerEdge R7515
  dmi.sys.vendor: Dell Inc.
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 15 03:13 seq
   crw-rw 1 root audio 116, 33 Sep 15 03:13 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse:
   Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with 
exit code 1: Cannot stat file /proc/215602/fd/10: Permission denied
   Cannot stat file /proc/323635/fd/10: Permission denied
  CRDA: N/A
  CasperMD5CheckResult: unknown
  CloudArchitecture: x86_64
  CloudID: maas
  CloudName: maas
  CloudPlatform: maas
  CloudSubPlatform: seed-dir (http://10.3.4.7:5248/MAAS/metadata/)
  DistroRelease: Ubuntu 22.04
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
  

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2023-12-07 Thread Bartosz Woronicz
@Andre,

I successfully installed on the machines with NVMUpdate64 tool. That is
on HPE machines.


$ sudo ./nvmupdate64e 

Intel(R) Ethernet NVM Update Tool
NVMUpdate version 1.39.56.8
Copyright(C) 2013 - 2023 Intel Corporation.


WARNING: To avoid damage to your device, do not stop the update or reboot or 
power off the system during this update.
Inventory in progress. Please wait [**+...]


Num Description  Ver.(hex)  DevId S:BStatus
=== ==  = == ==
01) Intel(R) Ethernet Network Adapter   4.48(4.30)   159B 00:016 Up to date
E810-XXV-2 
02) Intel(R) Ethernet Network Adapter N/A(N/A)   1521 00:072 Update not
I350-T4 for OCP NIC 3.0  available 


Tool execution completed with the following status: An error occurred accessing 
the device.
Press any key to exit.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  
  I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu 
Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CRDA: N/A
  CasperMD5CheckResult: pass
  CloudArchitecture: x86_64
  CloudID: none
  CloudName: none
  CloudPlatform: none
  CloudSubPlatform: config
  DistroRelease: Ubuntu 22.04
  InstallationDate: Installed on 2023-08-22 (24 days ago)
  InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release 
amd64 (20230810)
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
  MachineType: Dell Inc. PowerEdge R7515
  Package: linux (not installed)
  PciMultimedia:
   
  ProcFB: 0 mgag200drmfb
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic 
root=UUID=cfb5f171-77e6-4fcd-947b-52901f51b26a ro
  ProcVersionSignature: Ubuntu 5.15.0-83.92-generic 5.15.116
  RelatedPackageVersions:
   linux-restricted-modules-5.15.0-83-generic N/A
   linux-backports-modules-5.15.0-83-generic  N/A
   linux-firmware 20220329.git681281e4-0ubuntu3.18
  RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
  Tags:  jammy uec-images
  Uname: Linux 5.15.0-83-generic x86_64
  UpgradeStatus: No upgrade log present (probably fresh install)
  UserGroups: N/A
  _MarkForUpload: True
  dmi.bios.date: 07/27/2023
  dmi.bios.release: 2.12
  dmi.bios.vendor: Dell Inc.
  dmi.bios.version: 2.12.4
  dmi.board.name: 0J91V2
  dmi.board.vendor: Dell Inc.
  dmi.board.version: A01
  dmi.chassis.type: 23
  dmi.chassis.vendor: Dell Inc.
  dmi.modalias: 
dmi:bvnDellInc.:bvr2.12.4:bd07/27/2023:br2.12:svnDellInc.:pnPowerEdgeR7515:pvr:rvnDellInc.:rn0J91V2:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=08FD;ModelName=PowerEdgeR7515:
  dmi.product.family: PowerEdge
  dmi.product.name: PowerEdge R7515
  dmi.product.sku: SKU=08FD;ModelName=PowerEdge R7515
  dmi.sys.vendor: Dell Inc.
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 15 03:13 seq
   crw-rw 1 root audio 

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2023-12-06 Thread Andre Ruiz
@Bartosz

$ ethtool -i enp65s0f0 |grep firmware-version
firmware-version: 4.20 0x8001784b 22.0.9

This is the latest firmware supported by Dell. You will find 4.3
available on Intel website, but it is not available yet through dell
firmware tools.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  
  I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu 
Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CRDA: N/A
  CasperMD5CheckResult: pass
  CloudArchitecture: x86_64
  CloudID: none
  CloudName: none
  CloudPlatform: none
  CloudSubPlatform: config
  DistroRelease: Ubuntu 22.04
  InstallationDate: Installed on 2023-08-22 (24 days ago)
  InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release 
amd64 (20230810)
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
  MachineType: Dell Inc. PowerEdge R7515
  Package: linux (not installed)
  PciMultimedia:
   
  ProcFB: 0 mgag200drmfb
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic 
root=UUID=cfb5f171-77e6-4fcd-947b-52901f51b26a ro
  ProcVersionSignature: Ubuntu 5.15.0-83.92-generic 5.15.116
  RelatedPackageVersions:
   linux-restricted-modules-5.15.0-83-generic N/A
   linux-backports-modules-5.15.0-83-generic  N/A
   linux-firmware 20220329.git681281e4-0ubuntu3.18
  RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
  Tags:  jammy uec-images
  Uname: Linux 5.15.0-83-generic x86_64
  UpgradeStatus: No upgrade log present (probably fresh install)
  UserGroups: N/A
  _MarkForUpload: True
  dmi.bios.date: 07/27/2023
  dmi.bios.release: 2.12
  dmi.bios.vendor: Dell Inc.
  dmi.bios.version: 2.12.4
  dmi.board.name: 0J91V2
  dmi.board.vendor: Dell Inc.
  dmi.board.version: A01
  dmi.chassis.type: 23
  dmi.chassis.vendor: Dell Inc.
  dmi.modalias: 
dmi:bvnDellInc.:bvr2.12.4:bd07/27/2023:br2.12:svnDellInc.:pnPowerEdgeR7515:pvr:rvnDellInc.:rn0J91V2:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=08FD;ModelName=PowerEdgeR7515:
  dmi.product.family: PowerEdge
  dmi.product.name: PowerEdge R7515
  dmi.product.sku: SKU=08FD;ModelName=PowerEdge R7515
  dmi.sys.vendor: Dell Inc.
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 15 03:13 seq
   crw-rw 1 root audio 116, 33 Sep 15 03:13 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse:
   Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with 
exit code 1: Cannot stat file /proc/215602/fd/10: Permission denied
   Cannot stat file /proc/323635/fd/10: Permission denied
  CRDA: N/A
  CasperMD5CheckResult: unknown
  CloudArchitecture: x86_64
  CloudID: maas
  CloudName: maas
  CloudPlatform: maas
  CloudSubPlatform: seed-dir (http://10.3.4.7:5248/MAAS/metadata/)
  DistroRelease: Ubuntu 22.04
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2023-12-06 Thread Bartosz Woronicz
What is the cards firmware ?

$ ethtool -i  |grep firmware-version

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  
  I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu 
Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CRDA: N/A
  CasperMD5CheckResult: pass
  CloudArchitecture: x86_64
  CloudID: none
  CloudName: none
  CloudPlatform: none
  CloudSubPlatform: config
  DistroRelease: Ubuntu 22.04
  InstallationDate: Installed on 2023-08-22 (24 days ago)
  InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release 
amd64 (20230810)
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
  MachineType: Dell Inc. PowerEdge R7515
  Package: linux (not installed)
  PciMultimedia:
   
  ProcFB: 0 mgag200drmfb
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic 
root=UUID=cfb5f171-77e6-4fcd-947b-52901f51b26a ro
  ProcVersionSignature: Ubuntu 5.15.0-83.92-generic 5.15.116
  RelatedPackageVersions:
   linux-restricted-modules-5.15.0-83-generic N/A
   linux-backports-modules-5.15.0-83-generic  N/A
   linux-firmware 20220329.git681281e4-0ubuntu3.18
  RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
  Tags:  jammy uec-images
  Uname: Linux 5.15.0-83-generic x86_64
  UpgradeStatus: No upgrade log present (probably fresh install)
  UserGroups: N/A
  _MarkForUpload: True
  dmi.bios.date: 07/27/2023
  dmi.bios.release: 2.12
  dmi.bios.vendor: Dell Inc.
  dmi.bios.version: 2.12.4
  dmi.board.name: 0J91V2
  dmi.board.vendor: Dell Inc.
  dmi.board.version: A01
  dmi.chassis.type: 23
  dmi.chassis.vendor: Dell Inc.
  dmi.modalias: 
dmi:bvnDellInc.:bvr2.12.4:bd07/27/2023:br2.12:svnDellInc.:pnPowerEdgeR7515:pvr:rvnDellInc.:rn0J91V2:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=08FD;ModelName=PowerEdgeR7515:
  dmi.product.family: PowerEdge
  dmi.product.name: PowerEdge R7515
  dmi.product.sku: SKU=08FD;ModelName=PowerEdge R7515
  dmi.sys.vendor: Dell Inc.
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 15 03:13 seq
   crw-rw 1 root audio 116, 33 Sep 15 03:13 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse:
   Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with 
exit code 1: Cannot stat file /proc/215602/fd/10: Permission denied
   Cannot stat file /proc/323635/fd/10: Permission denied
  CRDA: N/A
  CasperMD5CheckResult: unknown
  CloudArchitecture: x86_64
  CloudID: maas
  CloudName: maas
  CloudPlatform: maas
  CloudSubPlatform: seed-dir (http://10.3.4.7:5248/MAAS/metadata/)
  DistroRelease: Ubuntu 22.04
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
  MachineType: Dell Inc. PowerEdge R7525
  NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
  Package: linux (not installed)
  PciMultimedia:
   
  ProcFB: 0 

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2023-12-04 Thread Andre Ruiz
I have tried this (patches suggested in comment #40) and the problem
seems to have gone away. It may be too soon to say but my test scenario
(which never gave me a false negative before) finished without issues.

Of course this is not a 'fix', so I'm curious to see what the OP has to
say about this result.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  
  I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu 
Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CRDA: N/A
  CasperMD5CheckResult: pass
  CloudArchitecture: x86_64
  CloudID: none
  CloudName: none
  CloudPlatform: none
  CloudSubPlatform: config
  DistroRelease: Ubuntu 22.04
  InstallationDate: Installed on 2023-08-22 (24 days ago)
  InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release 
amd64 (20230810)
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
  MachineType: Dell Inc. PowerEdge R7515
  Package: linux (not installed)
  PciMultimedia:
   
  ProcFB: 0 mgag200drmfb
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic 
root=UUID=cfb5f171-77e6-4fcd-947b-52901f51b26a ro
  ProcVersionSignature: Ubuntu 5.15.0-83.92-generic 5.15.116
  RelatedPackageVersions:
   linux-restricted-modules-5.15.0-83-generic N/A
   linux-backports-modules-5.15.0-83-generic  N/A
   linux-firmware 20220329.git681281e4-0ubuntu3.18
  RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
  Tags:  jammy uec-images
  Uname: Linux 5.15.0-83-generic x86_64
  UpgradeStatus: No upgrade log present (probably fresh install)
  UserGroups: N/A
  _MarkForUpload: True
  dmi.bios.date: 07/27/2023
  dmi.bios.release: 2.12
  dmi.bios.vendor: Dell Inc.
  dmi.bios.version: 2.12.4
  dmi.board.name: 0J91V2
  dmi.board.vendor: Dell Inc.
  dmi.board.version: A01
  dmi.chassis.type: 23
  dmi.chassis.vendor: Dell Inc.
  dmi.modalias: 
dmi:bvnDellInc.:bvr2.12.4:bd07/27/2023:br2.12:svnDellInc.:pnPowerEdgeR7515:pvr:rvnDellInc.:rn0J91V2:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=08FD;ModelName=PowerEdgeR7515:
  dmi.product.family: PowerEdge
  dmi.product.name: PowerEdge R7515
  dmi.product.sku: SKU=08FD;ModelName=PowerEdge R7515
  dmi.sys.vendor: Dell Inc.
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 15 03:13 seq
   crw-rw 1 root audio 116, 33 Sep 15 03:13 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse:
   Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with 
exit code 1: Cannot stat file /proc/215602/fd/10: Permission denied
   Cannot stat file /proc/323635/fd/10: Permission denied
  CRDA: N/A
  CasperMD5CheckResult: unknown
  CloudArchitecture: x86_64
  CloudID: maas
  CloudName: maas
  CloudPlatform: maas
  CloudSubPlatform: seed-dir (http://10.3.4.7:5248/MAAS/metadata/)
  DistroRelease: Ubuntu 22.04
  

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2023-11-29 Thread Christian Rohmann
1) Andre, after I switched to active-backup the issue is gone (so far).
But yeah, we are looking for a reproducer as well. It's hard to narrow
down some random issue - also likely for Intel.

2) But I just received an email from an Intel developer with a suggested
change to the driver to narrow down the issue further. I quote ...

--- cut ---

Could you edit file (from kernel source tree base) 
drivers/net/ethernet/intel/ice/ice_lag.c .
Then find the functions ice_init_lag()and ice_deinit_lag().

Then add this line to the beginning of the functions
 

return 0; and return; respectively.


the patch nomenclature would look something like this:


* Memory will be freed in ice_deinit_lag
*/
int ice_init_lag(struct ice_pf *pf)
{
struct device *dev = ice_pf_to_dev(pf);
struct ice_lag *lag;
struct ice_vsi *vsi;
int err;

+   return 0;
pf->lag = kzalloc(sizeof(*lag), GFP_KERNEL);
if (!pf->lag)
return -ENOMEM;
lag = pf->lag;

………


* This function is meant to only be called on driver remove/shutdown
*/
void ice_deinit_lag(struct ice_pf *pf)
{
struct ice_lag *lag;

+   return;
lag = pf->lag;

Then re-build the driver and try to reproduce the problem?

--- cut ---


So in essence I believe this just skips offloading the bonding / LACP to the HW.
I will set this up on one or two of our machines to test. Would you please also 
try this on your systems?

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  
  I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu 
Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CRDA: N/A
  CasperMD5CheckResult: pass
  CloudArchitecture: x86_64
  CloudID: none
  CloudName: none
  CloudPlatform: none
  CloudSubPlatform: config
  DistroRelease: Ubuntu 22.04
  InstallationDate: Installed on 2023-08-22 (24 days ago)
  InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release 
amd64 (20230810)
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
  MachineType: Dell Inc. PowerEdge R7515
  Package: linux (not installed)
  PciMultimedia:
   
  ProcFB: 0 mgag200drmfb
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic 
root=UUID=cfb5f171-77e6-4fcd-947b-52901f51b26a ro
  ProcVersionSignature: Ubuntu 5.15.0-83.92-generic 5.15.116
  RelatedPackageVersions:
   linux-restricted-modules-5.15.0-83-generic N/A
   linux-backports-modules-5.15.0-83-generic  N/A
   linux-firmware 20220329.git681281e4-0ubuntu3.18
  RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
  Tags:  jammy uec-images
  Uname: Linux 5.15.0-83-generic x86_64
  UpgradeStatus: No upgrade log present (probably fresh install)
  UserGroups: N/A
  _MarkForUpload: True
  dmi.bios.date: 07/27/2023
  dmi.bios.release: 2.12
  dmi.bios.vendor: Dell Inc.
  dmi.bios.version: 2.12.4
  dmi.board.name: 0J91V2
  dmi.board.vendor: Dell Inc.
  dmi.board.version: A01
  dmi.chassis.type: 23
  dmi.chassis.vendor: Dell 

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2023-11-21 Thread Andre Ruiz
Hi Christian

In my tests, I also saw the same issues with active-backup too.

Do you know a way to reproduce this issue? I'm having a hard time to
find a consistent reproducer, currently I need to deploy a complete
openstack, run a ser of load tests on it and eventually the problem
shows up, but it takes many hours and not on all hosts.

It would be much easier to have just one machine and trigger the issue
in some other way.

Also, this is fixed upstream, some changes between 1.8.x and 1.9.x of
upstream source drivers fixed the problem (they are at 1.12.x now, so it
has been fixed for quite a while now). The problem is that whatever the
fix is, is has not been imported to kernels 5.15 (jammy GA), 6.2 (jammy
HWE), 6.5 (cosmic GA). I could not reliably test upstream mainline 6.6
because there is no ubuntu currently shipping this package and the pure
upstream kernel breaks a lot of stuff in ubuntu.

I mention this because of your post in the intel mailing list. They will
probably not be able to help much.

Let me know if you find a consistent reproducer.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  
  I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu 
Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CRDA: N/A
  CasperMD5CheckResult: pass
  CloudArchitecture: x86_64
  CloudID: none
  CloudName: none
  CloudPlatform: none
  CloudSubPlatform: config
  DistroRelease: Ubuntu 22.04
  InstallationDate: Installed on 2023-08-22 (24 days ago)
  InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release 
amd64 (20230810)
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
  MachineType: Dell Inc. PowerEdge R7515
  Package: linux (not installed)
  PciMultimedia:
   
  ProcFB: 0 mgag200drmfb
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic 
root=UUID=cfb5f171-77e6-4fcd-947b-52901f51b26a ro
  ProcVersionSignature: Ubuntu 5.15.0-83.92-generic 5.15.116
  RelatedPackageVersions:
   linux-restricted-modules-5.15.0-83-generic N/A
   linux-backports-modules-5.15.0-83-generic  N/A
   linux-firmware 20220329.git681281e4-0ubuntu3.18
  RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
  Tags:  jammy uec-images
  Uname: Linux 5.15.0-83-generic x86_64
  UpgradeStatus: No upgrade log present (probably fresh install)
  UserGroups: N/A
  _MarkForUpload: True
  dmi.bios.date: 07/27/2023
  dmi.bios.release: 2.12
  dmi.bios.vendor: Dell Inc.
  dmi.bios.version: 2.12.4
  dmi.board.name: 0J91V2
  dmi.board.vendor: Dell Inc.
  dmi.board.version: A01
  dmi.chassis.type: 23
  dmi.chassis.vendor: Dell Inc.
  dmi.modalias: 
dmi:bvnDellInc.:bvr2.12.4:bd07/27/2023:br2.12:svnDellInc.:pnPowerEdgeR7515:pvr:rvnDellInc.:rn0J91V2:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=08FD;ModelName=PowerEdgeR7515:
  dmi.product.family: PowerEdge
  dmi.product.name: PowerEdge R7515
  dmi.product.sku: SKU=08FD;ModelName=PowerEdge R7515
  dmi.sys.vendor: Dell Inc.
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2023-11-21 Thread Christian Rohmann
I ran into this issue on 22.04 LTS (using HWE kernel 6.2) on a 100G dual-port 
E810 NIC.
Also with LACP only, active-backup works without issues.

To bring this more to the attention of the driver devs, I posted to the
intel-wired-lan ML: https://lists.osuosl.org/pipermail/intel-wired-
lan/Week-of-Mon-20231120/038096.html

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  
  I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu 
Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CRDA: N/A
  CasperMD5CheckResult: pass
  CloudArchitecture: x86_64
  CloudID: none
  CloudName: none
  CloudPlatform: none
  CloudSubPlatform: config
  DistroRelease: Ubuntu 22.04
  InstallationDate: Installed on 2023-08-22 (24 days ago)
  InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release 
amd64 (20230810)
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
  MachineType: Dell Inc. PowerEdge R7515
  Package: linux (not installed)
  PciMultimedia:
   
  ProcFB: 0 mgag200drmfb
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic 
root=UUID=cfb5f171-77e6-4fcd-947b-52901f51b26a ro
  ProcVersionSignature: Ubuntu 5.15.0-83.92-generic 5.15.116
  RelatedPackageVersions:
   linux-restricted-modules-5.15.0-83-generic N/A
   linux-backports-modules-5.15.0-83-generic  N/A
   linux-firmware 20220329.git681281e4-0ubuntu3.18
  RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
  Tags:  jammy uec-images
  Uname: Linux 5.15.0-83-generic x86_64
  UpgradeStatus: No upgrade log present (probably fresh install)
  UserGroups: N/A
  _MarkForUpload: True
  dmi.bios.date: 07/27/2023
  dmi.bios.release: 2.12
  dmi.bios.vendor: Dell Inc.
  dmi.bios.version: 2.12.4
  dmi.board.name: 0J91V2
  dmi.board.vendor: Dell Inc.
  dmi.board.version: A01
  dmi.chassis.type: 23
  dmi.chassis.vendor: Dell Inc.
  dmi.modalias: 
dmi:bvnDellInc.:bvr2.12.4:bd07/27/2023:br2.12:svnDellInc.:pnPowerEdgeR7515:pvr:rvnDellInc.:rn0J91V2:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=08FD;ModelName=PowerEdgeR7515:
  dmi.product.family: PowerEdge
  dmi.product.name: PowerEdge R7515
  dmi.product.sku: SKU=08FD;ModelName=PowerEdge R7515
  dmi.sys.vendor: Dell Inc.
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 15 03:13 seq
   crw-rw 1 root audio 116, 33 Sep 15 03:13 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse:
   Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with 
exit code 1: Cannot stat file /proc/215602/fd/10: Permission denied
   Cannot stat file /proc/323635/fd/10: Permission denied
  CRDA: N/A
  CasperMD5CheckResult: unknown
  CloudArchitecture: x86_64
  CloudID: maas
  CloudName: maas
  CloudPlatform: maas
  CloudSubPlatform: seed-dir (http://10.3.4.7:5248/MAAS/metadata/)
  DistroRelease: 

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2023-09-20 Thread Andre Ruiz
Removing lacp bonding (using just one interface without any kind of bonding) 
seemed to help, I'm not seeing the issue anymore. Still testing.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  
  I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu 
Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CRDA: N/A
  CasperMD5CheckResult: pass
  CloudArchitecture: x86_64
  CloudID: none
  CloudName: none
  CloudPlatform: none
  CloudSubPlatform: config
  DistroRelease: Ubuntu 22.04
  InstallationDate: Installed on 2023-08-22 (24 days ago)
  InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release 
amd64 (20230810)
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
  MachineType: Dell Inc. PowerEdge R7515
  Package: linux (not installed)
  PciMultimedia:
   
  ProcFB: 0 mgag200drmfb
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic 
root=UUID=cfb5f171-77e6-4fcd-947b-52901f51b26a ro
  ProcVersionSignature: Ubuntu 5.15.0-83.92-generic 5.15.116
  RelatedPackageVersions:
   linux-restricted-modules-5.15.0-83-generic N/A
   linux-backports-modules-5.15.0-83-generic  N/A
   linux-firmware 20220329.git681281e4-0ubuntu3.18
  RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
  Tags:  jammy uec-images
  Uname: Linux 5.15.0-83-generic x86_64
  UpgradeStatus: No upgrade log present (probably fresh install)
  UserGroups: N/A
  _MarkForUpload: True
  dmi.bios.date: 07/27/2023
  dmi.bios.release: 2.12
  dmi.bios.vendor: Dell Inc.
  dmi.bios.version: 2.12.4
  dmi.board.name: 0J91V2
  dmi.board.vendor: Dell Inc.
  dmi.board.version: A01
  dmi.chassis.type: 23
  dmi.chassis.vendor: Dell Inc.
  dmi.modalias: 
dmi:bvnDellInc.:bvr2.12.4:bd07/27/2023:br2.12:svnDellInc.:pnPowerEdgeR7515:pvr:rvnDellInc.:rn0J91V2:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=08FD;ModelName=PowerEdgeR7515:
  dmi.product.family: PowerEdge
  dmi.product.name: PowerEdge R7515
  dmi.product.sku: SKU=08FD;ModelName=PowerEdge R7515
  dmi.sys.vendor: Dell Inc.
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 15 03:13 seq
   crw-rw 1 root audio 116, 33 Sep 15 03:13 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse:
   Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with 
exit code 1: Cannot stat file /proc/215602/fd/10: Permission denied
   Cannot stat file /proc/323635/fd/10: Permission denied
  CRDA: N/A
  CasperMD5CheckResult: unknown
  CloudArchitecture: x86_64
  CloudID: maas
  CloudName: maas
  CloudPlatform: maas
  CloudSubPlatform: seed-dir (http://10.3.4.7:5248/MAAS/metadata/)
  DistroRelease: Ubuntu 22.04
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
  MachineType: Dell Inc. PowerEdge R7525
  NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
  

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2023-09-15 Thread Andre Ruiz
Disabling TSO on both legs of the bond in all hosts did not help. After 2h30min 
working well, it happened again.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  
  I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu 
Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CRDA: N/A
  CasperMD5CheckResult: pass
  CloudArchitecture: x86_64
  CloudID: none
  CloudName: none
  CloudPlatform: none
  CloudSubPlatform: config
  DistroRelease: Ubuntu 22.04
  InstallationDate: Installed on 2023-08-22 (24 days ago)
  InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release 
amd64 (20230810)
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
  MachineType: Dell Inc. PowerEdge R7515
  Package: linux (not installed)
  PciMultimedia:
   
  ProcFB: 0 mgag200drmfb
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic 
root=UUID=cfb5f171-77e6-4fcd-947b-52901f51b26a ro
  ProcVersionSignature: Ubuntu 5.15.0-83.92-generic 5.15.116
  RelatedPackageVersions:
   linux-restricted-modules-5.15.0-83-generic N/A
   linux-backports-modules-5.15.0-83-generic  N/A
   linux-firmware 20220329.git681281e4-0ubuntu3.18
  RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
  Tags:  jammy uec-images
  Uname: Linux 5.15.0-83-generic x86_64
  UpgradeStatus: No upgrade log present (probably fresh install)
  UserGroups: N/A
  _MarkForUpload: True
  dmi.bios.date: 07/27/2023
  dmi.bios.release: 2.12
  dmi.bios.vendor: Dell Inc.
  dmi.bios.version: 2.12.4
  dmi.board.name: 0J91V2
  dmi.board.vendor: Dell Inc.
  dmi.board.version: A01
  dmi.chassis.type: 23
  dmi.chassis.vendor: Dell Inc.
  dmi.modalias: 
dmi:bvnDellInc.:bvr2.12.4:bd07/27/2023:br2.12:svnDellInc.:pnPowerEdgeR7515:pvr:rvnDellInc.:rn0J91V2:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=08FD;ModelName=PowerEdgeR7515:
  dmi.product.family: PowerEdge
  dmi.product.name: PowerEdge R7515
  dmi.product.sku: SKU=08FD;ModelName=PowerEdge R7515
  dmi.sys.vendor: Dell Inc.
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 15 03:13 seq
   crw-rw 1 root audio 116, 33 Sep 15 03:13 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse:
   Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with 
exit code 1: Cannot stat file /proc/215602/fd/10: Permission denied
   Cannot stat file /proc/323635/fd/10: Permission denied
  CRDA: N/A
  CasperMD5CheckResult: unknown
  CloudArchitecture: x86_64
  CloudID: maas
  CloudName: maas
  CloudPlatform: maas
  CloudSubPlatform: seed-dir (http://10.3.4.7:5248/MAAS/metadata/)
  DistroRelease: Ubuntu 22.04
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
  MachineType: Dell Inc. PowerEdge R7525
  NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
  Package: linux (not 

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2023-09-15 Thread Andre Ruiz
Got a suggestion to try disabling TSO which helped in similar cases (same queue 
timeout error) in e1000e driver. Will report back soon.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  
  I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu 
Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CRDA: N/A
  CasperMD5CheckResult: pass
  CloudArchitecture: x86_64
  CloudID: none
  CloudName: none
  CloudPlatform: none
  CloudSubPlatform: config
  DistroRelease: Ubuntu 22.04
  InstallationDate: Installed on 2023-08-22 (24 days ago)
  InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release 
amd64 (20230810)
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
  MachineType: Dell Inc. PowerEdge R7515
  Package: linux (not installed)
  PciMultimedia:
   
  ProcFB: 0 mgag200drmfb
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic 
root=UUID=cfb5f171-77e6-4fcd-947b-52901f51b26a ro
  ProcVersionSignature: Ubuntu 5.15.0-83.92-generic 5.15.116
  RelatedPackageVersions:
   linux-restricted-modules-5.15.0-83-generic N/A
   linux-backports-modules-5.15.0-83-generic  N/A
   linux-firmware 20220329.git681281e4-0ubuntu3.18
  RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
  Tags:  jammy uec-images
  Uname: Linux 5.15.0-83-generic x86_64
  UpgradeStatus: No upgrade log present (probably fresh install)
  UserGroups: N/A
  _MarkForUpload: True
  dmi.bios.date: 07/27/2023
  dmi.bios.release: 2.12
  dmi.bios.vendor: Dell Inc.
  dmi.bios.version: 2.12.4
  dmi.board.name: 0J91V2
  dmi.board.vendor: Dell Inc.
  dmi.board.version: A01
  dmi.chassis.type: 23
  dmi.chassis.vendor: Dell Inc.
  dmi.modalias: 
dmi:bvnDellInc.:bvr2.12.4:bd07/27/2023:br2.12:svnDellInc.:pnPowerEdgeR7515:pvr:rvnDellInc.:rn0J91V2:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=08FD;ModelName=PowerEdgeR7515:
  dmi.product.family: PowerEdge
  dmi.product.name: PowerEdge R7515
  dmi.product.sku: SKU=08FD;ModelName=PowerEdge R7515
  dmi.sys.vendor: Dell Inc.
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 15 03:13 seq
   crw-rw 1 root audio 116, 33 Sep 15 03:13 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse:
   Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with 
exit code 1: Cannot stat file /proc/215602/fd/10: Permission denied
   Cannot stat file /proc/323635/fd/10: Permission denied
  CRDA: N/A
  CasperMD5CheckResult: unknown
  CloudArchitecture: x86_64
  CloudID: maas
  CloudName: maas
  CloudPlatform: maas
  CloudSubPlatform: seed-dir (http://10.3.4.7:5248/MAAS/metadata/)
  DistroRelease: Ubuntu 22.04
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
  MachineType: Dell Inc. PowerEdge R7525
  NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
  

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2023-09-15 Thread Andre Ruiz
https://www.mail-
archive.com/e1000-de...@lists.sourceforge.net/msg12747.html

similar issue

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  
  I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu 
Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CRDA: N/A
  CasperMD5CheckResult: pass
  CloudArchitecture: x86_64
  CloudID: none
  CloudName: none
  CloudPlatform: none
  CloudSubPlatform: config
  DistroRelease: Ubuntu 22.04
  InstallationDate: Installed on 2023-08-22 (24 days ago)
  InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release 
amd64 (20230810)
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
  MachineType: Dell Inc. PowerEdge R7515
  Package: linux (not installed)
  PciMultimedia:
   
  ProcFB: 0 mgag200drmfb
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic 
root=UUID=cfb5f171-77e6-4fcd-947b-52901f51b26a ro
  ProcVersionSignature: Ubuntu 5.15.0-83.92-generic 5.15.116
  RelatedPackageVersions:
   linux-restricted-modules-5.15.0-83-generic N/A
   linux-backports-modules-5.15.0-83-generic  N/A
   linux-firmware 20220329.git681281e4-0ubuntu3.18
  RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
  Tags:  jammy uec-images
  Uname: Linux 5.15.0-83-generic x86_64
  UpgradeStatus: No upgrade log present (probably fresh install)
  UserGroups: N/A
  _MarkForUpload: True
  dmi.bios.date: 07/27/2023
  dmi.bios.release: 2.12
  dmi.bios.vendor: Dell Inc.
  dmi.bios.version: 2.12.4
  dmi.board.name: 0J91V2
  dmi.board.vendor: Dell Inc.
  dmi.board.version: A01
  dmi.chassis.type: 23
  dmi.chassis.vendor: Dell Inc.
  dmi.modalias: 
dmi:bvnDellInc.:bvr2.12.4:bd07/27/2023:br2.12:svnDellInc.:pnPowerEdgeR7515:pvr:rvnDellInc.:rn0J91V2:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=08FD;ModelName=PowerEdgeR7515:
  dmi.product.family: PowerEdge
  dmi.product.name: PowerEdge R7515
  dmi.product.sku: SKU=08FD;ModelName=PowerEdge R7515
  dmi.sys.vendor: Dell Inc.
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 15 03:13 seq
   crw-rw 1 root audio 116, 33 Sep 15 03:13 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse:
   Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with 
exit code 1: Cannot stat file /proc/215602/fd/10: Permission denied
   Cannot stat file /proc/323635/fd/10: Permission denied
  CRDA: N/A
  CasperMD5CheckResult: unknown
  CloudArchitecture: x86_64
  CloudID: maas
  CloudName: maas
  CloudPlatform: maas
  CloudSubPlatform: seed-dir (http://10.3.4.7:5248/MAAS/metadata/)
  DistroRelease: Ubuntu 22.04
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
  MachineType: Dell Inc. PowerEdge R7525
  NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
  Package: linux (not installed)
  PciMultimedia:
  

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2023-09-15 Thread Andre Ruiz
I have not tested without the bond, but I believe this issue probably is not 
directly related to the fact that the interface is bonded, which would mean 
removing the bond will not help. While I will try to test this if possible 
(depends on customer doing reconfiguration of switch side), I appreciate any 
suggestion or workaround that could unblock the deployment.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  
  I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu 
Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered
  by some tests that I run after openstack finishes installing)

  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet

  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace

  - the switch does log that the bond is flapping
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CRDA: N/A
  CasperMD5CheckResult: pass
  CloudArchitecture: x86_64
  CloudID: none
  CloudName: none
  CloudPlatform: none
  CloudSubPlatform: config
  DistroRelease: Ubuntu 22.04
  InstallationDate: Installed on 2023-08-22 (24 days ago)
  InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release 
amd64 (20230810)
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
  MachineType: Dell Inc. PowerEdge R7515
  Package: linux (not installed)
  PciMultimedia:
   
  ProcFB: 0 mgag200drmfb
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic 
root=UUID=cfb5f171-77e6-4fcd-947b-52901f51b26a ro
  ProcVersionSignature: Ubuntu 5.15.0-83.92-generic 5.15.116
  RelatedPackageVersions:
   linux-restricted-modules-5.15.0-83-generic N/A
   linux-backports-modules-5.15.0-83-generic  N/A
   linux-firmware 20220329.git681281e4-0ubuntu3.18
  RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
  Tags:  jammy uec-images
  Uname: Linux 5.15.0-83-generic x86_64
  UpgradeStatus: No upgrade log present (probably fresh install)
  UserGroups: N/A
  _MarkForUpload: True
  dmi.bios.date: 07/27/2023
  dmi.bios.release: 2.12
  dmi.bios.vendor: Dell Inc.
  dmi.bios.version: 2.12.4
  dmi.board.name: 0J91V2
  dmi.board.vendor: Dell Inc.
  dmi.board.version: A01
  dmi.chassis.type: 23
  dmi.chassis.vendor: Dell Inc.
  dmi.modalias: 
dmi:bvnDellInc.:bvr2.12.4:bd07/27/2023:br2.12:svnDellInc.:pnPowerEdgeR7515:pvr:rvnDellInc.:rn0J91V2:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=08FD;ModelName=PowerEdgeR7515:
  dmi.product.family: PowerEdge
  dmi.product.name: PowerEdge R7515
  dmi.product.sku: SKU=08FD;ModelName=PowerEdge R7515
  dmi.sys.vendor: Dell Inc.
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 15 03:13 seq
   crw-rw 1 root audio 116, 33 Sep 15 03:13 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse:
   Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with 
exit code 1: Cannot stat file /proc/215602/fd/10: Permission denied
   Cannot stat file /proc/323635/fd/10: Permission denied
  CRDA: N/A
  CasperMD5CheckResult: unknown
  CloudArchitecture: x86_64
  CloudID: maas
  CloudName: maas
  CloudPlatform: maas
  CloudSubPlatform: seed-dir 

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2023-09-15 Thread Andre Ruiz
I added logs from a machine that I'm not sure was affected (infra01),
adding more logs below for the one that is certainly affected
(cloud002).


** Description changed:

  
  I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu 
Jammy.
  
  Details:
  
  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)
  
  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.
  
  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as the
  problem seems to be in the interface.
  
  - machine installed by maas. No issues during installation, but at that
  time bond is not formed yet, later when linux is booted, the bond is
  formed and works without issues for a while
  
  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered by
  some tests that I run after openstack finishes installing)
  
  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet
  
  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace
  
  - the switch does log that the bond is flapping
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
   crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CRDA: N/A
  CasperMD5CheckResult: pass
  CloudArchitecture: x86_64
  CloudID: none
  CloudName: none
  CloudPlatform: none
  CloudSubPlatform: config
  DistroRelease: Ubuntu 22.04
  InstallationDate: Installed on 2023-08-22 (24 days ago)
  InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release 
amd64 (20230810)
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
  MachineType: Dell Inc. PowerEdge R7515
  Package: linux (not installed)
  PciMultimedia:
   
  ProcFB: 0 mgag200drmfb
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic 
root=UUID=cfb5f171-77e6-4fcd-947b-52901f51b26a ro
  ProcVersionSignature: Ubuntu 5.15.0-83.92-generic 5.15.116
  RelatedPackageVersions:
   linux-restricted-modules-5.15.0-83-generic N/A
   linux-backports-modules-5.15.0-83-generic  N/A
   linux-firmware 20220329.git681281e4-0ubuntu3.18
  RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
  Tags:  jammy uec-images
  Uname: Linux 5.15.0-83-generic x86_64
  UpgradeStatus: No upgrade log present (probably fresh install)
  UserGroups: N/A
  _MarkForUpload: True
  dmi.bios.date: 07/27/2023
  dmi.bios.release: 2.12
  dmi.bios.vendor: Dell Inc.
  dmi.bios.version: 2.12.4
  dmi.board.name: 0J91V2
  dmi.board.vendor: Dell Inc.
  dmi.board.version: A01
  dmi.chassis.type: 23
  dmi.chassis.vendor: Dell Inc.
  dmi.modalias: 
dmi:bvnDellInc.:bvr2.12.4:bd07/27/2023:br2.12:svnDellInc.:pnPowerEdgeR7515:pvr:rvnDellInc.:rn0J91V2:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=08FD;ModelName=PowerEdgeR7515:
  dmi.product.family: PowerEdge
  dmi.product.name: PowerEdge R7515
  dmi.product.sku: SKU=08FD;ModelName=PowerEdge R7515
  dmi.sys.vendor: Dell Inc.
+ --- 
+ ProblemType: Bug
+ AlsaDevices:
+  total 0
+  crw-rw 1 root audio 116,  1 Sep 15 03:13 seq
+  crw-rw 1 root audio 116, 33 Sep 15 03:13 timer
+ AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
+ ApportVersion: 2.20.11-0ubuntu82.5
+ Architecture: amd64
+ ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
+ AudioDevicesInUse:
+  Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with 
exit code 1: Cannot stat file /proc/215602/fd/10: Permission denied
+  Cannot stat file /proc/323635/fd/10: Permission denied
+ CRDA: N/A
+ CasperMD5CheckResult: unknown
+ CloudArchitecture: x86_64
+ CloudID: maas
+ CloudName: maas
+ CloudPlatform: maas
+ CloudSubPlatform: seed-dir (http://10.3.4.7:5248/MAAS/metadata/)
+ DistroRelease: Ubuntu 22.04
+ IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
+ MachineType: Dell Inc. PowerEdge R7525
+ NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
+ Package: linux (not installed)
+ PciMultimedia:
+  
+ ProcFB: 0 mgag200drmfb
+ ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-6.2.0-32-generic 
root=UUID=9b437790-e6e2-4a2e-af79-5b13fee932af ro
+ ProcVersionSignature: Ubuntu 6.2.0-32.32~22.04.1-generic 6.2.16
+ 

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2023-09-15 Thread Andre Ruiz
** Tags added: apport-collected jammy uec-images

** Description changed:

  
  I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu 
Jammy.
  
  Details:
  
  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)
  
  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.
  
  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as the
  problem seems to be in the interface.
  
  - machine installed by maas. No issues during installation, but at that
  time bond is not formed yet, later when linux is booted, the bond is
  formed and works without issues for a while
  
  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network load, but it seems that it is triggered by
  some tests that I run after openstack finishes installing)
  
  - one of the legs of the bond freezes and everything that would go to
  that lag is discarded, in and out, ping to random external hosts start
  losing every second packet
  
  - after some time you can see on the kernel log messages about "NETDEV
  WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack
  trace
  
  - the switch does log that the bond is flapping
+ --- 
+ ProblemType: Bug
+ AlsaDevices:
+  total 0
+  crw-rw 1 root audio 116,  1 Sep 12 20:05 seq
+  crw-rw 1 root audio 116, 33 Sep 12 20:05 timer
+ AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
+ ApportVersion: 2.20.11-0ubuntu82.5
+ Architecture: amd64
+ ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
+ AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
+ CRDA: N/A
+ CasperMD5CheckResult: pass
+ CloudArchitecture: x86_64
+ CloudID: none
+ CloudName: none
+ CloudPlatform: none
+ CloudSubPlatform: config
+ DistroRelease: Ubuntu 22.04
+ InstallationDate: Installed on 2023-08-22 (24 days ago)
+ InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release 
amd64 (20230810)
+ IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
+ MachineType: Dell Inc. PowerEdge R7515
+ Package: linux (not installed)
+ PciMultimedia:
+  
+ ProcFB: 0 mgag200drmfb
+ ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic 
root=UUID=cfb5f171-77e6-4fcd-947b-52901f51b26a ro
+ ProcVersionSignature: Ubuntu 5.15.0-83.92-generic 5.15.116
+ RelatedPackageVersions:
+  linux-restricted-modules-5.15.0-83-generic N/A
+  linux-backports-modules-5.15.0-83-generic  N/A
+  linux-firmware 20220329.git681281e4-0ubuntu3.18
+ RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
+ Tags:  jammy uec-images
+ Uname: Linux 5.15.0-83-generic x86_64
+ UpgradeStatus: No upgrade log present (probably fresh install)
+ UserGroups: N/A
+ _MarkForUpload: True
+ dmi.bios.date: 07/27/2023
+ dmi.bios.release: 2.12
+ dmi.bios.vendor: Dell Inc.
+ dmi.bios.version: 2.12.4
+ dmi.board.name: 0J91V2
+ dmi.board.vendor: Dell Inc.
+ dmi.board.version: A01
+ dmi.chassis.type: 23
+ dmi.chassis.vendor: Dell Inc.
+ dmi.modalias: 
dmi:bvnDellInc.:bvr2.12.4:bd07/27/2023:br2.12:svnDellInc.:pnPowerEdgeR7515:pvr:rvnDellInc.:rn0J91V2:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=08FD;ModelName=PowerEdgeR7515:
+ dmi.product.family: PowerEdge
+ dmi.product.name: PowerEdge R7515
+ dmi.product.sku: SKU=08FD;ModelName=PowerEdge R7515
+ dmi.sys.vendor: Dell Inc.

** Attachment added: "CurrentDmesg.txt"
   
https://bugs.launchpad.net/bugs/2036239/+attachment/5701312/+files/CurrentDmesg.txt

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2036239

Title:
  Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  
  I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu 
Jammy.

  Details:

  - hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet
  Controller E810-XXV for SFP (rev 02)

  - tested with both GA and HWE kernels (`5.15.0-83-generic #92` and
  `6.2.0-32-generic #32~22.04.1-Ubuntu`) with the same results.

  - using a bond over the two ports of the same card, at 25Gbps to two
  different switches, bond is using LACP with hash layer3+4 and fast
  timeout. But I believe the bug is not directly related to bonding as
  the problem seems to be in the interface.

  - machine installed by maas. No issues during installation, but at
  that time bond is not formed yet, later when linux is booted, the bond
  is formed and works without issues for a while

  - it works for about 2 to 3 hours fine, then the issue starts (may or
  may not be related to network 

[Kernel-packages] [Bug 2036239] Re: Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out

2023-09-15 Thread Andre Ruiz
This is the log from the HWE kernel:

[33219.508873] [ cut here ]
[33219.508877] NETDEV WATCHDOG: enp161s0f1 (ice): transmit queue 35 timed out
[33219.508932] WARNING: CPU: 48 PID: 0 at net/sched/sch_generic.c:525 
dev_watchdog+0x21f/0x230
[33219.508940] Modules linked in: sch_ingress nf_conntrack_netlink geneve 
ip6_udp_tunnel udp_tunnel xt_CT dm_crypt scsi_transport_iscsi veth 
nfnetlink_cttimeout openvswitch nsh nf_conncount unix_diag nft_masq zfs(PO) 
zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) 
vhost_vsock vmw_vsock_virtio_transport_common vhost vhost_iotlb vsock 
xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp 
nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 
nf_tables nfnetlink bridge sunrpc nvme_fabrics 8021q garp mrp stp llc bonding 
tls binfmt_misc ipmi_ssif intel_rapl_msr intel_rapl_common amd64_edac 
edac_mce_amd dell_wmi kvm_amd video ledtrig_audio nls_iso8859_1 irdma 
sparse_keymap kvm i40e irqbypass dell_smbios dcdbas ib_uverbs rapl 
dell_wmi_descriptor wmi_bmof ib_core ccp ptdma k10temp acpi_ipmi ipmi_si 
ipmi_devintf ipmi_msghandler acpi_power_meter mac_hid sch_fq_codel dm_multipath 
scsi_dh_rdac scsi_dh_emc scsi_dh_alua ramoops
[33219.509051]  reed_solomon pstore_blk pstore_zone efi_pstore ip_tables 
x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov 
async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 
multipath linear cdc_ether usbnet mii mgag200 i2c_algo_bit drm_shmem_helper 
drm_kms_helper syscopyarea crct10dif_pclmul sysfillrect sysimgblt crc32_pclmul 
bcache polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 nvme 
aesni_intel crypto_simd nvme_core ahci xhci_pci cryptd ice tg3 libahci drm 
megaraid_sas i2c_piix4 xhci_pci_renesas nvme_common wmi
[33219.509114] CPU: 48 PID: 0 Comm: swapper/48 Tainted: P   O   
6.2.0-32-generic #32~22.04.1-Ubuntu
[33219.509116] Hardware name: Dell Inc. PowerEdge R7525/03WYW4, BIOS 2.12.4 
07/26/2023
[33219.509118] RIP: 0010:dev_watchdog+0x21f/0x230
[33219.509122] Code: 00 e9 31 ff ff ff 4c 89 e7 c6 05 66 83 78 01 01 e8 56 00 
f8 ff 44 89 f1 4c 89 e6 48 c7 c7 08 4f e4 b7 48 89 c2 e8 61 df 2b ff <0f> 0b e9 
22 ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90
[33219.509123] RSP: 0018:b42719fd0e70 EFLAGS: 00010246
[33219.509125] RAX:  RBX: 9bd91b3e74c8 RCX: 
[33219.509126] RDX:  RSI:  RDI: 
[33219.509127] RBP: b42719fd0e98 R08:  R09: 
[33219.509128] R10:  R11:  R12: 9bd91b3e7000
[33219.509129] R13: 9bd91b3e741c R14: 0023 R15: 
[33219.509130] FS:  () GS:9b573de0() 
knlGS:
[33219.509132] CS:  0010 DS:  ES:  CR0: 80050033
[33219.509133] CR2: 55fd64034000 CR3: 010273ae2004 CR4: 00770ee0
[33219.509135] PKRU: 5554
[33219.509135] Call Trace:
[33219.509137]  
[33219.509140]  ? show_regs+0x72/0x90
[33219.509145]  ? dev_watchdog+0x21f/0x230
[33219.509147]  ? __warn+0x8d/0x160
[33219.509151]  ? dev_watchdog+0x21f/0x230
[33219.509154]  ? report_bug+0x1bb/0x1d0
[33219.509158]  ? handle_bug+0x46/0x90
[33219.509162]  ? exc_invalid_op+0x19/0x80
[33219.509165]  ? asm_exc_invalid_op+0x1b/0x20
[33219.509171]  ? dev_watchdog+0x21f/0x230
[33219.509174]  ? __pfx_dev_watchdog+0x10/0x10
[33219.509176]  call_timer_fn+0x2c/0x160
[33219.509180]  ? __pfx_dev_watchdog+0x10/0x10
[33219.509182]  __run_timers.part.0+0x1fb/0x2b0
[33219.509185]  ? ktime_get+0x46/0xc0
[33219.509187]  ? __pfx_tick_sched_timer+0x10/0x10
[33219.509191]  ? native_apic_msr_write+0x46/0x70
[33219.509194]  ? lapic_next_event+0x20/0x30
[33219.509197]  ? clockevents_program_event+0xb5/0x140
[33219.509200]  run_timer_softirq+0x2a/0x60
[33219.509202]  __do_softirq+0xdd/0x330
[33219.509205]  ? hrtimer_interrupt+0x12b/0x250
[33219.509208]  __irq_exit_rcu+0xa2/0xd0
[33219.509210]  irq_exit_rcu+0xe/0x20
[33219.509212]  sysvec_apic_timer_interrupt+0x96/0xb0
[33219.509215]  
[33219.509216]  
[33219.509216]  asm_sysvec_apic_timer_interrupt+0x1b/0x20
[33219.509219] RIP: 0010:mwait_idle+0x55/0x90
[33219.509222] Code: 31 d2 48 89 d1 65 48 8b 04 25 40 18 03 00 0f 01 c8 48 8b 
00 a8 08 75 14 eb 07 0f 00 2d 24 d2 35 00 31 c0 48 89 c1 fb 0f 01 c9  06 fb 
0f 1f 44 00 00 65 48 8b 04 25 40 18 03 00 f0 80 60 02 df
[33219.509224] RSP: 0018:b42700587e80 EFLAGS: 0246
[33219.509225] RAX:  RBX: 9ad9ccd999c0 RCX: 
[33219.509226] RDX:  RSI:  RDI: 
[33219.509227] RBP: b42700587e80 R08:  R09: 
[33219.509229] R10:  R11:  R12: 
[33219.509230] R13:  R14:  R15: