** Description changed:

- [impact]
+ [Impact]
+ The i40e driver sometimes causes a "malicious device" event that the firmware 
detects, which causes the firmware to reset the NIC, causing an interruption in 
the network connection - which can cause further problems, e.g. if the 
interface is in a bond; the reset will at least cause a temporary interruption 
in network traffic.
  
- The i40e driver sometimes causes a "malicious device" event that the
- firmware detects, which causes the firmware to reset the nic, causing an
- interruption in the network connection - which can cause further
- problems, e.g. if the interface is in a bond; the reset will at least
- cause a temporary interruption in network traffic.
+ [Fix]
+ In the case of MDD events issued for the PF, they are usually the result of a 
misconfigured TX descriptor and not due to "bad" actions in the VFs. We don't 
need to issue a reset to the whole NIC, TX hang checks should handle those if 
necessary.
  
- [fix]
- 
- The fix for this is currently unknown.  As the "MDD event" is generated
- by the i40e firmware, and is completely undocumented, there is no way to
- tell what the i40e driver did to cause the MDD event.
- 
- [test case]
- 
- the bug is unfortunately very difficult to reproduce, but as shown in
- this (and previous) bug comments, some users of the i40e have traffic
- that can consistently reproduce the problem (although usually on the
- order of days, or longer, to reproduce). Reproducing is easily detected,
- as the nw traffic will be interrupted and the system logs will contain a
- message like:
- 
+ [Test Case]
+ The bug is unfortunately difficult to reproduce, as there's no detailed 
documentation on how the i40e firmware detects and raises MDDs. We have seen 
reports of this happening in Xenial and Bionic, for workloads stressing i40e 
bonds in LACP mode.
+ Reproducing is easily detected, as the network traffic will be interrupted 
and the system logs will contain a message like:
  i40e 0000:02:00.1: TX driver issue detected, PF reset issued
  
- [regression potential]
+ [Regression Potential]
+ Since we're removing resets for the NIC, regressions could show up as issues 
in connectivity after the MDD events are raised. If the firmware expects the 
whole NIC to reset, we could see TX/RX hangs and general unresponsiveness in 
networking.
  
- unknown since the specific fix is unknown.
- 
+ ==
  [original description]
  
  This is a continuation from bug 1713553 and then bug 1723127; a patch
  was added in the first bug and then the second bug, to attempt to fix
  this, and it may have helped reduce the issue but appears not to have
  fixed it, based on more reports.
  
  See bug 1713553 and bug 1723127 for more details.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1772675

Title:
  Intel i40e PF reset due to incorrect MDD detection
  (continues...again...)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1772675/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to