[Kernel-packages] [Bug 1772675] Re: i40e PF reset due to incorrect MDD event

2021-04-12 Thread Launchpad Bug Tracker
This bug was fixed in the package linux - 4.4.0-208.240

---
linux (4.4.0-208.240) xenial; urgency=medium

  * xenial/linux: 4.4.0-208.240 -proposed tracker (LP: #1922069)

  * linux ADT test failure with linux/4.4.0-207.239 -
ubuntu_qrt_kernel_security.test-kernel-security.py (LP: #1922200) //
CVE-2018-5953 // CVE-2018-5995 // CVE-2018-7754
- SAUCE: Revert "printk: hash addresses printed with %p"

  * lxd 2.0.11-0ubuntu1~16.04.4 ADT test failure with linux 4.4.0-207.239
(LP: #1921969)
- SAUCE: Fix fuse regression in 4.4.0-207.239

linux (4.4.0-207.239) xenial; urgency=medium

  * xenial/linux: 4.4.0-207.239 -proposed tracker (LP: #1919558)

  * Xenial update: v4.4.262 upstream stable release (LP: #1920221)
- uapi: nfnetlink_cthelper.h: fix userspace compilation error
- ath9k: fix transmitting to stations in dynamic SMPS mode
- net: Fix gro aggregation for udp encaps with zero csum
- can: skb: can_skb_set_owner(): fix ref counting if socket was closed 
before
  setting skb ownership
- can: flexcan: assert FRZ bit in flexcan_chip_freeze()
- can: flexcan: enable RX FIFO after FRZ/HALT valid
- netfilter: x_tables: gpf inside xt_find_revision()
- cifs: return proper error code in statfs(2)
- floppy: fix lock_fdc() signal handling
- Revert "mm, slub: consider rest of partial list if acquire_slab() fails"
- futex: Change locking rules
- futex: Cure exit race
- futex: fix dead code in attach_to_pi_owner()
- net/mlx4_en: update moderation when config reset
- net: lapbether: Remove netif_start_queue / netif_stop_queue
- net: davicom: Fix regulator not turned off on failed probe
- net: davicom: Fix regulator not turned off on driver removal
- media: usbtv: Fix deadlock on suspend
- mmc: mxs-mmc: Fix a resource leak in an error handling path in
  'mxs_mmc_probe()'
- mmc: mediatek: fix race condition between msdc_request_timeout and irq
- powerpc/perf: Record counter overflow always if SAMPLE_IP is unset
- PCI: xgene-msi: Fix race in installing chained irq handler
- s390/smp: __smp_rescan_cpus() - move cpumask away from stack
- scsi: libiscsi: Fix iscsi_prep_scsi_cmd_pdu() error handling
- ALSA: hda/hdmi: Cancel pending works before suspend
- ALSA: hda: Avoid spurious unsol event handling during S3/S4
- ALSA: usb-audio: Fix "cannot get freq eq" errors on Dell AE515 sound bar
- s390/dasd: fix hanging DASD driver unbind
- mmc: core: Fix partition switch time for eMMC
- scripts/recordmcount.{c,pl}: support -ffunction-sections .text.* section
  names
- Goodix Fingerprint device is not a modem
- usb: gadget: f_uac2: always increase endpoint max_packet_size by one audio
  slot
- usb: renesas_usbhs: Clear PIPECFG for re-enabling pipe with other EPNUM
- xhci: Improve detection of device initiated wake signal.
- USB: serial: io_edgeport: fix memory leak in edge_startup
- USB: serial: ch341: add new Product ID
- USB: serial: cp210x: add ID for Acuity Brands nLight Air Adapter
- USB: serial: cp210x: add some more GE USB IDs
- usbip: fix stub_dev to check for stream socket
- usbip: fix vhci_hcd to check for stream socket
- usbip: fix stub_dev usbip_sockfd_store() races leading to gpf
- staging: rtl8192u: fix ->ssid overflow in r8192_wx_set_scan()
- staging: rtl8188eu: prevent ->ssid overflow in rtw_wx_set_scan()
- staging: rtl8712: unterminated string leads to read overflow
- staging: rtl8188eu: fix potential memory corruption in
  rtw_check_beacon_data()
- staging: rtl8712: Fix possible buffer overflow in r8712_sitesurvey_cmd
- staging: rtl8192e: Fix possible buffer overflow in _rtl92e_wx_set_scan
- staging: comedi: addi_apci_1032: Fix endian problem for COS sample
- staging: comedi: addi_apci_1500: Fix endian problem for command sample
- staging: comedi: adv_pci1710: Fix endian problem for AI command data
- staging: comedi: das6402: Fix endian problem for AI command data
- staging: comedi: das800: Fix endian problem for AI command data
- staging: comedi: dmm32at: Fix endian problem for AI command data
- staging: comedi: me4000: Fix endian problem for AI command data
- staging: comedi: pcl711: Fix endian problem for AI command data
- staging: comedi: pcl818: Fix endian problem for AI command data
- NFSv4.2: fix return value of _nfs4_get_security_label()
- block: rsxx: fix error return code of rsxx_pci_probe()
- alpha: add $(src)/ rather than $(obj)/ to make source file path
- alpha: merge build rules of division routines
- alpha: make short build log available for division routines
- alpha: Package string routines together
- alpha: move exports to actual definitions
- alpha: get rid of tail-zeroing in __copy_user()
- alpha: switch __copy_user() and __do_clean_user() to normal calling
  conventions
- powerpc/64s: 

[Kernel-packages] [Bug 1772675] Re: i40e PF reset due to incorrect MDD event

2021-04-12 Thread Launchpad Bug Tracker
This bug was fixed in the package linux - 4.15.0-141.145

---
linux (4.15.0-141.145) bionic; urgency=medium

  * bionic/linux: 4.15.0-141.145 -proposed tracker (LP: #1919536)

  * binary assembly failures with CONFIG_MODVERSIONS present (LP: #1919315)
- [Packaging] quiet (nomially) benign errors in BUILD script

  * selftests: bpf verifier fails after sanitize_ptr_alu fixes (LP: #1920995)
- bpf: Simplify alu_limit masking for pointer arithmetic
- bpf: Add sanity check for upper ptr_limit
- bpf, selftests: Fix up some test_verifier cases for unprivileged

  * Packaging resync (LP: #1786013)
- update dkms package versions

  * CVE-2018-13095
- xfs: More robust inode extent count validation

  * i40e PF reset due to incorrect MDD event (LP: #1772675)
- i40e: change behavior on PF in response to MDD event

  * Bionic update: upstream stable patchset 2021-03-09 (LP: #1918330)
- ACPI: sysfs: Prefer "compatible" modalias
- ARM: dts: imx6qdl-gw52xx: fix duplicate regulator naming
- wext: fix NULL-ptr-dereference with cfg80211's lack of commit()
- net: usb: qmi_wwan: added support for Thales Cinterion PLSx3 modem family
- drivers: soc: atmel: Avoid calling at91_soc_init on non AT91 SoCs
- drivers: soc: atmel: add null entry at the end of at91_soc_allowed_list[]
- KVM: x86/pmu: Fix HW_REF_CPU_CYCLES event pseudo-encoding in
  intel_arch_events[]
- KVM: x86: get smi pending status correctly
- xen: Fix XenStore initialisation for XS_LOCAL
- leds: trigger: fix potential deadlock with libata
- mt7601u: fix kernel crash unplugging the device
- mt7601u: fix rx buffer refcounting
- xen-blkfront: allow discard-* nodes to be optional
- ARM: imx: build suspend-imx6.S with arm instruction set
- netfilter: nft_dynset: add timeout extension to template
- xfrm: Fix oops in xfrm_replay_advance_bmp
- RDMA/cxgb4: Fix the reported max_recv_sge value
- iwlwifi: pcie: use jiffies for memory read spin time limit
- iwlwifi: pcie: reschedule in long-running memory reads
- mac80211: pause TX while changing interface type
- can: dev: prevent potential information leak in can_fill_info()
- x86/entry/64/compat: Preserve r8-r11 in int $0x80
- x86/entry/64/compat: Fix "x86/entry/64/compat: Preserve r8-r11 in int 
$0x80"
- iommu/vt-d: Gracefully handle DMAR units with no supported address widths
- iommu/vt-d: Don't dereference iommu_device if IOMMU_API is not built
- NFC: fix resource leak when target index is invalid
- NFC: fix possible resource leak
- team: protect features update by RCU to avoid deadlock
- tcp: fix TLP timer not set when CA_STATE changes from DISORDER to OPEN
- kernel: kexec: remove the lock operation of system_transition_mutex
- PM: hibernate: flush swap writer after marking
- pNFS/NFSv4: Fix a layout segment leak in pnfs_layout_process()
- net/mlx5: Fix memory leak on flow table creation error flow
- rxrpc: Fix memory leak in rxrpc_lookup_local
- net: dsa: bcm_sf2: put device node before return
- ibmvnic: Ensure that CRQ entry read are correctly ordered
- ACPI: thermal: Do not call acpi_thermal_check() directly
- net_sched: gen_estimator: support large ewma log
- phy: cpcap-usb: Fix warning for missing regulator_disable
- x86: __always_inline __{rd,wr}msr()
- scsi: scsi_transport_srp: Don't block target in failfast state
- scsi: libfc: Avoid invoking response handler twice if ep is already
  completed
- mac80211: fix fast-rx encryption check
- scsi: ibmvfc: Set default timeout to avoid crash during migration
- objtool: Don't fail on missing symbol table
- kthread: Extract KTHREAD_IS_PER_CPU
- workqueue: Restrict affinity change to rescuer
- USB: serial: cp210x: add pid/vid for WSDA-200-USB
- USB: serial: cp210x: add new VID/PID for supporting Teraoka AD2000
- USB: serial: option: Adding support for Cinterion MV31
- arm64: dts: ls1046a: fix dcfg address range
- net: lapb: Copy the skb before sending a packet
- elfcore: fix building with clang
- USB: gadget: legacy: fix an error code in eth_bind()
- USB: usblp: don't call usb_set_interface if there's a single alt
- usb: dwc2: Fix endpoint direction check in ep_from_windex
- ovl: fix dentry leak in ovl_get_redirect
- mac80211: fix station rate table updates on assoc
- kretprobe: Avoid re-registration of the same kretprobe earlier
- xhci: fix bounce buffer usage for non-sg list case
- cifs: report error instead of invalid when revalidating a dentry fails
- smb3: Fix out-of-bounds bug in SMB2_negotiate()
- mmc: core: Limit retries when analyse of SDIO tuples fails
- nvme-pci: avoid the deepest sleep state on Kingston A2000 SSDs
- ARM: footbridge: fix dc21285 PCI configuration accessors
- mm: hugetlbfs: fix cannot migrate the fallocated HugeTLB page
- mm: hugetlb: fix 

[Kernel-packages] [Bug 1772675] Re: i40e PF reset due to incorrect MDD event

2021-03-30 Thread Heitor Alves de Siqueira
Verified for Bionic following the same test procedure from comment #13, running 
on the current -proposed kernel:
ubuntu@snorlax:~$ uname -rv
4.15.0-141-generic #145-Ubuntu SMP Wed Mar 24 18:08:07 UTC 2021

I had to adjust the tx_desc_addr probes, subtracting 0x4 from both offsets and 
changing the relevant registers (r15 for i40e, r11 for i40evf). Other probes 
didn't need any changes.
The test results were similar as on Xenial, netdev watchdog works as expected 
and no major issues were encountered with a smoke test.

** Tags removed: verification-needed-bionic
** Tags added: verification-done-bionic

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1772675

Title:
  i40e PF reset due to incorrect MDD event

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Xenial:
  Fix Committed
Status in linux source package in Bionic:
  Fix Committed
Status in linux source package in Cosmic:
  Won't Fix

Bug description:
  [Impact]
  The i40e driver sometimes causes a "malicious device" event that the firmware 
detects, which causes the firmware to reset the NIC, causing an interruption in 
the network connection - which can cause further problems, e.g. if the 
interface is in a bond; the reset will at least cause a temporary interruption 
in network traffic.

  [Fix]
  In the case of MDD events issued for the PF, they are usually the result of a 
misconfigured TX descriptor and not due to "bad" actions in the VFs. We don't 
need to issue a reset to the whole NIC, TX hang checks should handle those if 
necessary.

  [Test Procedure]
  The bug is unfortunately difficult to reproduce, as there's no detailed 
documentation on how the i40e firmware detects and raises MDDs. We have seen 
reports of this happening in Xenial and Bionic, for workloads stressing i40e 
bonds in LACP mode.
  Reproducing is easily detected, as the network traffic will be interrupted 
and the system logs will contain a message like:
  i40e :02:00.1: TX driver issue detected, PF reset issued

  An alternative test procedure makes use of the kprobes attached to the LP 
bug. The test setup is as follows:
  - Create 2 VFs on primary NIC
  - Passthrough VF 1 to a Bionic VM
  - Start iperf3 client on VM, going through i40evf interface
  - Start another iperf3 client on host, going through i40e interface
  Both iperf3 clients should be using an external server located on a separate 
host. By loading the kprobe module while iperf3 is running, we should be able 
to raise MDDs more consistently. MDD behaviour can change according to firmware 
version, so we may need to try with different sets of probes. The one with the 
most consistent results seems to be 'corrupt_tx_desc_addr', which corrupts the 
cmd_type_offset_bsz field of the last TX descriptor before the NIC is notified 
of new data.

  [Regression Potential]
  Since we're removing resets for the NIC, regressions could show up as issues 
in connectivity after the MDD events are raised. If the firmware expects the 
whole NIC to reset, we could see TX/RX hangs and general unresponsiveness in 
networking. The potential for this should however be fairly low, as this patch 
has been present since kernel 5.2 and hasn't seen any fixes or regressions 
upstream. Basic smoke tests also showed that the driver continues working as 
expected, and that necessary PF resets will be issued by the netdev watchdog in 
case of any hung queues.

  ==
  [original description]

  This is a continuation from bug 1713553 and then bug 1723127; a patch
  was added in the first bug and then the second bug, to attempt to fix
  this, and it may have helped reduce the issue but appears not to have
  fixed it, based on more reports.

  See bug 1713553 and bug 1723127 for more details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1772675/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1772675] Re: i40e PF reset due to incorrect MDD event

2021-03-30 Thread Heitor Alves de Siqueira
Verified for Xenial following the same test procedure from comment #13, running 
on the current -proposed kernel:
ubuntu@snorlax:~/kprobe$ uname -rv
4.4.0-207-generic #239-Ubuntu SMP Thu Mar 25 02:59:26 UTC 2021

The offsets don't seem to have changed on any of the probes, so I've
used the same set that's already uploaded to the bug. The netdev
watchdog kicks in if queues hang, and general test results look good.

** Tags removed: verification-needed-xenial
** Tags added: verification-done-xenial

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1772675

Title:
  i40e PF reset due to incorrect MDD event

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Xenial:
  Fix Committed
Status in linux source package in Bionic:
  Fix Committed
Status in linux source package in Cosmic:
  Won't Fix

Bug description:
  [Impact]
  The i40e driver sometimes causes a "malicious device" event that the firmware 
detects, which causes the firmware to reset the NIC, causing an interruption in 
the network connection - which can cause further problems, e.g. if the 
interface is in a bond; the reset will at least cause a temporary interruption 
in network traffic.

  [Fix]
  In the case of MDD events issued for the PF, they are usually the result of a 
misconfigured TX descriptor and not due to "bad" actions in the VFs. We don't 
need to issue a reset to the whole NIC, TX hang checks should handle those if 
necessary.

  [Test Procedure]
  The bug is unfortunately difficult to reproduce, as there's no detailed 
documentation on how the i40e firmware detects and raises MDDs. We have seen 
reports of this happening in Xenial and Bionic, for workloads stressing i40e 
bonds in LACP mode.
  Reproducing is easily detected, as the network traffic will be interrupted 
and the system logs will contain a message like:
  i40e :02:00.1: TX driver issue detected, PF reset issued

  An alternative test procedure makes use of the kprobes attached to the LP 
bug. The test setup is as follows:
  - Create 2 VFs on primary NIC
  - Passthrough VF 1 to a Bionic VM
  - Start iperf3 client on VM, going through i40evf interface
  - Start another iperf3 client on host, going through i40e interface
  Both iperf3 clients should be using an external server located on a separate 
host. By loading the kprobe module while iperf3 is running, we should be able 
to raise MDDs more consistently. MDD behaviour can change according to firmware 
version, so we may need to try with different sets of probes. The one with the 
most consistent results seems to be 'corrupt_tx_desc_addr', which corrupts the 
cmd_type_offset_bsz field of the last TX descriptor before the NIC is notified 
of new data.

  [Regression Potential]
  Since we're removing resets for the NIC, regressions could show up as issues 
in connectivity after the MDD events are raised. If the firmware expects the 
whole NIC to reset, we could see TX/RX hangs and general unresponsiveness in 
networking. The potential for this should however be fairly low, as this patch 
has been present since kernel 5.2 and hasn't seen any fixes or regressions 
upstream. Basic smoke tests also showed that the driver continues working as 
expected, and that necessary PF resets will be issued by the netdev watchdog in 
case of any hung queues.

  ==
  [original description]

  This is a continuation from bug 1713553 and then bug 1723127; a patch
  was added in the first bug and then the second bug, to attempt to fix
  this, and it may have helped reduce the issue but appears not to have
  fixed it, based on more reports.

  See bug 1713553 and bug 1723127 for more details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1772675/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1772675] Re: i40e PF reset due to incorrect MDD event

2021-03-25 Thread Ubuntu Kernel Bot
This bug is awaiting verification that the kernel in -proposed solves
the problem. Please test the kernel and update this bug with the
results. If the problem is solved, change the tag 'verification-needed-
bionic' to 'verification-done-bionic'. If the problem still exists,
change the tag 'verification-needed-bionic' to 'verification-failed-
bionic'.

If verification is not done by 5 working days from today, this fix will
be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
to enable and use -proposed. Thank you!


** Tags added: verification-needed-bionic

** Tags added: verification-needed-xenial

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1772675

Title:
  i40e PF reset due to incorrect MDD event

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Xenial:
  Fix Committed
Status in linux source package in Bionic:
  Fix Committed
Status in linux source package in Cosmic:
  Won't Fix

Bug description:
  [Impact]
  The i40e driver sometimes causes a "malicious device" event that the firmware 
detects, which causes the firmware to reset the NIC, causing an interruption in 
the network connection - which can cause further problems, e.g. if the 
interface is in a bond; the reset will at least cause a temporary interruption 
in network traffic.

  [Fix]
  In the case of MDD events issued for the PF, they are usually the result of a 
misconfigured TX descriptor and not due to "bad" actions in the VFs. We don't 
need to issue a reset to the whole NIC, TX hang checks should handle those if 
necessary.

  [Test Procedure]
  The bug is unfortunately difficult to reproduce, as there's no detailed 
documentation on how the i40e firmware detects and raises MDDs. We have seen 
reports of this happening in Xenial and Bionic, for workloads stressing i40e 
bonds in LACP mode.
  Reproducing is easily detected, as the network traffic will be interrupted 
and the system logs will contain a message like:
  i40e :02:00.1: TX driver issue detected, PF reset issued

  An alternative test procedure makes use of the kprobes attached to the LP 
bug. The test setup is as follows:
  - Create 2 VFs on primary NIC
  - Passthrough VF 1 to a Bionic VM
  - Start iperf3 client on VM, going through i40evf interface
  - Start another iperf3 client on host, going through i40e interface
  Both iperf3 clients should be using an external server located on a separate 
host. By loading the kprobe module while iperf3 is running, we should be able 
to raise MDDs more consistently. MDD behaviour can change according to firmware 
version, so we may need to try with different sets of probes. The one with the 
most consistent results seems to be 'corrupt_tx_desc_addr', which corrupts the 
cmd_type_offset_bsz field of the last TX descriptor before the NIC is notified 
of new data.

  [Regression Potential]
  Since we're removing resets for the NIC, regressions could show up as issues 
in connectivity after the MDD events are raised. If the firmware expects the 
whole NIC to reset, we could see TX/RX hangs and general unresponsiveness in 
networking. The potential for this should however be fairly low, as this patch 
has been present since kernel 5.2 and hasn't seen any fixes or regressions 
upstream. Basic smoke tests also showed that the driver continues working as 
expected, and that necessary PF resets will be issued by the netdev watchdog in 
case of any hung queues.

  ==
  [original description]

  This is a continuation from bug 1713553 and then bug 1723127; a patch
  was added in the first bug and then the second bug, to attempt to fix
  this, and it may have helped reduce the issue but appears not to have
  fixed it, based on more reports.

  See bug 1713553 and bug 1723127 for more details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1772675/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1772675] Re: i40e PF reset due to incorrect MDD event

2021-03-25 Thread Ubuntu Kernel Bot
This bug is awaiting verification that the kernel in -proposed solves
the problem. Please test the kernel and update this bug with the
results. If the problem is solved, change the tag 'verification-needed-
xenial' to 'verification-done-xenial'. If the problem still exists,
change the tag 'verification-needed-xenial' to 'verification-failed-
xenial'.

If verification is not done by 5 working days from today, this fix will
be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
to enable and use -proposed. Thank you!

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1772675

Title:
  i40e PF reset due to incorrect MDD event

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Xenial:
  Fix Committed
Status in linux source package in Bionic:
  Fix Committed
Status in linux source package in Cosmic:
  Won't Fix

Bug description:
  [Impact]
  The i40e driver sometimes causes a "malicious device" event that the firmware 
detects, which causes the firmware to reset the NIC, causing an interruption in 
the network connection - which can cause further problems, e.g. if the 
interface is in a bond; the reset will at least cause a temporary interruption 
in network traffic.

  [Fix]
  In the case of MDD events issued for the PF, they are usually the result of a 
misconfigured TX descriptor and not due to "bad" actions in the VFs. We don't 
need to issue a reset to the whole NIC, TX hang checks should handle those if 
necessary.

  [Test Procedure]
  The bug is unfortunately difficult to reproduce, as there's no detailed 
documentation on how the i40e firmware detects and raises MDDs. We have seen 
reports of this happening in Xenial and Bionic, for workloads stressing i40e 
bonds in LACP mode.
  Reproducing is easily detected, as the network traffic will be interrupted 
and the system logs will contain a message like:
  i40e :02:00.1: TX driver issue detected, PF reset issued

  An alternative test procedure makes use of the kprobes attached to the LP 
bug. The test setup is as follows:
  - Create 2 VFs on primary NIC
  - Passthrough VF 1 to a Bionic VM
  - Start iperf3 client on VM, going through i40evf interface
  - Start another iperf3 client on host, going through i40e interface
  Both iperf3 clients should be using an external server located on a separate 
host. By loading the kprobe module while iperf3 is running, we should be able 
to raise MDDs more consistently. MDD behaviour can change according to firmware 
version, so we may need to try with different sets of probes. The one with the 
most consistent results seems to be 'corrupt_tx_desc_addr', which corrupts the 
cmd_type_offset_bsz field of the last TX descriptor before the NIC is notified 
of new data.

  [Regression Potential]
  Since we're removing resets for the NIC, regressions could show up as issues 
in connectivity after the MDD events are raised. If the firmware expects the 
whole NIC to reset, we could see TX/RX hangs and general unresponsiveness in 
networking. The potential for this should however be fairly low, as this patch 
has been present since kernel 5.2 and hasn't seen any fixes or regressions 
upstream. Basic smoke tests also showed that the driver continues working as 
expected, and that necessary PF resets will be issued by the netdev watchdog in 
case of any hung queues.

  ==
  [original description]

  This is a continuation from bug 1713553 and then bug 1723127; a patch
  was added in the first bug and then the second bug, to attempt to fix
  this, and it may have helped reduce the issue but appears not to have
  fixed it, based on more reports.

  See bug 1713553 and bug 1723127 for more details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1772675/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1772675] Re: i40e PF reset due to incorrect MDD event

2021-03-10 Thread Kelsey Skunberg
** Changed in: linux (Ubuntu Xenial)
   Status: In Progress => Fix Committed

** Changed in: linux (Ubuntu Bionic)
   Status: In Progress => Fix Committed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1772675

Title:
  i40e PF reset due to incorrect MDD event

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Xenial:
  Fix Committed
Status in linux source package in Bionic:
  Fix Committed
Status in linux source package in Cosmic:
  Won't Fix

Bug description:
  [Impact]
  The i40e driver sometimes causes a "malicious device" event that the firmware 
detects, which causes the firmware to reset the NIC, causing an interruption in 
the network connection - which can cause further problems, e.g. if the 
interface is in a bond; the reset will at least cause a temporary interruption 
in network traffic.

  [Fix]
  In the case of MDD events issued for the PF, they are usually the result of a 
misconfigured TX descriptor and not due to "bad" actions in the VFs. We don't 
need to issue a reset to the whole NIC, TX hang checks should handle those if 
necessary.

  [Test Procedure]
  The bug is unfortunately difficult to reproduce, as there's no detailed 
documentation on how the i40e firmware detects and raises MDDs. We have seen 
reports of this happening in Xenial and Bionic, for workloads stressing i40e 
bonds in LACP mode.
  Reproducing is easily detected, as the network traffic will be interrupted 
and the system logs will contain a message like:
  i40e :02:00.1: TX driver issue detected, PF reset issued

  An alternative test procedure makes use of the kprobes attached to the LP 
bug. The test setup is as follows:
  - Create 2 VFs on primary NIC
  - Passthrough VF 1 to a Bionic VM
  - Start iperf3 client on VM, going through i40evf interface
  - Start another iperf3 client on host, going through i40e interface
  Both iperf3 clients should be using an external server located on a separate 
host. By loading the kprobe module while iperf3 is running, we should be able 
to raise MDDs more consistently. MDD behaviour can change according to firmware 
version, so we may need to try with different sets of probes. The one with the 
most consistent results seems to be 'corrupt_tx_desc_addr', which corrupts the 
cmd_type_offset_bsz field of the last TX descriptor before the NIC is notified 
of new data.

  [Regression Potential]
  Since we're removing resets for the NIC, regressions could show up as issues 
in connectivity after the MDD events are raised. If the firmware expects the 
whole NIC to reset, we could see TX/RX hangs and general unresponsiveness in 
networking. The potential for this should however be fairly low, as this patch 
has been present since kernel 5.2 and hasn't seen any fixes or regressions 
upstream. Basic smoke tests also showed that the driver continues working as 
expected, and that necessary PF resets will be issued by the netdev watchdog in 
case of any hung queues.

  ==
  [original description]

  This is a continuation from bug 1713553 and then bug 1723127; a patch
  was added in the first bug and then the second bug, to attempt to fix
  this, and it may have helped reduce the issue but appears not to have
  fixed it, based on more reports.

  See bug 1713553 and bug 1723127 for more details.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1772675/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1772675] Re: i40e PF reset due to incorrect MDD event

2021-03-10 Thread Heitor Alves de Siqueira
** Description changed:

  [Impact]
  The i40e driver sometimes causes a "malicious device" event that the firmware 
detects, which causes the firmware to reset the NIC, causing an interruption in 
the network connection - which can cause further problems, e.g. if the 
interface is in a bond; the reset will at least cause a temporary interruption 
in network traffic.
  
  [Fix]
  In the case of MDD events issued for the PF, they are usually the result of a 
misconfigured TX descriptor and not due to "bad" actions in the VFs. We don't 
need to issue a reset to the whole NIC, TX hang checks should handle those if 
necessary.
  
  [Test Procedure]
  The bug is unfortunately difficult to reproduce, as there's no detailed 
documentation on how the i40e firmware detects and raises MDDs. We have seen 
reports of this happening in Xenial and Bionic, for workloads stressing i40e 
bonds in LACP mode.
  Reproducing is easily detected, as the network traffic will be interrupted 
and the system logs will contain a message like:
  i40e :02:00.1: TX driver issue detected, PF reset issued
  
  An alternative test procedure makes use of the kprobes attached to the LP 
bug. The test setup is as follows:
  - Create 2 VFs on primary NIC
  - Passthrough VF 1 to a Bionic VM
  - Start iperf3 client on VM, going through i40evf interface
  - Start another iperf3 client on host, going through i40e interface
  Both iperf3 clients should be using an external server located on a separate 
host. By loading the kprobe module while iperf3 is running, we should be able 
to raise MDDs more consistently. MDD behaviour can change according to firmware 
version, so we may need to try with different sets of probes. The one with the 
most consistent results seems to be 'corrupt_tx_desc_addr', which corrupts the 
cmd_type_offset_bsz field of the last TX descriptor before the NIC is notified 
of new data.
  
  [Regression Potential]
- Since we're removing resets for the NIC, regressions could show up as issues 
in connectivity after the MDD events are raised. If the firmware expects the 
whole NIC to reset, we could see TX/RX hangs and general unresponsiveness in 
networking. The potential for this should however be fairly low, as this patch 
has been present since kernel 5.2 and hasn't seen any fixes or regressions 
upstream. Basic smoke tests also showed that the driver continues working as 
expected.
+ Since we're removing resets for the NIC, regressions could show up as issues 
in connectivity after the MDD events are raised. If the firmware expects the 
whole NIC to reset, we could see TX/RX hangs and general unresponsiveness in 
networking. The potential for this should however be fairly low, as this patch 
has been present since kernel 5.2 and hasn't seen any fixes or regressions 
upstream. Basic smoke tests also showed that the driver continues working as 
expected, and that necessary PF resets will be issued by the netdev watchdog in 
case of any hung queues.
  
  ==
  [original description]
  
  This is a continuation from bug 1713553 and then bug 1723127; a patch
  was added in the first bug and then the second bug, to attempt to fix
  this, and it may have helped reduce the issue but appears not to have
  fixed it, based on more reports.
  
  See bug 1713553 and bug 1723127 for more details.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1772675

Title:
  i40e PF reset due to incorrect MDD event

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Xenial:
  In Progress
Status in linux source package in Bionic:
  In Progress
Status in linux source package in Cosmic:
  Won't Fix

Bug description:
  [Impact]
  The i40e driver sometimes causes a "malicious device" event that the firmware 
detects, which causes the firmware to reset the NIC, causing an interruption in 
the network connection - which can cause further problems, e.g. if the 
interface is in a bond; the reset will at least cause a temporary interruption 
in network traffic.

  [Fix]
  In the case of MDD events issued for the PF, they are usually the result of a 
misconfigured TX descriptor and not due to "bad" actions in the VFs. We don't 
need to issue a reset to the whole NIC, TX hang checks should handle those if 
necessary.

  [Test Procedure]
  The bug is unfortunately difficult to reproduce, as there's no detailed 
documentation on how the i40e firmware detects and raises MDDs. We have seen 
reports of this happening in Xenial and Bionic, for workloads stressing i40e 
bonds in LACP mode.
  Reproducing is easily detected, as the network traffic will be interrupted 
and the system logs will contain a message like:
  i40e :02:00.1: TX driver issue detected, PF reset issued

  An alternative test procedure makes use of the kprobes attached to the LP 
bug. The test setup is as follows:
  - Create 2 VFs on primary NIC
  - 

[Kernel-packages] [Bug 1772675] Re: i40e PF reset due to incorrect MDD event

2021-03-10 Thread Heitor Alves de Siqueira
I'm attaching kprobes for the Xenial kernels. These are based on the 
4.4.0-203/204 versions.
I followed the same test setup for Bionic, as described in the previous 
comment, and also had similar results. The netdev watchdog seems to take good 
care of any hung queues, so in the end PF resets will be issued regardless, if 
necessary.

** Description changed:

  [Impact]
  The i40e driver sometimes causes a "malicious device" event that the firmware 
detects, which causes the firmware to reset the NIC, causing an interruption in 
the network connection - which can cause further problems, e.g. if the 
interface is in a bond; the reset will at least cause a temporary interruption 
in network traffic.
  
  [Fix]
  In the case of MDD events issued for the PF, they are usually the result of a 
misconfigured TX descriptor and not due to "bad" actions in the VFs. We don't 
need to issue a reset to the whole NIC, TX hang checks should handle those if 
necessary.
  
- [Test Case]
+ [Test Procedure]
  The bug is unfortunately difficult to reproduce, as there's no detailed 
documentation on how the i40e firmware detects and raises MDDs. We have seen 
reports of this happening in Xenial and Bionic, for workloads stressing i40e 
bonds in LACP mode.
  Reproducing is easily detected, as the network traffic will be interrupted 
and the system logs will contain a message like:
  i40e :02:00.1: TX driver issue detected, PF reset issued
+ 
+ An alternative test procedure makes use of the kprobes attached to the LP 
bug. The test setup is as follows:
+ - Create 2 VFs on primary NIC
+ - Passthrough VF 1 to a Bionic VM
+ - Start iperf3 client on VM, going through i40evf interface
+ - Start another iperf3 client on host, going through i40e interface
+ Both iperf3 clients should be using an external server located on a separate 
host. By loading the kprobe module while iperf3 is running, we should be able 
to raise MDDs more consistently. MDD behaviour can change according to firmware 
version, so we may need to try with different sets of probes. The one with the 
most consistent results seems to be 'corrupt_tx_desc_addr', which corrupts the 
cmd_type_offset_bsz field of the last TX descriptor before the NIC is notified 
of new data.
  
  [Regression Potential]
  Since we're removing resets for the NIC, regressions could show up as issues 
in connectivity after the MDD events are raised. If the firmware expects the 
whole NIC to reset, we could see TX/RX hangs and general unresponsiveness in 
networking. The potential for this should however be fairly low, as this patch 
has been present since kernel 5.2 and hasn't seen any fixes or regressions 
upstream. Basic smoke tests also showed that the driver continues working as 
expected.
  
  ==
  [original description]
  
  This is a continuation from bug 1713553 and then bug 1723127; a patch
  was added in the first bug and then the second bug, to attempt to fix
  this, and it may have helped reduce the issue but appears not to have
  fixed it, based on more reports.
  
  See bug 1713553 and bug 1723127 for more details.

** Attachment added: "probe_tx_xenial.c"
   
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1772675/+attachment/5475473/+files/probe_tx_xenial.c

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1772675

Title:
  i40e PF reset due to incorrect MDD event

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Xenial:
  In Progress
Status in linux source package in Bionic:
  In Progress
Status in linux source package in Cosmic:
  Won't Fix

Bug description:
  [Impact]
  The i40e driver sometimes causes a "malicious device" event that the firmware 
detects, which causes the firmware to reset the NIC, causing an interruption in 
the network connection - which can cause further problems, e.g. if the 
interface is in a bond; the reset will at least cause a temporary interruption 
in network traffic.

  [Fix]
  In the case of MDD events issued for the PF, they are usually the result of a 
misconfigured TX descriptor and not due to "bad" actions in the VFs. We don't 
need to issue a reset to the whole NIC, TX hang checks should handle those if 
necessary.

  [Test Procedure]
  The bug is unfortunately difficult to reproduce, as there's no detailed 
documentation on how the i40e firmware detects and raises MDDs. We have seen 
reports of this happening in Xenial and Bionic, for workloads stressing i40e 
bonds in LACP mode.
  Reproducing is easily detected, as the network traffic will be interrupted 
and the system logs will contain a message like:
  i40e :02:00.1: TX driver issue detected, PF reset issued

  An alternative test procedure makes use of the kprobes attached to the LP 
bug. The test setup is as follows:
  - Create 2 VFs on primary NIC
  - Passthrough VF 1 to a Bionic VM
  - Start iperf3 client on VM, going through 

[Kernel-packages] [Bug 1772675] Re: i40e PF reset due to incorrect MDD event

2021-03-09 Thread Heitor Alves de Siqueira
Given that we're could be changing reset behavior that might be expected
from the firmware, I wrote a quick set of kprobes to force the firmware
to raise MDD events and test out the patched kernel from the PPA.

I tried to force faulty TX descriptors according to "Table 7-138. Tx
Descriptor Validity Checks" in the XL710 Datasheet, under section
"7.6.2.2.1 Interrupt on Misbehavior of VM (Malicious Driver Detection)".
This document is publicly available at Intel's Technical Library site
for this NIC.

The test setup is as follows:
- Create 2 VFs on primary NIC
- Passthrough VF 1 to a Bionic VM
- Start iperf3 client on VM, going through i40evf interface
- Start another iperf3 client on host, going through i40e interface

The iperf3 servers in my testing were running on a separate host, so I
only had clients using the i40e NIC. This was primarily to verify what
the networking and connectivity impact would be if we ran into any MDDs.

After both iperf3 clients were running, I loaded the kprobe modules
according to a specific TX check to validate. Raising MDDs on the VF
turned out to be pretty trivial, and most of the i40e probes also work
on i40evf. MDDs on the PF were a bit more tricky to get, but I had good
results with corrupting the final TX descriptor's cmd_type_offset_bsz
field. As this happens right before the driver notifies the NIC about
the new data, it should force the firmware to raise the MDD event, as
opposed to us "manually" triggering it from the driver. This has the
benefit of keeping things consistent from the firmware's point of view,
as in the end it is the one responsible for detecting and notifying the
kernel about those events.

The primary point with this test was to verify whether we could leave
the NIC in an inconsistent state, by avoiding or delaying the PF reset.
The results were promising, and should hopefully give some more data on
the value of the upstream patch.

When raising MDDs on the VF, the firmware correctly slaps the
appropriate queues and schedules any resets as required. This is the
same behavior as before. With the test kernel however, we don't issue
any resets to the PF, so the iperf3 tests continue running uninterrupted
as desired.

When raising MDDs on the PF, we don't issue any resets anymore and
depending on what probe was used, connectivity will stop momentarily.
The netdev watchdog kicks in shortly afterwards, and issues a PF reset
as appropriate, and network connectivity resumes. This confirms that
even with the upstream patch any hung queues that don't reset
immediately will recover afterwards, as the queue watchdogs will take
care of those. This is consistent with the upstream behavior, and the
kernel logs look similar as to the one below:

[  573.279608] NETDEV WATCHDOG: ens1f1 (i40e): transmit queue 1 timed out
[  573.279652] WARNING: CPU: 14 PID: 0 at 
/build/linux-lqvoqZ/linux-4.15.0/net/sched/sch_generic.c:323 
dev_watchdog+0x221/0x230
[  573.279659] Modules linked in: vhost_net vhost tap vfio_pci vfio_virqfd 
vfio_iommu_type1 vfio i40evf xt_CHECKSUM iptable_mangle ipt_MASQUERADE 
nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 
nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp 
ebtable_filter devlink ebtables nls_iso8859_1 intel_rapl sb_edac 
x86_pkg_temp_thermal intel_powerclamp ipmi_ssif coretemp intel_cstate 
intel_rapl_perf lpc_ich hpilo ipmi_si ipmi_devintf ipmi_msghandler shpchp 
ioatdma acpi_power_meter mac_hid sch_fq_codel kvm_intel kvm irqbypass 
iptable_filter ip6table_filter ip6_tables br_netfilter bridge stp llc 
arp_tables ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi 
scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 
raid456 async_raid6_recov
[  573.279726]  async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c 
raid1 raid0 multipath linear ses enclosure crct10dif_pclmul crc32_pclmul 
ghash_clmulni_intel pcbc mgag200 i2c_algo_bit ttm aesni_intel aes_x86_64 
crypto_simd glue_helper cryptd drm_kms_helper syscopyarea ixgbe sysfillrect 
sysimgblt fb_sys_fops dca i40e drm tg3 ptp nvme hpsa pps_core nvme_core mdio 
scsi_transport_sas wmi [last unloaded: probe_tx_desc]
[  573.279756] CPU: 14 PID: 0 Comm: swapper/14 Tainted: G   OE
4.15.0-137-generic #141+TEST298651v20210225b1-Ubuntu
[  573.279757] Hardware name: HP ProLiant DL360 Gen9, BIOS P89 05/06/2015
[  573.279763] RIP: 0010:dev_watchdog+0x221/0x230
[  573.279764] RSP: 0018:8f28bf183e58 EFLAGS: 00010286
[  573.279766] RAX:  RBX: 0001 RCX: 083f
[  573.279767] RDX:  RSI: 00f6 RDI: 083f
[  573.279769] RBP: 8f28bf183e88 R08: 0694 R09: 0004
[  573.279770] R10: 8f28bf183ee0 R11: 0001 R12: 0040
[  573.279772] R13: 8f2827c69000 R14: 8f2827c69478 R15: 8f2827fa4f40
[  573.279774] FS:  () GS:8f28bf18()