[Kernel-packages] [Bug 1987287] Re: [UBUNTU 20.04] mlx5 driver crashes on accessing device attributes during recovery

2022-12-15 Thread Frank Heimes
This bug was not opened against linux-xilinx-zynqmp.
So I'm updating the verification tag just to unblock the further process.

** Tags removed: verification-needed-focal
** Tags added: verification-done-focal

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1987287

Title:
  [UBUNTU 20.04] mlx5 driver crashes on accessing device attributes
  during recovery

Status in Ubuntu on IBM z Systems:
  Fix Released
Status in linux package in Ubuntu:
  Invalid
Status in linux source package in Focal:
  Fix Released

Bug description:
  SRU Justification:
  --

  [Impact]

   * If the mlx5 driver is reloading while the recovery flow is happening,
     and if it receives new commands before the command interface is up
     again, this can lead to null pointer that tries to access non-
     initialized command structures.

   * So it's required to avoid processing commands before the command
     interface is up again.

   * This is accomplished by a new cmdif state that helps to avoid
     processing commands while cmdif is not ready.

  [Fix]

   * backport of f7936ddd35d8 f7936ddd35d8b849daf0372770c7c9dbe7910fca
  "net/mlx5: Avoid processing commands before cmdif is ready"

  [Test Plan]

   * An Ubuntu Server for s390x 18.04 or 20.04 LPAR or z/VM installation
     is needed that has Mellanox cards (RoCE Express 2.1) assigned,
     configured and enabled and that runs a 5.4 kernel (on bionic hwe-5.4).

   * Now trigger a recovery (guess that can be done at the Support Element)
     and reload the driver at the same time.

   * Make sure the module/driver mlx5 is loaded and in use
     (otherwise it can't be removed/unloaded).

   * Now remove/unload the module with:
     sudo modprobe -r mlx5
     and (re-)load it again with:
     sudo modprobe mlx5

   * Due to the lack of RoCE Express 2.1 hardware,
     IBM needs to do the verification.

  [Where problems could occur]

   * In case there is an issue with 'cmdif' it might not have the correct
     interface state, which:
     - either might lead to the fact that commands are not properly blocked
   and the situation is similar like before
     - or the commands may get always blocked,
   which render the hardware useless
     - or might block in wrong situation,
   which will cause unexpected issues and broken behavior.

   * Since the patch got upstream accepted with v5.7-rc7 it's
     not new to the kernel, was already part of groovy (and above)
     and is therefor already in use by newer Ubuntu releases.

  [Other Info]

   * Since the patch is upstream since v5.7-rc7,
     it's already included in jammy and kinetic.

   * Since the upstream patch incl. the line:
     Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox
     Connect-IB adapters") it looks to me that it was forgotten
     to mark the patch for upstream stable updates.

   * Such SRUs for focal's 5.4 will automatically land in bionic's
 hwe-5.4, too. But since this was especially requested for
 bionic's hwe-5.4, I wanted to mention this here.
  __

  We recently got a bug report for systems running Ubuntu 20.04 that were
  crashing with backtraces pointing at the mlx5 driver's handling of 
mlx5_ethtool_get_link_ksettings()
  when this is called through the sysfs (going through ethtool might have 
different checks).
  I managed to find a reliable way to reproduce the issue that I believe isn't 
tied to IBM Z at all.

  The procedure to reproduce is as follows. I created a script to read
  the sysfs attributes for the link's speed and duplex mode in a loop:

  #!/usr/bin/env bash

  if [ $# -lt 1 ]; then
  echo "Usage: $0 "
  exit 1
  fi

  while true; do
  cat /sys/class/net/$1/duplex > /dev/null
  cat /sys/class/net/$1/speed > /dev/null
  done

  Executed with:

  # ./script.sh enP10p0s0

  I ran this in one bash session and then in another one I triggered a PCI 
reset with
  the follwoing command where one needs to replace  with the PCI address 
of the NIC:

  echo 1 > /sys/bus/pci/devices//reset

  Then first I got a lot of the following messages:

   mlx5_core 0010:00:00.0 enP16p0s0: mlx5e_ethtool_get_link_ksettings:
  query port ptys failed: -5

  And then as the mlx5 driver's recovery kicks in the oops as below:

  [  659.103947] mlx5_core 0010:00:00.0: wait vital counter value 0x7b399f 
after 1 iterations
  [  659.103947] mlx5_core 0010:00:00.0: mlx5_pci_resume was called
  [  659.103966] mlx5_core 0010:00:00.0: firmware version: 14.32.1010
  [  659.104169] Unable to handle kernel pointer dereference in virtual kernel 
address space
  [  659.104171] Failing address:  TEID: 0483
  [  659.104172] Fault in home space mode while using kernel ASCE.
  [  659.104173] AS:3d29c007 R3:fffd0007 S:fffd5800 
P:003d
  [  659.104200] Oops: 0004 ilc:2 [#1] SMP
  

[Kernel-packages] [Bug 1987287] Re: [UBUNTU 20.04] mlx5 driver crashes on accessing device attributes during recovery

2022-12-15 Thread Ubuntu Kernel Bot
This bug is awaiting verification that the linux-xilinx-
zynqmp/5.4.0-1020.24 kernel in -proposed solves the problem. Please test
the kernel and update this bug with the results. If the problem is
solved, change the tag 'verification-needed-focal' to 'verification-
done-focal'. If the problem still exists, change the tag 'verification-
needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will
be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
to enable and use -proposed. Thank you!


** Tags added: kernel-spammed-focal-linux-xilinx-zynqmp 
verification-needed-focal

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1987287

Title:
  [UBUNTU 20.04] mlx5 driver crashes on accessing device attributes
  during recovery

Status in Ubuntu on IBM z Systems:
  Fix Released
Status in linux package in Ubuntu:
  Invalid
Status in linux source package in Focal:
  Fix Released

Bug description:
  SRU Justification:
  --

  [Impact]

   * If the mlx5 driver is reloading while the recovery flow is happening,
     and if it receives new commands before the command interface is up
     again, this can lead to null pointer that tries to access non-
     initialized command structures.

   * So it's required to avoid processing commands before the command
     interface is up again.

   * This is accomplished by a new cmdif state that helps to avoid
     processing commands while cmdif is not ready.

  [Fix]

   * backport of f7936ddd35d8 f7936ddd35d8b849daf0372770c7c9dbe7910fca
  "net/mlx5: Avoid processing commands before cmdif is ready"

  [Test Plan]

   * An Ubuntu Server for s390x 18.04 or 20.04 LPAR or z/VM installation
     is needed that has Mellanox cards (RoCE Express 2.1) assigned,
     configured and enabled and that runs a 5.4 kernel (on bionic hwe-5.4).

   * Now trigger a recovery (guess that can be done at the Support Element)
     and reload the driver at the same time.

   * Make sure the module/driver mlx5 is loaded and in use
     (otherwise it can't be removed/unloaded).

   * Now remove/unload the module with:
     sudo modprobe -r mlx5
     and (re-)load it again with:
     sudo modprobe mlx5

   * Due to the lack of RoCE Express 2.1 hardware,
     IBM needs to do the verification.

  [Where problems could occur]

   * In case there is an issue with 'cmdif' it might not have the correct
     interface state, which:
     - either might lead to the fact that commands are not properly blocked
   and the situation is similar like before
     - or the commands may get always blocked,
   which render the hardware useless
     - or might block in wrong situation,
   which will cause unexpected issues and broken behavior.

   * Since the patch got upstream accepted with v5.7-rc7 it's
     not new to the kernel, was already part of groovy (and above)
     and is therefor already in use by newer Ubuntu releases.

  [Other Info]

   * Since the patch is upstream since v5.7-rc7,
     it's already included in jammy and kinetic.

   * Since the upstream patch incl. the line:
     Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox
     Connect-IB adapters") it looks to me that it was forgotten
     to mark the patch for upstream stable updates.

   * Such SRUs for focal's 5.4 will automatically land in bionic's
 hwe-5.4, too. But since this was especially requested for
 bionic's hwe-5.4, I wanted to mention this here.
  __

  We recently got a bug report for systems running Ubuntu 20.04 that were
  crashing with backtraces pointing at the mlx5 driver's handling of 
mlx5_ethtool_get_link_ksettings()
  when this is called through the sysfs (going through ethtool might have 
different checks).
  I managed to find a reliable way to reproduce the issue that I believe isn't 
tied to IBM Z at all.

  The procedure to reproduce is as follows. I created a script to read
  the sysfs attributes for the link's speed and duplex mode in a loop:

  #!/usr/bin/env bash

  if [ $# -lt 1 ]; then
  echo "Usage: $0 "
  exit 1
  fi

  while true; do
  cat /sys/class/net/$1/duplex > /dev/null
  cat /sys/class/net/$1/speed > /dev/null
  done

  Executed with:

  # ./script.sh enP10p0s0

  I ran this in one bash session and then in another one I triggered a PCI 
reset with
  the follwoing command where one needs to replace  with the PCI address 
of the NIC:

  echo 1 > /sys/bus/pci/devices//reset

  Then first I got a lot of the following messages:

   mlx5_core 0010:00:00.0 enP16p0s0: mlx5e_ethtool_get_link_ksettings:
  query port ptys failed: -5

  And then as the mlx5 driver's recovery kicks in the oops as below:

  [  659.103947] mlx5_core 0010:00:00.0: wait vital counter value 0x7b399f 
after 1 iterations
  [  

[Kernel-packages] [Bug 1987287] Re: [UBUNTU 20.04] mlx5 driver crashes on accessing device attributes during recovery

2022-10-10 Thread bugproxy
** Tags removed: targetmilestone-inin---
** Tags added: targetmilestone-inin2004

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1987287

Title:
  [UBUNTU 20.04] mlx5 driver crashes on accessing device attributes
  during recovery

Status in Ubuntu on IBM z Systems:
  Fix Released
Status in linux package in Ubuntu:
  Invalid
Status in linux source package in Focal:
  Fix Released

Bug description:
  SRU Justification:
  --

  [Impact]

   * If the mlx5 driver is reloading while the recovery flow is happening,
     and if it receives new commands before the command interface is up
     again, this can lead to null pointer that tries to access non-
     initialized command structures.

   * So it's required to avoid processing commands before the command
     interface is up again.

   * This is accomplished by a new cmdif state that helps to avoid
     processing commands while cmdif is not ready.

  [Fix]

   * backport of f7936ddd35d8 f7936ddd35d8b849daf0372770c7c9dbe7910fca
  "net/mlx5: Avoid processing commands before cmdif is ready"

  [Test Plan]

   * An Ubuntu Server for s390x 18.04 or 20.04 LPAR or z/VM installation
     is needed that has Mellanox cards (RoCE Express 2.1) assigned,
     configured and enabled and that runs a 5.4 kernel (on bionic hwe-5.4).

   * Now trigger a recovery (guess that can be done at the Support Element)
     and reload the driver at the same time.

   * Make sure the module/driver mlx5 is loaded and in use
     (otherwise it can't be removed/unloaded).

   * Now remove/unload the module with:
     sudo modprobe -r mlx5
     and (re-)load it again with:
     sudo modprobe mlx5

   * Due to the lack of RoCE Express 2.1 hardware,
     IBM needs to do the verification.

  [Where problems could occur]

   * In case there is an issue with 'cmdif' it might not have the correct
     interface state, which:
     - either might lead to the fact that commands are not properly blocked
   and the situation is similar like before
     - or the commands may get always blocked,
   which render the hardware useless
     - or might block in wrong situation,
   which will cause unexpected issues and broken behavior.

   * Since the patch got upstream accepted with v5.7-rc7 it's
     not new to the kernel, was already part of groovy (and above)
     and is therefor already in use by newer Ubuntu releases.

  [Other Info]

   * Since the patch is upstream since v5.7-rc7,
     it's already included in jammy and kinetic.

   * Since the upstream patch incl. the line:
     Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox
     Connect-IB adapters") it looks to me that it was forgotten
     to mark the patch for upstream stable updates.

   * Such SRUs for focal's 5.4 will automatically land in bionic's
 hwe-5.4, too. But since this was especially requested for
 bionic's hwe-5.4, I wanted to mention this here.
  __

  We recently got a bug report for systems running Ubuntu 20.04 that were
  crashing with backtraces pointing at the mlx5 driver's handling of 
mlx5_ethtool_get_link_ksettings()
  when this is called through the sysfs (going through ethtool might have 
different checks).
  I managed to find a reliable way to reproduce the issue that I believe isn't 
tied to IBM Z at all.

  The procedure to reproduce is as follows. I created a script to read
  the sysfs attributes for the link's speed and duplex mode in a loop:

  #!/usr/bin/env bash

  if [ $# -lt 1 ]; then
  echo "Usage: $0 "
  exit 1
  fi

  while true; do
  cat /sys/class/net/$1/duplex > /dev/null
  cat /sys/class/net/$1/speed > /dev/null
  done

  Executed with:

  # ./script.sh enP10p0s0

  I ran this in one bash session and then in another one I triggered a PCI 
reset with
  the follwoing command where one needs to replace  with the PCI address 
of the NIC:

  echo 1 > /sys/bus/pci/devices//reset

  Then first I got a lot of the following messages:

   mlx5_core 0010:00:00.0 enP16p0s0: mlx5e_ethtool_get_link_ksettings:
  query port ptys failed: -5

  And then as the mlx5 driver's recovery kicks in the oops as below:

  [  659.103947] mlx5_core 0010:00:00.0: wait vital counter value 0x7b399f 
after 1 iterations
  [  659.103947] mlx5_core 0010:00:00.0: mlx5_pci_resume was called
  [  659.103966] mlx5_core 0010:00:00.0: firmware version: 14.32.1010
  [  659.104169] Unable to handle kernel pointer dereference in virtual kernel 
address space
  [  659.104171] Failing address:  TEID: 0483
  [  659.104172] Fault in home space mode while using kernel ASCE.
  [  659.104173] AS:3d29c007 R3:fffd0007 S:fffd5800 
P:003d
  [  659.104200] Oops: 0004 ilc:2 [#1] SMP
  [  659.104202] Modules linked in: s390_trng ism smc pnet chsc_sch eadm_sch 
vfio_ccw vfio_mdev mdev vfio_iommu_type1 vfio 

[Kernel-packages] [Bug 1987287] Re: [UBUNTU 20.04] mlx5 driver crashes on accessing device attributes during recovery

2022-10-10 Thread Frank Heimes
** Changed in: ubuntu-z-systems
   Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1987287

Title:
  [UBUNTU 20.04] mlx5 driver crashes on accessing device attributes
  during recovery

Status in Ubuntu on IBM z Systems:
  Fix Released
Status in linux package in Ubuntu:
  Invalid
Status in linux source package in Focal:
  Fix Released

Bug description:
  SRU Justification:
  --

  [Impact]

   * If the mlx5 driver is reloading while the recovery flow is happening,
     and if it receives new commands before the command interface is up
     again, this can lead to null pointer that tries to access non-
     initialized command structures.

   * So it's required to avoid processing commands before the command
     interface is up again.

   * This is accomplished by a new cmdif state that helps to avoid
     processing commands while cmdif is not ready.

  [Fix]

   * backport of f7936ddd35d8 f7936ddd35d8b849daf0372770c7c9dbe7910fca
  "net/mlx5: Avoid processing commands before cmdif is ready"

  [Test Plan]

   * An Ubuntu Server for s390x 18.04 or 20.04 LPAR or z/VM installation
     is needed that has Mellanox cards (RoCE Express 2.1) assigned,
     configured and enabled and that runs a 5.4 kernel (on bionic hwe-5.4).

   * Now trigger a recovery (guess that can be done at the Support Element)
     and reload the driver at the same time.

   * Make sure the module/driver mlx5 is loaded and in use
     (otherwise it can't be removed/unloaded).

   * Now remove/unload the module with:
     sudo modprobe -r mlx5
     and (re-)load it again with:
     sudo modprobe mlx5

   * Due to the lack of RoCE Express 2.1 hardware,
     IBM needs to do the verification.

  [Where problems could occur]

   * In case there is an issue with 'cmdif' it might not have the correct
     interface state, which:
     - either might lead to the fact that commands are not properly blocked
   and the situation is similar like before
     - or the commands may get always blocked,
   which render the hardware useless
     - or might block in wrong situation,
   which will cause unexpected issues and broken behavior.

   * Since the patch got upstream accepted with v5.7-rc7 it's
     not new to the kernel, was already part of groovy (and above)
     and is therefor already in use by newer Ubuntu releases.

  [Other Info]

   * Since the patch is upstream since v5.7-rc7,
     it's already included in jammy and kinetic.

   * Since the upstream patch incl. the line:
     Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox
     Connect-IB adapters") it looks to me that it was forgotten
     to mark the patch for upstream stable updates.

   * Such SRUs for focal's 5.4 will automatically land in bionic's
 hwe-5.4, too. But since this was especially requested for
 bionic's hwe-5.4, I wanted to mention this here.
  __

  We recently got a bug report for systems running Ubuntu 20.04 that were
  crashing with backtraces pointing at the mlx5 driver's handling of 
mlx5_ethtool_get_link_ksettings()
  when this is called through the sysfs (going through ethtool might have 
different checks).
  I managed to find a reliable way to reproduce the issue that I believe isn't 
tied to IBM Z at all.

  The procedure to reproduce is as follows. I created a script to read
  the sysfs attributes for the link's speed and duplex mode in a loop:

  #!/usr/bin/env bash

  if [ $# -lt 1 ]; then
  echo "Usage: $0 "
  exit 1
  fi

  while true; do
  cat /sys/class/net/$1/duplex > /dev/null
  cat /sys/class/net/$1/speed > /dev/null
  done

  Executed with:

  # ./script.sh enP10p0s0

  I ran this in one bash session and then in another one I triggered a PCI 
reset with
  the follwoing command where one needs to replace  with the PCI address 
of the NIC:

  echo 1 > /sys/bus/pci/devices//reset

  Then first I got a lot of the following messages:

   mlx5_core 0010:00:00.0 enP16p0s0: mlx5e_ethtool_get_link_ksettings:
  query port ptys failed: -5

  And then as the mlx5 driver's recovery kicks in the oops as below:

  [  659.103947] mlx5_core 0010:00:00.0: wait vital counter value 0x7b399f 
after 1 iterations
  [  659.103947] mlx5_core 0010:00:00.0: mlx5_pci_resume was called
  [  659.103966] mlx5_core 0010:00:00.0: firmware version: 14.32.1010
  [  659.104169] Unable to handle kernel pointer dereference in virtual kernel 
address space
  [  659.104171] Failing address:  TEID: 0483
  [  659.104172] Fault in home space mode while using kernel ASCE.
  [  659.104173] AS:3d29c007 R3:fffd0007 S:fffd5800 
P:003d
  [  659.104200] Oops: 0004 ilc:2 [#1] SMP
  [  659.104202] Modules linked in: s390_trng ism smc pnet chsc_sch eadm_sch 
vfio_ccw vfio_mdev mdev vfio_iommu_type1 vfio 

[Kernel-packages] [Bug 1987287] Re: [UBUNTU 20.04] mlx5 driver crashes on accessing device attributes during recovery

2022-10-10 Thread Launchpad Bug Tracker
This bug was fixed in the package linux - 5.4.0-128.144

---
linux (5.4.0-128.144) focal; urgency=medium

  * focal/linux: 5.4.0-128.144 -proposed tracker (LP: #1990152)

  * CVE-2022-3176
- io_uring: disable polling pollfree files

  * ip/nexthop: fix default address selection for connected nexthop
(LP: #1988809)
- selftests/net: test nexthop without gw

  * ip/nexthop: fix default address selection for connected nexthop
(LP: #1988809) // icmp_redirect.sh in ubuntu_kernel_selftests failed on
Jammy 5.15.0-49.55 (LP: #1990124)
- ip: fix triggering of 'icmp redirect'

linux (5.4.0-127.143) focal; urgency=medium

  * focal/linux: 5.4.0-127.143 -proposed tracker (LP: #1989892)

  * Packaging resync (LP: #1786013)
- debian/dkms-versions -- update from kernel-versions (main/2022.09.19)

  * [UBUNTU 20.04] mlx5 driver crashes on accessing device attributes during
recovery (LP: #1987287)
- net/mlx5: Avoid processing commands before cmdif is ready

  * Focal update: v5.4.210 upstream stable release (LP: #1989230)
- thermal: Fix NULL pointer dereferences in of_thermal_ functions
- ACPI: video: Force backlight native for some TongFang devices
- ACPI: video: Shortening quirk list by identifying Clevo by board_name only
- ACPI: APEI: Better fix to avoid spamming the console with old error logs
- bpf: Verifer, adjust_scalar_min_max_vals to always call 
update_reg_bounds()
- selftests/bpf: Extend verifier and bpf_sock tests for dst_port loads
- bpf: Test_verifier, #70 error message updates for 32-bit right shift
- KVM: Don't null dereference ops->destroy
- selftests: KVM: Handle compiler optimizations in ucall
- media: v4l2-mem2mem: Apply DST_QUEUE_OFF_BASE on MMAP buffers across 
ioctls
- macintosh/adb: fix oob read in do_adb_query() function
- x86/speculation: Add RSB VM Exit protections
- x86/speculation: Add LFENCE to RSB fill sequence
- Linux 5.4.210

  * Focal update: v5.4.209 upstream stable release (LP: #1989228)
- Bluetooth: L2CAP: Fix use-after-free caused by l2cap_chan_put
- ntfs: fix use-after-free in ntfs_ucsncmp()
- s390/archrandom: prevent CPACF trng invocations in interrupt context
- tcp: Fix data-races around sysctl_tcp_dsack.
- tcp: Fix a data-race around sysctl_tcp_app_win.
- tcp: Fix a data-race around sysctl_tcp_adv_win_scale.
- tcp: Fix a data-race around sysctl_tcp_frto.
- tcp: Fix a data-race around sysctl_tcp_nometrics_save.
- ice: check (DD | EOF) bits on Rx descriptor rather than (EOP | RS)
- ice: do not setup vlan for loopback VSI
- scsi: ufs: host: Hold reference returned by of_parse_phandle()
- tcp: Fix a data-race around sysctl_tcp_limit_output_bytes.
- tcp: Fix a data-race around sysctl_tcp_challenge_ack_limit.
- net: ping6: Fix memleak in ipv6_renew_options().
- ipv6/addrconf: fix a null-ptr-deref bug for ip6_ptr
- igmp: Fix data-races around sysctl_igmp_qrv.
- net: sungem_phy: Add of_node_put() for reference returned by 
of_get_parent()
- tcp: Fix a data-race around sysctl_tcp_min_tso_segs.
- tcp: Fix a data-race around sysctl_tcp_min_rtt_wlen.
- tcp: Fix a data-race around sysctl_tcp_autocorking.
- tcp: Fix a data-race around sysctl_tcp_invalid_ratelimit.
- Documentation: fix sctp_wmem in ip-sysctl.rst
- tcp: Fix a data-race around sysctl_tcp_comp_sack_delay_ns.
- tcp: Fix a data-race around sysctl_tcp_comp_sack_nr.
- i40e: Fix interface init with MSI interrupts (no MSI-X)
- sctp: fix sleep in atomic context bug in timer handlers
- virtio-net: fix the race between refill work and close
- perf symbol: Correct address for bss symbols
- sfc: disable softirqs for ptp TX
- sctp: leave the err path free in sctp_stream_init to sctp_stream_free
- ARM: crypto: comment out gcc warning that breaks clang builds
- mt7601u: add USB device ID for some versions of XiaoDu WiFi Dongle.
- scsi: core: Fix race between handling STS_RESOURCE and completion
- Linux 5.4.209

  * Focal update: v5.4.208 upstream stable release (LP: #1988225)
- pinctrl: stm32: fix optional IRQ support to gpios
- riscv: add as-options for modules with assembly compontents
- mlxsw: spectrum_router: Fix IPv4 nexthop gateway indication
- lockdown: Fix kexec lockdown bypass with ima policy
- xen/gntdev: Ignore failure to unmap INVALID_GRANT_HANDLE
- PCI: hv: Fix multi-MSI to allow more than one MSI vector
- PCI: hv: Fix hv_arch_irq_unmask() for multi-MSI
- PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()
- PCI: hv: Fix interrupt mapping for multi-MSI
- serial: mvebu-uart: correctly report configured baudrate value
- xfrm: xfrm_policy: fix a possible double xfrm_pols_put() in
  xfrm_bundle_lookup()
- power/reset: arm-versatile: Fix refcount leak in versatile_reboot_probe
- pinctrl: ralink: Check for null return of devm_kcalloc
- 

[Kernel-packages] [Bug 1987287] Re: [UBUNTU 20.04] mlx5 driver crashes on accessing device attributes during recovery

2022-09-26 Thread Frank Heimes
Thx for testing!
The further process is already in place and on a high level (just dates) 
described here:
https://kernel.ubuntu.com/
So this fix is now incl. in the kernel SRU cycle "2022.09.19".
Means the updated kernel that incl. the fix will soon (ideally today) land in 
the '-proposed' pocket of the archive (and will be installable by everyone who 
has '-proposed' enabled.
And if no significant issues arise during the "Bug verification & Regression 
testing" phase, it will be released on '10-Oct'.
That is the currently plan (however, highly severe and critical fixes and CVE 
have the power to overrule this planned schedule, but this fortunately does not 
happen very often).

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1987287

Title:
  [UBUNTU 20.04] mlx5 driver crashes on accessing device attributes
  during recovery

Status in Ubuntu on IBM z Systems:
  Fix Committed
Status in linux package in Ubuntu:
  Invalid
Status in linux source package in Focal:
  Fix Committed

Bug description:
  SRU Justification:
  --

  [Impact]

   * If the mlx5 driver is reloading while the recovery flow is happening,
     and if it receives new commands before the command interface is up
     again, this can lead to null pointer that tries to access non-
     initialized command structures.

   * So it's required to avoid processing commands before the command
     interface is up again.

   * This is accomplished by a new cmdif state that helps to avoid
     processing commands while cmdif is not ready.

  [Fix]

   * backport of f7936ddd35d8 f7936ddd35d8b849daf0372770c7c9dbe7910fca
  "net/mlx5: Avoid processing commands before cmdif is ready"

  [Test Plan]

   * An Ubuntu Server for s390x 18.04 or 20.04 LPAR or z/VM installation
     is needed that has Mellanox cards (RoCE Express 2.1) assigned,
     configured and enabled and that runs a 5.4 kernel (on bionic hwe-5.4).

   * Now trigger a recovery (guess that can be done at the Support Element)
     and reload the driver at the same time.

   * Make sure the module/driver mlx5 is loaded and in use
     (otherwise it can't be removed/unloaded).

   * Now remove/unload the module with:
     sudo modprobe -r mlx5
     and (re-)load it again with:
     sudo modprobe mlx5

   * Due to the lack of RoCE Express 2.1 hardware,
     IBM needs to do the verification.

  [Where problems could occur]

   * In case there is an issue with 'cmdif' it might not have the correct
     interface state, which:
     - either might lead to the fact that commands are not properly blocked
   and the situation is similar like before
     - or the commands may get always blocked,
   which render the hardware useless
     - or might block in wrong situation,
   which will cause unexpected issues and broken behavior.

   * Since the patch got upstream accepted with v5.7-rc7 it's
     not new to the kernel, was already part of groovy (and above)
     and is therefor already in use by newer Ubuntu releases.

  [Other Info]

   * Since the patch is upstream since v5.7-rc7,
     it's already included in jammy and kinetic.

   * Since the upstream patch incl. the line:
     Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox
     Connect-IB adapters") it looks to me that it was forgotten
     to mark the patch for upstream stable updates.

   * Such SRUs for focal's 5.4 will automatically land in bionic's
 hwe-5.4, too. But since this was especially requested for
 bionic's hwe-5.4, I wanted to mention this here.
  __

  We recently got a bug report for systems running Ubuntu 20.04 that were
  crashing with backtraces pointing at the mlx5 driver's handling of 
mlx5_ethtool_get_link_ksettings()
  when this is called through the sysfs (going through ethtool might have 
different checks).
  I managed to find a reliable way to reproduce the issue that I believe isn't 
tied to IBM Z at all.

  The procedure to reproduce is as follows. I created a script to read
  the sysfs attributes for the link's speed and duplex mode in a loop:

  #!/usr/bin/env bash

  if [ $# -lt 1 ]; then
  echo "Usage: $0 "
  exit 1
  fi

  while true; do
  cat /sys/class/net/$1/duplex > /dev/null
  cat /sys/class/net/$1/speed > /dev/null
  done

  Executed with:

  # ./script.sh enP10p0s0

  I ran this in one bash session and then in another one I triggered a PCI 
reset with
  the follwoing command where one needs to replace  with the PCI address 
of the NIC:

  echo 1 > /sys/bus/pci/devices//reset

  Then first I got a lot of the following messages:

   mlx5_core 0010:00:00.0 enP16p0s0: mlx5e_ethtool_get_link_ksettings:
  query port ptys failed: -5

  And then as the mlx5 driver's recovery kicks in the oops as below:

  [  659.103947] mlx5_core 0010:00:00.0: wait vital counter value 0x7b399f 
after 1 iterations
  [  659.103947] mlx5_core 

[Kernel-packages] [Bug 1987287] Re: [UBUNTU 20.04] mlx5 driver crashes on accessing device attributes during recovery

2022-09-22 Thread Ubuntu Kernel Bot
This bug is awaiting verification that the linux/5.4.0-128.144 kernel in
-proposed solves the problem. Please test the kernel and update this bug
with the results. If the problem is solved, change the tag
'verification-needed-focal' to 'verification-done-focal'. If the problem
still exists, change the tag 'verification-needed-focal' to
'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will
be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
to enable and use -proposed. Thank you!

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1987287

Title:
  [UBUNTU 20.04] mlx5 driver crashes on accessing device attributes
  during recovery

Status in Ubuntu on IBM z Systems:
  Fix Committed
Status in linux package in Ubuntu:
  Invalid
Status in linux source package in Focal:
  Fix Committed

Bug description:
  SRU Justification:
  --

  [Impact]

   * If the mlx5 driver is reloading while the recovery flow is happening,
     and if it receives new commands before the command interface is up
     again, this can lead to null pointer that tries to access non-
     initialized command structures.

   * So it's required to avoid processing commands before the command
     interface is up again.

   * This is accomplished by a new cmdif state that helps to avoid
     processing commands while cmdif is not ready.

  [Fix]

   * backport of f7936ddd35d8 f7936ddd35d8b849daf0372770c7c9dbe7910fca
  "net/mlx5: Avoid processing commands before cmdif is ready"

  [Test Plan]

   * An Ubuntu Server for s390x 18.04 or 20.04 LPAR or z/VM installation
     is needed that has Mellanox cards (RoCE Express 2.1) assigned,
     configured and enabled and that runs a 5.4 kernel (on bionic hwe-5.4).

   * Now trigger a recovery (guess that can be done at the Support Element)
     and reload the driver at the same time.

   * Make sure the module/driver mlx5 is loaded and in use
     (otherwise it can't be removed/unloaded).

   * Now remove/unload the module with:
     sudo modprobe -r mlx5
     and (re-)load it again with:
     sudo modprobe mlx5

   * Due to the lack of RoCE Express 2.1 hardware,
     IBM needs to do the verification.

  [Where problems could occur]

   * In case there is an issue with 'cmdif' it might not have the correct
     interface state, which:
     - either might lead to the fact that commands are not properly blocked
   and the situation is similar like before
     - or the commands may get always blocked,
   which render the hardware useless
     - or might block in wrong situation,
   which will cause unexpected issues and broken behavior.

   * Since the patch got upstream accepted with v5.7-rc7 it's
     not new to the kernel, was already part of groovy (and above)
     and is therefor already in use by newer Ubuntu releases.

  [Other Info]

   * Since the patch is upstream since v5.7-rc7,
     it's already included in jammy and kinetic.

   * Since the upstream patch incl. the line:
     Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox
     Connect-IB adapters") it looks to me that it was forgotten
     to mark the patch for upstream stable updates.

   * Such SRUs for focal's 5.4 will automatically land in bionic's
 hwe-5.4, too. But since this was especially requested for
 bionic's hwe-5.4, I wanted to mention this here.
  __

  We recently got a bug report for systems running Ubuntu 20.04 that were
  crashing with backtraces pointing at the mlx5 driver's handling of 
mlx5_ethtool_get_link_ksettings()
  when this is called through the sysfs (going through ethtool might have 
different checks).
  I managed to find a reliable way to reproduce the issue that I believe isn't 
tied to IBM Z at all.

  The procedure to reproduce is as follows. I created a script to read
  the sysfs attributes for the link's speed and duplex mode in a loop:

  #!/usr/bin/env bash

  if [ $# -lt 1 ]; then
  echo "Usage: $0 "
  exit 1
  fi

  while true; do
  cat /sys/class/net/$1/duplex > /dev/null
  cat /sys/class/net/$1/speed > /dev/null
  done

  Executed with:

  # ./script.sh enP10p0s0

  I ran this in one bash session and then in another one I triggered a PCI 
reset with
  the follwoing command where one needs to replace  with the PCI address 
of the NIC:

  echo 1 > /sys/bus/pci/devices//reset

  Then first I got a lot of the following messages:

   mlx5_core 0010:00:00.0 enP16p0s0: mlx5e_ethtool_get_link_ksettings:
  query port ptys failed: -5

  And then as the mlx5 driver's recovery kicks in the oops as below:

  [  659.103947] mlx5_core 0010:00:00.0: wait vital counter value 0x7b399f 
after 1 iterations
  [  659.103947] mlx5_core 0010:00:00.0: mlx5_pci_resume was called
  [  659.103966] mlx5_core 

[Kernel-packages] [Bug 1987287] Re: [UBUNTU 20.04] mlx5 driver crashes on accessing device attributes during recovery

2022-09-14 Thread Frank Heimes
** Changed in: ubuntu-z-systems
   Status: In Progress => Fix Committed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1987287

Title:
  [UBUNTU 20.04] mlx5 driver crashes on accessing device attributes
  during recovery

Status in Ubuntu on IBM z Systems:
  Fix Committed
Status in linux package in Ubuntu:
  Invalid
Status in linux source package in Focal:
  Fix Committed

Bug description:
  SRU Justification:
  --

  [Impact]

   * If the mlx5 driver is reloading while the recovery flow is happening,
     and if it receives new commands before the command interface is up
     again, this can lead to null pointer that tries to access non-
     initialized command structures.

   * So it's required to avoid processing commands before the command
     interface is up again.

   * This is accomplished by a new cmdif state that helps to avoid
     processing commands while cmdif is not ready.

  [Fix]

   * backport of f7936ddd35d8 f7936ddd35d8b849daf0372770c7c9dbe7910fca
  "net/mlx5: Avoid processing commands before cmdif is ready"

  [Test Plan]

   * An Ubuntu Server for s390x 18.04 or 20.04 LPAR or z/VM installation
     is needed that has Mellanox cards (RoCE Express 2.1) assigned,
     configured and enabled and that runs a 5.4 kernel (on bionic hwe-5.4).

   * Now trigger a recovery (guess that can be done at the Support Element)
     and reload the driver at the same time.

   * Make sure the module/driver mlx5 is loaded and in use
     (otherwise it can't be removed/unloaded).

   * Now remove/unload the module with:
     sudo modprobe -r mlx5
     and (re-)load it again with:
     sudo modprobe mlx5

   * Due to the lack of RoCE Express 2.1 hardware,
     IBM needs to do the verification.

  [Where problems could occur]

   * In case there is an issue with 'cmdif' it might not have the correct
     interface state, which:
     - either might lead to the fact that commands are not properly blocked
   and the situation is similar like before
     - or the commands may get always blocked,
   which render the hardware useless
     - or might block in wrong situation,
   which will cause unexpected issues and broken behavior.

   * Since the patch got upstream accepted with v5.7-rc7 it's
     not new to the kernel, was already part of groovy (and above)
     and is therefor already in use by newer Ubuntu releases.

  [Other Info]

   * Since the patch is upstream since v5.7-rc7,
     it's already included in jammy and kinetic.

   * Since the upstream patch incl. the line:
     Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox
     Connect-IB adapters") it looks to me that it was forgotten
     to mark the patch for upstream stable updates.

   * Such SRUs for focal's 5.4 will automatically land in bionic's
 hwe-5.4, too. But since this was especially requested for
 bionic's hwe-5.4, I wanted to mention this here.
  __

  We recently got a bug report for systems running Ubuntu 20.04 that were
  crashing with backtraces pointing at the mlx5 driver's handling of 
mlx5_ethtool_get_link_ksettings()
  when this is called through the sysfs (going through ethtool might have 
different checks).
  I managed to find a reliable way to reproduce the issue that I believe isn't 
tied to IBM Z at all.

  The procedure to reproduce is as follows. I created a script to read
  the sysfs attributes for the link's speed and duplex mode in a loop:

  #!/usr/bin/env bash

  if [ $# -lt 1 ]; then
  echo "Usage: $0 "
  exit 1
  fi

  while true; do
  cat /sys/class/net/$1/duplex > /dev/null
  cat /sys/class/net/$1/speed > /dev/null
  done

  Executed with:

  # ./script.sh enP10p0s0

  I ran this in one bash session and then in another one I triggered a PCI 
reset with
  the follwoing command where one needs to replace  with the PCI address 
of the NIC:

  echo 1 > /sys/bus/pci/devices//reset

  Then first I got a lot of the following messages:

   mlx5_core 0010:00:00.0 enP16p0s0: mlx5e_ethtool_get_link_ksettings:
  query port ptys failed: -5

  And then as the mlx5 driver's recovery kicks in the oops as below:

  [  659.103947] mlx5_core 0010:00:00.0: wait vital counter value 0x7b399f 
after 1 iterations
  [  659.103947] mlx5_core 0010:00:00.0: mlx5_pci_resume was called
  [  659.103966] mlx5_core 0010:00:00.0: firmware version: 14.32.1010
  [  659.104169] Unable to handle kernel pointer dereference in virtual kernel 
address space
  [  659.104171] Failing address:  TEID: 0483
  [  659.104172] Fault in home space mode while using kernel ASCE.
  [  659.104173] AS:3d29c007 R3:fffd0007 S:fffd5800 
P:003d
  [  659.104200] Oops: 0004 ilc:2 [#1] SMP
  [  659.104202] Modules linked in: s390_trng ism smc pnet chsc_sch eadm_sch 
vfio_ccw vfio_mdev mdev vfio_iommu_type1 vfio 

[Kernel-packages] [Bug 1987287] Re: [UBUNTU 20.04] mlx5 driver crashes on accessing device attributes during recovery

2022-09-14 Thread Stefan Bader
** Changed in: linux (Ubuntu Focal)
   Importance: Undecided => Medium

** Changed in: linux (Ubuntu Focal)
   Status: In Progress => Fix Committed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1987287

Title:
  [UBUNTU 20.04] mlx5 driver crashes on accessing device attributes
  during recovery

Status in Ubuntu on IBM z Systems:
  In Progress
Status in linux package in Ubuntu:
  Invalid
Status in linux source package in Focal:
  Fix Committed

Bug description:
  SRU Justification:
  --

  [Impact]

   * If the mlx5 driver is reloading while the recovery flow is happening,
     and if it receives new commands before the command interface is up
     again, this can lead to null pointer that tries to access non-
     initialized command structures.

   * So it's required to avoid processing commands before the command
     interface is up again.

   * This is accomplished by a new cmdif state that helps to avoid
     processing commands while cmdif is not ready.

  [Fix]

   * backport of f7936ddd35d8 f7936ddd35d8b849daf0372770c7c9dbe7910fca
  "net/mlx5: Avoid processing commands before cmdif is ready"

  [Test Plan]

   * An Ubuntu Server for s390x 18.04 or 20.04 LPAR or z/VM installation
     is needed that has Mellanox cards (RoCE Express 2.1) assigned,
     configured and enabled and that runs a 5.4 kernel (on bionic hwe-5.4).

   * Now trigger a recovery (guess that can be done at the Support Element)
     and reload the driver at the same time.

   * Make sure the module/driver mlx5 is loaded and in use
     (otherwise it can't be removed/unloaded).

   * Now remove/unload the module with:
     sudo modprobe -r mlx5
     and (re-)load it again with:
     sudo modprobe mlx5

   * Due to the lack of RoCE Express 2.1 hardware,
     IBM needs to do the verification.

  [Where problems could occur]

   * In case there is an issue with 'cmdif' it might not have the correct
     interface state, which:
     - either might lead to the fact that commands are not properly blocked
   and the situation is similar like before
     - or the commands may get always blocked,
   which render the hardware useless
     - or might block in wrong situation,
   which will cause unexpected issues and broken behavior.

   * Since the patch got upstream accepted with v5.7-rc7 it's
     not new to the kernel, was already part of groovy (and above)
     and is therefor already in use by newer Ubuntu releases.

  [Other Info]

   * Since the patch is upstream since v5.7-rc7,
     it's already included in jammy and kinetic.

   * Since the upstream patch incl. the line:
     Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox
     Connect-IB adapters") it looks to me that it was forgotten
     to mark the patch for upstream stable updates.

   * Such SRUs for focal's 5.4 will automatically land in bionic's
 hwe-5.4, too. But since this was especially requested for
 bionic's hwe-5.4, I wanted to mention this here.
  __

  We recently got a bug report for systems running Ubuntu 20.04 that were
  crashing with backtraces pointing at the mlx5 driver's handling of 
mlx5_ethtool_get_link_ksettings()
  when this is called through the sysfs (going through ethtool might have 
different checks).
  I managed to find a reliable way to reproduce the issue that I believe isn't 
tied to IBM Z at all.

  The procedure to reproduce is as follows. I created a script to read
  the sysfs attributes for the link's speed and duplex mode in a loop:

  #!/usr/bin/env bash

  if [ $# -lt 1 ]; then
  echo "Usage: $0 "
  exit 1
  fi

  while true; do
  cat /sys/class/net/$1/duplex > /dev/null
  cat /sys/class/net/$1/speed > /dev/null
  done

  Executed with:

  # ./script.sh enP10p0s0

  I ran this in one bash session and then in another one I triggered a PCI 
reset with
  the follwoing command where one needs to replace  with the PCI address 
of the NIC:

  echo 1 > /sys/bus/pci/devices//reset

  Then first I got a lot of the following messages:

   mlx5_core 0010:00:00.0 enP16p0s0: mlx5e_ethtool_get_link_ksettings:
  query port ptys failed: -5

  And then as the mlx5 driver's recovery kicks in the oops as below:

  [  659.103947] mlx5_core 0010:00:00.0: wait vital counter value 0x7b399f 
after 1 iterations
  [  659.103947] mlx5_core 0010:00:00.0: mlx5_pci_resume was called
  [  659.103966] mlx5_core 0010:00:00.0: firmware version: 14.32.1010
  [  659.104169] Unable to handle kernel pointer dereference in virtual kernel 
address space
  [  659.104171] Failing address:  TEID: 0483
  [  659.104172] Fault in home space mode while using kernel ASCE.
  [  659.104173] AS:3d29c007 R3:fffd0007 S:fffd5800 
P:003d
  [  659.104200] Oops: 0004 ilc:2 [#1] SMP
  [  659.104202] Modules linked in: s390_trng ism smc pnet 

[Kernel-packages] [Bug 1987287] Re: [UBUNTU 20.04] mlx5 driver crashes on accessing device attributes during recovery

2022-09-05 Thread Frank Heimes
Many thx, Niklas!

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1987287

Title:
  [UBUNTU 20.04] mlx5 driver crashes on accessing device attributes
  during recovery

Status in Ubuntu on IBM z Systems:
  In Progress
Status in linux package in Ubuntu:
  Invalid
Status in linux source package in Focal:
  In Progress

Bug description:
  SRU Justification:
  --

  [Impact]

   * If the mlx5 driver is reloading while the recovery flow is happening,
     and if it receives new commands before the command interface is up
     again, this can lead to null pointer that tries to access non-
     initialized command structures.

   * So it's required to avoid processing commands before the command
     interface is up again.

   * This is accomplished by a new cmdif state that helps to avoid
     processing commands while cmdif is not ready.

  [Fix]

   * backport of f7936ddd35d8 f7936ddd35d8b849daf0372770c7c9dbe7910fca
  "net/mlx5: Avoid processing commands before cmdif is ready"

  [Test Plan]

   * An Ubuntu Server for s390x 18.04 or 20.04 LPAR or z/VM installation
     is needed that has Mellanox cards (RoCE Express 2.1) assigned,
     configured and enabled and that runs a 5.4 kernel (on bionic hwe-5.4).

   * Now trigger a recovery (guess that can be done at the Support Element)
     and reload the driver at the same time.

   * Make sure the module/driver mlx5 is loaded and in use
     (otherwise it can't be removed/unloaded).

   * Now remove/unload the module with:
     sudo modprobe -r mlx5
     and (re-)load it again with:
     sudo modprobe mlx5

   * Due to the lack of RoCE Express 2.1 hardware,
     IBM needs to do the verification.

  [Where problems could occur]

   * In case there is an issue with 'cmdif' it might not have the correct
     interface state, which:
     - either might lead to the fact that commands are not properly blocked
   and the situation is similar like before
     - or the commands may get always blocked,
   which render the hardware useless
     - or might block in wrong situation,
   which will cause unexpected issues and broken behavior.

   * Since the patch got upstream accepted with v5.7-rc7 it's
     not new to the kernel, was already part of groovy (and above)
     and is therefor already in use by newer Ubuntu releases.

  [Other Info]

   * Since the patch is upstream since v5.7-rc7,
     it's already included in jammy and kinetic.

   * Since the upstream patch incl. the line:
     Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox
     Connect-IB adapters") it looks to me that it was forgotten
     to mark the patch for upstream stable updates.

   * Such SRUs for focal's 5.4 will automatically land in bionic's
 hwe-5.4, too. But since this was especially requested for
 bionic's hwe-5.4, I wanted to mention this here.
  __

  We recently got a bug report for systems running Ubuntu 20.04 that were
  crashing with backtraces pointing at the mlx5 driver's handling of 
mlx5_ethtool_get_link_ksettings()
  when this is called through the sysfs (going through ethtool might have 
different checks).
  I managed to find a reliable way to reproduce the issue that I believe isn't 
tied to IBM Z at all.

  The procedure to reproduce is as follows. I created a script to read
  the sysfs attributes for the link's speed and duplex mode in a loop:

  #!/usr/bin/env bash

  if [ $# -lt 1 ]; then
  echo "Usage: $0 "
  exit 1
  fi

  while true; do
  cat /sys/class/net/$1/duplex > /dev/null
  cat /sys/class/net/$1/speed > /dev/null
  done

  Executed with:

  # ./script.sh enP10p0s0

  I ran this in one bash session and then in another one I triggered a PCI 
reset with
  the follwoing command where one needs to replace  with the PCI address 
of the NIC:

  echo 1 > /sys/bus/pci/devices//reset

  Then first I got a lot of the following messages:

   mlx5_core 0010:00:00.0 enP16p0s0: mlx5e_ethtool_get_link_ksettings:
  query port ptys failed: -5

  And then as the mlx5 driver's recovery kicks in the oops as below:

  [  659.103947] mlx5_core 0010:00:00.0: wait vital counter value 0x7b399f 
after 1 iterations
  [  659.103947] mlx5_core 0010:00:00.0: mlx5_pci_resume was called
  [  659.103966] mlx5_core 0010:00:00.0: firmware version: 14.32.1010
  [  659.104169] Unable to handle kernel pointer dereference in virtual kernel 
address space
  [  659.104171] Failing address:  TEID: 0483
  [  659.104172] Fault in home space mode while using kernel ASCE.
  [  659.104173] AS:3d29c007 R3:fffd0007 S:fffd5800 
P:003d
  [  659.104200] Oops: 0004 ilc:2 [#1] SMP
  [  659.104202] Modules linked in: s390_trng ism smc pnet chsc_sch eadm_sch 
vfio_ccw vfio_mdev mdev vfio_iommu_type1 vfio sch_fq_codel drm 
drm_panel_orientation_quirks i2c_core ip_tables 

[Kernel-packages] [Bug 1987287] Re: [UBUNTU 20.04] mlx5 driver crashes on accessing device attributes during recovery

2022-09-01 Thread Frank Heimes
Kernel SRU request submitted for focal:
https://lists.ubuntu.com/archives/kernel-team/2022-September/thread.html#132957
changing status to 'In Progress'.

** Changed in: linux (Ubuntu Focal)
   Status: New => In Progress

** Changed in: ubuntu-z-systems
   Status: New => In Progress

** Changed in: linux (Ubuntu Focal)
 Assignee: Skipper Bug Screeners (skipper-screen-team) => Canonical Kernel 
Team (canonical-kernel-team)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1987287

Title:
  [UBUNTU 20.04] mlx5 driver crashes on accessing device attributes
  during recovery

Status in Ubuntu on IBM z Systems:
  In Progress
Status in linux package in Ubuntu:
  Invalid
Status in linux source package in Focal:
  In Progress

Bug description:
  SRU Justification:
  --

  [Impact]

   * If the mlx5 driver is reloading while the recovery flow is happening,
     and if it receives new commands before the command interface is up
     again, this can lead to null pointer that tries to access non-
     initialized command structures.

   * So it's required to avoid processing commands before the command
     interface is up again.

   * This is accomplished by a new cmdif state that helps to avoid
     processing commands while cmdif is not ready.

  [Fix]

   * backport of f7936ddd35d8 f7936ddd35d8b849daf0372770c7c9dbe7910fca
  "net/mlx5: Avoid processing commands before cmdif is ready"

  [Test Plan]

   * An Ubuntu Server for s390x 18.04 or 20.04 LPAR or z/VM installation
     is needed that has Mellanox cards (RoCE Express 2.1) assigned,
     configured and enabled and that runs a 5.4 kernel (on bionic hwe-5.4).

   * Now trigger a recovery (guess that can be done at the Support Element)
     and reload the driver at the same time.

   * Make sure the module/driver mlx5 is loaded and in use
     (otherwise it can't be removed/unloaded).

   * Now remove/unload the module with:
     sudo modprobe -r mlx5
     and (re-)load it again with:
     sudo modprobe mlx5

   * Due to the lack of RoCE Express 2.1 hardware,
     IBM needs to do the verification.

  [Where problems could occur]

   * In case there is an issue with 'cmdif' it might not have the correct
     interface state, which:
     - either might lead to the fact that commands are not properly blocked
   and the situation is similar like before
     - or the commands may get always blocked,
   which render the hardware useless
     - or might block in wrong situation,
   which will cause unexpected issues and broken behavior.

   * Since the patch got upstream accepted with v5.7-rc7 it's
     not new to the kernel, was already part of groovy (and above)
     and is therefor already in use by newer Ubuntu releases.

  [Other Info]

   * Since the patch is upstream since v5.7-rc7,
     it's already included in jammy and kinetic.

   * Since the upstream patch incl. the line:
     Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox
     Connect-IB adapters") it looks to me that it was forgotten
     to mark the patch for upstream stable updates.

   * Such SRUs for focal's 5.4 will automatically land in bionic's
 hwe-5.4, too. But since this was especially requested for
 bionic's hwe-5.4, I wanted to mention this here.
  __

  We recently got a bug report for systems running Ubuntu 20.04 that were
  crashing with backtraces pointing at the mlx5 driver's handling of 
mlx5_ethtool_get_link_ksettings()
  when this is called through the sysfs (going through ethtool might have 
different checks).
  I managed to find a reliable way to reproduce the issue that I believe isn't 
tied to IBM Z at all.

  The procedure to reproduce is as follows. I created a script to read
  the sysfs attributes for the link's speed and duplex mode in a loop:

  #!/usr/bin/env bash

  if [ $# -lt 1 ]; then
  echo "Usage: $0 "
  exit 1
  fi

  while true; do
  cat /sys/class/net/$1/duplex > /dev/null
  cat /sys/class/net/$1/speed > /dev/null
  done

  Executed with:

  # ./script.sh enP10p0s0

  I ran this in one bash session and then in another one I triggered a PCI 
reset with
  the follwoing command where one needs to replace  with the PCI address 
of the NIC:

  echo 1 > /sys/bus/pci/devices//reset

  Then first I got a lot of the following messages:

   mlx5_core 0010:00:00.0 enP16p0s0: mlx5e_ethtool_get_link_ksettings:
  query port ptys failed: -5

  And then as the mlx5 driver's recovery kicks in the oops as below:

  [  659.103947] mlx5_core 0010:00:00.0: wait vital counter value 0x7b399f 
after 1 iterations
  [  659.103947] mlx5_core 0010:00:00.0: mlx5_pci_resume was called
  [  659.103966] mlx5_core 0010:00:00.0: firmware version: 14.32.1010
  [  659.104169] Unable to handle kernel pointer dereference in virtual kernel 
address space
  [  659.104171] Failing address:  

[Kernel-packages] [Bug 1987287] Re: [UBUNTU 20.04] mlx5 driver crashes on accessing device attributes during recovery

2022-09-01 Thread Frank Heimes
** Description changed:

  SRU Justification:
  --
  
  [Impact]
  
-  * If the mlx5 driver is reloading while the recovery flow is happening,
-and if it receives new commands before the command interface is up
-again, this can lead to null pointer that tries to access non-
-initialized command structures.
+  * If the mlx5 driver is reloading while the recovery flow is happening,
+    and if it receives new commands before the command interface is up
+    again, this can lead to null pointer that tries to access non-
+    initialized command structures.
  
-  * So it's required to avoid processing commands before the command
-interface is up again.
+  * So it's required to avoid processing commands before the command
+    interface is up again.
  
-  * This is accomplished by a new cmdif state that helps to avoid
-processing commands while cmdif is not ready.
+  * This is accomplished by a new cmdif state that helps to avoid
+    processing commands while cmdif is not ready.
  
  [Fix]
  
-  * backport of f7936ddd35d8 f7936ddd35d8b849daf0372770c7c9dbe7910fca
+  * backport of f7936ddd35d8 f7936ddd35d8b849daf0372770c7c9dbe7910fca
  "net/mlx5: Avoid processing commands before cmdif is ready"
  
  [Test Plan]
  
-  * An Ubuntu Server for s390x 18.04 or 20.04 LPAR or z/VM installation
-is needed that has Mellanox cards (RoCE Express 2.1) assigned, 
-configured and enabled and that runs a 5.4 kernel (on bionic hwe-5.4).
+  * An Ubuntu Server for s390x 18.04 or 20.04 LPAR or z/VM installation
+    is needed that has Mellanox cards (RoCE Express 2.1) assigned,
+    configured and enabled and that runs a 5.4 kernel (on bionic hwe-5.4).
  
-  * Now trigger a recovery (guess that can be done at the Support Element)
-and reload the driver at the same time.
+  * Now trigger a recovery (guess that can be done at the Support Element)
+    and reload the driver at the same time.
  
-  * Make sure the module/driver mlx5 is loaded and in use
-(otherwise it can't be removed/unloaded).
+  * Make sure the module/driver mlx5 is loaded and in use
+    (otherwise it can't be removed/unloaded).
  
-  * Now remove/unload the module with:
-sudo modprobe -r mlx5
-and (re-)load it again with:
-sudo modprobe mlx5
+  * Now remove/unload the module with:
+    sudo modprobe -r mlx5
+    and (re-)load it again with:
+    sudo modprobe mlx5
  
-  * Due to the lack of RoCE Express 2.1 hardware,
-IBM needs to do the verification.
+  * Due to the lack of RoCE Express 2.1 hardware,
+    IBM needs to do the verification.
  
  [Where problems could occur]
  
-  * In case there is an issue with 'cmdif' it might not have the correct
-interface state, which:
-- either might lead to the fact that commands are not properly blocked
-  and the situation is similar like before
-- or the commands may get always blocked,
-  which render the hardware useless
-- or might block in wrong situation,
-  which will cause unexpected issues and broken behavior.
+  * In case there is an issue with 'cmdif' it might not have the correct
+    interface state, which:
+    - either might lead to the fact that commands are not properly blocked
+  and the situation is similar like before
+    - or the commands may get always blocked,
+  which render the hardware useless
+    - or might block in wrong situation,
+  which will cause unexpected issues and broken behavior.
  
-  * Since the patch got upstream accepted with v5.7-rc7 it's
-not new to the kernel, was already part of groovy (and above)
-and is therefor already in use by newer Ubuntu releases.
+  * Since the patch got upstream accepted with v5.7-rc7 it's
+    not new to the kernel, was already part of groovy (and above)
+    and is therefor already in use by newer Ubuntu releases.
  
  [Other Info]
-  
-  * Since the patch is upstream since v5.7-rc7,
-it's already included in jammy and kinetic.
  
-  * Since the upstream patch incl. the line:
-Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox
-Connect-IB adapters") it looks to me that it was forgotten
-to mark the patch for upstream stable updates.
+  * Since the patch is upstream since v5.7-rc7,
+    it's already included in jammy and kinetic.
+ 
+  * Since the upstream patch incl. the line:
+    Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox
+    Connect-IB adapters") it looks to me that it was forgotten
+    to mark the patch for upstream stable updates.
+ 
+  * Such SRUs for focal's 5.4 will automatically land in bionic's
+hwe-5.4, too. But since this was especially requested for
+bionic's hwe-5.4, I wanted to mention this here.
  __
  
  We recently got a bug report for systems running Ubuntu 20.04 that were
  crashing with backtraces pointing at the mlx5 driver's handling of 
mlx5_ethtool_get_link_ksettings()
  when this is called through the sysfs (going through ethtool might have 
different checks).
  I 

[Kernel-packages] [Bug 1987287] Re: [UBUNTU 20.04] mlx5 driver crashes on accessing device attributes during recovery

2022-08-31 Thread Frank Heimes
** Description changed:

+ SRU Justification:
+ --
+ 
+ [Impact]
+ 
+  * If the mlx5 driver is reloading while the recovery flow is happening,
+and if it receives new commands before the command interface is up
+again, this can lead to null pointer that tries to access non-
+initialized command structures.
+ 
+  * So it's required to avoid processing commands before the command
+interface is up again.
+ 
+  * This is accomplished by a new cmdif state that helps to avoid
+processing commands while cmdif is not ready.
+ 
+ [Fix]
+ 
+  * backport of f7936ddd35d8 f7936ddd35d8b849daf0372770c7c9dbe7910fca
+ "net/mlx5: Avoid processing commands before cmdif is ready"
+ 
+ [Test Plan]
+ 
+  * An Ubuntu Server for s390x 18.04 or 20.04 LPAR or z/VM installation
+is needed that has Mellanox cards (RoCE Express 2.1) assigned, 
+configured and enabled and that runs a 5.4 kernel (on bionic hwe-5.4).
+ 
+  * Now trigger a recovery (guess that can be done at the Support Element)
+and reload the driver at the same time.
+ 
+  * Make sure the module/driver mlx5 is loaded and in use
+(otherwise it can't be removed/unloaded).
+ 
+  * Now remove/unload the module with:
+sudo modprobe -r mlx5
+and (re-)load it again with:
+sudo modprobe mlx5
+ 
+  * Due to the lack of RoCE Express 2.1 hardware,
+IBM needs to do the verification.
+ 
+ [Where problems could occur]
+ 
+  * In case there is an issue with 'cmdif' it might not have the correct
+interface state, which:
+- either might lead to the fact that commands are not properly blocked
+  and the situation is similar like before
+- or the commands may get always blocked,
+  which render the hardware useless
+- or might block in wrong situation,
+  which will cause unexpected issues and broken behavior.
+ 
+  * Since the patch got upstream accepted with v5.7-rc7 it's
+not new to the kernel, was already part of groovy (and above)
+and is therefor already in use by newer Ubuntu releases.
+ 
+ [Other Info]
+  
+  * Since the patch is upstream since v5.7-rc7,
+it's already included in jammy and kinetic.
+ 
+  * Since the upstream patch incl. the line:
+Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox
+Connect-IB adapters") it looks to me that it was forgotten
+to mark the patch for upstream stable updates.
+ __
+ 
  We recently got a bug report for systems running Ubuntu 20.04 that were
  crashing with backtraces pointing at the mlx5 driver's handling of 
mlx5_ethtool_get_link_ksettings()
  when this is called through the sysfs (going through ethtool might have 
different checks).
  I managed to find a reliable way to reproduce the issue that I believe isn't 
tied to IBM Z at all.
  
  The procedure to reproduce is as follows. I created a script to read
  the sysfs attributes for the link's speed and duplex mode in a loop:
  
  #!/usr/bin/env bash
  
  if [ $# -lt 1 ]; then
- echo "Usage: $0 "
- exit 1
+ echo "Usage: $0 "
+ exit 1
  fi
  
  while true; do
- cat /sys/class/net/$1/duplex > /dev/null
- cat /sys/class/net/$1/speed > /dev/null
+ cat /sys/class/net/$1/duplex > /dev/null
+ cat /sys/class/net/$1/speed > /dev/null
  done
  
  Executed with:
  
  # ./script.sh enP10p0s0
  
  I ran this in one bash session and then in another one I triggered a PCI 
reset with
  the follwoing command where one needs to replace  with the PCI address 
of the NIC:
  
  echo 1 > /sys/bus/pci/devices//reset
  
  Then first I got a lot of the following messages:
  
-  mlx5_core 0010:00:00.0 enP16p0s0: mlx5e_ethtool_get_link_ksettings:
+  mlx5_core 0010:00:00.0 enP16p0s0: mlx5e_ethtool_get_link_ksettings:
  query port ptys failed: -5
  
  And then as the mlx5 driver's recovery kicks in the oops as below:
  
  [  659.103947] mlx5_core 0010:00:00.0: wait vital counter value 0x7b399f 
after 1 iterations
  [  659.103947] mlx5_core 0010:00:00.0: mlx5_pci_resume was called
  [  659.103966] mlx5_core 0010:00:00.0: firmware version: 14.32.1010
  [  659.104169] Unable to handle kernel pointer dereference in virtual kernel 
address space
  [  659.104171] Failing address:  TEID: 0483
  [  659.104172] Fault in home space mode while using kernel ASCE.
  [  659.104173] AS:3d29c007 R3:fffd0007 S:fffd5800 
P:003d
  [  659.104200] Oops: 0004 ilc:2 [#1] SMP
  [  659.104202] Modules linked in: s390_trng ism smc pnet chsc_sch eadm_sch 
vfio_ccw vfio_mdev mdev vfio_iommu_type1 vfio sch_fq_codel drm 
drm_panel_orientation_quirks i2c_core ip_tables x_tables btrfs zstd_compress 
zlib_deflate raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor 
async_tx xor raid6_pq libcrc32c raid1 raid0 linear mlx5_ib dm_service_time pkey 
zcrypt crc32_vx_s390 ib_uverbs ghash_s390 ib_core qeth_l2 prng aes_s390 
des_s390 nvme libdes sha3_512_s390 sha3_256_s390 sha512_s390 

[Kernel-packages] [Bug 1987287] Re: [UBUNTU 20.04] mlx5 driver crashes on accessing device attributes during recovery

2022-08-31 Thread Frank Heimes
Test kernels are now available in the following PPA:
https://launchpad.net/~fheimes/+archive/ubuntu/lp1987287
(20.04/focal 5.4 as well as 18.04.5/bionic hwe-5.4)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1987287

Title:
  [UBUNTU 20.04] mlx5 driver crashes on accessing device attributes
  during recovery

Status in Ubuntu on IBM z Systems:
  New
Status in linux package in Ubuntu:
  Invalid
Status in linux source package in Focal:
  New

Bug description:
  We recently got a bug report for systems running Ubuntu 20.04 that were
  crashing with backtraces pointing at the mlx5 driver's handling of 
mlx5_ethtool_get_link_ksettings()
  when this is called through the sysfs (going through ethtool might have 
different checks).
  I managed to find a reliable way to reproduce the issue that I believe isn't 
tied to IBM Z at all.

  The procedure to reproduce is as follows. I created a script to read
  the sysfs attributes for the link's speed and duplex mode in a loop:

  #!/usr/bin/env bash

  if [ $# -lt 1 ]; then
  echo "Usage: $0 "
  exit 1
  fi

  while true; do
  cat /sys/class/net/$1/duplex > /dev/null
  cat /sys/class/net/$1/speed > /dev/null
  done

  Executed with:

  # ./script.sh enP10p0s0

  I ran this in one bash session and then in another one I triggered a PCI 
reset with
  the follwoing command where one needs to replace  with the PCI address 
of the NIC:

  echo 1 > /sys/bus/pci/devices//reset

  Then first I got a lot of the following messages:

   mlx5_core 0010:00:00.0 enP16p0s0: mlx5e_ethtool_get_link_ksettings:
  query port ptys failed: -5

  And then as the mlx5 driver's recovery kicks in the oops as below:

  [  659.103947] mlx5_core 0010:00:00.0: wait vital counter value 0x7b399f 
after 1 iterations
  [  659.103947] mlx5_core 0010:00:00.0: mlx5_pci_resume was called
  [  659.103966] mlx5_core 0010:00:00.0: firmware version: 14.32.1010
  [  659.104169] Unable to handle kernel pointer dereference in virtual kernel 
address space
  [  659.104171] Failing address:  TEID: 0483
  [  659.104172] Fault in home space mode while using kernel ASCE.
  [  659.104173] AS:3d29c007 R3:fffd0007 S:fffd5800 
P:003d
  [  659.104200] Oops: 0004 ilc:2 [#1] SMP
  [  659.104202] Modules linked in: s390_trng ism smc pnet chsc_sch eadm_sch 
vfio_ccw vfio_mdev mdev vfio_iommu_type1 vfio sch_fq_codel drm 
drm_panel_orientation_quirks i2c_core ip_tables x_tables btrfs zstd_compress 
zlib_deflate raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor 
async_tx xor raid6_pq libcrc32c raid1 raid0 linear mlx5_ib dm_service_time pkey 
zcrypt crc32_vx_s390 ib_uverbs ghash_s390 ib_core qeth_l2 prng aes_s390 
des_s390 nvme libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 
sha1_s390 sha_common mlx5_core tls mlxfw ptp nvme_core pps_core dasd_eckd_mod 
dasd_mod zfcp scsi_transport_fc qeth qdio ccwgroup scsi_dh_emc scsi_dh_rdac 
scsi_dh_alua dm_multipath
  [  659.104232] CPU: 6 PID: 438216 Comm: cat Not tainted 5.4.0-124-generic 
#140-Ubuntu
  [  659.104233] Hardware name: IBM 3931 XYZ  (LPAR)
  [  659.104234] Krnl PSW : 0404c0018000 3bfa661e 
(__queue_work+0xfe/0x520)
  [  659.104241]R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 
RI:0 EA:3
  [  659.104242] Krnl GPRS: 3c291570  0007 
7fff
  [  659.104243]e2fe46e0 000fffe0 0006 
3d039588
  [  659.104244]  e2fe46e0 
bfb3e000
  [  659.104245]e194c400 03e007d6fb78 3bfa6602 
03e007d6f860
  [  659.104251] Krnl Code: 3bfa6612: a77400e5brc 
7,3bfa67dc
3bfa6616: 582003acl   
%r2,940
   #3bfa661a: a718lhi %r1,0
   >3bfa661e: ba129000cs  
%r1,%r2,0(%r9)
3bfa6622: a77401a7brc 
7,3bfa6970
3bfa6626: e310b0180012lt  
%r1,24(%r11)
3bfa662c: a78400ffbrc 
8,3bfa682a
3bfa6630: c004004bbrcl
0,3bfa66c6
  [  659.104261] Call Trace:
  [  659.104263] ([<>] 0x0)
  [  659.104265]  [<3bfa6aa2>] queue_work_on+0x62/0x70
  [  659.104329]  [<03ff80a2920a>] cmd_exec+0x4ea/0x840 [mlx5_core]
  [  659.104349]  [<03ff80a29680>] mlx5_cmd_exec+0x40/0x70 [mlx5_core]
  [  659.104369]  [<03ff80a334a8>] mlx5_core_access_reg+0x108/0x150 
[mlx5_core]
  [  659.104387]  [<03ff80a3354e>] mlx5_query_port_ptys+0x5e/0x70 
[mlx5_core]
  [ 

[Kernel-packages] [Bug 1987287] Re: [UBUNTU 20.04] mlx5 driver crashes on accessing device attributes during recovery

2022-08-30 Thread Frank Heimes
Hello 'nafeiy', we indeed wanted to pick this up in the next days.
Just notice that this is a development bug, opened at the Launchpad bug 
tracking system, that is mainly used for development.

We will usually produce a patched 20.04/focal/5.4 test kernel, that we make a 
available via a so called PPA (a special repository).
It will take then some steps in our SRU cycle until this patched kernel gets 
released.
These are the upcoming kernel Service Release Cycles (SRUs): 
https://kernel.ubuntu.com/

This 5.4 kernel will then be migrated by our kernel team to 
18.04/bionic/hwe-5.4.
But in this case we may also create a dedicated hwe-5.4 test kernel for bionic 
for you, to save you some time.

The process for hotfixes is a different.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1987287

Title:
  [UBUNTU 20.04] mlx5 driver crashes on accessing device attributes
  during recovery

Status in Ubuntu on IBM z Systems:
  New
Status in linux package in Ubuntu:
  Invalid
Status in linux source package in Focal:
  New

Bug description:
  We recently got a bug report for systems running Ubuntu 20.04 that were
  crashing with backtraces pointing at the mlx5 driver's handling of 
mlx5_ethtool_get_link_ksettings()
  when this is called through the sysfs (going through ethtool might have 
different checks).
  I managed to find a reliable way to reproduce the issue that I believe isn't 
tied to IBM Z at all.

  The procedure to reproduce is as follows. I created a script to read
  the sysfs attributes for the link's speed and duplex mode in a loop:

  #!/usr/bin/env bash

  if [ $# -lt 1 ]; then
  echo "Usage: $0 "
  exit 1
  fi

  while true; do
  cat /sys/class/net/$1/duplex > /dev/null
  cat /sys/class/net/$1/speed > /dev/null
  done

  Executed with:

  # ./script.sh enP10p0s0

  I ran this in one bash session and then in another one I triggered a PCI 
reset with
  the follwoing command where one needs to replace  with the PCI address 
of the NIC:

  echo 1 > /sys/bus/pci/devices//reset

  Then first I got a lot of the following messages:

   mlx5_core 0010:00:00.0 enP16p0s0: mlx5e_ethtool_get_link_ksettings:
  query port ptys failed: -5

  And then as the mlx5 driver's recovery kicks in the oops as below:

  [  659.103947] mlx5_core 0010:00:00.0: wait vital counter value 0x7b399f 
after 1 iterations
  [  659.103947] mlx5_core 0010:00:00.0: mlx5_pci_resume was called
  [  659.103966] mlx5_core 0010:00:00.0: firmware version: 14.32.1010
  [  659.104169] Unable to handle kernel pointer dereference in virtual kernel 
address space
  [  659.104171] Failing address:  TEID: 0483
  [  659.104172] Fault in home space mode while using kernel ASCE.
  [  659.104173] AS:3d29c007 R3:fffd0007 S:fffd5800 
P:003d
  [  659.104200] Oops: 0004 ilc:2 [#1] SMP
  [  659.104202] Modules linked in: s390_trng ism smc pnet chsc_sch eadm_sch 
vfio_ccw vfio_mdev mdev vfio_iommu_type1 vfio sch_fq_codel drm 
drm_panel_orientation_quirks i2c_core ip_tables x_tables btrfs zstd_compress 
zlib_deflate raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor 
async_tx xor raid6_pq libcrc32c raid1 raid0 linear mlx5_ib dm_service_time pkey 
zcrypt crc32_vx_s390 ib_uverbs ghash_s390 ib_core qeth_l2 prng aes_s390 
des_s390 nvme libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 
sha1_s390 sha_common mlx5_core tls mlxfw ptp nvme_core pps_core dasd_eckd_mod 
dasd_mod zfcp scsi_transport_fc qeth qdio ccwgroup scsi_dh_emc scsi_dh_rdac 
scsi_dh_alua dm_multipath
  [  659.104232] CPU: 6 PID: 438216 Comm: cat Not tainted 5.4.0-124-generic 
#140-Ubuntu
  [  659.104233] Hardware name: IBM 3931 XYZ  (LPAR)
  [  659.104234] Krnl PSW : 0404c0018000 3bfa661e 
(__queue_work+0xfe/0x520)
  [  659.104241]R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 
RI:0 EA:3
  [  659.104242] Krnl GPRS: 3c291570  0007 
7fff
  [  659.104243]e2fe46e0 000fffe0 0006 
3d039588
  [  659.104244]  e2fe46e0 
bfb3e000
  [  659.104245]e194c400 03e007d6fb78 3bfa6602 
03e007d6f860
  [  659.104251] Krnl Code: 3bfa6612: a77400e5brc 
7,3bfa67dc
3bfa6616: 582003acl   
%r2,940
   #3bfa661a: a718lhi %r1,0
   >3bfa661e: ba129000cs  
%r1,%r2,0(%r9)
3bfa6622: a77401a7brc 
7,3bfa6970
3bfa6626: e310b0180012lt  
%r1,24(%r11)
3bfa662c: a78400ff

[Kernel-packages] [Bug 1987287] Re: [UBUNTU 20.04] mlx5 driver crashes on accessing device attributes during recovery

2022-08-22 Thread Frank Heimes
Yes, if it get it incl. in the focal 5.4 (GA), it will automatically
find it's way into the bionic hwe-5.4 as well.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1987287

Title:
  [UBUNTU 20.04] mlx5 driver crashes on accessing device attributes
  during recovery

Status in Ubuntu on IBM z Systems:
  New
Status in linux package in Ubuntu:
  Invalid
Status in linux source package in Focal:
  New

Bug description:
  We recently got a bug report for systems running Ubuntu 20.04 that were
  crashing with backtraces pointing at the mlx5 driver's handling of 
mlx5_ethtool_get_link_ksettings()
  when this is called through the sysfs (going through ethtool might have 
different checks).
  I managed to find a reliable way to reproduce the issue that I believe isn't 
tied to IBM Z at all.

  The procedure to reproduce is as follows. I created a script to read
  the sysfs attributes for the link's speed and duplex mode in a loop:

  #!/usr/bin/env bash

  if [ $# -lt 1 ]; then
  echo "Usage: $0 "
  exit 1
  fi

  while true; do
  cat /sys/class/net/$1/duplex > /dev/null
  cat /sys/class/net/$1/speed > /dev/null
  done

  Executed with:

  # ./script.sh enP10p0s0

  I ran this in one bash session and then in another one I triggered a PCI 
reset with
  the follwoing command where one needs to replace  with the PCI address 
of the NIC:

  echo 1 > /sys/bus/pci/devices//reset

  Then first I got a lot of the following messages:

   mlx5_core 0010:00:00.0 enP16p0s0: mlx5e_ethtool_get_link_ksettings:
  query port ptys failed: -5

  And then as the mlx5 driver's recovery kicks in the oops as below:

  [  659.103947] mlx5_core 0010:00:00.0: wait vital counter value 0x7b399f 
after 1 iterations
  [  659.103947] mlx5_core 0010:00:00.0: mlx5_pci_resume was called
  [  659.103966] mlx5_core 0010:00:00.0: firmware version: 14.32.1010
  [  659.104169] Unable to handle kernel pointer dereference in virtual kernel 
address space
  [  659.104171] Failing address:  TEID: 0483
  [  659.104172] Fault in home space mode while using kernel ASCE.
  [  659.104173] AS:3d29c007 R3:fffd0007 S:fffd5800 
P:003d
  [  659.104200] Oops: 0004 ilc:2 [#1] SMP
  [  659.104202] Modules linked in: s390_trng ism smc pnet chsc_sch eadm_sch 
vfio_ccw vfio_mdev mdev vfio_iommu_type1 vfio sch_fq_codel drm 
drm_panel_orientation_quirks i2c_core ip_tables x_tables btrfs zstd_compress 
zlib_deflate raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor 
async_tx xor raid6_pq libcrc32c raid1 raid0 linear mlx5_ib dm_service_time pkey 
zcrypt crc32_vx_s390 ib_uverbs ghash_s390 ib_core qeth_l2 prng aes_s390 
des_s390 nvme libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 
sha1_s390 sha_common mlx5_core tls mlxfw ptp nvme_core pps_core dasd_eckd_mod 
dasd_mod zfcp scsi_transport_fc qeth qdio ccwgroup scsi_dh_emc scsi_dh_rdac 
scsi_dh_alua dm_multipath
  [  659.104232] CPU: 6 PID: 438216 Comm: cat Not tainted 5.4.0-124-generic 
#140-Ubuntu
  [  659.104233] Hardware name: IBM 3931 XYZ  (LPAR)
  [  659.104234] Krnl PSW : 0404c0018000 3bfa661e 
(__queue_work+0xfe/0x520)
  [  659.104241]R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 
RI:0 EA:3
  [  659.104242] Krnl GPRS: 3c291570  0007 
7fff
  [  659.104243]e2fe46e0 000fffe0 0006 
3d039588
  [  659.104244]  e2fe46e0 
bfb3e000
  [  659.104245]e194c400 03e007d6fb78 3bfa6602 
03e007d6f860
  [  659.104251] Krnl Code: 3bfa6612: a77400e5brc 
7,3bfa67dc
3bfa6616: 582003acl   
%r2,940
   #3bfa661a: a718lhi %r1,0
   >3bfa661e: ba129000cs  
%r1,%r2,0(%r9)
3bfa6622: a77401a7brc 
7,3bfa6970
3bfa6626: e310b0180012lt  
%r1,24(%r11)
3bfa662c: a78400ffbrc 
8,3bfa682a
3bfa6630: c004004bbrcl
0,3bfa66c6
  [  659.104261] Call Trace:
  [  659.104263] ([<>] 0x0)
  [  659.104265]  [<3bfa6aa2>] queue_work_on+0x62/0x70
  [  659.104329]  [<03ff80a2920a>] cmd_exec+0x4ea/0x840 [mlx5_core]
  [  659.104349]  [<03ff80a29680>] mlx5_cmd_exec+0x40/0x70 [mlx5_core]
  [  659.104369]  [<03ff80a334a8>] mlx5_core_access_reg+0x108/0x150 
[mlx5_core]
  [  659.104387]  [<03ff80a3354e>] mlx5_query_port_ptys+0x5e/0x70 
[mlx5_core]
  [  659.104407]  [<03ff80a5b928>] 

[Kernel-packages] [Bug 1987287] Re: [UBUNTU 20.04] mlx5 driver crashes on accessing device attributes during recovery

2022-08-22 Thread Frank Heimes
The patch mentioned is upstream since kernel v5.7-rc7,
hence it's only needed for focal (already incl. in jammy and kinetic).

** Also affects: ubuntu-z-systems
   Importance: Undecided
   Status: New

** Changed in: ubuntu-z-systems
 Assignee: (unassigned) => Skipper Bug Screeners (skipper-screen-team)

** Changed in: ubuntu-z-systems
   Importance: Undecided => High

** Also affects: linux (Ubuntu Focal)
   Importance: Undecided
   Status: New

** Changed in: linux (Ubuntu)
   Status: New => Invalid

** Changed in: linux (Ubuntu Focal)
 Assignee: (unassigned) => Skipper Bug Screeners (skipper-screen-team)

** Changed in: linux (Ubuntu)
 Assignee: Skipper Bug Screeners (skipper-screen-team) => (unassigned)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1987287

Title:
  [UBUNTU 20.04] mlx5 driver crashes on accessing device attributes
  during recovery

Status in Ubuntu on IBM z Systems:
  New
Status in linux package in Ubuntu:
  Invalid
Status in linux source package in Focal:
  New

Bug description:
  We recently got a bug report for systems running Ubuntu 20.04 that were
  crashing with backtraces pointing at the mlx5 driver's handling of 
mlx5_ethtool_get_link_ksettings()
  when this is called through the sysfs (going through ethtool might have 
different checks).
  I managed to find a reliable way to reproduce the issue that I believe isn't 
tied to IBM Z at all.

  The procedure to reproduce is as follows. I created a script to read
  the sysfs attributes for the link's speed and duplex mode in a loop:

  #!/usr/bin/env bash

  if [ $# -lt 1 ]; then
  echo "Usage: $0 "
  exit 1
  fi

  while true; do
  cat /sys/class/net/$1/duplex > /dev/null
  cat /sys/class/net/$1/speed > /dev/null
  done

  Executed with:

  # ./script.sh enP10p0s0

  I ran this in one bash session and then in another one I triggered a PCI 
reset with
  the follwoing command where one needs to replace  with the PCI address 
of the NIC:

  echo 1 > /sys/bus/pci/devices//reset

  Then first I got a lot of the following messages:

   mlx5_core 0010:00:00.0 enP16p0s0: mlx5e_ethtool_get_link_ksettings:
  query port ptys failed: -5

  And then as the mlx5 driver's recovery kicks in the oops as below:

  [  659.103947] mlx5_core 0010:00:00.0: wait vital counter value 0x7b399f 
after 1 iterations
  [  659.103947] mlx5_core 0010:00:00.0: mlx5_pci_resume was called
  [  659.103966] mlx5_core 0010:00:00.0: firmware version: 14.32.1010
  [  659.104169] Unable to handle kernel pointer dereference in virtual kernel 
address space
  [  659.104171] Failing address:  TEID: 0483
  [  659.104172] Fault in home space mode while using kernel ASCE.
  [  659.104173] AS:3d29c007 R3:fffd0007 S:fffd5800 
P:003d
  [  659.104200] Oops: 0004 ilc:2 [#1] SMP
  [  659.104202] Modules linked in: s390_trng ism smc pnet chsc_sch eadm_sch 
vfio_ccw vfio_mdev mdev vfio_iommu_type1 vfio sch_fq_codel drm 
drm_panel_orientation_quirks i2c_core ip_tables x_tables btrfs zstd_compress 
zlib_deflate raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor 
async_tx xor raid6_pq libcrc32c raid1 raid0 linear mlx5_ib dm_service_time pkey 
zcrypt crc32_vx_s390 ib_uverbs ghash_s390 ib_core qeth_l2 prng aes_s390 
des_s390 nvme libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 
sha1_s390 sha_common mlx5_core tls mlxfw ptp nvme_core pps_core dasd_eckd_mod 
dasd_mod zfcp scsi_transport_fc qeth qdio ccwgroup scsi_dh_emc scsi_dh_rdac 
scsi_dh_alua dm_multipath
  [  659.104232] CPU: 6 PID: 438216 Comm: cat Not tainted 5.4.0-124-generic 
#140-Ubuntu
  [  659.104233] Hardware name: IBM 3931 XYZ  (LPAR)
  [  659.104234] Krnl PSW : 0404c0018000 3bfa661e 
(__queue_work+0xfe/0x520)
  [  659.104241]R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 
RI:0 EA:3
  [  659.104242] Krnl GPRS: 3c291570  0007 
7fff
  [  659.104243]e2fe46e0 000fffe0 0006 
3d039588
  [  659.104244]  e2fe46e0 
bfb3e000
  [  659.104245]e194c400 03e007d6fb78 3bfa6602 
03e007d6f860
  [  659.104251] Krnl Code: 3bfa6612: a77400e5brc 
7,3bfa67dc
3bfa6616: 582003acl   
%r2,940
   #3bfa661a: a718lhi %r1,0
   >3bfa661e: ba129000cs  
%r1,%r2,0(%r9)
3bfa6622: a77401a7brc 
7,3bfa6970
3bfa6626: e310b0180012lt  
%r1,24(%r11)
3bfa662c: a78400ff