Public bug reported:

While testing on AC922 running Genesis kernel: 4.15.0-1016-ibm-gt
#18fix2, we hit EEH errors on the Mellanox CX5 port.

This setup is using two Mellanox 2-port CX5 adapters.

0000:01:00.0 Ethernet controller: Mellanox Technologies MT28800 Family 
[ConnectX-5 Ex]
0000:01:00.1 Ethernet controller: Mellanox Technologies MT28800 Family 
[ConnectX-5 Ex]
0030:01:00.0 Ethernet controller: Mellanox Technologies MT28800 Family 
[ConnectX-5 Ex]
0030:01:00.1 Ethernet controller: Mellanox Technologies MT28800 Family 
[ConnectX-5 Ex]

root@pok1-rz1-sr1-rk001-s16:~# mlxfwmanager
Querying Mellanox devices firmware ...

Device #1:
----------

  Device Type:      ConnectX5
  Part Number:      MCX516A-CDA_Ax
  Description:      ConnectX-5 Ex EN network interface card; 100GbE dual-port 
QSFP28;
                          PCIe4.0 x16; tall bracket; ROHS R6
  PSID:             MT_0000000013
  PCI Device Name:  /dev/mst/mt4121_pciconf1
  Base GUID:        98039b0300998bf2
  Base MAC:         98039b998bf2
  Versions:         Current        Available
     FW             16.24.1000     N/A
     PXE            3.5.0603       N/A
     UEFI           14.17.0011     N/A

  Status:           No matching image found

Device #2:
----------

  Device Type:      ConnectX5
  Part Number:      MCX516A-CDA_Ax
  Description:      ConnectX-5 Ex EN network interface card; 100GbE dual-port 
QSFP28;
                          PCIe4.0 x16; tall bracket; ROHS R6
  PSID:             MT_0000000013
  PCI Device Name:  /dev/mst/mt4121_pciconf0
  Base GUID:        98039b0300686bca
  Base MAC:         98039b686bca
  Versions:         Current        Available
     FW             16.24.1000     N/A
     PXE            3.5.0603       N/A
     UEFI           14.17.0011     N/A

  Status:           No matching image found

Here's a snippet of the data from syslog when this issue happened.

Apr 29 20:57:12 pok1-rz1-sr1-rk001-s16 kernel: [ 8360.328274] vport 55 error -5 
reading stats
Apr 29 20:57:12 pok1-rz1-sr1-rk001-s16 kernel: [ 8360.329572] EEH: PHB#0 
failure detected, location: N/A
Apr 29 20:57:12 pok1-rz1-sr1-rk001-s16 kernel: [ 8360.329616] CPU: 80 PID: 
93816 Comm: python Tainted: G           OE    4.15.0-1016-ibm-gt #18fix2
Apr 29 20:57:12 pok1-rz1-sr1-rk001-s16 kernel: [ 8360.329617] Call Trace:
Apr 29 20:57:12 pok1-rz1-sr1-rk001-s16 kernel: [ 8360.329624] 
[c0002037deacf978] [c000000000ce8abc] dump_stack+0xb0/0xf4 (unreliable)
Apr 29 20:57:12 pok1-rz1-sr1-rk001-s16 kernel: [ 8360.329628] 
[c0002037deacf9b8] [c00000000003aa44] eeh_dev_check_failure+0x234/0x5b0
Apr 29 20:57:12 pok1-rz1-sr1-rk001-s16 kernel: [ 8360.329630] 
[c0002037deacfa58] [c0000000000aec68] pnv_pci_read_config+0x128/0x160
Apr 29 20:57:12 pok1-rz1-sr1-rk001-s16 kernel: [ 8360.329633] 
[c0002037deacfa98] [c00000000076222c] pci_bus_read_config_dword+0x9c/0xf0
Apr 29 20:57:12 pok1-rz1-sr1-rk001-s16 kernel: [ 8360.329637] 
[c0002037deacfaf8] [c00800001cb80350] _vendor_specific_sem+0xf8/0x1f0 
[mst_pciconf]
Apr 29 20:57:12 pok1-rz1-sr1-rk001-s16 kernel: [ 8360.329639] 
[c0002037deacfb48] [c00800001cb819a8] _send_pci_cmd_int+0x50/0x790 [mst_pciconf]
Apr 29 20:57:12 pok1-rz1-sr1-rk001-s16 kernel: [ 8360.329640] 
[c0002037deacfb98] [c00800001cb81454] ioctl+0x90c/0x970 [mst_pciconf]
Apr 29 20:57:12 pok1-rz1-sr1-rk001-s16 kernel: [ 8360.329642] 
[c0002037deacfd18] [c00800001cb814ec] compat_ioctl+0x34/0x50 [mst_pciconf]
Apr 29 20:57:12 pok1-rz1-sr1-rk001-s16 kernel: [ 8360.329645] 
[c0002037deacfd38] [c0000000003f6524] do_vfs_ioctl+0xd4/0xa00
Apr 29 20:57:12 pok1-rz1-sr1-rk001-s16 kernel: [ 8360.329646] 
[c0002037deacfdd8] [c0000000003f6f14] SyS_ioctl+0xc4/0x130
Apr 29 20:57:12 pok1-rz1-sr1-rk001-s16 kernel: [ 8360.329649] 
[c0002037deacfe28] [c00000000000b284] system_call+0x58/0x6c
Apr 29 20:57:12 pok1-rz1-sr1-rk001-s16 kernel: [ 8360.330771] EEH: Detected 
error on PHB#0
Apr 29 20:57:12 pok1-rz1-sr1-rk001-s16 kernel: [ 8360.330775] EEH: This PCI 
device has failed 1 times in the last hour
Apr 29 20:57:12 pok1-rz1-sr1-rk001-s16 kernel: [ 8360.330776] EEH: Notify 
device drivers to shutdown
Apr 29 20:57:12 pok1-rz1-sr1-rk001-s16 kernel: [ 8360.330782] mlx5_core 
0000:01:00.0: mlx5_pci_err_detected was called
Apr 29 20:57:12 pok1-rz1-sr1-rk001-s16 kernel: [ 8360.330784] mlx5_core 
0000:01:00.0: mlx5_enter_error_state:121:(pid 780): start
Apr 29 20:57:12 pok1-rz1-sr1-rk001-s16 kernel: [ 8360.330880] vport 41 error -1 
reading stats
Apr 29 20:57:12 pok1-rz1-sr1-rk001-s16 kernel: [ 8360.330885] vport 37 error -1 
reading stats
Apr 29 20:57:12 pok1-rz1-sr1-rk001-s16 kernel: [ 8360.330891] vport 57 error -1 
reading stats
Apr 29 20:57:12 pok1-rz1-sr1-rk001-s16 kernel: [ 8360.330891] vport 34 error -1 
reading stats
Apr 29 20:57:12 pok1-rz1-sr1-rk001-s16 kernel: [ 8360.330898] vport 61 error -1 
reading stats
Apr 29 20:57:12 pok1-rz1-sr1-rk001-s16 kernel: [ 8360.330900] vport 46 error -1 
reading stats
Apr 29 20:57:12 pok1-rz1-sr1-rk001-s16 kernel: [ 8360.330903] vport 36 error -1 
reading stats
Apr 29 20:57:12 pok1-rz1-sr1-rk001-s16 kernel: [ 8360.330904] vport 31 error -1 
reading stats
Apr 29 20:57:12 pok1-rz1-sr1-rk001-s16 kernel: [ 8360.330908] vport 23 error -1 
reading stats
Apr 29 20:57:12 pok1-rz1-sr1-rk001-s16 kernel: [ 8360.330909] vport 42 error -1 
reading stats
Apr 29 20:57:12 pok1-rz1-sr1-rk001-s16 kernel: [ 8360.330911] vport 51 error -1 
reading stats
Apr 29 20:57:12 pok1-rz1-sr1-rk001-s16 kernel: [ 8360.330925] 
mlx5_0:mlx5_ib_event:2998:(pid 780): warning: event on port 0
Apr 29 20:57:12 pok1-rz1-sr1-rk001-s16 kernel: [ 8360.330928] mlx5_core 
0000:01:00.0: mlx5_enter_error_state:128:(pid 780): end

I've been working with Carol Soto to try and recreate this issue for the
last few days but I haven't seen it since. Carol provided a debug
kernel, which is currently installed on this machine:

192.168.5.33 
(
Needs to be accessed via the gctest2 jumphost. Once you login to gctest2, use 
the cloud.key to login:
# ssh -i ~/.ssh/cloud.key [email protected]
)


The steps that led to this issue:

1. Boot up the machine with 4.15.0-1016-ibm-gt #18fix2

2. Setup the Genesis Fabric services (fabcon, iobricks, mlx-config and others) 
to enable eswitch mode and setup 64 VFs per PF on one NIC :
0000:01:00.0
0000:01:00.1

3. Attempted some functional tests that involve using SR-IOV VFs and
testing connectivity between two machines. I don't know exactly which
test caused this, but I've since run the test suite several times and
haven't seen this issue again.


The system is currently running with Carol's debug kernel:

Linux pok1-rz1-sr1-rk001-s16 4.15.0-1016.18-eeh-debug-ibm-gt #debug SMP
Wed May 1 10:44:08 CDT 2019 ppc64le ppc64le ppc64le GNU/Linux

 
---uname output---
 Linux pok1-rz1-sr1-rk001-s16 4.15.0-1016.18-eeh-debug-ibm-gt #debug SMP Wed 
May 1 10:44:08 CDT 2019 ppc64le ppc64le ppc64le GNU/Linux
 
---Patches Installed---
4.15.0-1016-ibm-gt #18fix2
https://patchwork.ozlabs.org/patch/1043793/
    [kernel] powerpc/powernv/sriov: Register IOMMU groups for VFs

 
Machine Type = AC922 
 
---Debugger---
A debugger is not configured
 
Stack trace output:
 no
 
Oops output:
 no
 
System Dump Info:
  The system is not configured to capture a system dump.
 

== Comment: #1 - Carol L. Soto <[email protected]> - 2019-05-03 13:45:56 ==
The issue here is that if you put more than one Mellanox Ethernet card dual 
port that can also support lag the lag mapping can be wrong. If you have only 
one Mellanox Ethernet card the issue will not happen. 

The EEH maybe was due that since the lag mapping was wrong it was
generating link issues in the pcie layer so that is why we got EEH and
did not recover from it.

To notice the issue if you have these cards:
0000:01:00.0 Ethernet controller: Mellanox Technologies MT28800 Family 
[ConnectX-5 Ex]
0000:01:00.1 Ethernet controller: Mellanox Technologies MT28800 Family 
[ConnectX-5 Ex]
0030:01:00.0 Ethernet controller: Mellanox Technologies MT28800 Family 
[ConnectX-5 Ex]
0030:01:00.1 Ethernet controller: Mellanox Technologies MT28800 Family 
[ConnectX-5 Ex]


but just try to do multipath enablement in the first card you will see this 
print in dmesg that is was not correct:
Apr 29 19:08:37 pok1-rz1-sr1-rk001-s16 kernel: [ 1845.155876] mlx5_core 
0000:01:00.0: Activate multipath
Apr 29 19:08:37 pok1-rz1-sr1-rk001-s16 kernel: [ 1845.155880] mlx5_core 
0030:01:00.0: lag map port 1:2 port 2:2.   <---- this showing the second card 
which did not even has IP address assigned


If you have one card only its ok and this is the output you see:

Apr 30 17:37:04 ubuntu kernel: [ 1448.868885] mlx5_core 0000:01:00.0: Activate 
multipath
Apr 30 17:37:04 ubuntu kernel: [ 1448.868889] mlx5_core 0000:01:00.0: lag map 
port 1:1 port 2:2


This commit upstream will fix the case if you have 2 cards and try 
lag/multipath code:

commit df7ddb2396cd162e64aaff9401be05e31e438961
Author: Daniel Jurgens <[email protected]>
Date:   Mon Aug 27 09:09:46 2018 -0500

    net/mlx5: Consider PCI domain in search for next dev


If the system did not have this commit, the tester was able to hit the eeh in 
less than hour but with this commit the tester has not seen the EEH again. 
Will attach dmesg of a kernel with this patch.

== Comment: #2 - Carol L. Soto <[email protected]> - 2019-05-03 13:48:42 ==
in the dmesg we can see this:
[ 1348.269915] mlx5_core 0000:01:00.0: Activate multipath
[ 1348.269919] mlx5_core 0000:01:00.0: lag map port 1:1 port 2:2

** Affects: kernel-package (Ubuntu)
     Importance: Undecided
     Assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
         Status: New


** Tags: architecture-ppc64le bugnameltc-177400 severity-medium 
targetmilestone-inin---

** Tags added: architecture-ppc64le bugnameltc-177400 severity-medium
targetmilestone-inin---

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1829028

Title:
  Ubuntu 18.04 - EEH error on Mellanox CX5

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/kernel-package/+bug/1829028/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to