** Description changed:

+ [IMPACT]
+ MCE test renders the system unresponsive on P9 open power hardware 
(Withersoon)
+ 
+ [TEST]
+ A test kernel is available in ppa:ubuntu-power-triage/lp1848127. Please see 
the [OTHER] section for test details and comment #7 for results with the PPA 
kernel. 
+ 
+ [FIX]
+ IBM has identified the following patch that fixes this issue:
+ commit 99ead78afd1128bfcebe7f88f3b102fb2da09aee
+ Author: Balbir Singh <[email protected]>
+ Date:   Tue Aug 20 13:43:47 2019 +0530
+ 
+     powerpc/mce: Fix MCE handling for huge pages
+ 
+ [REGRESSION POTENTIAL]
+ The patch is applicable the powerpc architecture and limited in scope to MCE 
handling for huge pages. Patch does not touch any generic code. Regression if 
any is limited to powerpc MCE handling.
+ 
+ [OTHER]
  == Comment: #0 - PAVAMAN SUBRAMANIYAM <[email protected]> - 2019-05-07 
23:31:20 ==
  Install a P9 Open Power Hardware with the latest OP930 Firmware images built  
from the upstream op-build git tree.
  
  root@witherspoon:~# cat /etc/os-release
  ID="openbmc-phosphor"
  NAME="Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro)"
  VERSION="ibm-v2.3"
  VERSION_ID="ibm-v2.3-476-g2d622cb-r32-0-g9973ab0"
  PRETTY_NAME="Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro) 
ibm-v2.3"
  BUILD_ID="ibm-v2.3-476-g2d622cb-r32"
  root@witherspoon:~# cat /var/lib/phosphor-software-manager/pnor/ro/VERSION
-       open-power-witherspoon-v2.3-rc2-58-g59fd0743
-         buildroot-2019.02.2-17-g93b841d204
-         skiboot-v6.3-rc2
-         hostboot-19a436e
-         occ-58e422d
-         linux-5.0.9-openpower1-p3a4d5a4
-         petitboot-v1.10.3
-         machine-xml-a6f4df3
-         hostboot-binaries-hw043019a.940
-         capp-ucode-p9-dd2-v4
-         sbe-249671d
-         hcode-hw040319a.940
+  open-power-witherspoon-v2.3-rc2-58-g59fd0743
+         buildroot-2019.02.2-17-g93b841d204
+         skiboot-v6.3-rc2
+         hostboot-19a436e
+         occ-58e422d
+         linux-5.0.9-openpower1-p3a4d5a4
+         petitboot-v1.10.3
+         machine-xml-a6f4df3
+         hostboot-binaries-hw043019a.940
+         capp-ucode-p9-dd2-v4
+         sbe-249671d
+         hcode-hw040319a.940
  
  Then enable sw xstop manually by using below command:
  
  root@ltc-wspoon11:~# nvram -p ibm,skiboot --update-config opal-sw-xstop=enable
  root@ltc-wspoon11:~# nvram -p ibm,skiboot --print-config
  "ibm,skiboot" Partition
  --------------------------
  experimental-fast-reset=1
  snarf-mode=noooooo
  opal-sw-xstop=enable
  
  Then from the Linux HOST injected the MCE UE Error on the machine as
  follows:
  
  root@ltc-wspoon11:~# ./probe_cpus.sh -L
  CHIP ID: 0 CORE ID: 0 THREADS: 4 CPUs:  0 1 2 3
  CHIP ID: 0 CORE ID: 1 THREADS: 4 CPUs:  4 5 6 7
  CHIP ID: 0 CORE ID: 2 THREADS: 4 CPUs:  8 9 10 11
  CHIP ID: 0 CORE ID: 3 THREADS: 4 CPUs:  12 13 14 15
  CHIP ID: 0 CORE ID: 6 THREADS: 4 CPUs:  16 17 18 19
  CHIP ID: 0 CORE ID: 7 THREADS: 4 CPUs:  20 21 22 23
  CHIP ID: 0 CORE ID: 8 THREADS: 4 CPUs:  24 25 26 27
  CHIP ID: 0 CORE ID: 9 THREADS: 4 CPUs:  28 29 30 31
  CHIP ID: 0 CORE ID: 10 THREADS: 4 CPUs:  32 33 34 35
  CHIP ID: 0 CORE ID: 11 THREADS: 4 CPUs:  36 37 38 39
  CHIP ID: 0 CORE ID: 12 THREADS: 4 CPUs:  40 41 42 43
  CHIP ID: 0 CORE ID: 13 THREADS: 4 CPUs:  44 45 46 47
  CHIP ID: 0 CORE ID: 16 THREADS: 4 CPUs:  48 49 50 51
  CHIP ID: 0 CORE ID: 17 THREADS: 4 CPUs:  52 53 54 55
  CHIP ID: 0 CORE ID: 18 THREADS: 4 CPUs:  56 57 58 59
  CHIP ID: 0 CORE ID: 19 THREADS: 4 CPUs:  60 61 62 63
  CHIP ID: 0 CORE ID: 20 THREADS: 4 CPUs:  64 65 66 67
  CHIP ID: 0 CORE ID: 21 THREADS: 4 CPUs:  68 69 70 71
  CHIP ID: 8 CORE ID: 6 THREADS: 4 CPUs:  72 73 74 75
  CHIP ID: 8 CORE ID: 7 THREADS: 4 CPUs:  76 77 78 79
  CHIP ID: 8 CORE ID: 8 THREADS: 4 CPUs:  80 81 82 83
  CHIP ID: 8 CORE ID: 9 THREADS: 4 CPUs:  84 85 86 87
  CHIP ID: 8 CORE ID: 10 THREADS: 4 CPUs:  88 89 90 91
  CHIP ID: 8 CORE ID: 11 THREADS: 4 CPUs:  92 93 94 95
  CHIP ID: 8 CORE ID: 12 THREADS: 4 CPUs:  96 97 98 99
  CHIP ID: 8 CORE ID: 13 THREADS: 4 CPUs:  100 101 102 103
  CHIP ID: 8 CORE ID: 14 THREADS: 4 CPUs:  104 105 106 107
  CHIP ID: 8 CORE ID: 15 THREADS: 4 CPUs:  108 109 110 111
  CHIP ID: 8 CORE ID: 16 THREADS: 4 CPUs:  112 113 114 115
  CHIP ID: 8 CORE ID: 17 THREADS: 4 CPUs:  116 117 118 119
  CHIP ID: 8 CORE ID: 18 THREADS: 4 CPUs:  120 121 122 123
  CHIP ID: 8 CORE ID: 19 THREADS: 4 CPUs:  124 125 126 127
  CHIP ID: 8 CORE ID: 20 THREADS: 4 CPUs:  128 129 130 131
  CHIP ID: 8 CORE ID: 21 THREADS: 4 CPUs:  132 133 134 135
  CHIP ID: 8 CORE ID: 22 THREADS: 4 CPUs:  136 137 138 139
  CHIP ID: 8 CORE ID: 23 THREADS: 4 CPUs:  140 141 142 143
  
  -----------------------------
  p[0]
-    eq[0,1,2,3,4,5]
-    ex[0,1,3,4,5,6,8,9,10]
-     c[0,1,2,3,6,7,8,9,10,11,12,13,16,17,18,19,20,21]
+    eq[0,1,2,3,4,5]
+    ex[0,1,3,4,5,6,8,9,10]
+     c[0,1,2,3,6,7,8,9,10,11,12,13,16,17,18,19,20,21]
  p[8]
-    eq[1,2,3,4,5]
-    ex[3,4,5,6,7,8,9,10,11]
-     c[6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]
+    eq[1,2,3,4,5]
+    ex[3,4,5,6,7,8,9,10,11]
+     c[6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]
  -----------------------------
  
  ----------Processor Layout-------------------
  p[0]
-         +---EQ00----+   +---EQ02----+   +---EQ04----+
-         |EX-0    C0 |   |EX-4    C8 |   |EX-8    C16|
-         + - - - - - +   + - - - - - +   + - - - - - +
-         |EX-0    C1 |   |EX-4    C9 |   |EX-8    C17|
-         + - - - - - +   + - - - - - +   + - - - - - +
-         |EX-1    C2 |   |EX-5    C10|   |EX-9    C18|
-         + - - - - - +   + - - - - - +   + - - - - - +
-         |EX-1    C3 |   |EX-5    C11|   |EX-9    C19|
-         +-----------+   +-----------+   +-----------+
- 
-         +---EQ01----+   +---EQ03----+   +---EQ05----+
-         |           |   |EX-6    C12|   |EX-10   C20|
-         + - - - - - +   + - - - - - +   + - - - - - +
-         |           |   |EX-6    C13|   |EX-10   C21|
-         + - - - - - +   + - - - - - +   + - - - - - +
-         |EX-3    C6 |   |           |   |           |
-         + - - - - - +   + - - - - - +   + - - - - - +
-         |EX-3    C7 |   |           |   |           |
-         +-----------+   +-----------+   +-----------+
- 
+         +---EQ00----+   +---EQ02----+   +---EQ04----+
+         |EX-0    C0 |   |EX-4    C8 |   |EX-8    C16|
+         + - - - - - +   + - - - - - +   + - - - - - +
+         |EX-0    C1 |   |EX-4    C9 |   |EX-8    C17|
+         + - - - - - +   + - - - - - +   + - - - - - +
+         |EX-1    C2 |   |EX-5    C10|   |EX-9    C18|
+         + - - - - - +   + - - - - - +   + - - - - - +
+         |EX-1    C3 |   |EX-5    C11|   |EX-9    C19|
+         +-----------+   +-----------+   +-----------+
+ 
+         +---EQ01----+   +---EQ03----+   +---EQ05----+
+         |           |   |EX-6    C12|   |EX-10   C20|
+         + - - - - - +   + - - - - - +   + - - - - - +
+         |           |   |EX-6    C13|   |EX-10   C21|
+         + - - - - - +   + - - - - - +   + - - - - - +
+         |EX-3    C6 |   |           |   |           |
+         + - - - - - +   + - - - - - +   + - - - - - +
+         |EX-3    C7 |   |           |   |           |
+         +-----------+   +-----------+   +-----------+
  
  p[8]
-         +---EQ00----+   +---EQ02----+   +---EQ04----+
-         |           |   |EX-4    C8 |   |EX-8    C16|
-         + - - - - - +   + - - - - - +   + - - - - - +
-         |           |   |EX-4    C9 |   |EX-8    C17|
-         + - - - - - +   + - - - - - +   + - - - - - +
-         |           |   |EX-5    C10|   |EX-9    C18|
-         + - - - - - +   + - - - - - +   + - - - - - +
-         |           |   |EX-5    C11|   |EX-9    C19|
-         +-----------+   +-----------+   +-----------+
- 
-         +---EQ01----+   +---EQ03----+   +---EQ05----+
-         |           |   |EX-6    C12|   |EX-10   C20|
-         + - - - - - +   + - - - - - +   + - - - - - +
-         |           |   |EX-6    C13|   |EX-10   C21|
-         + - - - - - +   + - - - - - +   + - - - - - +
-         |EX-3    C6 |   |EX-7    C14|   |EX-11   C22|
-         + - - - - - +   + - - - - - +   + - - - - - +
-         |EX-3    C7 |   |EX-7    C15|   |EX-11   C23|
-         +-----------+   +-----------+   +-----------+
+         +---EQ00----+   +---EQ02----+   +---EQ04----+
+         |           |   |EX-4    C8 |   |EX-8    C16|
+         + - - - - - +   + - - - - - +   + - - - - - +
+         |           |   |EX-4    C9 |   |EX-8    C17|
+         + - - - - - +   + - - - - - +   + - - - - - +
+         |           |   |EX-5    C10|   |EX-9    C18|
+         + - - - - - +   + - - - - - +   + - - - - - +
+         |           |   |EX-5    C11|   |EX-9    C19|
+         +-----------+   +-----------+   +-----------+
+ 
+         +---EQ01----+   +---EQ03----+   +---EQ05----+
+         |           |   |EX-6    C12|   |EX-10   C20|
+         + - - - - - +   + - - - - - +   + - - - - - +
+         |           |   |EX-6    C13|   |EX-10   C21|
+         + - - - - - +   + - - - - - +   + - - - - - +
+         |EX-3    C6 |   |EX-7    C14|   |EX-11   C22|
+         + - - - - - +   + - - - - - +   + - - - - - +
+         |EX-3    C7 |   |EX-7    C15|   |EX-11   C23|
+         +-----------+   +-----------+   +-----------+
  
  root@ltc-wspoon11:~# ./statedisable.sh
  ./statedisable.sh: line 10: 
/sys/devices/system/cpu/cpu*/cpuidle/state7/disable: No such file or directory
  ./statedisable.sh: line 11: 
/sys/devices/system/cpu/cpu*/cpuidle/state8/disable: No such file or directory
  
  root@ltc-wspoon11:~# cpupower idle-info
  CPUidle driver: powernv_idle
  CPUidle governor: menu
  analyzing CPU 0:
  
  Number of idle states: 7
  Available idle states: snooze stop0_lite stop0 stop1 stop2 stop4 stop5
  snooze (DISABLED) :
  Flags/Description: snooze
  Latency: 0
  Usage: 81861
  Duration: 29748269
  stop0_lite (DISABLED) :
  Flags/Description: stop0_lite
  Latency: 1
  Usage: 70
  Duration: 1982345
  stop0 (DISABLED) :
  Flags/Description: stop0
  Latency: 2
  Usage: 274
  Duration: 125896
  stop1 (DISABLED) :
  Flags/Description: stop1
  Latency: 5
  Usage: 36
  Duration: 4922
  stop2 (DISABLED) :
  Flags/Description: stop2
  Latency: 10
  Usage: 3745
  Duration: 88300041
  stop4 (DISABLED) :
  Flags/Description: stop4
  Latency: 100
  Usage: 65
  Duration: 1048951
  stop5 (DISABLED) :
  Flags/Description: stop5
  Latency: 200
  Usage: 30377
  Duration: 61977191643
  
  root@ltc-wspoon11:~#./run_workload.sh
  
  root@ltc-wspoon11:~# ./scom_addr_p9.sh 0x1001080c 15
  EQ[ 3]: 0x1301080c
  EX[ 7]: 0x13010c0c
-  C[15]: 0x3f01080c
+  C[15]: 0x3f01080c
  root@ltc-wspoon11:~# ./skiboot/external/xscom-utils/getscom -c 0x8 0x13010c0c
  0000000000000000
  root@ltc-wspoon11:~# ./skiboot/external/xscom-utils/putscom -c 0x8 0x13010c0c 
0c00000000000000
  0c00000000000000
  root@ltc-wspoon11:~# ./skiboot/external/xscom-utils/putscom -c 0x8 0x13010c0c 
0c00000000000000
  0c00000000000000
  
  After injecting the Machine check error, the HOST Linux stops pinging
  and the console access to the machine also gets lost.
  
  But still the Open BMC shell and GUI still shows that the HOST is in
  Running state.
  
  == Comment: #1 - PAVAMAN SUBRAMANIYAM <[email protected]> - 2019-05-07 
23:33:31 ==
  The machine is installed with the Ubuntu 18.04 Linux OS.
  
  root@ltc-wspoon11:~# uname -a
  Linux ltc-wspoon11 4.15.0-48-generic #51-Ubuntu SMP Wed Apr 3 08:26:19 UTC 
2019 ppc64le ppc64le ppc64le GNU/Linux
  root@ltc-wspoon11:~# cat /etc/os-release
  NAME="Ubuntu"
  VERSION="18.04.2 LTS (Bionic Beaver)"
  ID=ubuntu
  ID_LIKE=debian
  PRETTY_NAME="Ubuntu 18.04.2 LTS"
  VERSION_ID="18.04"
  HOME_URL="https://www.ubuntu.com/";
  SUPPORT_URL="https://help.ubuntu.com/";
  BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/";
  
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy";
  VERSION_CODENAME=bionic
  UBUNTU_CODENAME=bionic
  root@ltc-wspoon11:~# cat /proc/cpuinfo | tail
  cpu             : POWER9, altivec supported
  clock           : 2300.000000MHz
  revision        : 2.3 (pvr 004e 1203)
  
  timebase        : 512000000
  platform        : PowerNV
  model           : 8335-GTH
  machine         : PowerNV 8335-GTH
  firmware        : OPAL
  MMU             : Radix
  
  root@ltc-wspoon11:~# lsmcode
  Version of System Firmware :
-  Product Name          : OpenPOWER Firmware
-  Product Version       : witherspoon-v2.3-rc2-58-g59fd0743
-  Product Extra         :        skiboot-v6.3-rc2
-  Product Extra         :        bmc-firmware-version-2.03
-  Product Extra         :        occ-58e422d
-  Product Extra         :        hostboot-19a436e
-  Product Extra         :        buildroot-2019.02.2-17-g93b841d204
-  Product Extra         :        capp-ucode-p9-dd2-v4
-  Product Extra         :        machine-xml-a6f4df3
-  Product Extra         :        hostboot-binaries-hw043019a.940
-  Product Extra         :        sbe-249671d
-  Product Extra         :        hcode-hw040319a.940
-  Product Extra         :        petitboot-v1.10.3
-  Product Extra         :        linux-5.0.9-openpower1-p3a4d5a4
+  Product Name          : OpenPOWER Firmware
+  Product Version       : witherspoon-v2.3-rc2-58-g59fd0743
+  Product Extra         :        skiboot-v6.3-rc2
+  Product Extra         :        bmc-firmware-version-2.03
+  Product Extra         :        occ-58e422d
+  Product Extra         :        hostboot-19a436e
+  Product Extra         :        buildroot-2019.02.2-17-g93b841d204
+  Product Extra         :        capp-ucode-p9-dd2-v4
+  Product Extra         :        machine-xml-a6f4df3
+  Product Extra         :        hostboot-binaries-hw043019a.940
+  Product Extra         :        sbe-249671d
+  Product Extra         :        hcode-hw040319a.940
+  Product Extra         :        petitboot-v1.10.3
+  Product Extra         :        linux-5.0.9-openpower1-p3a4d5a4
  
  == Comment: #3 - PAVAMAN SUBRAMANIYAM <[email protected]> - 2019-05-07
  23:42:35 ==
  
- 
- I quickly tested MCE on op930 build ( IBM-witherspoon-ibm-OP9-v2.2-3.5) with  
4.15.0-47-generic and found no hang. But on further investigation I see that 
the hang issue is seen from kernel version  4.15.0-48-generic and above.  Looks 
like changes that gone in  4.15.0-48-generic version causing the hang issue. 
Still investigating....
+ I quickly tested MCE on op930 build ( IBM-witherspoon-ibm-OP9-v2.2-3.5)
+ with  4.15.0-47-generic and found no hang. But on further investigation
+ I see that the hang issue is seen from kernel version  4.15.0-48-generic
+ and above.  Looks like changes that gone in  4.15.0-48-generic version
+ causing the hang issue. Still investigating....
  
  == Comment: #9 - Application Cdeadmin <[email protected]> - 2019-05-22 
06:45:07 ==
  ==== State: Working by: jayeshp on 22 May 2019 06:37:27 ====
  
  Any update?
  
  == Comment: #11 - MAHESH J. SALGAONKAR <[email protected]> - 
2019-09-19 04:44:01 ==
  The hang issues should go away with below patch.
  
  commit 99ead78afd1128bfcebe7f88f3b102fb2da09aee
  Author: Balbir Singh <[email protected]>
  Date:   Tue Aug 20 13:43:47 2019 +0530
  
-     powerpc/mce: Fix MCE handling for huge pages
-     
-     The current code would fail on huge pages addresses, since the shift would
-     be incorrect. Use the correct page shift value returned by
-     __find_linux_pte() to get the correct physical address. The code is more
-     generic and can handle both regular and compound pages.
-     
-     Fixes: ba41e1e1ccb9 ("powerpc/mce: Hookup derror (load/store) UE errors")
-     Signed-off-by: Balbir Singh <[email protected]>
-     [[email protected]: Fixup pseries_do_memory_failure()]
-     Signed-off-by: Reza Arbab <[email protected]>
-     Tested-by: Mahesh Salgaonkar <[email protected]>
-     Signed-off-by: Santosh Sivaraj <[email protected]>
-     Cc: [email protected] # v4.15+
-     Signed-off-by: Michael Ellerman <[email protected]>
-     Link: https://lore.kernel.org/r/[email protected]
+     powerpc/mce: Fix MCE handling for huge pages
+ 
+     The current code would fail on huge pages addresses, since the shift would
+     be incorrect. Use the correct page shift value returned by
+     __find_linux_pte() to get the correct physical address. The code is more
+     generic and can handle both regular and compound pages.
+ 
+     Fixes: ba41e1e1ccb9 ("powerpc/mce: Hookup derror (load/store) UE errors")
+     Signed-off-by: Balbir Singh <[email protected]>
+     [[email protected]: Fixup pseries_do_memory_failure()]
+     Signed-off-by: Reza Arbab <[email protected]>
+     Tested-by: Mahesh Salgaonkar <[email protected]>
+     Signed-off-by: Santosh Sivaraj <[email protected]>
+     Cc: [email protected] # v4.15+
+     Signed-off-by: Michael Ellerman <[email protected]>
+     Link: https://lore.kernel.org/r/[email protected]

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1848127

Title:
  [LTCTest][OPAL][OP930] Machine hangs after injecting the Machine Check
  Error

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1848127/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to