[Kernel-packages] [Bug 1422481] Comment bridged from LTC Bugzilla

2015-03-12 Thread bugproxy
--- Comment From bren...@br.ibm.com 2015-03-12 13:31 EDT---
Closing it per previous comment.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1422481

Title:
  mlx4 not recovering from EEH in Ubuntu 15.04 (Mellanox)

Status in linux package in Ubuntu:
  Fix Released

Bug description:
  ---Problem Description---
  EEH is not working with mlx4 driver. When the driver recovered it hits 
another EEH. 

  ---uname output---
  Linux ubuntu 3.18.0-12-generic #13 SMP Mon Feb 9 16:31:42 CST 2015 ppc64le 
ppc64le ppc64le GNU/Linux
   
  ---Additional Hardware Info---
  Need Mellanox adapter like Connect 3 adapter. 

  Machine Type = P8 

  ---Steps to Reproduce---
   Just inject EEH to mlx4 device. 
   
  Stack trace output:
   from EEH recovery then it hits this:
  [  188.747571] EEH: Collect temporary log
  [  188.748330] EEH: of node=/pci@8002007/ethernet@3
  [  188.748339] EEH: PCI device/vendor: 100715b3
  [  188.748361] EEH: PCI cmd/status register: 00100146
  [  188.748362] EEH: PCI-E capabilities and status follow:
  [  188.748459] EEH: PCI-E 00: 00020010 10008e02 0001200e 0843f483
  [  188.748537] EEH: PCI-E 10: 1083   
  [  188.748539] EEH: PCI-E 20: 
  [  188.748540] EEH: PCI-E AER capability register set follows:
  [  188.748625] EEH: PCI-E AER 00: 00020001   00062010
  [  188.748704] EEH: PCI-E AER 10: 2000 2000 01e0 
  [  188.748783] EEH: PCI-E AER 20:    
  [  188.748805] EEH: PCI-E AER 30:  
  [  188.748813] EEH: Reset without hotplug activity
  [  193.833245] EEH: Notify device drivers the completion of reset
  [  193.833257] mlx4_core: Initializing 0001:00:03.0
  [  193.833317] mlx4_core 0001:00:03.0: BAR 0: can't reserve [mem 
0x170b000-0x170b00f]
  [  193.833321] mlx4_core 0001:00:03.0: Couldn't get PCI resources, aborting
  [  193.833395] EEH: Not recovered
  [  193.833397] EEH: Unable to recover from failure from PHB#1-PE#1.
  Please try reseating or replacing it
  [  193.834531] EEH: of node=/pci@8002007/ethernet@3
  [  193.834547] EEH: PCI device/vendor: 100715b3
  [  193.834580] EEH: PCI cmd/status register: 00100142
  [  193.834582] EEH: PCI-E capabilities and status follow:
  [  193.834728] EEH: PCI-E 00: 00020010 10008e02 200e 0843f483
  [  193.834846] EEH: PCI-E 10: 1083   
  [  193.834849] EEH: PCI-E 20: 
  [  193.834850] EEH: PCI-E AER capability register set follows:
  [  193.834981] EEH: PCI-E AER 00: 00020001   00062010
  [  193.835101] EEH: PCI-E AER 10: 2000 2000 01e0 
  [  193.835219] EEH: PCI-E AER 20:    
  [  193.835252] EEH: PCI-E AER 30:  
  [  193.835289] Unable to handle kernel paging request for data at address 
0x0388
  [  193.835356] Faulting instruction address: 0xd1f3231c
  [  193.835415] Oops: Kernel access of bad area, sig: 11 [#1]
  [  193.835460] SMP NR_CPUS=2048 NUMA pSeries
  [  193.835509] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE 
nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 
nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp 
iptable_filter ip_tables x_tables bridge stp llc rtc_generic mlx4_en vxlan 
ip6_udp_tunnel udp_tunnel mlx4_core
  [  193.835886] CPU: 6 PID: 50 Comm: eehd Not tainted 3.18.0-12-generic #13
  [  193.835942] task: c003f72ca880 ti: c003f707c000 task.ti: 
c003f707c000
  [  193.836009] NIP: d1f3231c LR: d1f32790 CTR: 
d1f32760
  [  193.836076] REGS: c003f707f790 TRAP: 0300   Not tainted  
(3.18.0-12-generic)
  [  193.836141] MSR: 80019033   CR: 4448  
XER: 2000
  [  193.836302] CFAR: c00a7be0 DAR: 0388 DSISR: 4000 
SOFTE: 1
  GPR00: d1f32790 c003f707fa10 d1f66310 c003fe0ad000
  GPR04: 0003   c003fd00
  GPR08: 0001 d1f32760 fffa 00011001
  GPR12: d1f32760 cfb83600 c00d9118 c003f90e56c0
  GPR16:    
  GPR20:    c0c4ab90
  GPR24: c0c4ab68 00100100 c003fe068580 c003fe068580
  GPR28: c003fe0ad000 c003fe0685e0 d1f5da50 
  [  193.837205] NIP [d1f3231c] mlx4_unload_one+0x3c/0x480 [mlx4_core]
  [  193.837269] LR [d1f32790] mlx4_pci_err_detected+0x30/0x60 
[mlx4_core]
  [  193.837336] Call Trace:
  [  193.837361] [c003f707fa10] [c003fe068580] 0xc003fe068580 
(unreliable)
  [  193.837447] [c003f707faa0] [d1f32790] 
mlx4_pci_err_detected+0x30/0x60 [mlx4_co

[Kernel-packages] [Bug 1422481] Comment bridged from LTC Bugzilla

2015-03-11 Thread bugproxy
--- Comment From cls...@us.ibm.com 2015-03-11 19:15 EDT---
This looks fixed with  3.19.0-8-generic #8-Ubuntu
it was able to recover from EEH.

[ 2694.622586] EEH: Notify device drivers to shutdown
[ 2694.622587] mlx4_core 0004:01:00.0: device was reset successfully
[ 2694.622589] mlx4_core 0004:01:00.0: mlx4_pci_err_detected was called
[ 2694.622594] mlx4_en 0004:01:00.0: Internal error detected, restarting device
[ 2694.622786] mlx4_en: eth14: Close port called
[ 2694.846830] mlx4_en 0004:01:00.0: removed PHC
[ 2694.874036] EEH: Collect temporary log
[ 2694.879101] EEH: of node=/pciex@3fffe4200/pci@0/ethernet@0
[ 2694.879465] EEH: PCI device/vendor: 100715b3
[ 2694.879478] EEH: PCI cmd/status register: 00100142
[ 2694.879479] EEH: PCI-E capabilities and status follow:
[ 2694.879544] EEH: PCI-E 00: 00020010 10008e02 0020204e 0843f483
[ 2694.879597] EEH: PCI-E 10: 10830040   
[ 2694.879598] EEH: PCI-E 20: 
[ 2694.879599] EEH: PCI-E AER capability register set follows:
[ 2694.879666] EEH: PCI-E AER 00: 18c20001   00062010
[ 2694.879719] EEH: PCI-E AER 10:  2000 01e0 
[ 2694.879772] EEH: PCI-E AER 20:    
[ 2694.879785] EEH: PCI-E AER 30:  
[ 2694.879787] PHB3 PHB#4 Diag-data (Version: 1)
[ 2694.879789] brdgCtl: 0002
[ 2694.879790] UtlSts:  0020  
[ 2694.879791] RootSts: 0040 0040 f0830048 00100147 
[ 2694.879792] PhbSts:  001c 001c
[ 2694.879793] Lem: 0010 42498e327f502eae 
[ 2694.879795] InAErr:  8000 8000 04020080 

[ 2694.879796] PE[  1] A/B: 8480002b 8000
[ 2694.879797] PE[  2] A/B: 8000 8000
[ 2694.879798] PE[  3] A/B: 8000 8000
[ 2694.879799] PE[  4] A/B: 8000 8000
[ 2694.879800] PE[  5] A/B: 8000 8000
[ 2694.879801] EEH: Reset without hotplug activity
[ 2698.898176] EEH: Notify device drivers the completion of reset
[ 2698.898181] mlx4_core 0004:01:00.0: mlx4_pci_slot_reset was called
[ 2698.898218] mlx4_core 0004:01:00.0: enabling device (0140 -> 0142)
[ 2705.396286] mlx4_core 0004:01:00.0: PCIe link speed is 8.0GT/s, device 
supports 8.0GT/s
[ 2705.396288] mlx4_core 0004:01:00.0: PCIe link width is x8, device supports x8
[ 2706.143789] mlx4_en 0004:01:00.0: registered PHC clock
[ 2706.143864] mlx4_en 0004:01:00.0: Activating port:1
[ 2706.159496] mlx4_en: eth11: Using 256 TX rings
[ 2706.159504] mlx4_en: eth11: Using 8 RX rings
[ 2706.159506] mlx4_en: eth11:   frag:0 - size:1518 prefix:0 stride:1536
[ 2706.159722] mlx4_en: eth11: Initializing port
[ 2706.160022] mlx4_en 0004:01:00.0: Activating port:2
[ 2706.165214] mlx4_core 0004:01:00.0 eth14: renamed from eth11
[ 2706.188419] mlx4_en: eth11: Using 256 TX rings
[ 2706.188427] mlx4_en: eth11: Using 8 RX rings
[ 2706.188430] mlx4_en: eth11:   frag:0 - size:1518 prefix:0 stride:1536
[ 2706.188660] mlx4_en: eth11: Initializing port
[ 2706.197316] EEH: Notify device driver to resume
[ 2706.525987] mlx4_core 0004:01:00.0 eth16: renamed from eth11
[ 2707.487156] mlx4_en: eth14: Link Up
[ 2707.542052] mlx4_en: eth16: Link Up

thanks.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1422481

Title:
  mlx4 not recovering from EEH in Ubuntu 15.04 (Mellanox)

Status in linux package in Ubuntu:
  Fix Released

Bug description:
  ---Problem Description---
  EEH is not working with mlx4 driver. When the driver recovered it hits 
another EEH. 

  ---uname output---
  Linux ubuntu 3.18.0-12-generic #13 SMP Mon Feb 9 16:31:42 CST 2015 ppc64le 
ppc64le ppc64le GNU/Linux
   
  ---Additional Hardware Info---
  Need Mellanox adapter like Connect 3 adapter. 

  Machine Type = P8 

  ---Steps to Reproduce---
   Just inject EEH to mlx4 device. 
   
  Stack trace output:
   from EEH recovery then it hits this:
  [  188.747571] EEH: Collect temporary log
  [  188.748330] EEH: of node=/pci@8002007/ethernet@3
  [  188.748339] EEH: PCI device/vendor: 100715b3
  [  188.748361] EEH: PCI cmd/status register: 00100146
  [  188.748362] EEH: PCI-E capabilities and status follow:
  [  188.748459] EEH: PCI-E 00: 00020010 10008e02 0001200e 0843f483
  [  188.748537] EEH: PCI-E 10: 1083   
  [  188.748539] EEH: PCI-E 20: 
  [  188.748540] EEH: PCI-E AER capability register set follows:
  [  188.748625] EEH: PCI-E AER 00: 00020001   00062010
  [  188.748704] EEH: PCI-E AER 10: 2000 2000 01e0 
  [  188.748783] EEH: PCI-E AER 20:    
  [  188.748805] EEH: PCI-E AER 30:  
  [  188.748813] EEH: Reset without hotplug activity
  [  193.833245] EEH