Re: [PATCH v5 00/21] EEH reorganization

2012-04-16 Thread Gavin Shan
 I just hit this on mainline from today (3.4.0-rc2-00065-gf549e08).
 Haven't had a chance to narrow it down yet.

Thanks for the information. I'll try to reproduce the issue on
Firebird-L today. By the way, it seems that mstmread is some
user-level application accessing the config space while the problem
happened?



Looking closer, it was caused by an EEH error at boot. It looks like
the Mellanox infiniband card gets an error when probed by their
firmware tool (mstmread), but only if the kernel driver is not loaded.
I see this EEH error back on 3.0, so it's not new.

The question now is why we oops in the EEH code on mainline.


It seems the crash was caused by something like WARN_ON(). I checked
the function pointed by the backtrace (eeh_dn_check_failure) and I
didn't find any place has called WARN_ON() staff. Maybe I missed something
here.

Anyway, I'll try to reproduce it on Firebird-L machine first of all
and then narrow it down.

Anton


Thanks,
Gavin

[ cut here ]
WARNING: at arch/powerpc/platforms/pseries/eeh.c:492
Modules linked in:
NIP: c0056cc4 LR: c0056cc0 CTR: c051dd60
REGS: c01f3953f6a0 TRAP: 0700   Not tainted  
(3.4.0-rc2-00065-gf549e08-dirty)
MSR: 80029032 SF,EE,ME,IR,DR,RI  CR: 28004482  XER: 000f
SOFTE: 0
CFAR: c074ea30
TASK = c01f39685040[19058] 'mstmread' THREAD: c01f3953c000 CPU: 38
GPR00: c0056cc0 c01f3953f920 c0bd3a28 0021 
GPR04:   000323f7  
GPR08: 6365203c c0b10a20 0002 c0a74cc0 
GPR12: 24004422 ceda8500 3a58582e 583a5858 
GPR16: 2f585858 69636573 2f646576 10003b48 
GPR20: 0fffc7a3d17c 0058 0004 c01f3953fb90 
GPR24:   c0c77088 c03e6fffeee8 
GPR28: c0d82680  c0c770d0  
NIP [c0056cc4] .eeh_dn_check_failure+0x304/0x320
LR [c0056cc0] .eeh_dn_check_failure+0x300/0x320
Call Trace:
[c01f3953f920] [c0056cc0] .eeh_dn_check_failure+0x300/0x320 
(unreliable)
[c01f3953f9d0] [c002717c] .rtas_read_config+0x13c/0x1b0
[c01f3953fa70] [c03d543c] .pci_user_read_config_dword+0xcc/0x150
[c01f3953fb20] [c03e19d8] .pci_read_config+0xe8/0x2a0
[c01f3953fc00] [c022d330] .read+0x130/0x210
[c01f3953fce0] [c01a723c] .vfs_read+0xec/0x1e0
[c01f3953fd80] [c01a73ec] .SyS_pread64+0xbc/0xd0
[c01f3953fe30] [c0009780] syscall_exit+0x0/0x7c
Instruction dump:
7f83e378 48001909 6000 2fbf 419e002c e89f00d8 2fa4 409e0008 
e89f0098 e8629fb8 486f7d39 6000 0fe0 3b21 4bfffdb4 e8829fa8 
---[ end trace a6e6d788c9869e00 ]---
EEH: Detected PCI bus error on device 0006:01:00.0
EEH: This PCI device has failed 1 times in the last hour:
EEH: Bus location=U78AB.001.WZSGRFL-P1-C4-T1 driver= pci addr=0006:01:00.0
EEH: Device location=U78AB.001.WZSGRFL-P1-C4-T1 driver= pci addr=0006:01:00.0
EEH: of node=/pci@8002203/pci1014,415@0
EEH: PCI device/vendor: 673c15b3
EEH: PCI cmd/status register: 00100140


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v5 00/21] EEH reorganization

2012-04-16 Thread Anton Blanchard
Hi,

 Thanks for the information. I'll try to reproduce the issue on
 Firebird-L today. By the way, it seems that mstmread is some
 user-level application accessing the config space while the problem
 happened?

The EEH error is caused by the Melanox firmware tools.

 It seems the crash was caused by something like WARN_ON(). I checked
 the function pointed by the backtrace (eeh_dn_check_failure) and I
 didn't find any place has called WARN_ON() staff. Maybe I missed
 something here.

No. I replaced that backtrace in eeh_dn_check_failure with a WARN_ON()
because the backtrace doesn't give us enough info. I'm submitting a
patch for that today.

Bottom line is mstmread has been causing an EEH error since at least
3.0, but in 3.4 we now oops instead of recovering. The signs all point
to the EEH rework in 3.4.

Anton
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v5 00/21] EEH reorganization

2012-04-16 Thread Benjamin Herrenschmidt
On Tue, 2012-04-17 at 11:37 +1000, Anton Blanchard wrote:
 
 No. I replaced that backtrace in eeh_dn_check_failure with a WARN_ON()
 because the backtrace doesn't give us enough info. I'm submitting a
 patch for that today.
 
 Bottom line is mstmread has been causing an EEH error since at least
 3.0, but in 3.4 we now oops instead of recovering. The signs all point
 to the EEH rework in 3.4.

More precisely, the original oops reported by Anton decodes as such:

Oops: Kernel access of bad area, sig: 11 [#1]

This is typically a bad memory access..

SMP NR_CPUS=1024 NUMA pSeries
Modules linked in:
NIP: c0055af8 LR: c0033204 CTR: 
REGS: c01f42fb7990 TRAP: 0300   Tainted: GW 
(3.4.0-rc2-00065-gf549e08-dirty)

TRAP: 300 means that it's the result of a data access interrupts, ie,
load or store to a bad address

MSR: 80009032 SF,EE,ME,IR,DR,RI  CR: 24008084  XER: 
SOFTE: 1
CFAR: 49b8
DAR: 0070, DSISR: 4000

Here the DAR tells us what address was accessed. 0x70 is a strong indication
that this was an access to a NULL pointer (at offset 0x70 from that pointer).

It -might- be something else (such as a NULL passed to a list head or such)
but the idea that there's a NULL floating around is a good hint.

TASK = c01f6c7dfc40[19010] 'eehd' THREAD: c01f42fb4000 CPU: 6
GPR00: 0001 c01f42fb7c10 c0bd3a28 c01f80ab0800 
GPR04: c01f7c57d418 0380 c01f7c57e070 c0ed5360 
GPR08:  c0c77088  0001 
GPR12: 44008088 ceda1500 019ffa78 00a7 
GPR16: 00bb c0a9f754 c0963230 005e 
GPR20: 01b37e80 00bb  c0b0ad90 
GPR24:  c0b10588 0001 c01f80ab0800 
GPR28:  c01f80ab0828  c01f7ee1 
NIP [c0055af8] .eeh_add_device_tree_late+0x58/0xf0

This is the function where it happened (eeh_add_device_tree_late)

LR [c0033204] .pcibios_finish_adding_to_bus+0x34/0x50
Call Trace:
[c01f42fb7c10] [fdff] 0xfdff (unreliable)
[c01f42fb7ca0] [c0033204] .pcibios_finish_adding_to_bus+0x34/0x50
[c01f42fb7d20] [c0059a5c] .pcibios_add_pci_devices+0x7c/0x190
[c01f42fb7db0] [c0057a6c] .eeh_reset_device+0xfc/0x1a0
[c01f42fb7e50] [c0057e18] .handle_eeh_events+0x308/0x480
[c01f42fb7f00] [c00584dc] .eeh_event_handler+0x13c/0x1d0
[c01f42fb7f90] [c002099c] .kernel_thread+0x54/0x70

And your backtrace. You can see that you got an eeh event, which triggered an
eeh reset, which triggered a pcibios_add_pci_devices() etc...

Instruction dump:
48a8 6000 ebff 7fbfe800 419e0098 2fbf 419e005c e9229eb0 
80090008 2f80 419e004c ebdf01d0 e81e0070 7fbf 3160
7d2b0110 

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v5 00/21] EEH reorganization

2012-04-16 Thread Gavin Shan

Ben, thanks a lot for the backtrace to help narrowing down the root
cause. Also thanks a lot for how to parse the backtrace and register
staff printed by oops ;-) 

Finally, I successfully reproduced the issue on Firebird-L machine
without loading the corresponding device driver for Emulex ethernet
by disable the corresponding config options in .config. With injected
config space data parity error destined to the Emulex ethernet MAC,
I saw following backtrace. The problem came from following piece of
code. Actually, the EEH device should be retrieve from OF node instead
of PCI device since the PCI device didn't trace the corresponding
EEH device yet at that time. I'll send one patch against it soon even
it only need 1 line of code change ;-)

(gdb) p (((struct eeh_dev *)0)-pdev)
$1 = (struct pci_dev **) 0x70

static void eeh_add_device_late(struct pci_dev *dev)
{
struct device_node *dn;
struct eeh_dev *edev;

if (!dev || !eeh_subsystem_enabled)
return;
dn = pci_device_to_OF_node(dev);
edev = pci_dev_to_eeh_dev(dev);  edev should be NULL
if (edev-pdev == dev) { data access fault here.
pr_debug(EEH: Already referenced !\n);
return;
}
WARN_ON(edev-pdev);
:
:
}

[  176.972046] Unable to handle kernel paging request for data at address 
0x0070
[  176.972054] Faulting instruction address: 0xc0055ecc
[  176.972064] Oops: Kernel access of bad area, sig: 11 [#1]
[  176.972070] SMP NR_CPUS=1024 NUMA pSeries
[  176.972078] Modules linked in:
[  176.972086] NIP: c0055ecc LR: c0055ec8 CTR: c005babc
[  176.972102] REGS: c00f4d913970 TRAP: 0300   Not tainted  (3.4.0-rc2+)
[  176.972109] MSR: 80009032 SF,EE,ME,IR,DR,RI  CR: 2884  XER: 
0009
[  176.972129] SOFTE: 1
[  176.972133] CFAR: c0005080
[  176.972138] DAR: 0070, DSISR: 4000
[  176.972146] TASK = c00f4d8c3600[1038] 'eehd' THREAD: c00f4d91 
CPU: 24
[  176.972155] GPR00: c0055ec8 c00f4d913bf0 c147ed90 
001e 
[  176.972170] GPR04:    
 
[  176.972183] GPR08: 4f4e450d c0c44208 00036710 
00ec 
[  176.972197] GPR12: 2882 cff25400  
0106c9c8 
[  176.972212] GPR16: 0228 02e5acf0 01aff9a4 
0060 
[  176.972227] GPR20:    
c1345c78 
[  176.972241] GPR24: c1345c70   
c0851ac0 
[  176.972256] GPR28: c0a95ad3 c00f529f2c28 c00f529f2c00 
c00f4d88 
[  176.972276] NIP [c0055ecc] .eeh_add_device_tree_late+0x17c/0x2c4
[  176.972286] LR [c0055ec8] .eeh_add_device_tree_late+0x178/0x2c4
[  176.972294] Call Trace:
[  176.972300] [c00f4d913bf0] [c0055ec8] 
.eeh_add_device_tree_late+0x178/0x2c4 (unreliable)
[  176.972316] [c00f4d913ca0] [c0036bc8] 
.pcibios_finish_adding_to_bus+0x74/0x90
[  176.972328] [c00f4d913d20] [c0059b50] 
.pcibios_add_pci_devices+0x12c/0x150
[  176.972339] [c00f4d913db0] [c0057c60] 
.eeh_reset_device+0x10c/0x140
[  176.972350] [c00f4d913e50] [c0057ee4] 
.handle_eeh_events+0x250/0x42c
[  176.972361] [c00f4d913f10] [c0058560] 
.eeh_event_handler+0xe4/0x178
[  176.972372] [c00f4d913f90] [c0021550] .kernel_thread+0x54/0x70
[  176.972380] Instruction dump:
[  176.972384] eb82a1f0 7f83e378 487dd2e9 6000 e862a1f8 7f64db78 487dd2d9 
6000 
[  176.972400] eb5f02c0 7f83e378 487dd2c9 6000 e81a0070 7fa0f800 40de0028 
e862a188 

Thanks,
Gavin


More precisely, the original oops reported by Anton decodes as such:

Oops: Kernel access of bad area, sig: 11 [#1]

This is typically a bad memory access..

SMP NR_CPUS=1024 NUMA pSeries
Modules linked in:
NIP: c0055af8 LR: c0033204 CTR: 
REGS: c01f42fb7990 TRAP: 0300   Tainted: GW 
(3.4.0-rc2-00065-gf549e08-dirty)

TRAP: 300 means that it's the result of a data access interrupts, ie,
load or store to a bad address

MSR: 80009032 SF,EE,ME,IR,DR,RI  CR: 24008084  XER: 
SOFTE: 1
CFAR: 49b8
DAR: 0070, DSISR: 4000

Here the DAR tells us what address was accessed. 0x70 is a strong indication
that this was an access to a NULL pointer (at offset 0x70 from that pointer).

It -might- be something else (such as a NULL passed to a list head or such)
but the idea that there's a NULL floating around is a good hint.

TASK = c01f6c7dfc40[19010] 'eehd' THREAD: c01f42fb4000 CPU: 6
GPR00: 0001 c01f42fb7c10 c0bd3a28 c01f80ab0800 
GPR04: c01f7c57d418 0380 c01f7c57e070 c0ed5360 
GPR08:  c0c77088 

Re: [PATCH v5 00/21] EEH reorganization

2012-04-12 Thread Anton Blanchard

Hi Gavin,

 This series of patches is going to reorganize EEH so that it could
 support multiple platforms in future. The requirements were raised
 from the aspects.

I just hit this on mainline from today (3.4.0-rc2-00065-gf549e08).
Haven't had a chance to narrow it down yet.

Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=1024 NUMA pSeries
Modules linked in:
NIP: c0055af8 LR: c0033204 CTR: 
REGS: c01f42fb7990 TRAP: 0300   Tainted: GW 
(3.4.0-rc2-00065-gf549e08-dirty)
MSR: 80009032 SF,EE,ME,IR,DR,RI  CR: 24008084  XER: 
SOFTE: 1
CFAR: 49b8
DAR: 0070, DSISR: 4000
TASK = c01f6c7dfc40[19010] 'eehd' THREAD: c01f42fb4000 CPU: 6
GPR00: 0001 c01f42fb7c10 c0bd3a28 c01f80ab0800 
GPR04: c01f7c57d418 0380 c01f7c57e070 c0ed5360 
GPR08:  c0c77088  0001 
GPR12: 44008088 ceda1500 019ffa78 00a7 
GPR16: 00bb c0a9f754 c0963230 005e 
GPR20: 01b37e80 00bb  c0b0ad90 
GPR24:  c0b10588 0001 c01f80ab0800 
GPR28:  c01f80ab0828  c01f7ee1 
NIP [c0055af8] .eeh_add_device_tree_late+0x58/0xf0
LR [c0033204] .pcibios_finish_adding_to_bus+0x34/0x50
Call Trace:
[c01f42fb7c10] [fdff] 0xfdff (unreliable)
[c01f42fb7ca0] [c0033204] .pcibios_finish_adding_to_bus+0x34/0x50
[c01f42fb7d20] [c0059a5c] .pcibios_add_pci_devices+0x7c/0x190
[c01f42fb7db0] [c0057a6c] .eeh_reset_device+0xfc/0x1a0
[c01f42fb7e50] [c0057e18] .handle_eeh_events+0x308/0x480
[c01f42fb7f00] [c00584dc] .eeh_event_handler+0x13c/0x1d0
[c01f42fb7f90] [c002099c] .kernel_thread+0x54/0x70
Instruction dump:
48a8 6000 ebff 7fbfe800 419e0098 2fbf 419e005c e9229eb0 
80090008 2f80 419e004c ebdf01d0 e81e0070 7fbf 3160
7d2b0110 

Anton
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v5 00/21] EEH reorganization

2012-04-12 Thread Anton Blanchard

Hi,

 I just hit this on mainline from today (3.4.0-rc2-00065-gf549e08).
 Haven't had a chance to narrow it down yet.

Looking closer, it was caused by an EEH error at boot. It looks like
the Mellanox infiniband card gets an error when probed by their
firmware tool (mstmread), but only if the kernel driver is not loaded.
I see this EEH error back on 3.0, so it's not new.

The question now is why we oops in the EEH code on mainline.

Anton

[ cut here ]
WARNING: at arch/powerpc/platforms/pseries/eeh.c:492
Modules linked in:
NIP: c0056cc4 LR: c0056cc0 CTR: c051dd60
REGS: c01f3953f6a0 TRAP: 0700   Not tainted  
(3.4.0-rc2-00065-gf549e08-dirty)
MSR: 80029032 SF,EE,ME,IR,DR,RI  CR: 28004482  XER: 000f
SOFTE: 0
CFAR: c074ea30
TASK = c01f39685040[19058] 'mstmread' THREAD: c01f3953c000 CPU: 38
GPR00: c0056cc0 c01f3953f920 c0bd3a28 0021 
GPR04:   000323f7  
GPR08: 6365203c c0b10a20 0002 c0a74cc0 
GPR12: 24004422 ceda8500 3a58582e 583a5858 
GPR16: 2f585858 69636573 2f646576 10003b48 
GPR20: 0fffc7a3d17c 0058 0004 c01f3953fb90 
GPR24:   c0c77088 c03e6fffeee8 
GPR28: c0d82680  c0c770d0  
NIP [c0056cc4] .eeh_dn_check_failure+0x304/0x320
LR [c0056cc0] .eeh_dn_check_failure+0x300/0x320
Call Trace:
[c01f3953f920] [c0056cc0] .eeh_dn_check_failure+0x300/0x320 
(unreliable)
[c01f3953f9d0] [c002717c] .rtas_read_config+0x13c/0x1b0
[c01f3953fa70] [c03d543c] .pci_user_read_config_dword+0xcc/0x150
[c01f3953fb20] [c03e19d8] .pci_read_config+0xe8/0x2a0
[c01f3953fc00] [c022d330] .read+0x130/0x210
[c01f3953fce0] [c01a723c] .vfs_read+0xec/0x1e0
[c01f3953fd80] [c01a73ec] .SyS_pread64+0xbc/0xd0
[c01f3953fe30] [c0009780] syscall_exit+0x0/0x7c
Instruction dump:
7f83e378 48001909 6000 2fbf 419e002c e89f00d8 2fa4 409e0008 
e89f0098 e8629fb8 486f7d39 6000 0fe0 3b21 4bfffdb4 e8829fa8 
---[ end trace a6e6d788c9869e00 ]---
EEH: Detected PCI bus error on device 0006:01:00.0
EEH: This PCI device has failed 1 times in the last hour:
EEH: Bus location=U78AB.001.WZSGRFL-P1-C4-T1 driver= pci addr=0006:01:00.0
EEH: Device location=U78AB.001.WZSGRFL-P1-C4-T1 driver= pci addr=0006:01:00.0
EEH: of node=/pci@8002203/pci1014,415@0
EEH: PCI device/vendor: 673c15b3
EEH: PCI cmd/status register: 00100140

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v5 00/21] EEH reorganization

2012-02-28 Thread Gavin Shan
Hi Ben,

Could you pls take a look on this when you have time?

Thanks,
Gavin

 This series of patches is going to reorganize EEH so that it could support
 multiple platforms in future. The requirements were raised from the aspects.
 
   * The original EEH implementation only support pSeries platform, which
 would be regarded as guest system. Platform powernv is coming and EEH
 needs to be supported on powernv as well.
   * Different platforms might be running based on variable 
 firmware.Further
 more, the firmware would supply different EEH interfaces to kernel.
 Therefore, we have to do necessary abstraction on current EEH 
 implementation.
 
 In order to accomodate the requirements, the series of patches have 
 reorganized
 current EEH implementation.
 
   * The original implementation looks not clean enough. Necessary cleanup
 will be done in some of the patches.
   * struct eeh_ops has been introduced so that EEH core components and 
 platform
 dependent implementation could be split up. That make it possible for 
 EEH
 to be supported on multiple platforms.
   * struct eeh_dev has been introduced to replace struct pci_dn so that 
 EEH module
 works independently as much as possible.
   * EEH global statistics will be maintained in a collective fashion.
 
 v1 - v2:
 
   * If possible, to add eeh_ prefix for function names.
   * The format of leading function comments won't be changed in order not 
 to
 break kernel document automatic generation (e.g. by make pdfdocs).
   * The name of local variables won't be changed if there're no explicit 
 reasons.
   * Represent the PE's state in bitmap fasion.
   * Some function names have been adjusted so that they look shorter and
 meaningful.
   * Platform operation name has been changed to pseries.
   * Merge those patches for cleanup if possible.
   * The line length is kept as appropriately short if possible.
   * Fixup on alignment  spacing issues.
 
 v2 - v3:
   * Split cleanup patch into 2: one for comment cleanup and another one 
 for
 renaming function names.
   * Try to use pr_warning/pr_info/pr_debug instead of printk() function 
 call.
   * Function names are adjusted a little bit so that they looks more 
 meaningful
 according to comments from Michael/Ben.
   * Useful comment has been kept according to Michael's comments.
   * struct eeh_ops::set_eeh has been changed to eeh_ops::set_option.
   * struct eeh_ops::name has been changed to char *.
   * Remove file name from the source file.
   * Copyright (C) format has been changed since (C) isn't encouraged to 
 use.
   * The header files included in the source file have been sorted 
 alphabetically.
   * eeh_platform_init() has been replaced by eeh_pseries_init() to avoid 
 duplicate
 functions when kernel supports multiple platforms.
   * F/W has been changed to Firmware.
   * The maximal wait time to retrieve PE's state has been covered by 
 macro.
   * It also include changes according to the minor comments from Michael.
 
 v3 - v4:
   * Fix some typo included in the commit messages.
   * Reduce code nesting according to Ram's suggestions.
   * Addtinal pr_warning on failure of configuring bridges.
 
 v4 - v5:
   * OF node and PCI device are tracing the corresponding eeh device.
 That has been changed to struct eeh_dev * instead of the original
 void *.
   * The conversion between OF node, PCI device, eeh device is changed
 to inline functions instead of the original macros.
   * The struct eeh_stats has been moved from eeh.h to eeh.c. Besides,
 the individual members of the struct have been changed to fixed-type
 unsigned int. 
 
 
 The series of patches (v5) has been verified on Firebird-L machine. In order 
 to carry out
 the test, you have to install IBM Power Tools from IBM internal yum source. 
 Following
 command is used to force EEH check on ethernet interface, which could be 
 recovered eventually
 by EEH and device driver successfully. You could keep pinging to the blade 
 before issuing
 the following command to force EEH. You should see the network interface 
 can't be reached for
 a moment and everything will be recovered couple of seconds after the forced 
 EEH error. At the
 same time, you should see EEH error log out of system console. 
 
   * errinjct eeh -v -f 0 -p U78AE.001.WZS00M9-P1-C18-L1-T2 -a 0x0 -m 0x0
 
 -
 
 arch/powerpc/include/asm/device.h|3 +
 arch/powerpc/include/asm/eeh.h   |  134 +++-
 arch/powerpc/include/asm/eeh_event.h |   33 +-
 arch/powerpc/include/asm/ppc-pci.h   |   89 +--
 arch/powerpc/kernel/of_platform.c|3 +
 arch/powerpc/kernel/rtas_pci.c   |3 +
 arch/powerpc/platforms/pseries/Makefile  |