Re: [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64

2008-10-15 Thread Benjamin Herrenschmidt

  Well, at the time of the sample, the other CPU indeed -seems- to be in
  an IRQ disabled section yes. 
 
 This is not really a sample. The hardirqs enable/disable is actually tracked
 using the TRACE_{EN,DIS}ABLE_INTS macros.

That's what I meant. IE. the hardirq state was updated by the stuck CPU
but sampled by the non-stuck one. ie. the non-stuck one could have
sampled a transcient value where it happened to have hard irq
disabled...

 For the decrementer, the interrupt code is generated by the
 STD_EXCEPTION_COMMON_LITE() macro.

Yeah, I know that :-)

 Aha, none of the PPC interrupt handlers actually us TRACE_ENABLE_INTS (they do
 use TRACE_DISABLE_INTS). So that's why it thinks decrementer_common disabled
 interrupts, without enabling them again...

Well, they aren't supposed to enable IRQs if they were disabled...

Ben.

 With kind regards,
 
 Geert Uytterhoeven
 Software Architect
 
 Sony Techsoft Centre Europe
 The Corporate Village · Da Vincilaan 7-D1 · B-1935 Zaventem · Belgium
 
 Phone:+32 (0)2 700 8453
 Fax:  +32 (0)2 700 8622
 E-mail:   [EMAIL PROTECTED]
 Internet: http://www.sony-europe.com/
 
 A division of Sony Europe (Belgium) N.V.
 VAT BE 0413.825.160 · RPR Brussels
 Fortis · BIC GEBABEBB · IBAN BE41293037680010

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64

2008-10-15 Thread Geert Uytterhoeven
On Wed, 15 Oct 2008, Benjamin Herrenschmidt wrote:
 On Wed, 2008-10-15 at 13:46 +0200, Geert Uytterhoeven wrote:
  On Wed, 15 Oct 2008, Benjamin Herrenschmidt wrote:
 Well, at the time of the sample, the other CPU indeed -seems- to be in
 an IRQ disabled section yes. 

This is not really a sample. The hardirqs enable/disable is actually 
tracked
using the TRACE_{EN,DIS}ABLE_INTS macros.
   
   That's what I meant. IE. the hardirq state was updated by the stuck CPU
   but sampled by the non-stuck one. ie. the non-stuck one could have
   sampled a transcient value where it happened to have hard irq
   disabled...
  
  These states are per_cpu.
 
 I know, but that doesn't prevent another CPU from peeking at them :-)
 The question is, was the message printed by the CPU that locked up or by
 the other one that detected the lockup ?

It's printed by the spinlock debug code, i.e. by the CPU that wants to take the
spinlock (in this case the spinlock for the BKL).

  They do call TRACE_DISABLE_INTS, which records the interrupt being disabled.
  So this makes the actual state recording useless...
 
 Well, they record that when they disable it. They don't enable it. Can
 you find a spot where the IRQ is enabled and it's not recorded or a case
 where it's not disabled and recorded as disabled ?

I guess it's auto-enabled when the decrementer interrupt handler exits?
So shouldn't there be a `bl trace_hardirqs_on' somewhere?

With kind regards,

Geert Uytterhoeven
Software Architect

Sony Techsoft Centre Europe
The Corporate Village · Da Vincilaan 7-D1 · B-1935 Zaventem · Belgium

Phone:+32 (0)2 700 8453
Fax:  +32 (0)2 700 8622
E-mail:   [EMAIL PROTECTED]
Internet: http://www.sony-europe.com/

A division of Sony Europe (Belgium) N.V.
VAT BE 0413.825.160 · RPR Brussels
Fortis · BIC GEBABEBB · IBAN BE41293037680010___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64

2008-10-15 Thread Benjamin Herrenschmidt
On Wed, 2008-10-15 at 11:25 +0200, Geert Uytterhoeven wrote:
 On Wed, 15 Oct 2008, Benjamin Herrenschmidt wrote:
  On Tue, 2008-10-14 at 11:32 +0200, Geert Uytterhoeven wrote:
   which points again to smp_call_function_single...
  
  Yup, it doesn't bring more information. At this stage, your 'other' CPU
  is stuck with interrupts disabled. Hard to tell what's happening without
  some HW assist. Do you have ways to trigger a non-maskable interrupt
  such as a 0x100 ? That would allow to catch the other guy in xmon and
  see what it was doing...
 
 Interrupts are not disabled on the other CPU thread, at least not according to
 the irqs_disabled() check I added to the printing of the `spinlock lockup'
 message in __spin_lock_debug().
 
 As the log also said
 
 | hardirqs last  enabled at (5018779): [c0007c1c] restore+0x1c/0xe4
 | hardirqs last disabled at (5018780): [c0003600] 
 decrementer_common+0x100/0x180
 
 I started blinking the LEDs on decrementer interupts, which do arrive on both
 CPU threads.

Hrm, ok I though the log shows the decrementer interrupt of the thread
that's still working. If you are confident they are both taking
interrupts, then there's indeed something to track down.

 However, I'm a bit puzzled by these `hardirqs last enabled/disabled' messages,
 as they do indicate interrupts are off...

Well, at the time of the sample, the other CPU indeed -seems- to be in
an IRQ disabled section yes. 

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64

2008-10-15 Thread Geert Uytterhoeven
On Wed, 15 Oct 2008, Benjamin Herrenschmidt wrote:
 On Tue, 2008-10-14 at 11:32 +0200, Geert Uytterhoeven wrote:
  which points again to smp_call_function_single...
 
 Yup, it doesn't bring more information. At this stage, your 'other' CPU
 is stuck with interrupts disabled. Hard to tell what's happening without
 some HW assist. Do you have ways to trigger a non-maskable interrupt
 such as a 0x100 ? That would allow to catch the other guy in xmon and
 see what it was doing...

Interrupts are not disabled on the other CPU thread, at least not according to
the irqs_disabled() check I added to the printing of the `spinlock lockup'
message in __spin_lock_debug().

As the log also said

| hardirqs last  enabled at (5018779): [c0007c1c] restore+0x1c/0xe4
| hardirqs last disabled at (5018780): [c0003600] 
decrementer_common+0x100/0x180

I started blinking the LEDs on decrementer interupts, which do arrive on both
CPU threads.

However, I'm a bit puzzled by these `hardirqs last enabled/disabled' messages,
as they do indicate interrupts are off...

 It could be something in the ps3vram driver causing the kernel to
 lockup Now the question is whether the kernel is stuffed with
 something like a deadlock with interrupts off, or is it a HW problem
 causing a CPU to lockup on an access to the vram ?

It's not related to the ps3vram driver, as it happens without, too.

With kind regards,

Geert Uytterhoeven
Software Architect

Sony Techsoft Centre Europe
The Corporate Village · Da Vincilaan 7-D1 · B-1935 Zaventem · Belgium

Phone:+32 (0)2 700 8453
Fax:  +32 (0)2 700 8622
E-mail:   [EMAIL PROTECTED]
Internet: http://www.sony-europe.com/

A division of Sony Europe (Belgium) N.V.
VAT BE 0413.825.160 · RPR Brussels
Fortis · BIC GEBABEBB · IBAN BE41293037680010___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64

2008-10-15 Thread Geert Uytterhoeven
On Wed, 15 Oct 2008, Benjamin Herrenschmidt wrote:
   Well, at the time of the sample, the other CPU indeed -seems- to be in
   an IRQ disabled section yes. 
  
  This is not really a sample. The hardirqs enable/disable is actually tracked
  using the TRACE_{EN,DIS}ABLE_INTS macros.
 
 That's what I meant. IE. the hardirq state was updated by the stuck CPU
 but sampled by the non-stuck one. ie. the non-stuck one could have
 sampled a transcient value where it happened to have hard irq
 disabled...

These states are per_cpu.

  Aha, none of the PPC interrupt handlers actually us TRACE_ENABLE_INTS (they 
  do
  use TRACE_DISABLE_INTS). So that's why it thinks decrementer_common disabled
  interrupts, without enabling them again...
 
 Well, they aren't supposed to enable IRQs if they were disabled...

They do call TRACE_DISABLE_INTS, which records the interrupt being disabled.
So this makes the actual state recording useless...

With kind regards,

Geert Uytterhoeven
Software Architect

Sony Techsoft Centre Europe
The Corporate Village · Da Vincilaan 7-D1 · B-1935 Zaventem · Belgium

Phone:+32 (0)2 700 8453
Fax:  +32 (0)2 700 8622
E-mail:   [EMAIL PROTECTED]
Internet: http://www.sony-europe.com/

A division of Sony Europe (Belgium) N.V.
VAT BE 0413.825.160 · RPR Brussels
Fortis · BIC GEBABEBB · IBAN BE41293037680010___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64

2008-10-15 Thread Geert Uytterhoeven
On Wed, 15 Oct 2008, Benjamin Herrenschmidt wrote:
 On Wed, 2008-10-15 at 11:25 +0200, Geert Uytterhoeven wrote:
  On Wed, 15 Oct 2008, Benjamin Herrenschmidt wrote:
   On Tue, 2008-10-14 at 11:32 +0200, Geert Uytterhoeven wrote:
which points again to smp_call_function_single...
   
   Yup, it doesn't bring more information. At this stage, your 'other' CPU
   is stuck with interrupts disabled. Hard to tell what's happening without
   some HW assist. Do you have ways to trigger a non-maskable interrupt
   such as a 0x100 ? That would allow to catch the other guy in xmon and
   see what it was doing...
  
  Interrupts are not disabled on the other CPU thread, at least not according 
  to
  the irqs_disabled() check I added to the printing of the `spinlock lockup'
  message in __spin_lock_debug().
  
  As the log also said
  
  | hardirqs last  enabled at (5018779): [c0007c1c] 
  restore+0x1c/0xe4
  | hardirqs last disabled at (5018780): [c0003600] 
  decrementer_common+0x100/0x180
  
  I started blinking the LEDs on decrementer interupts, which do arrive on 
  both
  CPU threads.
 
 Hrm, ok I though the log shows the decrementer interrupt of the thread
 that's still working. If you are confident they are both taking
 interrupts, then there's indeed something to track down.
 
  However, I'm a bit puzzled by these `hardirqs last enabled/disabled' 
  messages,
  as they do indicate interrupts are off...
 
 Well, at the time of the sample, the other CPU indeed -seems- to be in
 an IRQ disabled section yes. 

This is not really a sample. The hardirqs enable/disable is actually tracked
using the TRACE_{EN,DIS}ABLE_INTS macros.

For the decrementer, the interrupt code is generated by the
STD_EXCEPTION_COMMON_LITE() macro.

Aha, none of the PPC interrupt handlers actually us TRACE_ENABLE_INTS (they do
use TRACE_DISABLE_INTS). So that's why it thinks decrementer_common disabled
interrupts, without enabling them again...

With kind regards,

Geert Uytterhoeven
Software Architect

Sony Techsoft Centre Europe
The Corporate Village · Da Vincilaan 7-D1 · B-1935 Zaventem · Belgium

Phone:+32 (0)2 700 8453
Fax:  +32 (0)2 700 8622
E-mail:   [EMAIL PROTECTED]
Internet: http://www.sony-europe.com/

A division of Sony Europe (Belgium) N.V.
VAT BE 0413.825.160 · RPR Brussels
Fortis · BIC GEBABEBB · IBAN BE41293037680010___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64

2008-10-15 Thread Benjamin Herrenschmidt
On Wed, 2008-10-15 at 13:46 +0200, Geert Uytterhoeven wrote:
 On Wed, 15 Oct 2008, Benjamin Herrenschmidt wrote:
Well, at the time of the sample, the other CPU indeed -seems- to be in
an IRQ disabled section yes. 
   
   This is not really a sample. The hardirqs enable/disable is actually 
   tracked
   using the TRACE_{EN,DIS}ABLE_INTS macros.
  
  That's what I meant. IE. the hardirq state was updated by the stuck CPU
  but sampled by the non-stuck one. ie. the non-stuck one could have
  sampled a transcient value where it happened to have hard irq
  disabled...
 
 These states are per_cpu.

I know, but that doesn't prevent another CPU from peeking at them :-)
The question is, was the message printed by the CPU that locked up or by
the other one that detected the lockup ?

 They do call TRACE_DISABLE_INTS, which records the interrupt being disabled.
 So this makes the actual state recording useless...

Well, they record that when they disable it. They don't enable it. Can
you find a spot where the IRQ is enabled and it's not recorded or a case
where it's not disabled and recorded as disabled ?

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64

2008-10-15 Thread Benjamin Herrenschmidt
On Wed, 2008-10-15 at 14:05 +0200, Geert Uytterhoeven wrote:
 I guess it's auto-enabled when the decrementer interrupt handler exits?
 So shouldn't there be a `bl trace_hardirqs_on' somewhere?

The interrupts are restored to their previous state on exit of
interrupts via the TRACE_AND_RESTORE_IRQ() macro which is called
from entry_64.S in the main restore path and in head_64.S in the
fast path and hashing faults.

Cheers,
Ben.




___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64

2008-10-14 Thread Geert Uytterhoeven
On Mon, 13 Oct 2008, Geert Uytterhoeven wrote:
 On Sun, 12 Oct 2008, Aaron Tokhy wrote:
  I recently built 2.6.27 with these patches on my PS3.
  
  http://www.kernel.org/pub/linux/kernel/people/geoff/cell/ps3-linux-patches/ps3-wip/ps3vram-driver.patch
  http://www.kernel.org/pub/linux/kernel/people/geoff/cell/ps3-linux-patches/ps3-wip/ps3vram-proc-fs.patch
  
  These patches enable the 'ps3vram' module, which creates a MTD node
 
  Now I am not sure if the patch is the issue.  None of the functions in
 
 No, we've seen similar things happen without ps3vram, too.
 
  BUG: soft lockup - CPU#0 stuck for 61s! [top:22788]
  Modules linked in: evdev hci_usb usbhid bluetooth usb_storage snd_ps3
  ehci_hcd snd_pcm ohci_hcd snd_page_alloc snd_timer usbcore snd sg
  ps3_lpm soundcore
  irq event stamp: 5018780
  hardirqs last  enabled at (5018779): [c0007c1c] restore+0x1c/0xe4
  hardirqs last disabled at (5018780): [c0003600] 
  decrementer_common+0x100/0x180
  softirqs last  enabled at (5018778): [c0020928] 
  .call_do_softirq+0x14/0x24
  softirqs last disabled at (5018773): [c0020928] 
  .call_do_softirq+0x14/0x24
  NIP: c0084110 LR: c0084468 CTR: c03181d0
  REGS: c6f37280 TRAP: 0901   Not tainted  (2.6.27)
  MSR: 80008032 EE,IR,DR  CR: 42004424  XER: 
  TASK = c798[22788] 'top' THREAD: c6f34000 CPU: 0
  GPR00: 0001 c6f37500 c05543d0 c6f37570
  GPR04:  c008427c 0001 
  GPR08: 0830 0001  c0b96874
  GPR12: 80008032 c0586300
  NIP [c0084110] .csd_flag_wait+0x14/0x1c
  LR [c0084468] .smp_call_function_single+0x13c/0x164
  Call Trace:
  [c6f37500] [c0084468] .smp_call_function_single+0x13c/0x164 
  (unreliable)
 
 smp_call_function_single() causes an IPI to be sent to the other CPU thread.
 However, the IPI never seems to arrive at the other CPU thread, causing the
 soft lockup message to be printed on the console.
 
 If this happens when the BKL is held before sending the IPI, the system will
 deadlock when the other CPU thread tries to acquire the BKL. In that
 unfortunate case, you won't see any message on the console of a retail PS3,
 though.
 
 So far we do not know what's the exact cause of the IPI not arriving, hence
 suggestions are welcome.

I've enabled the recently introduced CONFIG_RCU_CPU_STALL_DETECTOR option and
got:

| 3RCU detected CPU 1 stall (t=4295279718/750 jiffies)
| Call Trace:
| [c00013e5a940] [c000f314] .show_stack+0x70/0x184 (unreliable)
| [c00013e5a9f0] [c009029c] .__rcu_pending+0x9c/0x2b4
| [c00013e5aa90] [c00904ec] .rcu_pending+0x38/0x84
| [c00013e5ab10] [c005d9f0] .update_process_times+0x40/0x8c
| [c00013e5aba0] [c0076d4c] .tick_sched_timer+0x154/0x1bc
| [c00013e5ac60] [c006e630] .__run_hrtimer+0x8c/0x128
| [c00013e5ad00] [c006f60c] .hrtimer_interrupt+0x10c/0x1c8
| [c00013e5add0] [c001d2d0] .timer_interrupt+0xcc/0x124
| [c00013e5ae80] [c0003614] decrementer_common+0x114/0x180
| --- Exception: 901 at .csd_flag_wait+0x4/0x1c
| LR = .smp_call_function_single+0x13c/0x164
| [c00013e5b230] [c0082774] .smp_call_function_mask+0xe4/0x240
| [c00013e5b390] [c00566dc] .on_each_cpu+0x24/0x94
| [c00013e5b430] [c00998bc] .drain_all_pages+0x24/0x3c
| [c00013e5b4b0] [c0099ba4] .__alloc_pages_internal+0x2d0/0x464
| [c00013e5b5b0] [c00bb158] .cache_alloc_refill+0x340/0x678
| [c00013e5b680] [c00bb574] .__kmalloc+0xe4/0x170
| [c00013e5b720] [c0297e18] .__alloc_skb+0x7c/0x154
| [c00013e5b7c0] [c02923a8] .sock_alloc_send_skb+0xc4/0x2a4
| [c00013e5b8a0] [c030a464] .unix_stream_sendmsg+0x178/0x384
| [c00013e5b990] [c028e234] .sock_aio_write+0xec/0x114
| [c00013e5baa0] [c00bf2dc] .do_sync_readv_writev+0xc8/0x130
| [c00013e5bc30] [c00fefa0] .compat_do_readv_writev+0x1e0/0x33c
| [c00013e5bd90] [c00ff184] .compat_sys_writev+0x88/0xbc
| [c00013e5be30] [c00074dc] syscall_exit+0x0/0x40

which points again to smp_call_function_single...

With kind regards,

Geert Uytterhoeven
Software Architect

Sony Techsoft Centre Europe
The Corporate Village · Da Vincilaan 7-D1 · B-1935 Zaventem · Belgium

Phone:+32 (0)2 700 8453
Fax:  +32 (0)2 700 8622
E-mail:   [EMAIL PROTECTED]
Internet: http://www.sony-europe.com/

A division of Sony Europe (Belgium) N.V.
VAT BE 0413.825.160 · RPR Brussels
Fortis · BIC GEBABEBB · IBAN BE41293037680010___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64

2008-10-14 Thread Benjamin Herrenschmidt
On Tue, 2008-10-14 at 11:32 +0200, Geert Uytterhoeven wrote:

 
 which points again to smp_call_function_single...

Yup, it doesn't bring more information. At this stage, your 'other' CPU
is stuck with interrupts disabled. Hard to tell what's happening without
some HW assist. Do you have ways to trigger a non-maskable interrupt
such as a 0x100 ? That would allow to catch the other guy in xmon and
see what it was doing...

It could be something in the ps3vram driver causing the kernel to
lockup Now the question is whether the kernel is stuffed with
something like a deadlock with interrupts off, or is it a HW problem
causing a CPU to lockup on an access to the vram ?

I'm afraid only you guys have tools that would allow to debug that
sort of problem.

Cheers,
Ben.

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64

2008-10-13 Thread Geert Uytterhoeven
On Sun, 12 Oct 2008, Aaron Tokhy wrote:
 I recently built 2.6.27 with these patches on my PS3.
 
 http://www.kernel.org/pub/linux/kernel/people/geoff/cell/ps3-linux-patches/ps3-wip/ps3vram-driver.patch
 http://www.kernel.org/pub/linux/kernel/people/geoff/cell/ps3-linux-patches/ps3-wip/ps3vram-proc-fs.patch
 
 These patches enable the 'ps3vram' module, which creates a MTD node

 Now I am not sure if the patch is the issue.  None of the functions in

No, we've seen similar things happen without ps3vram, too.

 BUG: soft lockup - CPU#0 stuck for 61s! [top:22788]
 Modules linked in: evdev hci_usb usbhid bluetooth usb_storage snd_ps3
 ehci_hcd snd_pcm ohci_hcd snd_page_alloc snd_timer usbcore snd sg
 ps3_lpm soundcore
 irq event stamp: 5018780
 hardirqs last  enabled at (5018779): [c0007c1c] restore+0x1c/0xe4
 hardirqs last disabled at (5018780): [c0003600] 
 decrementer_common+0x100/0x180
 softirqs last  enabled at (5018778): [c0020928] 
 .call_do_softirq+0x14/0x24
 softirqs last disabled at (5018773): [c0020928] 
 .call_do_softirq+0x14/0x24
 NIP: c0084110 LR: c0084468 CTR: c03181d0
 REGS: c6f37280 TRAP: 0901   Not tainted  (2.6.27)
 MSR: 80008032 EE,IR,DR  CR: 42004424  XER: 
 TASK = c798[22788] 'top' THREAD: c6f34000 CPU: 0
 GPR00: 0001 c6f37500 c05543d0 c6f37570
 GPR04:  c008427c 0001 
 GPR08: 0830 0001  c0b96874
 GPR12: 80008032 c0586300
 NIP [c0084110] .csd_flag_wait+0x14/0x1c
 LR [c0084468] .smp_call_function_single+0x13c/0x164
 Call Trace:
 [c6f37500] [c0084468] .smp_call_function_single+0x13c/0x164 
 (unreliable)

smp_call_function_single() causes an IPI to be sent to the other CPU thread.
However, the IPI never seems to arrive at the other CPU thread, causing the
soft lockup message to be printed on the console.

If this happens when the BKL is held before sending the IPI, the system will
deadlock when the other CPU thread tries to acquire the BKL. In that
unfortunate case, you won't see any message on the console of a retail PS3,
though.

So far we do not know what's the exact cause of the IPI not arriving, hence
suggestions are welcome.

With kind regards,

Geert Uytterhoeven
Software Architect

Sony Techsoft Centre Europe
The Corporate Village · Da Vincilaan 7-D1 · B-1935 Zaventem · Belgium

Phone:+32 (0)2 700 8453
Fax:  +32 (0)2 700 8622
E-mail:   [EMAIL PROTECTED]
Internet: http://www.sony-europe.com/

A division of Sony Europe (Belgium) N.V.
VAT BE 0413.825.160 · RPR Brussels
Fortis · BIC GEBABEBB · IBAN BE41293037680010___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

[PROBLEM] Soft lockup on Linux 2.6.27, 2 patches, Cell/PPC64

2008-10-11 Thread Aaron Tokhy
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello,

I recently built 2.6.27 with these patches on my PS3.

http://www.kernel.org/pub/linux/kernel/people/geoff/cell/ps3-linux-patches/ps3-wip/ps3vram-driver.patch
http://www.kernel.org/pub/linux/kernel/people/geoff/cell/ps3-linux-patches/ps3-wip/ps3vram-proc-fs.patch

These patches enable the 'ps3vram' module, which creates a MTD node
/dev/mtdblock0.  In addition to the 256 MB of XDR ram used by the
system, I can use 245 MB of the video ram as a fast swap (getting a
somewhat valuable 60 MB/s read/write speed on a random access device).
I was using the mtdblock0 as a swap space when the soft lockup occurred
while leaving the `top` program open.

Now I am not sure if the patch is the issue.  None of the functions in
that list are functions in the patch... but this is my first time at
debugging a kernel bug, some of the functions have the word 'page' so it
might be due to problems occurring while paging to that mtdblock0
device, but surely calls to the functions in that patch would appear.
How would I start debugging this?

The trace is also available in pastebin: http://pastebin.com/m2ea72e52

BUG: soft lockup - CPU#0 stuck for 61s! [top:22788]
Modules linked in: evdev hci_usb usbhid bluetooth usb_storage snd_ps3
ehci_hcd snd_pcm ohci_hcd snd_page_alloc snd_timer usbcore snd sg
ps3_lpm soundcore
irq event stamp: 5018780
hardirqs last  enabled at (5018779): [c0007c1c] restore+0x1c/0xe4
hardirqs last disabled at (5018780): [c0003600]
decrementer_common+0x100/0x180
softirqs last  enabled at (5018778): [c0020928]
.call_do_softirq+0x14/0x24
softirqs last disabled at (5018773): [c0020928]
.call_do_softirq+0x14/0x24
NIP: c0084110 LR: c0084468 CTR: c03181d0
REGS: c6f37280 TRAP: 0901   Not tainted  (2.6.27)
MSR: 80008032 EE,IR,DR  CR: 42004424  XER: 
TASK = c798[22788] 'top' THREAD: c6f34000 CPU: 0
GPR00: 0001 c6f37500 c05543d0 c6f37570
GPR04:  c008427c 0001 
GPR08: 0830 0001  c0b96874
GPR12: 80008032 c0586300
NIP [c0084110] .csd_flag_wait+0x14/0x1c
LR [c0084468] .smp_call_function_single+0x13c/0x164
Call Trace:
[c6f37500] [c0084468]
.smp_call_function_single+0x13c/0x164 (unreliable)
[c6f375c0] [c0084578] .smp_call_function_mask+0xe8/0x244
[c6f37720] [c005809c] .on_each_cpu+0x24/0x9c
[c6f377c0] [c009bde4] .drain_all_pages+0x24/0x3c
[c6f37840] [c009c0c8] .__alloc_pages_internal+0x2cc/0x464
[c6f37950] [c00c3d54] .__slab_alloc+0x1f8/0x6cc
[c6f37a10] [c00c466c] .kmem_cache_alloc+0x74/0x108
[c6f37ab0] [c00cd200] .get_empty_filp+0x98/0x1a0
[c6f37b40] [c00d9fa0] .__path_lookup_intent_open+0x40/0xd0
[c6f37bf0] [c00da294] .do_filp_open+0xc0/0x7f0
[c6f37d80] [c00c9818] .do_sys_open+0x88/0x154
[c6f37e30] [c00076dc] syscall_exit+0x0/0x40
Instruction dump:
2f88 3860fff0 409e000c f88b0008 3860 ebc1fff0 4e800020 7c0004ac
80030020 780907e1 4d820020 7c210b78 7c421378 4be8 4e800020 7c0802a6

- --
- -Thanks
Aaron Tokhy
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkjxfeYACgkQO3nEAs/Ru1mjtwCfW25E51GIAY5KOcpJOp2TeUrz
hhQAni7m4UM7ojCPnjEsmiAEVxpLoljh
=AVql
-END PGP SIGNATURE-
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev