Re: BUG: sleeping function called from ras_epow_interrupt context

2015-07-16 Thread Nathan Fontenot
On 07/16/2015 01:23 AM, Thomas Huth wrote:
 On 07/15/2015 09:58 PM, Nathan Fontenot wrote:
 On 07/15/2015 09:35 AM, Thomas Huth wrote:
 On 07/14/2015 11:22 PM, Benjamin Herrenschmidt wrote:
 On Tue, 2015-07-14 at 20:43 +0200, Thomas Huth wrote:
 Any suggestions how to fix this? Simply revert 587f83e8dd50d? Use
 mdelay() instead of msleep() in rtas_busy_delay()? Something more
 fancy?

 A proper fix would be more fancy, the get_sensor should happen in a
 kernel thread instead.

 I'm not very familiar with this stuff, but isn't the EPOW interrupt
 something that is very time-critical? Moving parts of the handler into a
 kernel thread then does not sound like a very good idea to me...

 Another question: Can it happen at all that this get-sensor call results
 in a sleep condition? Looking at commit ID
 81b73dd92b97423b8f5324a59044da478c04f4c4 (Fix might-sleep warning on
 removing cpus), which apparently fixed a similar issue for CPU
 hot-plugging, indicates that at least some of the rtas calls are never
 returning the busy code? In that case we could fix this by introducing a
 similar rtas_get_sensor_fast() function? (or simply revert 587f83e8dd50d
 which would be quite similar, I think)


 Looking at the PAPR, the get-sensor-state rtas call for the EPOW sensor
 is listed as a fast call and should not return a busy indication.
 
 Great, good to know, thanks for looking that up! So IMHO we should
 either introduce a rtas_get_sensor_fast() function or revert
 587f83e8dd50d ... any preferences? Shall I come up with a patch?

A quick look at the kernel, I only find three places that rtas_get_sensor
is called. The instance you point out here for the EPOW sensor is the
only time I find it called for a sensor that should not return a busy
indication.

Reverting commit 587f83e8dd50d would solve the issue but not fix any future
users of a fast get-sensor call. I don't have an issue with a patch for a
rtas_get_sensor_fast().

-Nathan
 
 I'm curious as to why we're getting a busy return indication when
 making this call.
 
 Looking at the code again, rtas_busy_delay() likely never slept ... it's
 likely just the might_sleep() annotation in that function that causes
 the BUG.
 
  Thomas
 
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: BUG: sleeping function called from ras_epow_interrupt context

2015-07-16 Thread Thomas Huth
On 07/15/2015 09:58 PM, Nathan Fontenot wrote:
 On 07/15/2015 09:35 AM, Thomas Huth wrote:
 On 07/14/2015 11:22 PM, Benjamin Herrenschmidt wrote:
 On Tue, 2015-07-14 at 20:43 +0200, Thomas Huth wrote:
 Any suggestions how to fix this? Simply revert 587f83e8dd50d? Use
 mdelay() instead of msleep() in rtas_busy_delay()? Something more
 fancy?

 A proper fix would be more fancy, the get_sensor should happen in a
 kernel thread instead.

 I'm not very familiar with this stuff, but isn't the EPOW interrupt
 something that is very time-critical? Moving parts of the handler into a
 kernel thread then does not sound like a very good idea to me...

 Another question: Can it happen at all that this get-sensor call results
 in a sleep condition? Looking at commit ID
 81b73dd92b97423b8f5324a59044da478c04f4c4 (Fix might-sleep warning on
 removing cpus), which apparently fixed a similar issue for CPU
 hot-plugging, indicates that at least some of the rtas calls are never
 returning the busy code? In that case we could fix this by introducing a
 similar rtas_get_sensor_fast() function? (or simply revert 587f83e8dd50d
 which would be quite similar, I think)

 
 Looking at the PAPR, the get-sensor-state rtas call for the EPOW sensor
 is listed as a fast call and should not return a busy indication.

Great, good to know, thanks for looking that up! So IMHO we should
either introduce a rtas_get_sensor_fast() function or revert
587f83e8dd50d ... any preferences? Shall I come up with a patch?

 I'm curious as to why we're getting a busy return indication when
 making this call.

Looking at the code again, rtas_busy_delay() likely never slept ... it's
likely just the might_sleep() annotation in that function that causes
the BUG.

 Thomas




signature.asc
Description: OpenPGP digital signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: BUG: sleeping function called from ras_epow_interrupt context

2015-07-15 Thread Nathan Fontenot
On 07/15/2015 09:35 AM, Thomas Huth wrote:
 On 07/14/2015 11:22 PM, Benjamin Herrenschmidt wrote:
 On Tue, 2015-07-14 at 20:43 +0200, Thomas Huth wrote:
 Any suggestions how to fix this? Simply revert 587f83e8dd50d? Use
 mdelay() instead of msleep() in rtas_busy_delay()? Something more
 fancy?

 A proper fix would be more fancy, the get_sensor should happen in a
 kernel thread instead.
 
 I'm not very familiar with this stuff, but isn't the EPOW interrupt
 something that is very time-critical? Moving parts of the handler into a
 kernel thread then does not sound like a very good idea to me...
 
 Another question: Can it happen at all that this get-sensor call results
 in a sleep condition? Looking at commit ID
 81b73dd92b97423b8f5324a59044da478c04f4c4 (Fix might-sleep warning on
 removing cpus), which apparently fixed a similar issue for CPU
 hot-plugging, indicates that at least some of the rtas calls are never
 returning the busy code? In that case we could fix this by introducing a
 similar rtas_get_sensor_fast() function? (or simply revert 587f83e8dd50d
 which would be quite similar, I think)
 

Looking at the PAPR, the get-sensor-state rtas call for the EPOW sensor
is listed as a fast call and should not return a busy indication.

I'm curious as to why we're getting a busy return indication when
making this call.

-Nathan

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: BUG: sleeping function called from ras_epow_interrupt context

2015-07-15 Thread Thomas Huth
On 07/14/2015 11:22 PM, Benjamin Herrenschmidt wrote:
 On Tue, 2015-07-14 at 20:43 +0200, Thomas Huth wrote:
 Any suggestions how to fix this? Simply revert 587f83e8dd50d? Use
 mdelay() instead of msleep() in rtas_busy_delay()? Something more
 fancy?
 
 A proper fix would be more fancy, the get_sensor should happen in a
 kernel thread instead.

I'm not very familiar with this stuff, but isn't the EPOW interrupt
something that is very time-critical? Moving parts of the handler into a
kernel thread then does not sound like a very good idea to me...

Another question: Can it happen at all that this get-sensor call results
in a sleep condition? Looking at commit ID
81b73dd92b97423b8f5324a59044da478c04f4c4 (Fix might-sleep warning on
removing cpus), which apparently fixed a similar issue for CPU
hot-plugging, indicates that at least some of the rtas calls are never
returning the busy code? In that case we could fix this by introducing a
similar rtas_get_sensor_fast() function? (or simply revert 587f83e8dd50d
which would be quite similar, I think)

 Thomas




signature.asc
Description: OpenPGP digital signature
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: BUG: sleeping function called from ras_epow_interrupt context

2015-07-14 Thread Benjamin Herrenschmidt
On Tue, 2015-07-14 at 20:43 +0200, Thomas Huth wrote:
 Any suggestions how to fix this? Simply revert 587f83e8dd50d? Use
 mdelay() instead of msleep() in rtas_busy_delay()? Something more
 fancy?

A proper fix would be more fancy, the get_sensor should happen in a
kernel thread instead.

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

BUG: sleeping function called from ras_epow_interrupt context

2015-07-14 Thread Thomas Huth

 Hi all!

A colleague recently ran into some kernel BUG messages that happen when
hot-plugging a virtio disk to a KVM guest on powerpc (with virsh
attach-disk), and IIRC CONFIG_DEBUG_ATOMIC_SLEEP enabled. I've tried to
re-create the problem with an up-to-date kernel (4.2.0-rc2) and the
problem still seems to be there:

The hotplug action triggers the ras_epow_interrupt() in
arch/powerpc/platforms/pseries/ras.c, which again calls
rtas_get_sensor(). That function then uses rtas_busy_delay() to wait in
case the RTAS call did not succeed immediately. But rtas_busy_delay()
uses msleep() for sleeping - which is forbidden during an atomic
interrupt context!

Following backtrace is printed out by the kernel:

[   33.920528] BUG: sleeping function called from invalid context at
/home/thuth/devel/linux-up/arch/powerpc/kernel/rtas.c:496
[   33.920590] in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper/1
[   33.920624] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.2.0-rc2-thuth #1
[   33.920657] Call Trace:
[   33.920677] [c0007ffe79b0] [c07e43f4]
.dump_stack+0x98/0xd4 (unreliable)
[   33.920729] [c0007ffe7a30] [c00dcc78]
.___might_sleep+0x128/0x170
[   33.920769] [c0007ffe7aa0] [c0029f38]
.rtas_busy_delay+0x28/0xe0
[   33.920809] [c0007ffe7b20] [c002adb4]
.rtas_get_sensor+0x74/0xe0
[   33.920850] [c0007ffe7bc0] [c007ff58]
.ras_epow_interrupt+0x48/0x450
[   33.920896] [c0007ffe7c80] [c0119d94]
.handle_irq_event_percpu+0xa4/0x310
[   33.920942] [c0007ffe7d70] [c011a05c]
.handle_irq_event+0x5c/0xa0
[   33.920982] [c0007ffe7e00] [c011e7a8]
.handle_fasteoi_irq+0xe8/0x270
[   33.921028] [c0007ffe7e90] [c01190bc]
.generic_handle_irq+0x4c/0x80
[   33.921074] [c0007ffe7f10] [c0010a48] .__do_irq+0x88/0x1f0
[   33.921115] [c0007ffe7f90] [c0022a0c] .call_do_irq+0x14/0x24
[   33.921155] [c0007e6f37e0] [c0010c3c] .do_IRQ+0x8c/0x100
[   33.921195] [c0007e6f3880] [c0002594]
hardware_interrupt_common+0x114/0x180
[   33.921243] --- interrupt: 501 at .plpar_hcall_norets+0x14/0x20
[   33.921243] LR = .check_and_cede_processor+0x24/0x40
[   33.921300] [c0007e6f3b70] []   (null)
(unreliable)
[   33.921347] [c0007e6f3be0] [c0628068]
.shared_cede_loop+0x58/0x160
[   33.921393] [c0007e6f3c70] [c06259ac]
.cpuidle_enter_state+0xbc/0x3b0
[   33.921439] [c0007e6f3d30] [c00fe32c] .call_cpuidle+0x4c/0xa0
[   33.921479] [c0007e6f3db0] [c00fe700]
.cpu_startup_entry+0x380/0x4a0
[   33.921526] [c0007e6f3ed0] [c0043110]
.start_secondary+0x320/0x350
[   33.921571] [c0007e6f3f90] [c0008b6c]
start_secondary_prolog+0x10/0x14

I think that bug might have been introduced by commit
587f83e8dd50d22bc0c62 (Use rtas_get_sensor in RAS code) since the
rtas_busy_delay() was not called before that commit, as far as I can see.

Any suggestions how to fix this? Simply revert 587f83e8dd50d? Use
mdelay() instead of msleep() in rtas_busy_delay()? Something more fancy?

 Thanks,
  Thomas
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev