Re: [regression] boot failure on alpha, bisected

2012-10-08 Thread Oleg Nesterov
On 10/08, Geert Uytterhoeven wrote:
>
> Just wondering. As this is on an SMP system, perhaps the
> read_barrier_depends() vs. smp_read_barrier_depends() matters
> here?
> http://lkml.indiana.edu/hypermail/linux/kernel/1209.3/00555.html

Yes, thanks, I do remember about this ;)

This will come as 2/2 which also removes the unnecessary "work = NULL"
initialization.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression] boot failure on alpha, bisected

2012-10-08 Thread Geert Uytterhoeven
On Sun, Oct 7, 2012 at 9:16 PM, Oleg Nesterov  wrote:
> On 10/07, Thorsten Kranzkowski wrote:
>> On Sun, Oct 07, 2012 at 07:13:00PM +0200, Oleg Nesterov wrote:
>> > On 10/07, Oleg Nesterov wrote:
>> > > Hmm. I know nothing about arch/alpha and I can't understand its entry.S.
>> > > But _it seems_ to me that do_notify_resume() is called with irqs 
>> > > disabled.
>> > > If this is true, then imho arch/alpha should be fixed.
>> > >
>> > > Before this commit task_work_run() enabled irqs, but this was the "side
>> > > effect" of spin_lock_irq/spin_unlock_irq, we should not rely on this.
>> >
>> > Could you please test the debugging patch below?
>>
>> Of course. With that patch applied the kernel (ac3d0da) boots again. The 
>> trace line
>> is printed about once a second, with values '2' and '4'.
>
> Thanks a lot Thorsten!
>
> So I'll probably send the patch which enables interrupts in
> task_work_run(). I guess this needs "if (irqs_disabled())"
> for lockdep.
>
> The question is, should I add the warning to remind that this
> arch needs a fix?

Just wondering. As this is on an SMP system, perhaps the
read_barrier_depends() vs. smp_read_barrier_depends() matters
here?
http://lkml.indiana.edu/hypermail/linux/kernel/1209.3/00555.html

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression] boot failure on alpha, bisected

2012-10-08 Thread Oleg Nesterov
On 10/07, Al Viro wrote:
>
> On Sun, Oct 07, 2012 at 07:33:36PM +0200, Oleg Nesterov wrote:
>
> > > Um...  There's a bunch of architectures that are in the same situation.
> > > grep for do_notify_resume() and you'll see...
> >
> > And every do_notify_resume() should be changed anyway, do_signal() and
> > tracehook_notify_resume() should be re-ordered.
>
> There's a bit more to it.
> [...big snip...]

Thanks Al. I need to read your email carefully.

But what do you think we should do right now to fix this particular
bug?

So far I am going to send the patch below. This was always wrong,
even before task_works were added. key_replace_session_keyring()
was might_sleep() too but it also did lock_irq/unlock_irq, so this
was not noticed before.

Or do you think we should add local_irq_enable() into
arch/alpha/kernel/signal.c:do_notify_resume() before
tracehook_notify_resume and wait for other similar report?

Oleg.
---

diff --git a/kernel/task_work.c b/kernel/task_work.c
index 65bd3c9..fd18bd7 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -51,6 +51,10 @@ void task_work_run(void)
struct task_struct *task = current;
struct callback_head *work, *head, *next;
 
+   if (WARN_ONCE(irqs_disabled(),
+ "do_notify_resume() with irqs disabled, fix this arch."))
+   local_irq_enable();
+
for (;;) {
/*
 * work->func() can do task_work_add(), do not set

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression] boot failure on alpha, bisected

2012-10-08 Thread Dialup Jon Norstog
Hello!  I'm an Alpha user - I just want to thank you all for working to keep
Linux current on this architecture.  I am still using the last working Alpha
Core release ... I hope to keep the old beast running for many more years!

Jon Norstog

www.thursdaybicycles.com


On Sun, 7 Oct 2012 20:39:09 +0100, Al Viro wrote
> On Sun, Oct 07, 2012 at 07:33:36PM +0200, Oleg Nesterov wrote:
> 
> > > Um...  There's a bunch of architectures that are in the same situation.
> > > grep for do_notify_resume() and you'll see...
> > 
> > And every do_notify_resume() should be changed anyway, do_signal() and
> > tracehook_notify_resume() should be re-ordered.
> 
> There's a bit more to it.  The thing is, we have quite a mess around
> the signal-handling loops, mixed with that regarding the signal restarts.
> On arm it's done about right by now:
>   * looping until all signals had been handled is done in C;
> none of that "loop in asm glue" nonsense anymore.
>   * prevention of double restarts is *also* there, TYVM.
>   * do_work_pending() is called with interrupts disabled.
> It may return 0, in which case we are done, interrupts are disabled
> and the caller should proceed to userland without reenabling them
> until it leaves.  Otherwise we have a syscall restart to handle and
> no userland signal handler had been invoked.  Interrupts are enabled
> and we should simply reload arguments and syscall number from pt_regs
> and proceed to syscall entry, without returning to userland.  The 
> only twist is that negative return value means ERESTART_RESTARTBLOCK 
> kind of restart, in which case we need to use __NR_restart_syscall 
> for syscall number.
> 
> Note that we do *not* go through return to userland and reentering 
> the kernel on handlerless syscall restarts.  S390 uses the same 
> model, but there it's done in assembler glue - for no good reason. 
>  Should be in straight C.
> 
> For alpha there's another twist, though - there we do _not_ save all
> registers in pt_regs; there's a fairly large chunk of callee-saved
> registers we don't need to protect from being messed by C parts of
> the kernel.  We do need to save them in sigcontext, though.  So alpha
> (and quite a few other architctures) has separate struct switch_stack
> 
> (named so since switch_to() needs to save/restore the same registers)
> . Rules:  * on fork() et.al. we save those callee-saved registers in 
> struct switch_stack, right next to pt_regs.  We do that before 
> calling the actual sys_fork() and have copy_thread() copy these guys 
> into child.  Remember that newborns are first woken up in ret_from_fork
> and as with all context switches they go through switch_to().  So these
> registers are restored by the time the sucker wakes up.
>   * on signal delivery we save those registers in struct switch_stack
> and use it, along with pt_regs it lives next to, to fill sigcontext.
>   * ptrace counts on those suckers being next to pt_regs.  That allows
> tracer to modify tracee's registers, including callee-saved ones.  
> So we
> (1) restore them from switch_stack once we are done with do_signal() 
> and
> (2) save/restore them around another place where we can get stopped for
> tracer to examine us - PTRACE_SYSCALL-induced paths in syscall handling.
>   * on sigreturn/rt_sigreturn we need to restore all registers.
> So we reserve switch_stack on stack, next to pt_regs and have the C 
> part of sigreturn fill those along with pt_regs.  Once we are done,
>  read those registers from switch_stack.
> 
> That's more or less it; many other architectures are doing more or less
> similar things, but not all of them put that stuff into separate structure.
> 
> E.g. another valid solution is to leave space in pt_regs, fill only 
> a subset on entry and have switch_to() save stuff in task_struct 
> instead of putting it on kernel stack.
> 
> What it means for us is that saving all that crap on stack should *not*
> be done unless we have work to do.  OTOH, in situations when we have
> more than one pending signal it's bloody dumb to save/restore around
> each do_notify_resume() call separately.  OTTH, in situation when 
> we'd run out of timeslice and had nothing arrive until we'd regained 
> CPU save/restore around schedule() is pointless at the very least. 
>  So for things like alpha I'd do this:
> 
>   interrupts disabled
>   check thread flags
>   no work to do => bugger off to userland
>   just NEED_RESCHED?
>   schedule()
>   reread thread flags
>   no work to do => bugger off to userland
>   save callee-saved registers
>   call do_work_pending
>   restore callee-saved registers
>   if do_work_pendign returned 0 => bugger off to userland
>   deal with handlerless restart
> 
> Note that the loop around do_signal() and friends is in C and is fairly
> similar to what we've got on ARM.  x86 is in intermediate situation -
> the main complication there is v86 crap.

Re: [regression] boot failure on alpha, bisected

2012-10-08 Thread Dialup Jon Norstog
Hello!  I'm an Alpha user - I just want to thank you ll for working to keep
Linux current on this architecture.  I am still using the last working Alpha
Core release ... I hope to keep the old beast running for many more years!

Jon Norstog

www.thursdaybicycles.com


On Sun, 7 Oct 2012 20:39:09 +0100, Al Viro wrote
> On Sun, Oct 07, 2012 at 07:33:36PM +0200, Oleg Nesterov wrote:
> 
> > > Um...  There's a bunch of architectures that are in the same situation.
> > > grep for do_notify_resume() and you'll see...
> > 
> > And every do_notify_resume() should be changed anyway, do_signal() and
> > tracehook_notify_resume() should be re-ordered.
> 
> There's a bit more to it.  The thing is, we have quite a mess around
> the signal-handling loops, mixed with that regarding the signal restarts.
> On arm it's done about right by now:
>   * looping until all signals had been handled is done in C;
> none of that "loop in asm glue" nonsense anymore.
>   * prevention of double restarts is *also* there, TYVM.
>   * do_work_pending() is called with interrupts disabled.
> It may return 0, in which case we are done, interrupts are disabled
> and the caller should proceed to userland without reenabling them
> until it leaves.  Otherwise we have a syscall restart to handle and
> no userland signal handler had been invoked.  Interrupts are enabled
> and we should simply reload arguments and syscall number from pt_regs
> and proceed to syscall entry, without returning to userland.  The 
> only twist is that negative return value means ERESTART_RESTARTBLOCK 
> kind of restart, in which case we need to use __NR_restart_syscall 
> for syscall number.
> 
> Note that we do *not* go through return to userland and reentering 
> the kernel on handlerless syscall restarts.  S390 uses the same 
> model, but there it's done in assembler glue - for no good reason. 
>  Should be in straight C.
> 
> For alpha there's another twist, though - there we do _not_ save all
> registers in pt_regs; there's a fairly large chunk of callee-saved
> registers we don't need to protect from being messed by C parts of
> the kernel.  We do need to save them in sigcontext, though.  So alpha
> (and quite a few other architctures) has separate struct switch_stack
> 
> (named so since switch_to() needs to save/restore the same registers)
> . Rules:  * on fork() et.al. we save those callee-saved registers in 
> struct switch_stack, right next to pt_regs.  We do that before 
> calling the actual sys_fork() and have copy_thread() copy these guys 
> into child.  Remember that newborns are first woken up in ret_from_fork
> and as with all context switches they go through switch_to().  So these
> registers are restored by the time the sucker wakes up.
>   * on signal delivery we save those registers in struct switch_stack
> and use it, along with pt_regs it lives next to, to fill sigcontext.
>   * ptrace counts on those suckers being next to pt_regs.  That allows
> tracer to modify tracee's registers, including callee-saved ones.  
> So we
> (1) restore them from switch_stack once we are done with do_signal() 
> and
> (2) save/restore them around another place where we can get stopped for
> tracer to examine us - PTRACE_SYSCALL-induced paths in syscall handling.
>   * on sigreturn/rt_sigreturn we need to restore all registers.
> So we reserve switch_stack on stack, next to pt_regs and have the C 
> part of sigreturn fill those along with pt_regs.  Once we are done,
>  read those registers from switch_stack.
> 
> That's more or less it; many other architectures are doing more or less
> similar things, but not all of them put that stuff into separate structure.
> 
> E.g. another valid solution is to leave space in pt_regs, fill only 
> a subset on entry and have switch_to() save stuff in task_struct 
> instead of putting it on kernel stack.
> 
> What it means for us is that saving all that crap on stack should *not*
> be done unless we have work to do.  OTOH, in situations when we have
> more than one pending signal it's bloody dumb to save/restore around
> each do_notify_resume() call separately.  OTTH, in situation when 
> we'd run out of timeslice and had nothing arrive until we'd regained 
> CPU save/restore around schedule() is pointless at the very least. 
>  So for things like alpha I'd do this:
> 
>   interrupts disabled
>   check thread flags
>   no work to do => bugger off to userland
>   just NEED_RESCHED?
>   schedule()
>   reread thread flags
>   no work to do => bugger off to userland
>   save callee-saved registers
>   call do_work_pending
>   restore callee-saved registers
>   if do_work_pendign returned 0 => bugger off to userland
>   deal with handlerless restart
> 
> Note that the loop around do_signal() and friends is in C and is fairly
> similar to what we've got on ARM.  x86 is in intermediate situation -
> the main complication there is v86 crap.

Re: [regression] boot failure on alpha, bisected

2012-10-08 Thread Dialup Jon Norstog
Hello!  I'm an Alpha user - I just want to thank you ll for working to keep
Linux current on this architecture.  I am still using the last working Alpha
Core release ... I hope to keep the old beast running for many more years!

Jon Norstog

www.thursdaybicycles.com


On Sun, 7 Oct 2012 20:39:09 +0100, Al Viro wrote
 On Sun, Oct 07, 2012 at 07:33:36PM +0200, Oleg Nesterov wrote:
 
   Um...  There's a bunch of architectures that are in the same situation.
   grep for do_notify_resume() and you'll see...
  
  And every do_notify_resume() should be changed anyway, do_signal() and
  tracehook_notify_resume() should be re-ordered.
 
 There's a bit more to it.  The thing is, we have quite a mess around
 the signal-handling loops, mixed with that regarding the signal restarts.
 On arm it's done about right by now:
   * looping until all signals had been handled is done in C;
 none of that loop in asm glue nonsense anymore.
   * prevention of double restarts is *also* there, TYVM.
   * do_work_pending() is called with interrupts disabled.
 It may return 0, in which case we are done, interrupts are disabled
 and the caller should proceed to userland without reenabling them
 until it leaves.  Otherwise we have a syscall restart to handle and
 no userland signal handler had been invoked.  Interrupts are enabled
 and we should simply reload arguments and syscall number from pt_regs
 and proceed to syscall entry, without returning to userland.  The 
 only twist is that negative return value means ERESTART_RESTARTBLOCK 
 kind of restart, in which case we need to use __NR_restart_syscall 
 for syscall number.
 
 Note that we do *not* go through return to userland and reentering 
 the kernel on handlerless syscall restarts.  S390 uses the same 
 model, but there it's done in assembler glue - for no good reason. 
  Should be in straight C.
 
 For alpha there's another twist, though - there we do _not_ save all
 registers in pt_regs; there's a fairly large chunk of callee-saved
 registers we don't need to protect from being messed by C parts of
 the kernel.  We do need to save them in sigcontext, though.  So alpha
 (and quite a few other architctures) has separate struct switch_stack
 
 (named so since switch_to() needs to save/restore the same registers)
 . Rules:  * on fork() et.al. we save those callee-saved registers in 
 struct switch_stack, right next to pt_regs.  We do that before 
 calling the actual sys_fork() and have copy_thread() copy these guys 
 into child.  Remember that newborns are first woken up in ret_from_fork
 and as with all context switches they go through switch_to().  So these
 registers are restored by the time the sucker wakes up.
   * on signal delivery we save those registers in struct switch_stack
 and use it, along with pt_regs it lives next to, to fill sigcontext.
   * ptrace counts on those suckers being next to pt_regs.  That allows
 tracer to modify tracee's registers, including callee-saved ones.  
 So we
 (1) restore them from switch_stack once we are done with do_signal() 
 and
 (2) save/restore them around another place where we can get stopped for
 tracer to examine us - PTRACE_SYSCALL-induced paths in syscall handling.
   * on sigreturn/rt_sigreturn we need to restore all registers.
 So we reserve switch_stack on stack, next to pt_regs and have the C 
 part of sigreturn fill those along with pt_regs.  Once we are done,
  read those registers from switch_stack.
 
 That's more or less it; many other architectures are doing more or less
 similar things, but not all of them put that stuff into separate structure.
 
 E.g. another valid solution is to leave space in pt_regs, fill only 
 a subset on entry and have switch_to() save stuff in task_struct 
 instead of putting it on kernel stack.
 
 What it means for us is that saving all that crap on stack should *not*
 be done unless we have work to do.  OTOH, in situations when we have
 more than one pending signal it's bloody dumb to save/restore around
 each do_notify_resume() call separately.  OTTH, in situation when 
 we'd run out of timeslice and had nothing arrive until we'd regained 
 CPU save/restore around schedule() is pointless at the very least. 
  So for things like alpha I'd do this:
 
   interrupts disabled
   check thread flags
   no work to do = bugger off to userland
   just NEED_RESCHED?
   schedule()
   reread thread flags
   no work to do = bugger off to userland
   save callee-saved registers
   call do_work_pending
   restore callee-saved registers
   if do_work_pendign returned 0 = bugger off to userland
   deal with handlerless restart
 
 Note that the loop around do_signal() and friends is in C and is fairly
 similar to what we've got on ARM.  x86 is in intermediate situation -
 the main complication there is v86 crap.
 
 I'd say that for now your variant should do, but we really need to 
 get that crap under control 

Re: [regression] boot failure on alpha, bisected

2012-10-08 Thread Dialup Jon Norstog
Hello!  I'm an Alpha user - I just want to thank you all for working to keep
Linux current on this architecture.  I am still using the last working Alpha
Core release ... I hope to keep the old beast running for many more years!

Jon Norstog

www.thursdaybicycles.com


On Sun, 7 Oct 2012 20:39:09 +0100, Al Viro wrote
 On Sun, Oct 07, 2012 at 07:33:36PM +0200, Oleg Nesterov wrote:
 
   Um...  There's a bunch of architectures that are in the same situation.
   grep for do_notify_resume() and you'll see...
  
  And every do_notify_resume() should be changed anyway, do_signal() and
  tracehook_notify_resume() should be re-ordered.
 
 There's a bit more to it.  The thing is, we have quite a mess around
 the signal-handling loops, mixed with that regarding the signal restarts.
 On arm it's done about right by now:
   * looping until all signals had been handled is done in C;
 none of that loop in asm glue nonsense anymore.
   * prevention of double restarts is *also* there, TYVM.
   * do_work_pending() is called with interrupts disabled.
 It may return 0, in which case we are done, interrupts are disabled
 and the caller should proceed to userland without reenabling them
 until it leaves.  Otherwise we have a syscall restart to handle and
 no userland signal handler had been invoked.  Interrupts are enabled
 and we should simply reload arguments and syscall number from pt_regs
 and proceed to syscall entry, without returning to userland.  The 
 only twist is that negative return value means ERESTART_RESTARTBLOCK 
 kind of restart, in which case we need to use __NR_restart_syscall 
 for syscall number.
 
 Note that we do *not* go through return to userland and reentering 
 the kernel on handlerless syscall restarts.  S390 uses the same 
 model, but there it's done in assembler glue - for no good reason. 
  Should be in straight C.
 
 For alpha there's another twist, though - there we do _not_ save all
 registers in pt_regs; there's a fairly large chunk of callee-saved
 registers we don't need to protect from being messed by C parts of
 the kernel.  We do need to save them in sigcontext, though.  So alpha
 (and quite a few other architctures) has separate struct switch_stack
 
 (named so since switch_to() needs to save/restore the same registers)
 . Rules:  * on fork() et.al. we save those callee-saved registers in 
 struct switch_stack, right next to pt_regs.  We do that before 
 calling the actual sys_fork() and have copy_thread() copy these guys 
 into child.  Remember that newborns are first woken up in ret_from_fork
 and as with all context switches they go through switch_to().  So these
 registers are restored by the time the sucker wakes up.
   * on signal delivery we save those registers in struct switch_stack
 and use it, along with pt_regs it lives next to, to fill sigcontext.
   * ptrace counts on those suckers being next to pt_regs.  That allows
 tracer to modify tracee's registers, including callee-saved ones.  
 So we
 (1) restore them from switch_stack once we are done with do_signal() 
 and
 (2) save/restore them around another place where we can get stopped for
 tracer to examine us - PTRACE_SYSCALL-induced paths in syscall handling.
   * on sigreturn/rt_sigreturn we need to restore all registers.
 So we reserve switch_stack on stack, next to pt_regs and have the C 
 part of sigreturn fill those along with pt_regs.  Once we are done,
  read those registers from switch_stack.
 
 That's more or less it; many other architectures are doing more or less
 similar things, but not all of them put that stuff into separate structure.
 
 E.g. another valid solution is to leave space in pt_regs, fill only 
 a subset on entry and have switch_to() save stuff in task_struct 
 instead of putting it on kernel stack.
 
 What it means for us is that saving all that crap on stack should *not*
 be done unless we have work to do.  OTOH, in situations when we have
 more than one pending signal it's bloody dumb to save/restore around
 each do_notify_resume() call separately.  OTTH, in situation when 
 we'd run out of timeslice and had nothing arrive until we'd regained 
 CPU save/restore around schedule() is pointless at the very least. 
  So for things like alpha I'd do this:
 
   interrupts disabled
   check thread flags
   no work to do = bugger off to userland
   just NEED_RESCHED?
   schedule()
   reread thread flags
   no work to do = bugger off to userland
   save callee-saved registers
   call do_work_pending
   restore callee-saved registers
   if do_work_pendign returned 0 = bugger off to userland
   deal with handlerless restart
 
 Note that the loop around do_signal() and friends is in C and is fairly
 similar to what we've got on ARM.  x86 is in intermediate situation -
 the main complication there is v86 crap.
 
 I'd say that for now your variant should do, but we really need to 
 get that crap under control 

Re: [regression] boot failure on alpha, bisected

2012-10-08 Thread Oleg Nesterov
On 10/07, Al Viro wrote:

 On Sun, Oct 07, 2012 at 07:33:36PM +0200, Oleg Nesterov wrote:

   Um...  There's a bunch of architectures that are in the same situation.
   grep for do_notify_resume() and you'll see...
 
  And every do_notify_resume() should be changed anyway, do_signal() and
  tracehook_notify_resume() should be re-ordered.

 There's a bit more to it.
 [...big snip...]

Thanks Al. I need to read your email carefully.

But what do you think we should do right now to fix this particular
bug?

So far I am going to send the patch below. This was always wrong,
even before task_works were added. key_replace_session_keyring()
was might_sleep() too but it also did lock_irq/unlock_irq, so this
was not noticed before.

Or do you think we should add local_irq_enable() into
arch/alpha/kernel/signal.c:do_notify_resume() before
tracehook_notify_resume and wait for other similar report?

Oleg.
---

diff --git a/kernel/task_work.c b/kernel/task_work.c
index 65bd3c9..fd18bd7 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -51,6 +51,10 @@ void task_work_run(void)
struct task_struct *task = current;
struct callback_head *work, *head, *next;
 
+   if (WARN_ONCE(irqs_disabled(),
+ do_notify_resume() with irqs disabled, fix this arch.))
+   local_irq_enable();
+
for (;;) {
/*
 * work-func() can do task_work_add(), do not set

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression] boot failure on alpha, bisected

2012-10-08 Thread Geert Uytterhoeven
On Sun, Oct 7, 2012 at 9:16 PM, Oleg Nesterov o...@redhat.com wrote:
 On 10/07, Thorsten Kranzkowski wrote:
 On Sun, Oct 07, 2012 at 07:13:00PM +0200, Oleg Nesterov wrote:
  On 10/07, Oleg Nesterov wrote:
   Hmm. I know nothing about arch/alpha and I can't understand its entry.S.
   But _it seems_ to me that do_notify_resume() is called with irqs 
   disabled.
   If this is true, then imho arch/alpha should be fixed.
  
   Before this commit task_work_run() enabled irqs, but this was the side
   effect of spin_lock_irq/spin_unlock_irq, we should not rely on this.
 
  Could you please test the debugging patch below?

 Of course. With that patch applied the kernel (ac3d0da) boots again. The 
 trace line
 is printed about once a second, with values '2' and '4'.

 Thanks a lot Thorsten!

 So I'll probably send the patch which enables interrupts in
 task_work_run(). I guess this needs if (irqs_disabled())
 for lockdep.

 The question is, should I add the warning to remind that this
 arch needs a fix?

Just wondering. As this is on an SMP system, perhaps the
read_barrier_depends() vs. smp_read_barrier_depends() matters
here?
http://lkml.indiana.edu/hypermail/linux/kernel/1209.3/00555.html

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say programmer or something like that.
-- Linus Torvalds
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression] boot failure on alpha, bisected

2012-10-08 Thread Oleg Nesterov
On 10/08, Geert Uytterhoeven wrote:

 Just wondering. As this is on an SMP system, perhaps the
 read_barrier_depends() vs. smp_read_barrier_depends() matters
 here?
 http://lkml.indiana.edu/hypermail/linux/kernel/1209.3/00555.html

Yes, thanks, I do remember about this ;)

This will come as 2/2 which also removes the unnecessary work = NULL
initialization.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression] boot failure on alpha, bisected

2012-10-07 Thread Thorsten Kranzkowski
On Sun, Oct 07, 2012 at 09:16:27PM +0200, Oleg Nesterov wrote:
> On 10/07, Thorsten Kranzkowski wrote:
> > On Sun, Oct 07, 2012 at 07:13:00PM +0200, Oleg Nesterov wrote:
> > > On 10/07, Oleg Nesterov wrote:
> > > >
> > > > Hmm. I know nothing about arch/alpha and I can't understand its entry.S.
> > > > But _it seems_ to me that do_notify_resume() is called with irqs 
> > > > disabled.
> > > > If this is true, then imho arch/alpha should be fixed.
> > > >
> > > > Before this commit task_work_run() enabled irqs, but this was the "side
> > > > effect" of spin_lock_irq/spin_unlock_irq, we should not rely on this.
> > >
> > > Could you please test the debugging patch below?
> >
> > Of course. With that patch applied the kernel (ac3d0da) boots again. The 
> > trace line
> > is printed about once a second, with values '2' and '4'.
> 
> Thanks a lot Thorsten!
> 
> So I'll probably send the patch which enables interrupts in
> task_work_run(). I guess this needs "if (irqs_disabled())"
> for lockdep.
> 
> The question is, should I add the warning to remind that this
> arch needs a fix?

If you do, then please warn only once, otherwise there will be a lot of
warnings :-)
 
Thorsten
 

-- 
| Thorsten KranzkowskiInternet: dl8...@dl8bcu.de  |
| Mobile: ++49 170 1876134   Snail: Kiebitzstr. 14, 49324 Melle, Germany  |
| Ampr: dl8bcu@db0lj.#rpl.deu.eu, dl8...@marvin.dl8bcu.ampr.org [44.130.8.19] |
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression] boot failure on alpha, bisected

2012-10-07 Thread Al Viro
On Sun, Oct 07, 2012 at 07:33:36PM +0200, Oleg Nesterov wrote:

> > Um...  There's a bunch of architectures that are in the same situation.
> > grep for do_notify_resume() and you'll see...
> 
> And every do_notify_resume() should be changed anyway, do_signal() and
> tracehook_notify_resume() should be re-ordered.

There's a bit more to it.  The thing is, we have quite a mess around
the signal-handling loops, mixed with that regarding the signal restarts.
On arm it's done about right by now:
* looping until all signals had been handled is done in C;
none of that "loop in asm glue" nonsense anymore.
* prevention of double restarts is *also* there, TYVM.
* do_work_pending() is called with interrupts disabled.
It may return 0, in which case we are done, interrupts are disabled
and the caller should proceed to userland without reenabling them
until it leaves.  Otherwise we have a syscall restart to handle and
no userland signal handler had been invoked.  Interrupts are enabled
and we should simply reload arguments and syscall number from pt_regs
and proceed to syscall entry, without returning to userland.  The only
twist is that negative return value means ERESTART_RESTARTBLOCK kind
of restart, in which case we need to use __NR_restart_syscall for
syscall number.

Note that we do *not* go through return to userland and reentering the
kernel on handlerless syscall restarts.  S390 uses the same model, but
there it's done in assembler glue - for no good reason.  Should be in
straight C.

For alpha there's another twist, though - there we do _not_ save all
registers in pt_regs; there's a fairly large chunk of callee-saved
registers we don't need to protect from being messed by C parts of
the kernel.  We do need to save them in sigcontext, though.  So alpha
(and quite a few other architctures) has separate struct switch_stack
(named so since switch_to() needs to save/restore the same registers).
Rules:
* on fork() et.al. we save those callee-saved registers in
struct switch_stack, right next to pt_regs.  We do that before calling
the actual sys_fork() and have copy_thread() copy these guys into
child.  Remember that newborns are first woken up in ret_from_fork
and as with all context switches they go through switch_to().  So these
registers are restored by the time the sucker wakes up.
* on signal delivery we save those registers in struct switch_stack
and use it, along with pt_regs it lives next to, to fill sigcontext.
* ptrace counts on those suckers being next to pt_regs.  That allows
tracer to modify tracee's registers, including callee-saved ones.  So we
(1) restore them from switch_stack once we are done with do_signal() and
(2) save/restore them around another place where we can get stopped for
tracer to examine us - PTRACE_SYSCALL-induced paths in syscall handling.
* on sigreturn/rt_sigreturn we need to restore all registers.
So we reserve switch_stack on stack, next to pt_regs and have the C part of
sigreturn fill those along with pt_regs.  Once we are done, read those
registers from switch_stack.

That's more or less it; many other architectures are doing more or less
similar things, but not all of them put that stuff into separate structure.
E.g. another valid solution is to leave space in pt_regs, fill only a subset
on entry and have switch_to() save stuff in task_struct instead of putting
it on kernel stack.

What it means for us is that saving all that crap on stack should *not*
be done unless we have work to do.  OTOH, in situations when we have
more than one pending signal it's bloody dumb to save/restore around
each do_notify_resume() call separately.  OTTH, in situation when we'd
run out of timeslice and had nothing arrive until we'd regained CPU
save/restore around schedule() is pointless at the very least.  So for
things like alpha I'd do this:

interrupts disabled
check thread flags
no work to do => bugger off to userland
just NEED_RESCHED?
schedule()
reread thread flags
no work to do => bugger off to userland
save callee-saved registers
call do_work_pending
restore callee-saved registers
if do_work_pendign returned 0 => bugger off to userland
deal with handlerless restart

Note that the loop around do_signal() and friends is in C and is fairly
similar to what we've got on ARM.  x86 is in intermediate situation -
the main complication there is v86 crap.

I'd say that for now your variant should do, but we really need to get
that crap under control and out of asm glue.  Are you willing to participate?
Guys, we need a way to do cross-architecture work without going insane.
I've spent quite a bit of time this year crawling through that stuff.
And yes, it's getting better as the result, but it's not sustainable -
I have VFS work to do, after all.

Basically, we need more people willing to take part in that; ideally -

Re: [regression] boot failure on alpha, bisected

2012-10-07 Thread Oleg Nesterov
On 10/07, Thorsten Kranzkowski wrote:
>
> On Sun, Oct 07, 2012 at 07:13:00PM +0200, Oleg Nesterov wrote:
> > On 10/07, Oleg Nesterov wrote:
> > >
> > > Hmm. I know nothing about arch/alpha and I can't understand its entry.S.
> > > But _it seems_ to me that do_notify_resume() is called with irqs disabled.
> > > If this is true, then imho arch/alpha should be fixed.
> > >
> > > Before this commit task_work_run() enabled irqs, but this was the "side
> > > effect" of spin_lock_irq/spin_unlock_irq, we should not rely on this.
> >
> > Could you please test the debugging patch below?
>
> Of course. With that patch applied the kernel (ac3d0da) boots again. The 
> trace line
> is printed about once a second, with values '2' and '4'.

Thanks a lot Thorsten!

So I'll probably send the patch which enables interrupts in
task_work_run(). I guess this needs "if (irqs_disabled())"
for lockdep.

The question is, should I add the warning to remind that this
arch needs a fix?

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression] boot failure on alpha, bisected

2012-10-07 Thread Thorsten Kranzkowski
On Sun, Oct 07, 2012 at 07:13:00PM +0200, Oleg Nesterov wrote:
> On 10/07, Oleg Nesterov wrote:
> >
> > Hmm. I know nothing about arch/alpha and I can't understand its entry.S.
> > But _it seems_ to me that do_notify_resume() is called with irqs disabled.
> > If this is true, then imho arch/alpha should be fixed.
> >
> > Before this commit task_work_run() enabled irqs, but this was the "side
> > effect" of spin_lock_irq/spin_unlock_irq, we should not rely on this.
> 
> Could you please test the debugging patch below?

Of course. With that patch applied the kernel (ac3d0da) boots again. The trace 
line
is printed about once a second, with values '2' and '4'.

Thorsten

> 
> Oleg.
> 
> 
> --- x/arch/alpha/kernel/signal.c
> +++ x/arch/alpha/kernel/signal.c
> @@ -567,11 +567,19 @@ do_signal(struct pt_regs * regs, struct 
>   ptrace_set_bpt(current);/* re-set breakpoint */
>  }
>  
> +#include 
> +
>  void
>  do_notify_resume(struct pt_regs *regs, struct switch_stack *sw,
>unsigned long thread_info_flags,
>unsigned long r0, unsigned long r19)
>  {
> + if (irqs_disabled()) {
> + printk_ratelimited(KERN_WARNING
> + "NOTIFY with irqs_disabled:%lx\n", thread_info_flags);
> + local_irq_enable();
> + }
> +
>   if (thread_info_flags & _TIF_SIGPENDING)
>   do_signal(regs, sw, r0, r19);
>  
> 

-- 
| Thorsten KranzkowskiInternet: dl8...@dl8bcu.de  |
| Mobile: ++49 170 1876134   Snail: Kiebitzstr. 14, 49324 Melle, Germany  |
| Ampr: dl8bcu@db0lj.#rpl.deu.eu, dl8...@marvin.dl8bcu.ampr.org [44.130.8.19] |
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression] boot failure on alpha, bisected

2012-10-07 Thread Oleg Nesterov
On 10/07, Al Viro wrote:
>
> On Sun, Oct 07, 2012 at 06:55:34PM +0200, Oleg Nesterov wrote:
>
> > Hmm. I know nothing about arch/alpha and I can't understand its entry.S.
> > But _it seems_ to me that do_notify_resume() is called with irqs disabled.
> > If this is true, then imho arch/alpha should be fixed.
> >
> > Before this commit task_work_run() enabled irqs, but this was the "side
> > effect" of spin_lock_irq/spin_unlock_irq, we should not rely on this.
>
> Um...  There's a bunch of architectures that are in the same situation.
> grep for do_notify_resume() and you'll see...

And every do_notify_resume() should be changed anyway, do_signal() and
tracehook_notify_resume() should be re-ordered.

> It needs to be dealt with sanely, and actually have patches for alpha
> going in that direction, but breaking a bunch of architectures is not a good
> thing, obviously.  So you've bought yourself a major PITA for coming
> weeks...

So perhaps the patch below until they are fixed?


--- x/kernel/task_work.c
+++ x/kernel/task_work.c
@@ -51,6 +51,9 @@ void task_work_run(void)
struct task_struct *task = current;
struct callback_head *work, *head, *next;
 
+   if (WARN_ONCE(irqs_disabled(), "notify_resume() with irqs_disabled"))
+   local_irq_enable();
+
for (;;) {
/*
 * work->func() can do task_work_add(), do not set

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression] boot failure on alpha, bisected

2012-10-07 Thread Oleg Nesterov
On 10/07, Oleg Nesterov wrote:
>
> Hmm. I know nothing about arch/alpha and I can't understand its entry.S.
> But _it seems_ to me that do_notify_resume() is called with irqs disabled.
> If this is true, then imho arch/alpha should be fixed.
>
> Before this commit task_work_run() enabled irqs, but this was the "side
> effect" of spin_lock_irq/spin_unlock_irq, we should not rely on this.

Could you please test the debugging patch below?

Oleg.


--- x/arch/alpha/kernel/signal.c
+++ x/arch/alpha/kernel/signal.c
@@ -567,11 +567,19 @@ do_signal(struct pt_regs * regs, struct 
ptrace_set_bpt(current);/* re-set breakpoint */
 }
 
+#include 
+
 void
 do_notify_resume(struct pt_regs *regs, struct switch_stack *sw,
 unsigned long thread_info_flags,
 unsigned long r0, unsigned long r19)
 {
+   if (irqs_disabled()) {
+   printk_ratelimited(KERN_WARNING
+   "NOTIFY with irqs_disabled:%lx\n", thread_info_flags);
+   local_irq_enable();
+   }
+
if (thread_info_flags & _TIF_SIGPENDING)
do_signal(regs, sw, r0, r19);
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression] boot failure on alpha, bisected

2012-10-07 Thread Al Viro
On Sun, Oct 07, 2012 at 06:55:34PM +0200, Oleg Nesterov wrote:

> Hmm. I know nothing about arch/alpha and I can't understand its entry.S.
> But _it seems_ to me that do_notify_resume() is called with irqs disabled.
> If this is true, then imho arch/alpha should be fixed.
> 
> Before this commit task_work_run() enabled irqs, but this was the "side
> effect" of spin_lock_irq/spin_unlock_irq, we should not rely on this.

Um...  There's a bunch of architectures that are in the same situation.
grep for do_notify_resume() and you'll see...

It needs to be dealt with sanely, and actually have patches for alpha
going in that direction, but breaking a bunch of architectures is not a good
thing, obviously.  So you've bought yourself a major PITA for coming
weeks...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression] boot failure on alpha, bisected

2012-10-07 Thread Al Viro
On Sun, Oct 07, 2012 at 06:55:34PM +0200, Oleg Nesterov wrote:

 Hmm. I know nothing about arch/alpha and I can't understand its entry.S.
 But _it seems_ to me that do_notify_resume() is called with irqs disabled.
 If this is true, then imho arch/alpha should be fixed.
 
 Before this commit task_work_run() enabled irqs, but this was the side
 effect of spin_lock_irq/spin_unlock_irq, we should not rely on this.

Um...  There's a bunch of architectures that are in the same situation.
grep for do_notify_resume() and you'll see...

It needs to be dealt with sanely, and actually have patches for alpha
going in that direction, but breaking a bunch of architectures is not a good
thing, obviously.  So you've bought yourself a major PITA for coming
weeks...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression] boot failure on alpha, bisected

2012-10-07 Thread Oleg Nesterov
On 10/07, Oleg Nesterov wrote:

 Hmm. I know nothing about arch/alpha and I can't understand its entry.S.
 But _it seems_ to me that do_notify_resume() is called with irqs disabled.
 If this is true, then imho arch/alpha should be fixed.

 Before this commit task_work_run() enabled irqs, but this was the side
 effect of spin_lock_irq/spin_unlock_irq, we should not rely on this.

Could you please test the debugging patch below?

Oleg.


--- x/arch/alpha/kernel/signal.c
+++ x/arch/alpha/kernel/signal.c
@@ -567,11 +567,19 @@ do_signal(struct pt_regs * regs, struct 
ptrace_set_bpt(current);/* re-set breakpoint */
 }
 
+#include linux/ratelimit.h
+
 void
 do_notify_resume(struct pt_regs *regs, struct switch_stack *sw,
 unsigned long thread_info_flags,
 unsigned long r0, unsigned long r19)
 {
+   if (irqs_disabled()) {
+   printk_ratelimited(KERN_WARNING
+   NOTIFY with irqs_disabled:%lx\n, thread_info_flags);
+   local_irq_enable();
+   }
+
if (thread_info_flags  _TIF_SIGPENDING)
do_signal(regs, sw, r0, r19);
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression] boot failure on alpha, bisected

2012-10-07 Thread Oleg Nesterov
On 10/07, Al Viro wrote:

 On Sun, Oct 07, 2012 at 06:55:34PM +0200, Oleg Nesterov wrote:

  Hmm. I know nothing about arch/alpha and I can't understand its entry.S.
  But _it seems_ to me that do_notify_resume() is called with irqs disabled.
  If this is true, then imho arch/alpha should be fixed.
 
  Before this commit task_work_run() enabled irqs, but this was the side
  effect of spin_lock_irq/spin_unlock_irq, we should not rely on this.

 Um...  There's a bunch of architectures that are in the same situation.
 grep for do_notify_resume() and you'll see...

And every do_notify_resume() should be changed anyway, do_signal() and
tracehook_notify_resume() should be re-ordered.

 It needs to be dealt with sanely, and actually have patches for alpha
 going in that direction, but breaking a bunch of architectures is not a good
 thing, obviously.  So you've bought yourself a major PITA for coming
 weeks...

So perhaps the patch below until they are fixed?


--- x/kernel/task_work.c
+++ x/kernel/task_work.c
@@ -51,6 +51,9 @@ void task_work_run(void)
struct task_struct *task = current;
struct callback_head *work, *head, *next;
 
+   if (WARN_ONCE(irqs_disabled(), notify_resume() with irqs_disabled))
+   local_irq_enable();
+
for (;;) {
/*
 * work-func() can do task_work_add(), do not set

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression] boot failure on alpha, bisected

2012-10-07 Thread Thorsten Kranzkowski
On Sun, Oct 07, 2012 at 07:13:00PM +0200, Oleg Nesterov wrote:
 On 10/07, Oleg Nesterov wrote:
 
  Hmm. I know nothing about arch/alpha and I can't understand its entry.S.
  But _it seems_ to me that do_notify_resume() is called with irqs disabled.
  If this is true, then imho arch/alpha should be fixed.
 
  Before this commit task_work_run() enabled irqs, but this was the side
  effect of spin_lock_irq/spin_unlock_irq, we should not rely on this.
 
 Could you please test the debugging patch below?

Of course. With that patch applied the kernel (ac3d0da) boots again. The trace 
line
is printed about once a second, with values '2' and '4'.

Thorsten

 
 Oleg.
 
 
 --- x/arch/alpha/kernel/signal.c
 +++ x/arch/alpha/kernel/signal.c
 @@ -567,11 +567,19 @@ do_signal(struct pt_regs * regs, struct 
   ptrace_set_bpt(current);/* re-set breakpoint */
  }
  
 +#include linux/ratelimit.h
 +
  void
  do_notify_resume(struct pt_regs *regs, struct switch_stack *sw,
unsigned long thread_info_flags,
unsigned long r0, unsigned long r19)
  {
 + if (irqs_disabled()) {
 + printk_ratelimited(KERN_WARNING
 + NOTIFY with irqs_disabled:%lx\n, thread_info_flags);
 + local_irq_enable();
 + }
 +
   if (thread_info_flags  _TIF_SIGPENDING)
   do_signal(regs, sw, r0, r19);
  
 

-- 
| Thorsten KranzkowskiInternet: dl8...@dl8bcu.de  |
| Mobile: ++49 170 1876134   Snail: Kiebitzstr. 14, 49324 Melle, Germany  |
| Ampr: dl8bcu@db0lj.#rpl.deu.eu, dl8...@marvin.dl8bcu.ampr.org [44.130.8.19] |
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression] boot failure on alpha, bisected

2012-10-07 Thread Oleg Nesterov
On 10/07, Thorsten Kranzkowski wrote:

 On Sun, Oct 07, 2012 at 07:13:00PM +0200, Oleg Nesterov wrote:
  On 10/07, Oleg Nesterov wrote:
  
   Hmm. I know nothing about arch/alpha and I can't understand its entry.S.
   But _it seems_ to me that do_notify_resume() is called with irqs disabled.
   If this is true, then imho arch/alpha should be fixed.
  
   Before this commit task_work_run() enabled irqs, but this was the side
   effect of spin_lock_irq/spin_unlock_irq, we should not rely on this.
 
  Could you please test the debugging patch below?

 Of course. With that patch applied the kernel (ac3d0da) boots again. The 
 trace line
 is printed about once a second, with values '2' and '4'.

Thanks a lot Thorsten!

So I'll probably send the patch which enables interrupts in
task_work_run(). I guess this needs if (irqs_disabled())
for lockdep.

The question is, should I add the warning to remind that this
arch needs a fix?

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression] boot failure on alpha, bisected

2012-10-07 Thread Al Viro
On Sun, Oct 07, 2012 at 07:33:36PM +0200, Oleg Nesterov wrote:

  Um...  There's a bunch of architectures that are in the same situation.
  grep for do_notify_resume() and you'll see...
 
 And every do_notify_resume() should be changed anyway, do_signal() and
 tracehook_notify_resume() should be re-ordered.

There's a bit more to it.  The thing is, we have quite a mess around
the signal-handling loops, mixed with that regarding the signal restarts.
On arm it's done about right by now:
* looping until all signals had been handled is done in C;
none of that loop in asm glue nonsense anymore.
* prevention of double restarts is *also* there, TYVM.
* do_work_pending() is called with interrupts disabled.
It may return 0, in which case we are done, interrupts are disabled
and the caller should proceed to userland without reenabling them
until it leaves.  Otherwise we have a syscall restart to handle and
no userland signal handler had been invoked.  Interrupts are enabled
and we should simply reload arguments and syscall number from pt_regs
and proceed to syscall entry, without returning to userland.  The only
twist is that negative return value means ERESTART_RESTARTBLOCK kind
of restart, in which case we need to use __NR_restart_syscall for
syscall number.

Note that we do *not* go through return to userland and reentering the
kernel on handlerless syscall restarts.  S390 uses the same model, but
there it's done in assembler glue - for no good reason.  Should be in
straight C.

For alpha there's another twist, though - there we do _not_ save all
registers in pt_regs; there's a fairly large chunk of callee-saved
registers we don't need to protect from being messed by C parts of
the kernel.  We do need to save them in sigcontext, though.  So alpha
(and quite a few other architctures) has separate struct switch_stack
(named so since switch_to() needs to save/restore the same registers).
Rules:
* on fork() et.al. we save those callee-saved registers in
struct switch_stack, right next to pt_regs.  We do that before calling
the actual sys_fork() and have copy_thread() copy these guys into
child.  Remember that newborns are first woken up in ret_from_fork
and as with all context switches they go through switch_to().  So these
registers are restored by the time the sucker wakes up.
* on signal delivery we save those registers in struct switch_stack
and use it, along with pt_regs it lives next to, to fill sigcontext.
* ptrace counts on those suckers being next to pt_regs.  That allows
tracer to modify tracee's registers, including callee-saved ones.  So we
(1) restore them from switch_stack once we are done with do_signal() and
(2) save/restore them around another place where we can get stopped for
tracer to examine us - PTRACE_SYSCALL-induced paths in syscall handling.
* on sigreturn/rt_sigreturn we need to restore all registers.
So we reserve switch_stack on stack, next to pt_regs and have the C part of
sigreturn fill those along with pt_regs.  Once we are done, read those
registers from switch_stack.

That's more or less it; many other architectures are doing more or less
similar things, but not all of them put that stuff into separate structure.
E.g. another valid solution is to leave space in pt_regs, fill only a subset
on entry and have switch_to() save stuff in task_struct instead of putting
it on kernel stack.

What it means for us is that saving all that crap on stack should *not*
be done unless we have work to do.  OTOH, in situations when we have
more than one pending signal it's bloody dumb to save/restore around
each do_notify_resume() call separately.  OTTH, in situation when we'd
run out of timeslice and had nothing arrive until we'd regained CPU
save/restore around schedule() is pointless at the very least.  So for
things like alpha I'd do this:

interrupts disabled
check thread flags
no work to do = bugger off to userland
just NEED_RESCHED?
schedule()
reread thread flags
no work to do = bugger off to userland
save callee-saved registers
call do_work_pending
restore callee-saved registers
if do_work_pendign returned 0 = bugger off to userland
deal with handlerless restart

Note that the loop around do_signal() and friends is in C and is fairly
similar to what we've got on ARM.  x86 is in intermediate situation -
the main complication there is v86 crap.

I'd say that for now your variant should do, but we really need to get
that crap under control and out of asm glue.  Are you willing to participate?
Guys, we need a way to do cross-architecture work without going insane.
I've spent quite a bit of time this year crawling through that stuff.
And yes, it's getting better as the result, but it's not sustainable -
I have VFS work to do, after all.

Basically, we need more people willing to take part in that; ideally -
architecture 

Re: [regression] boot failure on alpha, bisected

2012-10-07 Thread Thorsten Kranzkowski
On Sun, Oct 07, 2012 at 09:16:27PM +0200, Oleg Nesterov wrote:
 On 10/07, Thorsten Kranzkowski wrote:
  On Sun, Oct 07, 2012 at 07:13:00PM +0200, Oleg Nesterov wrote:
   On 10/07, Oleg Nesterov wrote:
   
Hmm. I know nothing about arch/alpha and I can't understand its entry.S.
But _it seems_ to me that do_notify_resume() is called with irqs 
disabled.
If this is true, then imho arch/alpha should be fixed.
   
Before this commit task_work_run() enabled irqs, but this was the side
effect of spin_lock_irq/spin_unlock_irq, we should not rely on this.
  
   Could you please test the debugging patch below?
 
  Of course. With that patch applied the kernel (ac3d0da) boots again. The 
  trace line
  is printed about once a second, with values '2' and '4'.
 
 Thanks a lot Thorsten!
 
 So I'll probably send the patch which enables interrupts in
 task_work_run(). I guess this needs if (irqs_disabled())
 for lockdep.
 
 The question is, should I add the warning to remind that this
 arch needs a fix?

If you do, then please warn only once, otherwise there will be a lot of
warnings :-)
 
Thorsten
 

-- 
| Thorsten KranzkowskiInternet: dl8...@dl8bcu.de  |
| Mobile: ++49 170 1876134   Snail: Kiebitzstr. 14, 49324 Melle, Germany  |
| Ampr: dl8bcu@db0lj.#rpl.deu.eu, dl8...@marvin.dl8bcu.ampr.org [44.130.8.19] |
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/