Re: [regression] boot failure on alpha, bisected
On 10/08, Geert Uytterhoeven wrote: > > Just wondering. As this is on an SMP system, perhaps the > read_barrier_depends() vs. smp_read_barrier_depends() matters > here? > http://lkml.indiana.edu/hypermail/linux/kernel/1209.3/00555.html Yes, thanks, I do remember about this ;) This will come as 2/2 which also removes the unnecessary "work = NULL" initialization. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [regression] boot failure on alpha, bisected
On Sun, Oct 7, 2012 at 9:16 PM, Oleg Nesterov wrote: > On 10/07, Thorsten Kranzkowski wrote: >> On Sun, Oct 07, 2012 at 07:13:00PM +0200, Oleg Nesterov wrote: >> > On 10/07, Oleg Nesterov wrote: >> > > Hmm. I know nothing about arch/alpha and I can't understand its entry.S. >> > > But _it seems_ to me that do_notify_resume() is called with irqs >> > > disabled. >> > > If this is true, then imho arch/alpha should be fixed. >> > > >> > > Before this commit task_work_run() enabled irqs, but this was the "side >> > > effect" of spin_lock_irq/spin_unlock_irq, we should not rely on this. >> > >> > Could you please test the debugging patch below? >> >> Of course. With that patch applied the kernel (ac3d0da) boots again. The >> trace line >> is printed about once a second, with values '2' and '4'. > > Thanks a lot Thorsten! > > So I'll probably send the patch which enables interrupts in > task_work_run(). I guess this needs "if (irqs_disabled())" > for lockdep. > > The question is, should I add the warning to remind that this > arch needs a fix? Just wondering. As this is on an SMP system, perhaps the read_barrier_depends() vs. smp_read_barrier_depends() matters here? http://lkml.indiana.edu/hypermail/linux/kernel/1209.3/00555.html Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [regression] boot failure on alpha, bisected
On 10/07, Al Viro wrote: > > On Sun, Oct 07, 2012 at 07:33:36PM +0200, Oleg Nesterov wrote: > > > > Um... There's a bunch of architectures that are in the same situation. > > > grep for do_notify_resume() and you'll see... > > > > And every do_notify_resume() should be changed anyway, do_signal() and > > tracehook_notify_resume() should be re-ordered. > > There's a bit more to it. > [...big snip...] Thanks Al. I need to read your email carefully. But what do you think we should do right now to fix this particular bug? So far I am going to send the patch below. This was always wrong, even before task_works were added. key_replace_session_keyring() was might_sleep() too but it also did lock_irq/unlock_irq, so this was not noticed before. Or do you think we should add local_irq_enable() into arch/alpha/kernel/signal.c:do_notify_resume() before tracehook_notify_resume and wait for other similar report? Oleg. --- diff --git a/kernel/task_work.c b/kernel/task_work.c index 65bd3c9..fd18bd7 100644 --- a/kernel/task_work.c +++ b/kernel/task_work.c @@ -51,6 +51,10 @@ void task_work_run(void) struct task_struct *task = current; struct callback_head *work, *head, *next; + if (WARN_ONCE(irqs_disabled(), + "do_notify_resume() with irqs disabled, fix this arch.")) + local_irq_enable(); + for (;;) { /* * work->func() can do task_work_add(), do not set -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [regression] boot failure on alpha, bisected
Hello! I'm an Alpha user - I just want to thank you all for working to keep Linux current on this architecture. I am still using the last working Alpha Core release ... I hope to keep the old beast running for many more years! Jon Norstog www.thursdaybicycles.com On Sun, 7 Oct 2012 20:39:09 +0100, Al Viro wrote > On Sun, Oct 07, 2012 at 07:33:36PM +0200, Oleg Nesterov wrote: > > > > Um... There's a bunch of architectures that are in the same situation. > > > grep for do_notify_resume() and you'll see... > > > > And every do_notify_resume() should be changed anyway, do_signal() and > > tracehook_notify_resume() should be re-ordered. > > There's a bit more to it. The thing is, we have quite a mess around > the signal-handling loops, mixed with that regarding the signal restarts. > On arm it's done about right by now: > * looping until all signals had been handled is done in C; > none of that "loop in asm glue" nonsense anymore. > * prevention of double restarts is *also* there, TYVM. > * do_work_pending() is called with interrupts disabled. > It may return 0, in which case we are done, interrupts are disabled > and the caller should proceed to userland without reenabling them > until it leaves. Otherwise we have a syscall restart to handle and > no userland signal handler had been invoked. Interrupts are enabled > and we should simply reload arguments and syscall number from pt_regs > and proceed to syscall entry, without returning to userland. The > only twist is that negative return value means ERESTART_RESTARTBLOCK > kind of restart, in which case we need to use __NR_restart_syscall > for syscall number. > > Note that we do *not* go through return to userland and reentering > the kernel on handlerless syscall restarts. S390 uses the same > model, but there it's done in assembler glue - for no good reason. > Should be in straight C. > > For alpha there's another twist, though - there we do _not_ save all > registers in pt_regs; there's a fairly large chunk of callee-saved > registers we don't need to protect from being messed by C parts of > the kernel. We do need to save them in sigcontext, though. So alpha > (and quite a few other architctures) has separate struct switch_stack > > (named so since switch_to() needs to save/restore the same registers) > . Rules: * on fork() et.al. we save those callee-saved registers in > struct switch_stack, right next to pt_regs. We do that before > calling the actual sys_fork() and have copy_thread() copy these guys > into child. Remember that newborns are first woken up in ret_from_fork > and as with all context switches they go through switch_to(). So these > registers are restored by the time the sucker wakes up. > * on signal delivery we save those registers in struct switch_stack > and use it, along with pt_regs it lives next to, to fill sigcontext. > * ptrace counts on those suckers being next to pt_regs. That allows > tracer to modify tracee's registers, including callee-saved ones. > So we > (1) restore them from switch_stack once we are done with do_signal() > and > (2) save/restore them around another place where we can get stopped for > tracer to examine us - PTRACE_SYSCALL-induced paths in syscall handling. > * on sigreturn/rt_sigreturn we need to restore all registers. > So we reserve switch_stack on stack, next to pt_regs and have the C > part of sigreturn fill those along with pt_regs. Once we are done, > read those registers from switch_stack. > > That's more or less it; many other architectures are doing more or less > similar things, but not all of them put that stuff into separate structure. > > E.g. another valid solution is to leave space in pt_regs, fill only > a subset on entry and have switch_to() save stuff in task_struct > instead of putting it on kernel stack. > > What it means for us is that saving all that crap on stack should *not* > be done unless we have work to do. OTOH, in situations when we have > more than one pending signal it's bloody dumb to save/restore around > each do_notify_resume() call separately. OTTH, in situation when > we'd run out of timeslice and had nothing arrive until we'd regained > CPU save/restore around schedule() is pointless at the very least. > So for things like alpha I'd do this: > > interrupts disabled > check thread flags > no work to do => bugger off to userland > just NEED_RESCHED? > schedule() > reread thread flags > no work to do => bugger off to userland > save callee-saved registers > call do_work_pending > restore callee-saved registers > if do_work_pendign returned 0 => bugger off to userland > deal with handlerless restart > > Note that the loop around do_signal() and friends is in C and is fairly > similar to what we've got on ARM. x86 is in intermediate situation - > the main complication there is v86 crap.
Re: [regression] boot failure on alpha, bisected
Hello! I'm an Alpha user - I just want to thank you ll for working to keep Linux current on this architecture. I am still using the last working Alpha Core release ... I hope to keep the old beast running for many more years! Jon Norstog www.thursdaybicycles.com On Sun, 7 Oct 2012 20:39:09 +0100, Al Viro wrote > On Sun, Oct 07, 2012 at 07:33:36PM +0200, Oleg Nesterov wrote: > > > > Um... There's a bunch of architectures that are in the same situation. > > > grep for do_notify_resume() and you'll see... > > > > And every do_notify_resume() should be changed anyway, do_signal() and > > tracehook_notify_resume() should be re-ordered. > > There's a bit more to it. The thing is, we have quite a mess around > the signal-handling loops, mixed with that regarding the signal restarts. > On arm it's done about right by now: > * looping until all signals had been handled is done in C; > none of that "loop in asm glue" nonsense anymore. > * prevention of double restarts is *also* there, TYVM. > * do_work_pending() is called with interrupts disabled. > It may return 0, in which case we are done, interrupts are disabled > and the caller should proceed to userland without reenabling them > until it leaves. Otherwise we have a syscall restart to handle and > no userland signal handler had been invoked. Interrupts are enabled > and we should simply reload arguments and syscall number from pt_regs > and proceed to syscall entry, without returning to userland. The > only twist is that negative return value means ERESTART_RESTARTBLOCK > kind of restart, in which case we need to use __NR_restart_syscall > for syscall number. > > Note that we do *not* go through return to userland and reentering > the kernel on handlerless syscall restarts. S390 uses the same > model, but there it's done in assembler glue - for no good reason. > Should be in straight C. > > For alpha there's another twist, though - there we do _not_ save all > registers in pt_regs; there's a fairly large chunk of callee-saved > registers we don't need to protect from being messed by C parts of > the kernel. We do need to save them in sigcontext, though. So alpha > (and quite a few other architctures) has separate struct switch_stack > > (named so since switch_to() needs to save/restore the same registers) > . Rules: * on fork() et.al. we save those callee-saved registers in > struct switch_stack, right next to pt_regs. We do that before > calling the actual sys_fork() and have copy_thread() copy these guys > into child. Remember that newborns are first woken up in ret_from_fork > and as with all context switches they go through switch_to(). So these > registers are restored by the time the sucker wakes up. > * on signal delivery we save those registers in struct switch_stack > and use it, along with pt_regs it lives next to, to fill sigcontext. > * ptrace counts on those suckers being next to pt_regs. That allows > tracer to modify tracee's registers, including callee-saved ones. > So we > (1) restore them from switch_stack once we are done with do_signal() > and > (2) save/restore them around another place where we can get stopped for > tracer to examine us - PTRACE_SYSCALL-induced paths in syscall handling. > * on sigreturn/rt_sigreturn we need to restore all registers. > So we reserve switch_stack on stack, next to pt_regs and have the C > part of sigreturn fill those along with pt_regs. Once we are done, > read those registers from switch_stack. > > That's more or less it; many other architectures are doing more or less > similar things, but not all of them put that stuff into separate structure. > > E.g. another valid solution is to leave space in pt_regs, fill only > a subset on entry and have switch_to() save stuff in task_struct > instead of putting it on kernel stack. > > What it means for us is that saving all that crap on stack should *not* > be done unless we have work to do. OTOH, in situations when we have > more than one pending signal it's bloody dumb to save/restore around > each do_notify_resume() call separately. OTTH, in situation when > we'd run out of timeslice and had nothing arrive until we'd regained > CPU save/restore around schedule() is pointless at the very least. > So for things like alpha I'd do this: > > interrupts disabled > check thread flags > no work to do => bugger off to userland > just NEED_RESCHED? > schedule() > reread thread flags > no work to do => bugger off to userland > save callee-saved registers > call do_work_pending > restore callee-saved registers > if do_work_pendign returned 0 => bugger off to userland > deal with handlerless restart > > Note that the loop around do_signal() and friends is in C and is fairly > similar to what we've got on ARM. x86 is in intermediate situation - > the main complication there is v86 crap.
Re: [regression] boot failure on alpha, bisected
Hello! I'm an Alpha user - I just want to thank you ll for working to keep Linux current on this architecture. I am still using the last working Alpha Core release ... I hope to keep the old beast running for many more years! Jon Norstog www.thursdaybicycles.com On Sun, 7 Oct 2012 20:39:09 +0100, Al Viro wrote On Sun, Oct 07, 2012 at 07:33:36PM +0200, Oleg Nesterov wrote: Um... There's a bunch of architectures that are in the same situation. grep for do_notify_resume() and you'll see... And every do_notify_resume() should be changed anyway, do_signal() and tracehook_notify_resume() should be re-ordered. There's a bit more to it. The thing is, we have quite a mess around the signal-handling loops, mixed with that regarding the signal restarts. On arm it's done about right by now: * looping until all signals had been handled is done in C; none of that loop in asm glue nonsense anymore. * prevention of double restarts is *also* there, TYVM. * do_work_pending() is called with interrupts disabled. It may return 0, in which case we are done, interrupts are disabled and the caller should proceed to userland without reenabling them until it leaves. Otherwise we have a syscall restart to handle and no userland signal handler had been invoked. Interrupts are enabled and we should simply reload arguments and syscall number from pt_regs and proceed to syscall entry, without returning to userland. The only twist is that negative return value means ERESTART_RESTARTBLOCK kind of restart, in which case we need to use __NR_restart_syscall for syscall number. Note that we do *not* go through return to userland and reentering the kernel on handlerless syscall restarts. S390 uses the same model, but there it's done in assembler glue - for no good reason. Should be in straight C. For alpha there's another twist, though - there we do _not_ save all registers in pt_regs; there's a fairly large chunk of callee-saved registers we don't need to protect from being messed by C parts of the kernel. We do need to save them in sigcontext, though. So alpha (and quite a few other architctures) has separate struct switch_stack (named so since switch_to() needs to save/restore the same registers) . Rules: * on fork() et.al. we save those callee-saved registers in struct switch_stack, right next to pt_regs. We do that before calling the actual sys_fork() and have copy_thread() copy these guys into child. Remember that newborns are first woken up in ret_from_fork and as with all context switches they go through switch_to(). So these registers are restored by the time the sucker wakes up. * on signal delivery we save those registers in struct switch_stack and use it, along with pt_regs it lives next to, to fill sigcontext. * ptrace counts on those suckers being next to pt_regs. That allows tracer to modify tracee's registers, including callee-saved ones. So we (1) restore them from switch_stack once we are done with do_signal() and (2) save/restore them around another place where we can get stopped for tracer to examine us - PTRACE_SYSCALL-induced paths in syscall handling. * on sigreturn/rt_sigreturn we need to restore all registers. So we reserve switch_stack on stack, next to pt_regs and have the C part of sigreturn fill those along with pt_regs. Once we are done, read those registers from switch_stack. That's more or less it; many other architectures are doing more or less similar things, but not all of them put that stuff into separate structure. E.g. another valid solution is to leave space in pt_regs, fill only a subset on entry and have switch_to() save stuff in task_struct instead of putting it on kernel stack. What it means for us is that saving all that crap on stack should *not* be done unless we have work to do. OTOH, in situations when we have more than one pending signal it's bloody dumb to save/restore around each do_notify_resume() call separately. OTTH, in situation when we'd run out of timeslice and had nothing arrive until we'd regained CPU save/restore around schedule() is pointless at the very least. So for things like alpha I'd do this: interrupts disabled check thread flags no work to do = bugger off to userland just NEED_RESCHED? schedule() reread thread flags no work to do = bugger off to userland save callee-saved registers call do_work_pending restore callee-saved registers if do_work_pendign returned 0 = bugger off to userland deal with handlerless restart Note that the loop around do_signal() and friends is in C and is fairly similar to what we've got on ARM. x86 is in intermediate situation - the main complication there is v86 crap. I'd say that for now your variant should do, but we really need to get that crap under control
Re: [regression] boot failure on alpha, bisected
Hello! I'm an Alpha user - I just want to thank you all for working to keep Linux current on this architecture. I am still using the last working Alpha Core release ... I hope to keep the old beast running for many more years! Jon Norstog www.thursdaybicycles.com On Sun, 7 Oct 2012 20:39:09 +0100, Al Viro wrote On Sun, Oct 07, 2012 at 07:33:36PM +0200, Oleg Nesterov wrote: Um... There's a bunch of architectures that are in the same situation. grep for do_notify_resume() and you'll see... And every do_notify_resume() should be changed anyway, do_signal() and tracehook_notify_resume() should be re-ordered. There's a bit more to it. The thing is, we have quite a mess around the signal-handling loops, mixed with that regarding the signal restarts. On arm it's done about right by now: * looping until all signals had been handled is done in C; none of that loop in asm glue nonsense anymore. * prevention of double restarts is *also* there, TYVM. * do_work_pending() is called with interrupts disabled. It may return 0, in which case we are done, interrupts are disabled and the caller should proceed to userland without reenabling them until it leaves. Otherwise we have a syscall restart to handle and no userland signal handler had been invoked. Interrupts are enabled and we should simply reload arguments and syscall number from pt_regs and proceed to syscall entry, without returning to userland. The only twist is that negative return value means ERESTART_RESTARTBLOCK kind of restart, in which case we need to use __NR_restart_syscall for syscall number. Note that we do *not* go through return to userland and reentering the kernel on handlerless syscall restarts. S390 uses the same model, but there it's done in assembler glue - for no good reason. Should be in straight C. For alpha there's another twist, though - there we do _not_ save all registers in pt_regs; there's a fairly large chunk of callee-saved registers we don't need to protect from being messed by C parts of the kernel. We do need to save them in sigcontext, though. So alpha (and quite a few other architctures) has separate struct switch_stack (named so since switch_to() needs to save/restore the same registers) . Rules: * on fork() et.al. we save those callee-saved registers in struct switch_stack, right next to pt_regs. We do that before calling the actual sys_fork() and have copy_thread() copy these guys into child. Remember that newborns are first woken up in ret_from_fork and as with all context switches they go through switch_to(). So these registers are restored by the time the sucker wakes up. * on signal delivery we save those registers in struct switch_stack and use it, along with pt_regs it lives next to, to fill sigcontext. * ptrace counts on those suckers being next to pt_regs. That allows tracer to modify tracee's registers, including callee-saved ones. So we (1) restore them from switch_stack once we are done with do_signal() and (2) save/restore them around another place where we can get stopped for tracer to examine us - PTRACE_SYSCALL-induced paths in syscall handling. * on sigreturn/rt_sigreturn we need to restore all registers. So we reserve switch_stack on stack, next to pt_regs and have the C part of sigreturn fill those along with pt_regs. Once we are done, read those registers from switch_stack. That's more or less it; many other architectures are doing more or less similar things, but not all of them put that stuff into separate structure. E.g. another valid solution is to leave space in pt_regs, fill only a subset on entry and have switch_to() save stuff in task_struct instead of putting it on kernel stack. What it means for us is that saving all that crap on stack should *not* be done unless we have work to do. OTOH, in situations when we have more than one pending signal it's bloody dumb to save/restore around each do_notify_resume() call separately. OTTH, in situation when we'd run out of timeslice and had nothing arrive until we'd regained CPU save/restore around schedule() is pointless at the very least. So for things like alpha I'd do this: interrupts disabled check thread flags no work to do = bugger off to userland just NEED_RESCHED? schedule() reread thread flags no work to do = bugger off to userland save callee-saved registers call do_work_pending restore callee-saved registers if do_work_pendign returned 0 = bugger off to userland deal with handlerless restart Note that the loop around do_signal() and friends is in C and is fairly similar to what we've got on ARM. x86 is in intermediate situation - the main complication there is v86 crap. I'd say that for now your variant should do, but we really need to get that crap under control
Re: [regression] boot failure on alpha, bisected
On 10/07, Al Viro wrote: On Sun, Oct 07, 2012 at 07:33:36PM +0200, Oleg Nesterov wrote: Um... There's a bunch of architectures that are in the same situation. grep for do_notify_resume() and you'll see... And every do_notify_resume() should be changed anyway, do_signal() and tracehook_notify_resume() should be re-ordered. There's a bit more to it. [...big snip...] Thanks Al. I need to read your email carefully. But what do you think we should do right now to fix this particular bug? So far I am going to send the patch below. This was always wrong, even before task_works were added. key_replace_session_keyring() was might_sleep() too but it also did lock_irq/unlock_irq, so this was not noticed before. Or do you think we should add local_irq_enable() into arch/alpha/kernel/signal.c:do_notify_resume() before tracehook_notify_resume and wait for other similar report? Oleg. --- diff --git a/kernel/task_work.c b/kernel/task_work.c index 65bd3c9..fd18bd7 100644 --- a/kernel/task_work.c +++ b/kernel/task_work.c @@ -51,6 +51,10 @@ void task_work_run(void) struct task_struct *task = current; struct callback_head *work, *head, *next; + if (WARN_ONCE(irqs_disabled(), + do_notify_resume() with irqs disabled, fix this arch.)) + local_irq_enable(); + for (;;) { /* * work-func() can do task_work_add(), do not set -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [regression] boot failure on alpha, bisected
On Sun, Oct 7, 2012 at 9:16 PM, Oleg Nesterov o...@redhat.com wrote: On 10/07, Thorsten Kranzkowski wrote: On Sun, Oct 07, 2012 at 07:13:00PM +0200, Oleg Nesterov wrote: On 10/07, Oleg Nesterov wrote: Hmm. I know nothing about arch/alpha and I can't understand its entry.S. But _it seems_ to me that do_notify_resume() is called with irqs disabled. If this is true, then imho arch/alpha should be fixed. Before this commit task_work_run() enabled irqs, but this was the side effect of spin_lock_irq/spin_unlock_irq, we should not rely on this. Could you please test the debugging patch below? Of course. With that patch applied the kernel (ac3d0da) boots again. The trace line is printed about once a second, with values '2' and '4'. Thanks a lot Thorsten! So I'll probably send the patch which enables interrupts in task_work_run(). I guess this needs if (irqs_disabled()) for lockdep. The question is, should I add the warning to remind that this arch needs a fix? Just wondering. As this is on an SMP system, perhaps the read_barrier_depends() vs. smp_read_barrier_depends() matters here? http://lkml.indiana.edu/hypermail/linux/kernel/1209.3/00555.html Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say programmer or something like that. -- Linus Torvalds -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [regression] boot failure on alpha, bisected
On 10/08, Geert Uytterhoeven wrote: Just wondering. As this is on an SMP system, perhaps the read_barrier_depends() vs. smp_read_barrier_depends() matters here? http://lkml.indiana.edu/hypermail/linux/kernel/1209.3/00555.html Yes, thanks, I do remember about this ;) This will come as 2/2 which also removes the unnecessary work = NULL initialization. Oleg. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [regression] boot failure on alpha, bisected
On Sun, Oct 07, 2012 at 09:16:27PM +0200, Oleg Nesterov wrote: > On 10/07, Thorsten Kranzkowski wrote: > > On Sun, Oct 07, 2012 at 07:13:00PM +0200, Oleg Nesterov wrote: > > > On 10/07, Oleg Nesterov wrote: > > > > > > > > Hmm. I know nothing about arch/alpha and I can't understand its entry.S. > > > > But _it seems_ to me that do_notify_resume() is called with irqs > > > > disabled. > > > > If this is true, then imho arch/alpha should be fixed. > > > > > > > > Before this commit task_work_run() enabled irqs, but this was the "side > > > > effect" of spin_lock_irq/spin_unlock_irq, we should not rely on this. > > > > > > Could you please test the debugging patch below? > > > > Of course. With that patch applied the kernel (ac3d0da) boots again. The > > trace line > > is printed about once a second, with values '2' and '4'. > > Thanks a lot Thorsten! > > So I'll probably send the patch which enables interrupts in > task_work_run(). I guess this needs "if (irqs_disabled())" > for lockdep. > > The question is, should I add the warning to remind that this > arch needs a fix? If you do, then please warn only once, otherwise there will be a lot of warnings :-) Thorsten -- | Thorsten KranzkowskiInternet: dl8...@dl8bcu.de | | Mobile: ++49 170 1876134 Snail: Kiebitzstr. 14, 49324 Melle, Germany | | Ampr: dl8bcu@db0lj.#rpl.deu.eu, dl8...@marvin.dl8bcu.ampr.org [44.130.8.19] | -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [regression] boot failure on alpha, bisected
On Sun, Oct 07, 2012 at 07:33:36PM +0200, Oleg Nesterov wrote: > > Um... There's a bunch of architectures that are in the same situation. > > grep for do_notify_resume() and you'll see... > > And every do_notify_resume() should be changed anyway, do_signal() and > tracehook_notify_resume() should be re-ordered. There's a bit more to it. The thing is, we have quite a mess around the signal-handling loops, mixed with that regarding the signal restarts. On arm it's done about right by now: * looping until all signals had been handled is done in C; none of that "loop in asm glue" nonsense anymore. * prevention of double restarts is *also* there, TYVM. * do_work_pending() is called with interrupts disabled. It may return 0, in which case we are done, interrupts are disabled and the caller should proceed to userland without reenabling them until it leaves. Otherwise we have a syscall restart to handle and no userland signal handler had been invoked. Interrupts are enabled and we should simply reload arguments and syscall number from pt_regs and proceed to syscall entry, without returning to userland. The only twist is that negative return value means ERESTART_RESTARTBLOCK kind of restart, in which case we need to use __NR_restart_syscall for syscall number. Note that we do *not* go through return to userland and reentering the kernel on handlerless syscall restarts. S390 uses the same model, but there it's done in assembler glue - for no good reason. Should be in straight C. For alpha there's another twist, though - there we do _not_ save all registers in pt_regs; there's a fairly large chunk of callee-saved registers we don't need to protect from being messed by C parts of the kernel. We do need to save them in sigcontext, though. So alpha (and quite a few other architctures) has separate struct switch_stack (named so since switch_to() needs to save/restore the same registers). Rules: * on fork() et.al. we save those callee-saved registers in struct switch_stack, right next to pt_regs. We do that before calling the actual sys_fork() and have copy_thread() copy these guys into child. Remember that newborns are first woken up in ret_from_fork and as with all context switches they go through switch_to(). So these registers are restored by the time the sucker wakes up. * on signal delivery we save those registers in struct switch_stack and use it, along with pt_regs it lives next to, to fill sigcontext. * ptrace counts on those suckers being next to pt_regs. That allows tracer to modify tracee's registers, including callee-saved ones. So we (1) restore them from switch_stack once we are done with do_signal() and (2) save/restore them around another place where we can get stopped for tracer to examine us - PTRACE_SYSCALL-induced paths in syscall handling. * on sigreturn/rt_sigreturn we need to restore all registers. So we reserve switch_stack on stack, next to pt_regs and have the C part of sigreturn fill those along with pt_regs. Once we are done, read those registers from switch_stack. That's more or less it; many other architectures are doing more or less similar things, but not all of them put that stuff into separate structure. E.g. another valid solution is to leave space in pt_regs, fill only a subset on entry and have switch_to() save stuff in task_struct instead of putting it on kernel stack. What it means for us is that saving all that crap on stack should *not* be done unless we have work to do. OTOH, in situations when we have more than one pending signal it's bloody dumb to save/restore around each do_notify_resume() call separately. OTTH, in situation when we'd run out of timeslice and had nothing arrive until we'd regained CPU save/restore around schedule() is pointless at the very least. So for things like alpha I'd do this: interrupts disabled check thread flags no work to do => bugger off to userland just NEED_RESCHED? schedule() reread thread flags no work to do => bugger off to userland save callee-saved registers call do_work_pending restore callee-saved registers if do_work_pendign returned 0 => bugger off to userland deal with handlerless restart Note that the loop around do_signal() and friends is in C and is fairly similar to what we've got on ARM. x86 is in intermediate situation - the main complication there is v86 crap. I'd say that for now your variant should do, but we really need to get that crap under control and out of asm glue. Are you willing to participate? Guys, we need a way to do cross-architecture work without going insane. I've spent quite a bit of time this year crawling through that stuff. And yes, it's getting better as the result, but it's not sustainable - I have VFS work to do, after all. Basically, we need more people willing to take part in that; ideally -
Re: [regression] boot failure on alpha, bisected
On 10/07, Thorsten Kranzkowski wrote: > > On Sun, Oct 07, 2012 at 07:13:00PM +0200, Oleg Nesterov wrote: > > On 10/07, Oleg Nesterov wrote: > > > > > > Hmm. I know nothing about arch/alpha and I can't understand its entry.S. > > > But _it seems_ to me that do_notify_resume() is called with irqs disabled. > > > If this is true, then imho arch/alpha should be fixed. > > > > > > Before this commit task_work_run() enabled irqs, but this was the "side > > > effect" of spin_lock_irq/spin_unlock_irq, we should not rely on this. > > > > Could you please test the debugging patch below? > > Of course. With that patch applied the kernel (ac3d0da) boots again. The > trace line > is printed about once a second, with values '2' and '4'. Thanks a lot Thorsten! So I'll probably send the patch which enables interrupts in task_work_run(). I guess this needs "if (irqs_disabled())" for lockdep. The question is, should I add the warning to remind that this arch needs a fix? Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [regression] boot failure on alpha, bisected
On Sun, Oct 07, 2012 at 07:13:00PM +0200, Oleg Nesterov wrote: > On 10/07, Oleg Nesterov wrote: > > > > Hmm. I know nothing about arch/alpha and I can't understand its entry.S. > > But _it seems_ to me that do_notify_resume() is called with irqs disabled. > > If this is true, then imho arch/alpha should be fixed. > > > > Before this commit task_work_run() enabled irqs, but this was the "side > > effect" of spin_lock_irq/spin_unlock_irq, we should not rely on this. > > Could you please test the debugging patch below? Of course. With that patch applied the kernel (ac3d0da) boots again. The trace line is printed about once a second, with values '2' and '4'. Thorsten > > Oleg. > > > --- x/arch/alpha/kernel/signal.c > +++ x/arch/alpha/kernel/signal.c > @@ -567,11 +567,19 @@ do_signal(struct pt_regs * regs, struct > ptrace_set_bpt(current);/* re-set breakpoint */ > } > > +#include > + > void > do_notify_resume(struct pt_regs *regs, struct switch_stack *sw, >unsigned long thread_info_flags, >unsigned long r0, unsigned long r19) > { > + if (irqs_disabled()) { > + printk_ratelimited(KERN_WARNING > + "NOTIFY with irqs_disabled:%lx\n", thread_info_flags); > + local_irq_enable(); > + } > + > if (thread_info_flags & _TIF_SIGPENDING) > do_signal(regs, sw, r0, r19); > > -- | Thorsten KranzkowskiInternet: dl8...@dl8bcu.de | | Mobile: ++49 170 1876134 Snail: Kiebitzstr. 14, 49324 Melle, Germany | | Ampr: dl8bcu@db0lj.#rpl.deu.eu, dl8...@marvin.dl8bcu.ampr.org [44.130.8.19] | -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [regression] boot failure on alpha, bisected
On 10/07, Al Viro wrote: > > On Sun, Oct 07, 2012 at 06:55:34PM +0200, Oleg Nesterov wrote: > > > Hmm. I know nothing about arch/alpha and I can't understand its entry.S. > > But _it seems_ to me that do_notify_resume() is called with irqs disabled. > > If this is true, then imho arch/alpha should be fixed. > > > > Before this commit task_work_run() enabled irqs, but this was the "side > > effect" of spin_lock_irq/spin_unlock_irq, we should not rely on this. > > Um... There's a bunch of architectures that are in the same situation. > grep for do_notify_resume() and you'll see... And every do_notify_resume() should be changed anyway, do_signal() and tracehook_notify_resume() should be re-ordered. > It needs to be dealt with sanely, and actually have patches for alpha > going in that direction, but breaking a bunch of architectures is not a good > thing, obviously. So you've bought yourself a major PITA for coming > weeks... So perhaps the patch below until they are fixed? --- x/kernel/task_work.c +++ x/kernel/task_work.c @@ -51,6 +51,9 @@ void task_work_run(void) struct task_struct *task = current; struct callback_head *work, *head, *next; + if (WARN_ONCE(irqs_disabled(), "notify_resume() with irqs_disabled")) + local_irq_enable(); + for (;;) { /* * work->func() can do task_work_add(), do not set -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [regression] boot failure on alpha, bisected
On 10/07, Oleg Nesterov wrote: > > Hmm. I know nothing about arch/alpha and I can't understand its entry.S. > But _it seems_ to me that do_notify_resume() is called with irqs disabled. > If this is true, then imho arch/alpha should be fixed. > > Before this commit task_work_run() enabled irqs, but this was the "side > effect" of spin_lock_irq/spin_unlock_irq, we should not rely on this. Could you please test the debugging patch below? Oleg. --- x/arch/alpha/kernel/signal.c +++ x/arch/alpha/kernel/signal.c @@ -567,11 +567,19 @@ do_signal(struct pt_regs * regs, struct ptrace_set_bpt(current);/* re-set breakpoint */ } +#include + void do_notify_resume(struct pt_regs *regs, struct switch_stack *sw, unsigned long thread_info_flags, unsigned long r0, unsigned long r19) { + if (irqs_disabled()) { + printk_ratelimited(KERN_WARNING + "NOTIFY with irqs_disabled:%lx\n", thread_info_flags); + local_irq_enable(); + } + if (thread_info_flags & _TIF_SIGPENDING) do_signal(regs, sw, r0, r19); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [regression] boot failure on alpha, bisected
On Sun, Oct 07, 2012 at 06:55:34PM +0200, Oleg Nesterov wrote: > Hmm. I know nothing about arch/alpha and I can't understand its entry.S. > But _it seems_ to me that do_notify_resume() is called with irqs disabled. > If this is true, then imho arch/alpha should be fixed. > > Before this commit task_work_run() enabled irqs, but this was the "side > effect" of spin_lock_irq/spin_unlock_irq, we should not rely on this. Um... There's a bunch of architectures that are in the same situation. grep for do_notify_resume() and you'll see... It needs to be dealt with sanely, and actually have patches for alpha going in that direction, but breaking a bunch of architectures is not a good thing, obviously. So you've bought yourself a major PITA for coming weeks... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [regression] boot failure on alpha, bisected
On Sun, Oct 07, 2012 at 06:55:34PM +0200, Oleg Nesterov wrote: Hmm. I know nothing about arch/alpha and I can't understand its entry.S. But _it seems_ to me that do_notify_resume() is called with irqs disabled. If this is true, then imho arch/alpha should be fixed. Before this commit task_work_run() enabled irqs, but this was the side effect of spin_lock_irq/spin_unlock_irq, we should not rely on this. Um... There's a bunch of architectures that are in the same situation. grep for do_notify_resume() and you'll see... It needs to be dealt with sanely, and actually have patches for alpha going in that direction, but breaking a bunch of architectures is not a good thing, obviously. So you've bought yourself a major PITA for coming weeks... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [regression] boot failure on alpha, bisected
On 10/07, Oleg Nesterov wrote: Hmm. I know nothing about arch/alpha and I can't understand its entry.S. But _it seems_ to me that do_notify_resume() is called with irqs disabled. If this is true, then imho arch/alpha should be fixed. Before this commit task_work_run() enabled irqs, but this was the side effect of spin_lock_irq/spin_unlock_irq, we should not rely on this. Could you please test the debugging patch below? Oleg. --- x/arch/alpha/kernel/signal.c +++ x/arch/alpha/kernel/signal.c @@ -567,11 +567,19 @@ do_signal(struct pt_regs * regs, struct ptrace_set_bpt(current);/* re-set breakpoint */ } +#include linux/ratelimit.h + void do_notify_resume(struct pt_regs *regs, struct switch_stack *sw, unsigned long thread_info_flags, unsigned long r0, unsigned long r19) { + if (irqs_disabled()) { + printk_ratelimited(KERN_WARNING + NOTIFY with irqs_disabled:%lx\n, thread_info_flags); + local_irq_enable(); + } + if (thread_info_flags _TIF_SIGPENDING) do_signal(regs, sw, r0, r19); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [regression] boot failure on alpha, bisected
On 10/07, Al Viro wrote: On Sun, Oct 07, 2012 at 06:55:34PM +0200, Oleg Nesterov wrote: Hmm. I know nothing about arch/alpha and I can't understand its entry.S. But _it seems_ to me that do_notify_resume() is called with irqs disabled. If this is true, then imho arch/alpha should be fixed. Before this commit task_work_run() enabled irqs, but this was the side effect of spin_lock_irq/spin_unlock_irq, we should not rely on this. Um... There's a bunch of architectures that are in the same situation. grep for do_notify_resume() and you'll see... And every do_notify_resume() should be changed anyway, do_signal() and tracehook_notify_resume() should be re-ordered. It needs to be dealt with sanely, and actually have patches for alpha going in that direction, but breaking a bunch of architectures is not a good thing, obviously. So you've bought yourself a major PITA for coming weeks... So perhaps the patch below until they are fixed? --- x/kernel/task_work.c +++ x/kernel/task_work.c @@ -51,6 +51,9 @@ void task_work_run(void) struct task_struct *task = current; struct callback_head *work, *head, *next; + if (WARN_ONCE(irqs_disabled(), notify_resume() with irqs_disabled)) + local_irq_enable(); + for (;;) { /* * work-func() can do task_work_add(), do not set -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [regression] boot failure on alpha, bisected
On Sun, Oct 07, 2012 at 07:13:00PM +0200, Oleg Nesterov wrote: On 10/07, Oleg Nesterov wrote: Hmm. I know nothing about arch/alpha and I can't understand its entry.S. But _it seems_ to me that do_notify_resume() is called with irqs disabled. If this is true, then imho arch/alpha should be fixed. Before this commit task_work_run() enabled irqs, but this was the side effect of spin_lock_irq/spin_unlock_irq, we should not rely on this. Could you please test the debugging patch below? Of course. With that patch applied the kernel (ac3d0da) boots again. The trace line is printed about once a second, with values '2' and '4'. Thorsten Oleg. --- x/arch/alpha/kernel/signal.c +++ x/arch/alpha/kernel/signal.c @@ -567,11 +567,19 @@ do_signal(struct pt_regs * regs, struct ptrace_set_bpt(current);/* re-set breakpoint */ } +#include linux/ratelimit.h + void do_notify_resume(struct pt_regs *regs, struct switch_stack *sw, unsigned long thread_info_flags, unsigned long r0, unsigned long r19) { + if (irqs_disabled()) { + printk_ratelimited(KERN_WARNING + NOTIFY with irqs_disabled:%lx\n, thread_info_flags); + local_irq_enable(); + } + if (thread_info_flags _TIF_SIGPENDING) do_signal(regs, sw, r0, r19); -- | Thorsten KranzkowskiInternet: dl8...@dl8bcu.de | | Mobile: ++49 170 1876134 Snail: Kiebitzstr. 14, 49324 Melle, Germany | | Ampr: dl8bcu@db0lj.#rpl.deu.eu, dl8...@marvin.dl8bcu.ampr.org [44.130.8.19] | -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [regression] boot failure on alpha, bisected
On 10/07, Thorsten Kranzkowski wrote: On Sun, Oct 07, 2012 at 07:13:00PM +0200, Oleg Nesterov wrote: On 10/07, Oleg Nesterov wrote: Hmm. I know nothing about arch/alpha and I can't understand its entry.S. But _it seems_ to me that do_notify_resume() is called with irqs disabled. If this is true, then imho arch/alpha should be fixed. Before this commit task_work_run() enabled irqs, but this was the side effect of spin_lock_irq/spin_unlock_irq, we should not rely on this. Could you please test the debugging patch below? Of course. With that patch applied the kernel (ac3d0da) boots again. The trace line is printed about once a second, with values '2' and '4'. Thanks a lot Thorsten! So I'll probably send the patch which enables interrupts in task_work_run(). I guess this needs if (irqs_disabled()) for lockdep. The question is, should I add the warning to remind that this arch needs a fix? Oleg. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [regression] boot failure on alpha, bisected
On Sun, Oct 07, 2012 at 07:33:36PM +0200, Oleg Nesterov wrote: Um... There's a bunch of architectures that are in the same situation. grep for do_notify_resume() and you'll see... And every do_notify_resume() should be changed anyway, do_signal() and tracehook_notify_resume() should be re-ordered. There's a bit more to it. The thing is, we have quite a mess around the signal-handling loops, mixed with that regarding the signal restarts. On arm it's done about right by now: * looping until all signals had been handled is done in C; none of that loop in asm glue nonsense anymore. * prevention of double restarts is *also* there, TYVM. * do_work_pending() is called with interrupts disabled. It may return 0, in which case we are done, interrupts are disabled and the caller should proceed to userland without reenabling them until it leaves. Otherwise we have a syscall restart to handle and no userland signal handler had been invoked. Interrupts are enabled and we should simply reload arguments and syscall number from pt_regs and proceed to syscall entry, without returning to userland. The only twist is that negative return value means ERESTART_RESTARTBLOCK kind of restart, in which case we need to use __NR_restart_syscall for syscall number. Note that we do *not* go through return to userland and reentering the kernel on handlerless syscall restarts. S390 uses the same model, but there it's done in assembler glue - for no good reason. Should be in straight C. For alpha there's another twist, though - there we do _not_ save all registers in pt_regs; there's a fairly large chunk of callee-saved registers we don't need to protect from being messed by C parts of the kernel. We do need to save them in sigcontext, though. So alpha (and quite a few other architctures) has separate struct switch_stack (named so since switch_to() needs to save/restore the same registers). Rules: * on fork() et.al. we save those callee-saved registers in struct switch_stack, right next to pt_regs. We do that before calling the actual sys_fork() and have copy_thread() copy these guys into child. Remember that newborns are first woken up in ret_from_fork and as with all context switches they go through switch_to(). So these registers are restored by the time the sucker wakes up. * on signal delivery we save those registers in struct switch_stack and use it, along with pt_regs it lives next to, to fill sigcontext. * ptrace counts on those suckers being next to pt_regs. That allows tracer to modify tracee's registers, including callee-saved ones. So we (1) restore them from switch_stack once we are done with do_signal() and (2) save/restore them around another place where we can get stopped for tracer to examine us - PTRACE_SYSCALL-induced paths in syscall handling. * on sigreturn/rt_sigreturn we need to restore all registers. So we reserve switch_stack on stack, next to pt_regs and have the C part of sigreturn fill those along with pt_regs. Once we are done, read those registers from switch_stack. That's more or less it; many other architectures are doing more or less similar things, but not all of them put that stuff into separate structure. E.g. another valid solution is to leave space in pt_regs, fill only a subset on entry and have switch_to() save stuff in task_struct instead of putting it on kernel stack. What it means for us is that saving all that crap on stack should *not* be done unless we have work to do. OTOH, in situations when we have more than one pending signal it's bloody dumb to save/restore around each do_notify_resume() call separately. OTTH, in situation when we'd run out of timeslice and had nothing arrive until we'd regained CPU save/restore around schedule() is pointless at the very least. So for things like alpha I'd do this: interrupts disabled check thread flags no work to do = bugger off to userland just NEED_RESCHED? schedule() reread thread flags no work to do = bugger off to userland save callee-saved registers call do_work_pending restore callee-saved registers if do_work_pendign returned 0 = bugger off to userland deal with handlerless restart Note that the loop around do_signal() and friends is in C and is fairly similar to what we've got on ARM. x86 is in intermediate situation - the main complication there is v86 crap. I'd say that for now your variant should do, but we really need to get that crap under control and out of asm glue. Are you willing to participate? Guys, we need a way to do cross-architecture work without going insane. I've spent quite a bit of time this year crawling through that stuff. And yes, it's getting better as the result, but it's not sustainable - I have VFS work to do, after all. Basically, we need more people willing to take part in that; ideally - architecture
Re: [regression] boot failure on alpha, bisected
On Sun, Oct 07, 2012 at 09:16:27PM +0200, Oleg Nesterov wrote: On 10/07, Thorsten Kranzkowski wrote: On Sun, Oct 07, 2012 at 07:13:00PM +0200, Oleg Nesterov wrote: On 10/07, Oleg Nesterov wrote: Hmm. I know nothing about arch/alpha and I can't understand its entry.S. But _it seems_ to me that do_notify_resume() is called with irqs disabled. If this is true, then imho arch/alpha should be fixed. Before this commit task_work_run() enabled irqs, but this was the side effect of spin_lock_irq/spin_unlock_irq, we should not rely on this. Could you please test the debugging patch below? Of course. With that patch applied the kernel (ac3d0da) boots again. The trace line is printed about once a second, with values '2' and '4'. Thanks a lot Thorsten! So I'll probably send the patch which enables interrupts in task_work_run(). I guess this needs if (irqs_disabled()) for lockdep. The question is, should I add the warning to remind that this arch needs a fix? If you do, then please warn only once, otherwise there will be a lot of warnings :-) Thorsten -- | Thorsten KranzkowskiInternet: dl8...@dl8bcu.de | | Mobile: ++49 170 1876134 Snail: Kiebitzstr. 14, 49324 Melle, Germany | | Ampr: dl8bcu@db0lj.#rpl.deu.eu, dl8...@marvin.dl8bcu.ampr.org [44.130.8.19] | -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/