Re: [PATCH 2/2] sched: do not call workqueue sleep hook if task is already dead
On Fri, Oct 21, 2016 at 5:47 PM, Oleg Nesterovwrote: > On 10/20, Andy Lutomirski wrote: >> >> > --- a/kernel/sched/core.c >> > +++ b/kernel/sched/core.c >> > @@ -3380,8 +3380,22 @@ static void __sched notrace __schedule(bool preempt) >> > * If a worker went to sleep, notify and ask >> > workqueue >> > * whether it wants to wake up a task to maintain >> > * concurrency. >> > +* >> > +* Also the following stack is possible: >> > +*oops_end() >> > +*do_exit() >> > +*schedule() >> > +* >> > +* If panic_on_oops is not set and oops happens on >> > +* a workqueue execution path, thread will be >> > killed. >> > +* That is definitly sad, but not to make the >> > situation >> > +* even worse we have to ignore dead tasks in >> > order not >> > +* to step on zeroed out members (e.g. >> > t->vfork_done is >> > +* already NULL on that path, since we were called >> > by >> > +* do_exit())) > > And we have more problems like this. Say, if blk_flush_plug_list() > crashes it will likely crash again and again recursively. I will send a patch if reproduce it :) > >> > */ >> > - if (prev->flags & PF_WQ_WORKER) { >> > + if (prev->flags & PF_WQ_WORKER && >> > + prev->state != TASK_DEAD) { > > I don't think we should change __schedule()... Can't we simply clear > PF_WQ_WORKER in complete_vfork_done() ? Or add the PF_EXITING checks > into wq_worker_sleeping() and wq_worker_waking_up(). Yeah, probably handling this corner case in wq_worker_sleeping() func is much better. > > Or perhaps something like the change below. That's a nice stuff, thanks Oleg. I simply did not know about these callbacks. But the huge problem is that after commit 2deb4be28 by Andy Lutomirski we can't use stack when we are already in do_exit(). And putting this callback head inside a worker structure is a bloat. I will resend this with a simple task state check in a wq_worker_sleeping(). -- Roman > > Oleg. > > --- x/kernel/workqueue.c > +++ x/kernel/workqueue.c > @@ -2157,6 +2157,14 @@ static void process_scheduled_works(stru > } > } > > +static void oops_handler(struct callback_head *oops_work) > +{ > + if (!(current->flags & PF_WQ_WORKER)) > + return; > + > + clear PF_WQ_WORKER, probably do more cleanups > +} > + > /** > * worker_thread - the worker thread function > * @__worker: self > @@ -2171,11 +2179,14 @@ static void process_scheduled_works(stru > */ > static int worker_thread(void *__worker) > { > + struct callback_head oops_work; > struct worker *worker = __worker; > struct worker_pool *pool = worker->pool; > > /* tell the scheduler that this is a workqueue worker */ > worker->task->flags |= PF_WQ_WORKER; > + init_task_work(_work, oops_handler); > + task_work_add(current, _work, false); > woke_up: > spin_lock_irq(>lock); > >
Re: [PATCH 2/2] sched: do not call workqueue sleep hook if task is already dead
On Fri, Oct 21, 2016 at 5:47 PM, Oleg Nesterov wrote: > On 10/20, Andy Lutomirski wrote: >> >> > --- a/kernel/sched/core.c >> > +++ b/kernel/sched/core.c >> > @@ -3380,8 +3380,22 @@ static void __sched notrace __schedule(bool preempt) >> > * If a worker went to sleep, notify and ask >> > workqueue >> > * whether it wants to wake up a task to maintain >> > * concurrency. >> > +* >> > +* Also the following stack is possible: >> > +*oops_end() >> > +*do_exit() >> > +*schedule() >> > +* >> > +* If panic_on_oops is not set and oops happens on >> > +* a workqueue execution path, thread will be >> > killed. >> > +* That is definitly sad, but not to make the >> > situation >> > +* even worse we have to ignore dead tasks in >> > order not >> > +* to step on zeroed out members (e.g. >> > t->vfork_done is >> > +* already NULL on that path, since we were called >> > by >> > +* do_exit())) > > And we have more problems like this. Say, if blk_flush_plug_list() > crashes it will likely crash again and again recursively. I will send a patch if reproduce it :) > >> > */ >> > - if (prev->flags & PF_WQ_WORKER) { >> > + if (prev->flags & PF_WQ_WORKER && >> > + prev->state != TASK_DEAD) { > > I don't think we should change __schedule()... Can't we simply clear > PF_WQ_WORKER in complete_vfork_done() ? Or add the PF_EXITING checks > into wq_worker_sleeping() and wq_worker_waking_up(). Yeah, probably handling this corner case in wq_worker_sleeping() func is much better. > > Or perhaps something like the change below. That's a nice stuff, thanks Oleg. I simply did not know about these callbacks. But the huge problem is that after commit 2deb4be28 by Andy Lutomirski we can't use stack when we are already in do_exit(). And putting this callback head inside a worker structure is a bloat. I will resend this with a simple task state check in a wq_worker_sleeping(). -- Roman > > Oleg. > > --- x/kernel/workqueue.c > +++ x/kernel/workqueue.c > @@ -2157,6 +2157,14 @@ static void process_scheduled_works(stru > } > } > > +static void oops_handler(struct callback_head *oops_work) > +{ > + if (!(current->flags & PF_WQ_WORKER)) > + return; > + > + clear PF_WQ_WORKER, probably do more cleanups > +} > + > /** > * worker_thread - the worker thread function > * @__worker: self > @@ -2171,11 +2179,14 @@ static void process_scheduled_works(stru > */ > static int worker_thread(void *__worker) > { > + struct callback_head oops_work; > struct worker *worker = __worker; > struct worker_pool *pool = worker->pool; > > /* tell the scheduler that this is a workqueue worker */ > worker->task->flags |= PF_WQ_WORKER; > + init_task_work(_work, oops_handler); > + task_work_add(current, _work, false); > woke_up: > spin_lock_irq(>lock); > >
Re: [PATCH 2/2] sched: do not call workqueue sleep hook if task is already dead
On 10/20, Andy Lutomirski wrote: > > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -3380,8 +3380,22 @@ static void __sched notrace __schedule(bool preempt) > > * If a worker went to sleep, notify and ask > > workqueue > > * whether it wants to wake up a task to maintain > > * concurrency. > > +* > > +* Also the following stack is possible: > > +*oops_end() > > +*do_exit() > > +*schedule() > > +* > > +* If panic_on_oops is not set and oops happens on > > +* a workqueue execution path, thread will be > > killed. > > +* That is definitly sad, but not to make the > > situation > > +* even worse we have to ignore dead tasks in order > > not > > +* to step on zeroed out members (e.g. > > t->vfork_done is > > +* already NULL on that path, since we were called > > by > > +* do_exit())) And we have more problems like this. Say, if blk_flush_plug_list() crashes it will likely crash again and again recursively. > > */ > > - if (prev->flags & PF_WQ_WORKER) { > > + if (prev->flags & PF_WQ_WORKER && > > + prev->state != TASK_DEAD) { I don't think we should change __schedule()... Can't we simply clear PF_WQ_WORKER in complete_vfork_done() ? Or add the PF_EXITING checks into wq_worker_sleeping() and wq_worker_waking_up(). Or perhaps something like the change below. Oleg. --- x/kernel/workqueue.c +++ x/kernel/workqueue.c @@ -2157,6 +2157,14 @@ static void process_scheduled_works(stru } } +static void oops_handler(struct callback_head *oops_work) +{ + if (!(current->flags & PF_WQ_WORKER)) + return; + + clear PF_WQ_WORKER, probably do more cleanups +} + /** * worker_thread - the worker thread function * @__worker: self @@ -2171,11 +2179,14 @@ static void process_scheduled_works(stru */ static int worker_thread(void *__worker) { + struct callback_head oops_work; struct worker *worker = __worker; struct worker_pool *pool = worker->pool; /* tell the scheduler that this is a workqueue worker */ worker->task->flags |= PF_WQ_WORKER; + init_task_work(_work, oops_handler); + task_work_add(current, _work, false); woke_up: spin_lock_irq(>lock);
Re: [PATCH 2/2] sched: do not call workqueue sleep hook if task is already dead
On 10/20, Andy Lutomirski wrote: > > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -3380,8 +3380,22 @@ static void __sched notrace __schedule(bool preempt) > > * If a worker went to sleep, notify and ask > > workqueue > > * whether it wants to wake up a task to maintain > > * concurrency. > > +* > > +* Also the following stack is possible: > > +*oops_end() > > +*do_exit() > > +*schedule() > > +* > > +* If panic_on_oops is not set and oops happens on > > +* a workqueue execution path, thread will be > > killed. > > +* That is definitly sad, but not to make the > > situation > > +* even worse we have to ignore dead tasks in order > > not > > +* to step on zeroed out members (e.g. > > t->vfork_done is > > +* already NULL on that path, since we were called > > by > > +* do_exit())) And we have more problems like this. Say, if blk_flush_plug_list() crashes it will likely crash again and again recursively. > > */ > > - if (prev->flags & PF_WQ_WORKER) { > > + if (prev->flags & PF_WQ_WORKER && > > + prev->state != TASK_DEAD) { I don't think we should change __schedule()... Can't we simply clear PF_WQ_WORKER in complete_vfork_done() ? Or add the PF_EXITING checks into wq_worker_sleeping() and wq_worker_waking_up(). Or perhaps something like the change below. Oleg. --- x/kernel/workqueue.c +++ x/kernel/workqueue.c @@ -2157,6 +2157,14 @@ static void process_scheduled_works(stru } } +static void oops_handler(struct callback_head *oops_work) +{ + if (!(current->flags & PF_WQ_WORKER)) + return; + + clear PF_WQ_WORKER, probably do more cleanups +} + /** * worker_thread - the worker thread function * @__worker: self @@ -2171,11 +2179,14 @@ static void process_scheduled_works(stru */ static int worker_thread(void *__worker) { + struct callback_head oops_work; struct worker *worker = __worker; struct worker_pool *pool = worker->pool; /* tell the scheduler that this is a workqueue worker */ worker->task->flags |= PF_WQ_WORKER; + init_task_work(_work, oops_handler); + task_work_add(current, _work, false); woke_up: spin_lock_irq(>lock);
Re: [PATCH 2/2] sched: do not call workqueue sleep hook if task is already dead
On Wed, Sep 21, 2016 at 05:43:50PM +0200, Roman Pen wrote: > If panic_on_oops is not set and oops happens inside workqueue kthread, > kernel kills this kthread. Current patch fixes recursive GPF which > happens in that case with the following stack: > The root cause is that zeroed task->vfork_done member is accessed from > wq_worker_sleeping() hook. This is the kthread_data() -> to_kthread() thing? Could've done with spelling out, now you had me searching all over :/ Urgh what a mess..
Re: [PATCH 2/2] sched: do not call workqueue sleep hook if task is already dead
On Wed, Sep 21, 2016 at 05:43:50PM +0200, Roman Pen wrote: > If panic_on_oops is not set and oops happens inside workqueue kthread, > kernel kills this kthread. Current patch fixes recursive GPF which > happens in that case with the following stack: > The root cause is that zeroed task->vfork_done member is accessed from > wq_worker_sleeping() hook. This is the kthread_data() -> to_kthread() thing? Could've done with spelling out, now you had me searching all over :/ Urgh what a mess..
Re: [PATCH 2/2] sched: do not call workqueue sleep hook if task is already dead
On Wed, Sep 21, 2016 at 8:43 AM, Roman Penwrote: > If panic_on_oops is not set and oops happens inside workqueue kthread, > kernel kills this kthread. Current patch fixes recursive GPF which > happens in that case with the following stack: Oleg, can you take a look at this? --Andy > > [] dump_stack+0x68/0x93 > [] ? do_exit+0x7ab/0xc10 > [] __schedule_bug+0x83/0xe0 > [] __schedule+0x7ea/0xba0 > [] ? vprintk_default+0x1f/0x30 > [] ? printk+0x48/0x50 > [] schedule+0x40/0x90 > [] do_exit+0x9ca/0xc10 > [] ? kmsg_dump+0x11d/0x190 > [] ? kmsg_dump+0x17/0x190 > [] oops_end+0x99/0xd0 > [] no_context+0x185/0x3e0 > [] __bad_area_nosemaphore+0x83/0x1c0 > [] ? vprintk_emit+0x25e/0x530 > [] bad_area_nosemaphore+0x14/0x20 > [] __do_page_fault+0xac/0x570 > [] ? console_trylock+0x1e/0xe0 > [] ? trace_hardirqs_off_thunk+0x1a/0x1c > [] do_page_fault+0xc/0x10 > [] page_fault+0x22/0x30 > [] ? kthread_data+0x33/0x40 > [] ? wq_worker_sleeping+0xe/0x80 > [] __schedule+0x47b/0xba0 > [] schedule+0x40/0x90 > [] do_exit+0x7dd/0xc10 > [] oops_end+0x99/0xd0 > > The root cause is that zeroed task->vfork_done member is accessed from > wq_worker_sleeping() hook. The zeroing out happens on the following > path: > >oops_end() >do_exit() >exit_mm() >mm_release() >complete_vfork_done() > > In order to fix a bug dead tasks must be ignored. > > Signed-off-by: Roman Pen > Cc: Andy Lutomirski > Cc: Josh Poimboeuf > Cc: Borislav Petkov > Cc: Brian Gerst > Cc: Denys Vlasenko > Cc: H. Peter Anvin > Cc: Peter Zijlstra > Cc: Thomas Gleixner > Cc: Ingo Molnar > Cc: Tejun Heo > Cc: linux-kernel@vger.kernel.org > --- > kernel/sched/core.c | 16 +++- > 1 file changed, 15 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 2c303e7..50772e5 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -3380,8 +3380,22 @@ static void __sched notrace __schedule(bool preempt) > * If a worker went to sleep, notify and ask workqueue > * whether it wants to wake up a task to maintain > * concurrency. > +* > +* Also the following stack is possible: > +*oops_end() > +*do_exit() > +*schedule() > +* > +* If panic_on_oops is not set and oops happens on > +* a workqueue execution path, thread will be killed. > +* That is definitly sad, but not to make the > situation > +* even worse we have to ignore dead tasks in order > not > +* to step on zeroed out members (e.g. t->vfork_done > is > +* already NULL on that path, since we were called by > +* do_exit())) > */ > - if (prev->flags & PF_WQ_WORKER) { > + if (prev->flags & PF_WQ_WORKER && > + prev->state != TASK_DEAD) { > struct task_struct *to_wakeup; > > to_wakeup = wq_worker_sleeping(prev); > -- > 2.9.3 > -- Andy Lutomirski AMA Capital Management, LLC
Re: [PATCH 2/2] sched: do not call workqueue sleep hook if task is already dead
On Wed, Sep 21, 2016 at 8:43 AM, Roman Pen wrote: > If panic_on_oops is not set and oops happens inside workqueue kthread, > kernel kills this kthread. Current patch fixes recursive GPF which > happens in that case with the following stack: Oleg, can you take a look at this? --Andy > > [] dump_stack+0x68/0x93 > [] ? do_exit+0x7ab/0xc10 > [] __schedule_bug+0x83/0xe0 > [] __schedule+0x7ea/0xba0 > [] ? vprintk_default+0x1f/0x30 > [] ? printk+0x48/0x50 > [] schedule+0x40/0x90 > [] do_exit+0x9ca/0xc10 > [] ? kmsg_dump+0x11d/0x190 > [] ? kmsg_dump+0x17/0x190 > [] oops_end+0x99/0xd0 > [] no_context+0x185/0x3e0 > [] __bad_area_nosemaphore+0x83/0x1c0 > [] ? vprintk_emit+0x25e/0x530 > [] bad_area_nosemaphore+0x14/0x20 > [] __do_page_fault+0xac/0x570 > [] ? console_trylock+0x1e/0xe0 > [] ? trace_hardirqs_off_thunk+0x1a/0x1c > [] do_page_fault+0xc/0x10 > [] page_fault+0x22/0x30 > [] ? kthread_data+0x33/0x40 > [] ? wq_worker_sleeping+0xe/0x80 > [] __schedule+0x47b/0xba0 > [] schedule+0x40/0x90 > [] do_exit+0x7dd/0xc10 > [] oops_end+0x99/0xd0 > > The root cause is that zeroed task->vfork_done member is accessed from > wq_worker_sleeping() hook. The zeroing out happens on the following > path: > >oops_end() >do_exit() >exit_mm() >mm_release() >complete_vfork_done() > > In order to fix a bug dead tasks must be ignored. > > Signed-off-by: Roman Pen > Cc: Andy Lutomirski > Cc: Josh Poimboeuf > Cc: Borislav Petkov > Cc: Brian Gerst > Cc: Denys Vlasenko > Cc: H. Peter Anvin > Cc: Peter Zijlstra > Cc: Thomas Gleixner > Cc: Ingo Molnar > Cc: Tejun Heo > Cc: linux-kernel@vger.kernel.org > --- > kernel/sched/core.c | 16 +++- > 1 file changed, 15 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 2c303e7..50772e5 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -3380,8 +3380,22 @@ static void __sched notrace __schedule(bool preempt) > * If a worker went to sleep, notify and ask workqueue > * whether it wants to wake up a task to maintain > * concurrency. > +* > +* Also the following stack is possible: > +*oops_end() > +*do_exit() > +*schedule() > +* > +* If panic_on_oops is not set and oops happens on > +* a workqueue execution path, thread will be killed. > +* That is definitly sad, but not to make the > situation > +* even worse we have to ignore dead tasks in order > not > +* to step on zeroed out members (e.g. t->vfork_done > is > +* already NULL on that path, since we were called by > +* do_exit())) > */ > - if (prev->flags & PF_WQ_WORKER) { > + if (prev->flags & PF_WQ_WORKER && > + prev->state != TASK_DEAD) { > struct task_struct *to_wakeup; > > to_wakeup = wq_worker_sleeping(prev); > -- > 2.9.3 > -- Andy Lutomirski AMA Capital Management, LLC
[PATCH 2/2] sched: do not call workqueue sleep hook if task is already dead
If panic_on_oops is not set and oops happens inside workqueue kthread, kernel kills this kthread. Current patch fixes recursive GPF which happens in that case with the following stack: [] dump_stack+0x68/0x93 [] ? do_exit+0x7ab/0xc10 [] __schedule_bug+0x83/0xe0 [] __schedule+0x7ea/0xba0 [] ? vprintk_default+0x1f/0x30 [] ? printk+0x48/0x50 [] schedule+0x40/0x90 [] do_exit+0x9ca/0xc10 [] ? kmsg_dump+0x11d/0x190 [] ? kmsg_dump+0x17/0x190 [] oops_end+0x99/0xd0 [] no_context+0x185/0x3e0 [] __bad_area_nosemaphore+0x83/0x1c0 [] ? vprintk_emit+0x25e/0x530 [] bad_area_nosemaphore+0x14/0x20 [] __do_page_fault+0xac/0x570 [] ? console_trylock+0x1e/0xe0 [] ? trace_hardirqs_off_thunk+0x1a/0x1c [] do_page_fault+0xc/0x10 [] page_fault+0x22/0x30 [] ? kthread_data+0x33/0x40 [] ? wq_worker_sleeping+0xe/0x80 [] __schedule+0x47b/0xba0 [] schedule+0x40/0x90 [] do_exit+0x7dd/0xc10 [] oops_end+0x99/0xd0 The root cause is that zeroed task->vfork_done member is accessed from wq_worker_sleeping() hook. The zeroing out happens on the following path: oops_end() do_exit() exit_mm() mm_release() complete_vfork_done() In order to fix a bug dead tasks must be ignored. Signed-off-by: Roman PenCc: Andy Lutomirski Cc: Josh Poimboeuf Cc: Borislav Petkov Cc: Brian Gerst Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Tejun Heo Cc: linux-kernel@vger.kernel.org --- kernel/sched/core.c | 16 +++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2c303e7..50772e5 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3380,8 +3380,22 @@ static void __sched notrace __schedule(bool preempt) * If a worker went to sleep, notify and ask workqueue * whether it wants to wake up a task to maintain * concurrency. +* +* Also the following stack is possible: +*oops_end() +*do_exit() +*schedule() +* +* If panic_on_oops is not set and oops happens on +* a workqueue execution path, thread will be killed. +* That is definitly sad, but not to make the situation +* even worse we have to ignore dead tasks in order not +* to step on zeroed out members (e.g. t->vfork_done is +* already NULL on that path, since we were called by +* do_exit())) */ - if (prev->flags & PF_WQ_WORKER) { + if (prev->flags & PF_WQ_WORKER && + prev->state != TASK_DEAD) { struct task_struct *to_wakeup; to_wakeup = wq_worker_sleeping(prev); -- 2.9.3
[PATCH 2/2] sched: do not call workqueue sleep hook if task is already dead
If panic_on_oops is not set and oops happens inside workqueue kthread, kernel kills this kthread. Current patch fixes recursive GPF which happens in that case with the following stack: [] dump_stack+0x68/0x93 [] ? do_exit+0x7ab/0xc10 [] __schedule_bug+0x83/0xe0 [] __schedule+0x7ea/0xba0 [] ? vprintk_default+0x1f/0x30 [] ? printk+0x48/0x50 [] schedule+0x40/0x90 [] do_exit+0x9ca/0xc10 [] ? kmsg_dump+0x11d/0x190 [] ? kmsg_dump+0x17/0x190 [] oops_end+0x99/0xd0 [] no_context+0x185/0x3e0 [] __bad_area_nosemaphore+0x83/0x1c0 [] ? vprintk_emit+0x25e/0x530 [] bad_area_nosemaphore+0x14/0x20 [] __do_page_fault+0xac/0x570 [] ? console_trylock+0x1e/0xe0 [] ? trace_hardirqs_off_thunk+0x1a/0x1c [] do_page_fault+0xc/0x10 [] page_fault+0x22/0x30 [] ? kthread_data+0x33/0x40 [] ? wq_worker_sleeping+0xe/0x80 [] __schedule+0x47b/0xba0 [] schedule+0x40/0x90 [] do_exit+0x7dd/0xc10 [] oops_end+0x99/0xd0 The root cause is that zeroed task->vfork_done member is accessed from wq_worker_sleeping() hook. The zeroing out happens on the following path: oops_end() do_exit() exit_mm() mm_release() complete_vfork_done() In order to fix a bug dead tasks must be ignored. Signed-off-by: Roman Pen Cc: Andy Lutomirski Cc: Josh Poimboeuf Cc: Borislav Petkov Cc: Brian Gerst Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Tejun Heo Cc: linux-kernel@vger.kernel.org --- kernel/sched/core.c | 16 +++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2c303e7..50772e5 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3380,8 +3380,22 @@ static void __sched notrace __schedule(bool preempt) * If a worker went to sleep, notify and ask workqueue * whether it wants to wake up a task to maintain * concurrency. +* +* Also the following stack is possible: +*oops_end() +*do_exit() +*schedule() +* +* If panic_on_oops is not set and oops happens on +* a workqueue execution path, thread will be killed. +* That is definitly sad, but not to make the situation +* even worse we have to ignore dead tasks in order not +* to step on zeroed out members (e.g. t->vfork_done is +* already NULL on that path, since we were called by +* do_exit())) */ - if (prev->flags & PF_WQ_WORKER) { + if (prev->flags & PF_WQ_WORKER && + prev->state != TASK_DEAD) { struct task_struct *to_wakeup; to_wakeup = wq_worker_sleeping(prev); -- 2.9.3