Re: [PATCH] khungtaskd: Kick stuck processes
On Tue 13-02-18 12:08:12, Chris Wilson wrote: > Quoting Michal Hocko (2018-02-13 11:56:42) > > On Thu 08-02-18 19:07:53, Chris Wilson wrote: > > > After spotting a stuck process, and having decided not to panic, give > > > the task a kick to see if that helps it to recover (e.g. to paper over a > > > missed wake up). > > > > huh, this is just no-no. watchdog is there to report problems not > > interfere. You cannot never know whether the sleeper is prepared for > > spurious wakeups. Do not paper over bugs... > > Aside from khungtaskd being a debug feature, we want to identify the bug > by kicking the stuck process and seeing what squeals. Being told that > khugepaged is stuck over and over again doesn't help resolve who is > holding onto that lock_page, or if it was just a missed wakeup as all > other processes are asleep. And how exactly does kicking helps here? If the waiter uses lock_page then it would go sleep again because of PG_locked. If the page is not locked and this is a missed wake up then either unlock_page is wrong (which doesn't seem to be the case AFAICS) or somebody messes up with the page locking and this patch doesn't achieve anything. > We are trying to paper over other bugs so that we can fix ours. But you do not want to break existing code which might be sensible to spurious wakeups. You could argue that such a code is broken already and I would tend to agree, but an artificial wake up is just nogo. -- Michal Hocko SUSE Labs
Re: [PATCH] khungtaskd: Kick stuck processes
On Tue 13-02-18 12:08:12, Chris Wilson wrote: > Quoting Michal Hocko (2018-02-13 11:56:42) > > On Thu 08-02-18 19:07:53, Chris Wilson wrote: > > > After spotting a stuck process, and having decided not to panic, give > > > the task a kick to see if that helps it to recover (e.g. to paper over a > > > missed wake up). > > > > huh, this is just no-no. watchdog is there to report problems not > > interfere. You cannot never know whether the sleeper is prepared for > > spurious wakeups. Do not paper over bugs... > > Aside from khungtaskd being a debug feature, we want to identify the bug > by kicking the stuck process and seeing what squeals. Being told that > khugepaged is stuck over and over again doesn't help resolve who is > holding onto that lock_page, or if it was just a missed wakeup as all > other processes are asleep. And how exactly does kicking helps here? If the waiter uses lock_page then it would go sleep again because of PG_locked. If the page is not locked and this is a missed wake up then either unlock_page is wrong (which doesn't seem to be the case AFAICS) or somebody messes up with the page locking and this patch doesn't achieve anything. > We are trying to paper over other bugs so that we can fix ours. But you do not want to break existing code which might be sensible to spurious wakeups. You could argue that such a code is broken already and I would tend to agree, but an artificial wake up is just nogo. -- Michal Hocko SUSE Labs
Re: [PATCH] khungtaskd: Kick stuck processes
On Thu 08-02-18 19:07:53, Chris Wilson wrote: > After spotting a stuck process, and having decided not to panic, give > the task a kick to see if that helps it to recover (e.g. to paper over a > missed wake up). huh, this is just no-no. watchdog is there to report problems not interfere. You cannot never know whether the sleeper is prepared for spurious wakeups. Do not paper over bugs... > References: https://bugs.freedesktop.org/show_bug.cgi?id=104009 > References: https://bugs.freedesktop.org/show_bug.cgi?id=104682 > Signed-off-by: Chris Wilson> Cc: Ingo Molnar > Cc: Tetsuo Handa > Cc: Andrew Morton > --- > kernel/hung_task.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/kernel/hung_task.c b/kernel/hung_task.c > index 751593ed7c0b..b32acb6bcc63 100644 > --- a/kernel/hung_task.c > +++ b/kernel/hung_task.c > @@ -132,6 +132,8 @@ static void check_hung_task(struct task_struct *t, > unsigned long timeout) > trigger_all_cpu_backtrace(); > panic("hung_task: blocked tasks"); > } > + > + wake_up_process(t); > } > > /* > -- > 2.16.1 -- Michal Hocko SUSE Labs
Re: [PATCH] khungtaskd: Kick stuck processes
On Thu 08-02-18 19:07:53, Chris Wilson wrote: > After spotting a stuck process, and having decided not to panic, give > the task a kick to see if that helps it to recover (e.g. to paper over a > missed wake up). huh, this is just no-no. watchdog is there to report problems not interfere. You cannot never know whether the sleeper is prepared for spurious wakeups. Do not paper over bugs... > References: https://bugs.freedesktop.org/show_bug.cgi?id=104009 > References: https://bugs.freedesktop.org/show_bug.cgi?id=104682 > Signed-off-by: Chris Wilson > Cc: Ingo Molnar > Cc: Tetsuo Handa > Cc: Andrew Morton > --- > kernel/hung_task.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/kernel/hung_task.c b/kernel/hung_task.c > index 751593ed7c0b..b32acb6bcc63 100644 > --- a/kernel/hung_task.c > +++ b/kernel/hung_task.c > @@ -132,6 +132,8 @@ static void check_hung_task(struct task_struct *t, > unsigned long timeout) > trigger_all_cpu_backtrace(); > panic("hung_task: blocked tasks"); > } > + > + wake_up_process(t); > } > > /* > -- > 2.16.1 -- Michal Hocko SUSE Labs
Re: [PATCH] khungtaskd: Kick stuck processes
Chris Wilson wrote: > Quoting Tetsuo Handa (2018-02-08 23:10:43) > > Chris Wilson wrote: > > > After spotting a stuck process, and having decided not to panic, give > > > the task a kick to see if that helps it to recover (e.g. to paper over a > > > missed wake up). > > > > Yes, we are seeing hangs at io_schedule(), but doesn't optionally allowing > > io_schedule() be replaced with timeout version (e.g. dump_page() upon > > timeout > > if io_schedule() was called for e.g. wait_on_page_bit()) give us more clue? > > Yes, this isn't for debugging who left the page locked (or the exact > root cause), this is just trying to allow the system to limp along > afterwards :) From personal experience, I know how easy it is to lose a > wakeup and the only thing to notice is khungtaskd shouting every 120s. Calling wake_up_process() does not sleep, does it? Then, I think you can do it using SystemTap, for SystemTap gives you ability to call exported functions at (almost) arbitrary line of (almost) arbitrary file. You can find https://events.static.linuxfound.org/sites/events/files/slides/LCJ2014-en_0.pdf for an example.
Re: [PATCH] khungtaskd: Kick stuck processes
Chris Wilson wrote: > Quoting Tetsuo Handa (2018-02-08 23:10:43) > > Chris Wilson wrote: > > > After spotting a stuck process, and having decided not to panic, give > > > the task a kick to see if that helps it to recover (e.g. to paper over a > > > missed wake up). > > > > Yes, we are seeing hangs at io_schedule(), but doesn't optionally allowing > > io_schedule() be replaced with timeout version (e.g. dump_page() upon > > timeout > > if io_schedule() was called for e.g. wait_on_page_bit()) give us more clue? > > Yes, this isn't for debugging who left the page locked (or the exact > root cause), this is just trying to allow the system to limp along > afterwards :) From personal experience, I know how easy it is to lose a > wakeup and the only thing to notice is khungtaskd shouting every 120s. Calling wake_up_process() does not sleep, does it? Then, I think you can do it using SystemTap, for SystemTap gives you ability to call exported functions at (almost) arbitrary line of (almost) arbitrary file. You can find https://events.static.linuxfound.org/sites/events/files/slides/LCJ2014-en_0.pdf for an example.
Re: [PATCH] khungtaskd: Kick stuck processes
Quoting Tetsuo Handa (2018-02-08 23:10:43) > Chris Wilson wrote: > > After spotting a stuck process, and having decided not to panic, give > > the task a kick to see if that helps it to recover (e.g. to paper over a > > missed wake up). > > Yes, we are seeing hangs at io_schedule(), but doesn't optionally allowing > io_schedule() be replaced with timeout version (e.g. dump_page() upon timeout > if io_schedule() was called for e.g. wait_on_page_bit()) give us more clue? Yes, this isn't for debugging who left the page locked (or the exact root cause), this is just trying to allow the system to limp along afterwards :) From personal experience, I know how easy it is to lose a wakeup and the only thing to notice is khungtaskd shouting every 120s. -Chris
Re: [PATCH] khungtaskd: Kick stuck processes
Quoting Tetsuo Handa (2018-02-08 23:10:43) > Chris Wilson wrote: > > After spotting a stuck process, and having decided not to panic, give > > the task a kick to see if that helps it to recover (e.g. to paper over a > > missed wake up). > > Yes, we are seeing hangs at io_schedule(), but doesn't optionally allowing > io_schedule() be replaced with timeout version (e.g. dump_page() upon timeout > if io_schedule() was called for e.g. wait_on_page_bit()) give us more clue? Yes, this isn't for debugging who left the page locked (or the exact root cause), this is just trying to allow the system to limp along afterwards :) From personal experience, I know how easy it is to lose a wakeup and the only thing to notice is khungtaskd shouting every 120s. -Chris
Re: [PATCH] khungtaskd: Kick stuck processes
Chris Wilson wrote: > After spotting a stuck process, and having decided not to panic, give > the task a kick to see if that helps it to recover (e.g. to paper over a > missed wake up). Yes, we are seeing hangs at io_schedule(), but doesn't optionally allowing io_schedule() be replaced with timeout version (e.g. dump_page() upon timeout if io_schedule() was called for e.g. wait_on_page_bit()) give us more clue? > > References: https://bugs.freedesktop.org/show_bug.cgi?id=104009 > References: https://bugs.freedesktop.org/show_bug.cgi?id=104682 > Signed-off-by: Chris Wilson> Cc: Ingo Molnar > Cc: Tetsuo Handa > Cc: Andrew Morton > --- > kernel/hung_task.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/kernel/hung_task.c b/kernel/hung_task.c > index 751593ed7c0b..b32acb6bcc63 100644 > --- a/kernel/hung_task.c > +++ b/kernel/hung_task.c > @@ -132,6 +132,8 @@ static void check_hung_task(struct task_struct *t, > unsigned long timeout) > trigger_all_cpu_backtrace(); > panic("hung_task: blocked tasks"); > } > + > + wake_up_process(t); > } > > /* > -- > 2.16.1
Re: [PATCH] khungtaskd: Kick stuck processes
Chris Wilson wrote: > After spotting a stuck process, and having decided not to panic, give > the task a kick to see if that helps it to recover (e.g. to paper over a > missed wake up). Yes, we are seeing hangs at io_schedule(), but doesn't optionally allowing io_schedule() be replaced with timeout version (e.g. dump_page() upon timeout if io_schedule() was called for e.g. wait_on_page_bit()) give us more clue? > > References: https://bugs.freedesktop.org/show_bug.cgi?id=104009 > References: https://bugs.freedesktop.org/show_bug.cgi?id=104682 > Signed-off-by: Chris Wilson > Cc: Ingo Molnar > Cc: Tetsuo Handa > Cc: Andrew Morton > --- > kernel/hung_task.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/kernel/hung_task.c b/kernel/hung_task.c > index 751593ed7c0b..b32acb6bcc63 100644 > --- a/kernel/hung_task.c > +++ b/kernel/hung_task.c > @@ -132,6 +132,8 @@ static void check_hung_task(struct task_struct *t, > unsigned long timeout) > trigger_all_cpu_backtrace(); > panic("hung_task: blocked tasks"); > } > + > + wake_up_process(t); > } > > /* > -- > 2.16.1