Re: [PATCH] khungtaskd: Kick stuck processes

2018-02-13 Thread Michal Hocko
On Tue 13-02-18 12:08:12, Chris Wilson wrote:
> Quoting Michal Hocko (2018-02-13 11:56:42)
> > On Thu 08-02-18 19:07:53, Chris Wilson wrote:
> > > After spotting a stuck process, and having decided not to panic, give
> > > the task a kick to see if that helps it to recover (e.g. to paper over a
> > > missed wake up).
> > 
> > huh, this is just no-no. watchdog is there to report problems not
> > interfere. You cannot never know whether the sleeper is prepared for
> > spurious wakeups. Do not paper over bugs...
> 
> Aside from khungtaskd being a debug feature, we want to identify the bug
> by kicking the stuck process and seeing what squeals. Being told that
> khugepaged is stuck over and over again doesn't help resolve who is
> holding onto that lock_page, or if it was just a missed wakeup as all
> other processes are asleep.

And how exactly does kicking helps here? If the waiter uses lock_page
then it would go sleep again because of PG_locked. If the page is not
locked and this is a missed wake up then either unlock_page is wrong
(which doesn't seem to be the case AFAICS) or somebody messes up with
the page locking and this patch doesn't achieve anything.

> We are trying to paper over other bugs so that we can fix ours.

But you do not want to break existing code which might be sensible to
spurious wakeups. You could argue that such a code is broken already and
I would tend to agree, but an artificial wake up is just nogo.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] khungtaskd: Kick stuck processes

2018-02-13 Thread Michal Hocko
On Tue 13-02-18 12:08:12, Chris Wilson wrote:
> Quoting Michal Hocko (2018-02-13 11:56:42)
> > On Thu 08-02-18 19:07:53, Chris Wilson wrote:
> > > After spotting a stuck process, and having decided not to panic, give
> > > the task a kick to see if that helps it to recover (e.g. to paper over a
> > > missed wake up).
> > 
> > huh, this is just no-no. watchdog is there to report problems not
> > interfere. You cannot never know whether the sleeper is prepared for
> > spurious wakeups. Do not paper over bugs...
> 
> Aside from khungtaskd being a debug feature, we want to identify the bug
> by kicking the stuck process and seeing what squeals. Being told that
> khugepaged is stuck over and over again doesn't help resolve who is
> holding onto that lock_page, or if it was just a missed wakeup as all
> other processes are asleep.

And how exactly does kicking helps here? If the waiter uses lock_page
then it would go sleep again because of PG_locked. If the page is not
locked and this is a missed wake up then either unlock_page is wrong
(which doesn't seem to be the case AFAICS) or somebody messes up with
the page locking and this patch doesn't achieve anything.

> We are trying to paper over other bugs so that we can fix ours.

But you do not want to break existing code which might be sensible to
spurious wakeups. You could argue that such a code is broken already and
I would tend to agree, but an artificial wake up is just nogo.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] khungtaskd: Kick stuck processes

2018-02-13 Thread Michal Hocko
On Thu 08-02-18 19:07:53, Chris Wilson wrote:
> After spotting a stuck process, and having decided not to panic, give
> the task a kick to see if that helps it to recover (e.g. to paper over a
> missed wake up).

huh, this is just no-no. watchdog is there to report problems not
interfere. You cannot never know whether the sleeper is prepared for
spurious wakeups. Do not paper over bugs...

> References: https://bugs.freedesktop.org/show_bug.cgi?id=104009
> References: https://bugs.freedesktop.org/show_bug.cgi?id=104682
> Signed-off-by: Chris Wilson 
> Cc: Ingo Molnar 
> Cc: Tetsuo Handa 
> Cc: Andrew Morton 
> ---
>  kernel/hung_task.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> index 751593ed7c0b..b32acb6bcc63 100644
> --- a/kernel/hung_task.c
> +++ b/kernel/hung_task.c
> @@ -132,6 +132,8 @@ static void check_hung_task(struct task_struct *t, 
> unsigned long timeout)
>   trigger_all_cpu_backtrace();
>   panic("hung_task: blocked tasks");
>   }
> +
> + wake_up_process(t);
>  }
>  
>  /*
> -- 
> 2.16.1

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] khungtaskd: Kick stuck processes

2018-02-13 Thread Michal Hocko
On Thu 08-02-18 19:07:53, Chris Wilson wrote:
> After spotting a stuck process, and having decided not to panic, give
> the task a kick to see if that helps it to recover (e.g. to paper over a
> missed wake up).

huh, this is just no-no. watchdog is there to report problems not
interfere. You cannot never know whether the sleeper is prepared for
spurious wakeups. Do not paper over bugs...

> References: https://bugs.freedesktop.org/show_bug.cgi?id=104009
> References: https://bugs.freedesktop.org/show_bug.cgi?id=104682
> Signed-off-by: Chris Wilson 
> Cc: Ingo Molnar 
> Cc: Tetsuo Handa 
> Cc: Andrew Morton 
> ---
>  kernel/hung_task.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> index 751593ed7c0b..b32acb6bcc63 100644
> --- a/kernel/hung_task.c
> +++ b/kernel/hung_task.c
> @@ -132,6 +132,8 @@ static void check_hung_task(struct task_struct *t, 
> unsigned long timeout)
>   trigger_all_cpu_backtrace();
>   panic("hung_task: blocked tasks");
>   }
> +
> + wake_up_process(t);
>  }
>  
>  /*
> -- 
> 2.16.1

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] khungtaskd: Kick stuck processes

2018-02-09 Thread Tetsuo Handa
Chris Wilson wrote:
> Quoting Tetsuo Handa (2018-02-08 23:10:43)
> > Chris Wilson wrote:
> > > After spotting a stuck process, and having decided not to panic, give
> > > the task a kick to see if that helps it to recover (e.g. to paper over a
> > > missed wake up).
> > 
> > Yes, we are seeing hangs at io_schedule(), but doesn't optionally allowing
> > io_schedule() be replaced with timeout version (e.g. dump_page() upon 
> > timeout
> > if io_schedule() was called for e.g. wait_on_page_bit()) give us more clue?
> 
> Yes, this isn't for debugging who left the page locked (or the exact
> root cause), this is just trying to allow the system to limp along
> afterwards :) From personal experience, I know how easy it is to lose a
> wakeup and the only thing to notice is khungtaskd shouting every 120s.

Calling wake_up_process() does not sleep, does it? Then, I think you can
do it using SystemTap, for SystemTap gives you ability to call exported
functions at (almost) arbitrary line of (almost) arbitrary file. You can find
https://events.static.linuxfound.org/sites/events/files/slides/LCJ2014-en_0.pdf
for an example.


Re: [PATCH] khungtaskd: Kick stuck processes

2018-02-09 Thread Tetsuo Handa
Chris Wilson wrote:
> Quoting Tetsuo Handa (2018-02-08 23:10:43)
> > Chris Wilson wrote:
> > > After spotting a stuck process, and having decided not to panic, give
> > > the task a kick to see if that helps it to recover (e.g. to paper over a
> > > missed wake up).
> > 
> > Yes, we are seeing hangs at io_schedule(), but doesn't optionally allowing
> > io_schedule() be replaced with timeout version (e.g. dump_page() upon 
> > timeout
> > if io_schedule() was called for e.g. wait_on_page_bit()) give us more clue?
> 
> Yes, this isn't for debugging who left the page locked (or the exact
> root cause), this is just trying to allow the system to limp along
> afterwards :) From personal experience, I know how easy it is to lose a
> wakeup and the only thing to notice is khungtaskd shouting every 120s.

Calling wake_up_process() does not sleep, does it? Then, I think you can
do it using SystemTap, for SystemTap gives you ability to call exported
functions at (almost) arbitrary line of (almost) arbitrary file. You can find
https://events.static.linuxfound.org/sites/events/files/slides/LCJ2014-en_0.pdf
for an example.


Re: [PATCH] khungtaskd: Kick stuck processes

2018-02-08 Thread Chris Wilson
Quoting Tetsuo Handa (2018-02-08 23:10:43)
> Chris Wilson wrote:
> > After spotting a stuck process, and having decided not to panic, give
> > the task a kick to see if that helps it to recover (e.g. to paper over a
> > missed wake up).
> 
> Yes, we are seeing hangs at io_schedule(), but doesn't optionally allowing
> io_schedule() be replaced with timeout version (e.g. dump_page() upon timeout
> if io_schedule() was called for e.g. wait_on_page_bit()) give us more clue?

Yes, this isn't for debugging who left the page locked (or the exact
root cause), this is just trying to allow the system to limp along
afterwards :) From personal experience, I know how easy it is to lose a
wakeup and the only thing to notice is khungtaskd shouting every 120s.
-Chris


Re: [PATCH] khungtaskd: Kick stuck processes

2018-02-08 Thread Chris Wilson
Quoting Tetsuo Handa (2018-02-08 23:10:43)
> Chris Wilson wrote:
> > After spotting a stuck process, and having decided not to panic, give
> > the task a kick to see if that helps it to recover (e.g. to paper over a
> > missed wake up).
> 
> Yes, we are seeing hangs at io_schedule(), but doesn't optionally allowing
> io_schedule() be replaced with timeout version (e.g. dump_page() upon timeout
> if io_schedule() was called for e.g. wait_on_page_bit()) give us more clue?

Yes, this isn't for debugging who left the page locked (or the exact
root cause), this is just trying to allow the system to limp along
afterwards :) From personal experience, I know how easy it is to lose a
wakeup and the only thing to notice is khungtaskd shouting every 120s.
-Chris


Re: [PATCH] khungtaskd: Kick stuck processes

2018-02-08 Thread Tetsuo Handa
Chris Wilson wrote:
> After spotting a stuck process, and having decided not to panic, give
> the task a kick to see if that helps it to recover (e.g. to paper over a
> missed wake up).

Yes, we are seeing hangs at io_schedule(), but doesn't optionally allowing
io_schedule() be replaced with timeout version (e.g. dump_page() upon timeout
if io_schedule() was called for e.g. wait_on_page_bit()) give us more clue?

> 
> References: https://bugs.freedesktop.org/show_bug.cgi?id=104009
> References: https://bugs.freedesktop.org/show_bug.cgi?id=104682
> Signed-off-by: Chris Wilson 
> Cc: Ingo Molnar 
> Cc: Tetsuo Handa 
> Cc: Andrew Morton 
> ---
>  kernel/hung_task.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> index 751593ed7c0b..b32acb6bcc63 100644
> --- a/kernel/hung_task.c
> +++ b/kernel/hung_task.c
> @@ -132,6 +132,8 @@ static void check_hung_task(struct task_struct *t, 
> unsigned long timeout)
>   trigger_all_cpu_backtrace();
>   panic("hung_task: blocked tasks");
>   }
> +
> + wake_up_process(t);
>  }
>  
>  /*
> -- 
> 2.16.1


Re: [PATCH] khungtaskd: Kick stuck processes

2018-02-08 Thread Tetsuo Handa
Chris Wilson wrote:
> After spotting a stuck process, and having decided not to panic, give
> the task a kick to see if that helps it to recover (e.g. to paper over a
> missed wake up).

Yes, we are seeing hangs at io_schedule(), but doesn't optionally allowing
io_schedule() be replaced with timeout version (e.g. dump_page() upon timeout
if io_schedule() was called for e.g. wait_on_page_bit()) give us more clue?

> 
> References: https://bugs.freedesktop.org/show_bug.cgi?id=104009
> References: https://bugs.freedesktop.org/show_bug.cgi?id=104682
> Signed-off-by: Chris Wilson 
> Cc: Ingo Molnar 
> Cc: Tetsuo Handa 
> Cc: Andrew Morton 
> ---
>  kernel/hung_task.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> index 751593ed7c0b..b32acb6bcc63 100644
> --- a/kernel/hung_task.c
> +++ b/kernel/hung_task.c
> @@ -132,6 +132,8 @@ static void check_hung_task(struct task_struct *t, 
> unsigned long timeout)
>   trigger_all_cpu_backtrace();
>   panic("hung_task: blocked tasks");
>   }
> +
> + wake_up_process(t);
>  }
>  
>  /*
> -- 
> 2.16.1