Re: [v3 1/6] mm, oom: use oom_victims counter to synchronize oom victim selection

2017-06-29 Thread Tetsuo Handa
Roman Gushchin wrote:
> On Fri, Jun 23, 2017 at 06:52:20AM +0900, Tetsuo Handa wrote:
> > Tetsuo Handa wrote:
> > Oops, I misinterpreted. This is where a multithreaded OOM victim with or 
> > without
> > the OOM reaper can get stuck forever. Think about a process with two 
> > threads is
> > selected by the OOM killer and only one of these two threads can get 
> > TIF_MEMDIE.
> > 
> >   Thread-1 Thread-2 The OOM killer  
> >  The OOM reaper
> > 
> >Calls down_write(>mm->mmap_sem).
> >   Enters __alloc_pages_slowpath().
> >Enters __alloc_pages_slowpath().
> >   Takes oom_lock.
> >   Calls out_of_memory().
> > Selects Thread-1 as an 
> > OOM victim.
> >   Gets SIGKILL.Gets SIGKILL.
> >   Gets TIF_MEMDIE.
> >   Releases oom_lock.
> >   Leaves __alloc_pages_slowpath() because Thread-1 has TIF_MEMDIE.
> > 
> >  Takes oom_lock.
> > 
> >  Will do nothing because down_read_trylock() fails.
> > 
> >  Releases oom_lock.
> > 
> >  Gives up and sets MMF_OOM_SKIP after one second.
> >Takes oom_lock.
> >Calls out_of_memory().
> >Will not check MMF_OOM_SKIP because Thread-1 
> > still has TIF_MEMDIE. // <= get stuck waiting for Thread-1.
> >Releases oom_lock.
> >Will not leave __alloc_pages_slowpath() because 
> > Thread-2 does not have TIF_MEMDIE.
> >Will not call up_write(>mm->mmap_sem).
> >   Reaches do_exit().
> >   Calls down_read(>mm->mmap_sem) in exit_mm() in do_exit(). // <= 
> > get stuck waiting for Thread-2.
> >   Will not call up_read(>mm->mmap_sem) in exit_mm() in do_exit().
> >   Will not clear TIF_MEMDIE in exit_oom_victim() in exit_mm() in do_exit().
> 
> That's interesting... Does it mean, that we have to give an access to the 
> reserves
> to all threads to guarantee the forward progress?

Yes, for we don't have __GFP_KILLABLE flag.

> 
> What do you think about Michal's approach? He posted a link in the thread.

Please read that thread.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 1/6] mm, oom: use oom_victims counter to synchronize oom victim selection

2017-06-29 Thread Roman Gushchin
On Fri, Jun 23, 2017 at 06:52:20AM +0900, Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > Roman Gushchin wrote:
> > > On Thu, Jun 22, 2017 at 09:40:28AM +0900, Tetsuo Handa wrote:
> > > > Roman Gushchin wrote:
> > > > > --- a/mm/oom_kill.c
> > > > > +++ b/mm/oom_kill.c
> > > > > @@ -992,6 +992,13 @@ bool out_of_memory(struct oom_control *oc)
> > > > >   if (oom_killer_disabled)
> > > > >   return false;
> > > > >  
> > > > > + /*
> > > > > +  * If there are oom victims in flight, we don't need to select
> > > > > +  * a new victim.
> > > > > +  */
> > > > > + if (atomic_read(_victims) > 0)
> > > > > + return true;
> > > > > +
> > > > >   if (!is_memcg_oom(oc)) {
> > > > >   blocking_notifier_call_chain(_notify_list, 0, 
> > > > > );
> > > > >   if (freed > 0)
> > > > 
> > > > The OOM reaper is not available for CONFIG_MMU=n kernels, and timeout 
> > > > based
> > > > giveup is not permitted, but a multithreaded process might be selected 
> > > > as
> > > > an OOM victim. Not setting TIF_MEMDIE to all threads sharing an OOM 
> > > > victim's
> > > > mm increases possibility of preventing some OOM victim thread from 
> > > > terminating
> > > > (e.g. one of them cannot leave __alloc_pages_slowpath() with mmap_sem 
> > > > held for
> > > > write due to waiting for the TIF_MEMDIE thread to call 
> > > > exit_oom_victim() when
> > > > the TIF_MEMDIE thread is waiting for the thread with mmap_sem held for 
> > > > write).
> > > 
> > > I agree, that CONFIG_MMU=n is a special case, and the proposed approach 
> > > can't
> > > be used directly. But can you, please, why do you find the first  chunk 
> > > wrong?
> > 
> > Since you are checking oom_victims before checking 
> > task_will_free_mem(current),
> > only one thread can get TIF_MEMDIE. This is where a multithreaded OOM 
> > victim without
> > the OOM reaper can get stuck forever.
> 
> Oops, I misinterpreted. This is where a multithreaded OOM victim with or 
> without
> the OOM reaper can get stuck forever. Think about a process with two threads 
> is
> selected by the OOM killer and only one of these two threads can get 
> TIF_MEMDIE.
> 
>   Thread-1 Thread-2 The OOM killer   
> The OOM reaper
> 
>Calls down_write(>mm->mmap_sem).
>   Enters __alloc_pages_slowpath().
>Enters __alloc_pages_slowpath().
>   Takes oom_lock.
>   Calls out_of_memory().
> Selects Thread-1 as an 
> OOM victim.
>   Gets SIGKILL.Gets SIGKILL.
>   Gets TIF_MEMDIE.
>   Releases oom_lock.
>   Leaves __alloc_pages_slowpath() because Thread-1 has TIF_MEMDIE.
>  
> Takes oom_lock.
>  
> Will do nothing because down_read_trylock() fails.
>  
> Releases oom_lock.
>  
> Gives up and sets MMF_OOM_SKIP after one second.
>Takes oom_lock.
>Calls out_of_memory().
>Will not check MMF_OOM_SKIP because Thread-1 still 
> has TIF_MEMDIE. // <= get stuck waiting for Thread-1.
>Releases oom_lock.
>Will not leave __alloc_pages_slowpath() because 
> Thread-2 does not have TIF_MEMDIE.
>Will not call up_write(>mm->mmap_sem).
>   Reaches do_exit().
>   Calls down_read(>mm->mmap_sem) in exit_mm() in do_exit(). // <= 
> get stuck waiting for Thread-2.
>   Will not call up_read(>mm->mmap_sem) in exit_mm() in do_exit().
>   Will not clear TIF_MEMDIE in exit_oom_victim() in exit_mm() in do_exit().

That's interesting... Does it mean, that we have to give an access to the 
reserves
to all threads to guarantee the forward progress?

What do you think about Michal's approach? He posted a link in the thread.

Thank you!

Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 1/6] mm, oom: use oom_victims counter to synchronize oom victim selection

2017-06-29 Thread Michal Hocko
On Wed 21-06-17 22:19:11, Roman Gushchin wrote:
> Oom killer should avoid unnecessary kills. To prevent them, during
> the tasks list traverse we check for task which was previously
> selected as oom victims. If there is such a task, new victim
> is not selected.
> 
> This approach is sub-optimal (we're doing costly iteration over the task
> list every time) and will not work for the cgroup-aware oom killer.
> 
> We already have oom_victims counter, which can be effectively used
> for the task.

A global counter will not work properly, I am afraid. a) you should
consider the oom domain and do not block oom on unrelated domains and b)
you have no guarantee that the oom victim will terminate reasonably.
That is why we have MMF_OOM_SKIP check in oom_evaluate_task.

I think you should have something similar for your memcg victim selection.
If you see a memcg in the oom hierarchy with oom victims which are alive
and not MMF_OOM_SKIP, you should abort the scanning.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 1/6] mm, oom: use oom_victims counter to synchronize oom victim selection

2017-06-22 Thread Tetsuo Handa
Tetsuo Handa wrote:
> Roman Gushchin wrote:
> > On Thu, Jun 22, 2017 at 09:40:28AM +0900, Tetsuo Handa wrote:
> > > Roman Gushchin wrote:
> > > > --- a/mm/oom_kill.c
> > > > +++ b/mm/oom_kill.c
> > > > @@ -992,6 +992,13 @@ bool out_of_memory(struct oom_control *oc)
> > > > if (oom_killer_disabled)
> > > > return false;
> > > >  
> > > > +   /*
> > > > +* If there are oom victims in flight, we don't need to select
> > > > +* a new victim.
> > > > +*/
> > > > +   if (atomic_read(_victims) > 0)
> > > > +   return true;
> > > > +
> > > > if (!is_memcg_oom(oc)) {
> > > > blocking_notifier_call_chain(_notify_list, 0, 
> > > > );
> > > > if (freed > 0)
> > > 
> > > The OOM reaper is not available for CONFIG_MMU=n kernels, and timeout 
> > > based
> > > giveup is not permitted, but a multithreaded process might be selected as
> > > an OOM victim. Not setting TIF_MEMDIE to all threads sharing an OOM 
> > > victim's
> > > mm increases possibility of preventing some OOM victim thread from 
> > > terminating
> > > (e.g. one of them cannot leave __alloc_pages_slowpath() with mmap_sem 
> > > held for
> > > write due to waiting for the TIF_MEMDIE thread to call exit_oom_victim() 
> > > when
> > > the TIF_MEMDIE thread is waiting for the thread with mmap_sem held for 
> > > write).
> > 
> > I agree, that CONFIG_MMU=n is a special case, and the proposed approach 
> > can't
> > be used directly. But can you, please, why do you find the first  chunk 
> > wrong?
> 
> Since you are checking oom_victims before checking 
> task_will_free_mem(current),
> only one thread can get TIF_MEMDIE. This is where a multithreaded OOM victim 
> without
> the OOM reaper can get stuck forever.

Oops, I misinterpreted. This is where a multithreaded OOM victim with or without
the OOM reaper can get stuck forever. Think about a process with two threads is
selected by the OOM killer and only one of these two threads can get TIF_MEMDIE.

  Thread-1 Thread-2 The OOM killer   
The OOM reaper

   Calls down_write(>mm->mmap_sem).
  Enters __alloc_pages_slowpath().
   Enters __alloc_pages_slowpath().
  Takes oom_lock.
  Calls out_of_memory().
Selects Thread-1 as an OOM 
victim.
  Gets SIGKILL.Gets SIGKILL.
  Gets TIF_MEMDIE.
  Releases oom_lock.
  Leaves __alloc_pages_slowpath() because Thread-1 has TIF_MEMDIE.
 
Takes oom_lock.
 
Will do nothing because down_read_trylock() fails.
 
Releases oom_lock.
 
Gives up and sets MMF_OOM_SKIP after one second.
   Takes oom_lock.
   Calls out_of_memory().
   Will not check MMF_OOM_SKIP because Thread-1 still 
has TIF_MEMDIE. // <= get stuck waiting for Thread-1.
   Releases oom_lock.
   Will not leave __alloc_pages_slowpath() because 
Thread-2 does not have TIF_MEMDIE.
   Will not call up_write(>mm->mmap_sem).
  Reaches do_exit().
  Calls down_read(>mm->mmap_sem) in exit_mm() in do_exit(). // <= get 
stuck waiting for Thread-2.
  Will not call up_read(>mm->mmap_sem) in exit_mm() in do_exit().
  Will not clear TIF_MEMDIE in exit_oom_victim() in exit_mm() in do_exit().
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 1/6] mm, oom: use oom_victims counter to synchronize oom victim selection

2017-06-22 Thread Tetsuo Handa
Roman Gushchin wrote:
> On Thu, Jun 22, 2017 at 09:40:28AM +0900, Tetsuo Handa wrote:
> > Roman Gushchin wrote:
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -992,6 +992,13 @@ bool out_of_memory(struct oom_control *oc)
> > >   if (oom_killer_disabled)
> > >   return false;
> > >  
> > > + /*
> > > +  * If there are oom victims in flight, we don't need to select
> > > +  * a new victim.
> > > +  */
> > > + if (atomic_read(_victims) > 0)
> > > + return true;
> > > +
> > >   if (!is_memcg_oom(oc)) {
> > >   blocking_notifier_call_chain(_notify_list, 0, );
> > >   if (freed > 0)
> > 
> > Above in this patch and below in patch 5 are wrong.
> > 
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -665,7 +672,13 @@ static void mark_oom_victim(struct task_struct *tsk)
> > >* that TIF_MEMDIE tasks should be ignored.
> > >*/
> > >   __thaw_task(tsk);
> > > - atomic_inc(_victims);
> > > +
> > > + /*
> > > +  * If there are no oom victims in flight,
> > > +  * give the task an access to the memory reserves.
> > > +  */
> > > + if (atomic_inc_return(_victims) == 1)
> > > + set_tsk_thread_flag(tsk, TIF_MEMDIE);
> > >  }
> > >  
> > >  /**
> > 
> > The OOM reaper is not available for CONFIG_MMU=n kernels, and timeout based
> > giveup is not permitted, but a multithreaded process might be selected as
> > an OOM victim. Not setting TIF_MEMDIE to all threads sharing an OOM victim's
> > mm increases possibility of preventing some OOM victim thread from 
> > terminating
> > (e.g. one of them cannot leave __alloc_pages_slowpath() with mmap_sem held 
> > for
> > write due to waiting for the TIF_MEMDIE thread to call exit_oom_victim() 
> > when
> > the TIF_MEMDIE thread is waiting for the thread with mmap_sem held for 
> > write).
> 
> I agree, that CONFIG_MMU=n is a special case, and the proposed approach can't
> be used directly. But can you, please, why do you find the first  chunk wrong?

Since you are checking oom_victims before checking task_will_free_mem(current),
only one thread can get TIF_MEMDIE. This is where a multithreaded OOM victim 
without
the OOM reaper can get stuck forever.

> Basically, it's exactly what we do now: we increment oom_victims for every oom
> victim, and every process decrements this counter during it's exit path.
> If there is at least one existing victim, we will select it again, so it's 
> just
> an optimization. Am I missing something? Why should we start new victim 
> selection
> if there processes that will likely quit and release memory soon?

If you check oom_victims like
http://lkml.kernel.org/r/201706220053.v5m0rmou078...@www262.sakura.ne.jp does,
all threads in a multithreaded OOM victim will have a chance to get TIF_MEMDIE.



By the way, below in patch 5 is also wrong.
You are not permitted to set TIF_MEMDIE if oom_killer_disabled == true.

--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -556,8 +556,18 @@ static void oom_reap_task(struct task_struct *tsk)
struct mm_struct *mm = tsk->signal->oom_mm;
 
/* Retry the down_read_trylock(mmap_sem) a few times */
-   while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, 
mm))
+   while (attempts++ < MAX_OOM_REAP_RETRIES &&
+  !__oom_reap_task_mm(tsk, mm)) {
+
+   /*
+* If the task has no access to the memory reserves,
+* grant it to help the task to exit.
+*/
+   if (!test_tsk_thread_flag(tsk, TIF_MEMDIE))
+   set_tsk_thread_flag(tsk, TIF_MEMDIE);
+
schedule_timeout_idle(HZ/10);
+   }
 
if (attempts <= MAX_OOM_REAP_RETRIES)
goto done;
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 1/6] mm, oom: use oom_victims counter to synchronize oom victim selection

2017-06-22 Thread Roman Gushchin
On Thu, Jun 22, 2017 at 09:40:28AM +0900, Tetsuo Handa wrote:
> Roman Gushchin wrote:
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -992,6 +992,13 @@ bool out_of_memory(struct oom_control *oc)
> > if (oom_killer_disabled)
> > return false;
> >  
> > +   /*
> > +* If there are oom victims in flight, we don't need to select
> > +* a new victim.
> > +*/
> > +   if (atomic_read(_victims) > 0)
> > +   return true;
> > +
> > if (!is_memcg_oom(oc)) {
> > blocking_notifier_call_chain(_notify_list, 0, );
> > if (freed > 0)
> 
> Above in this patch and below in patch 5 are wrong.
> 
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -665,7 +672,13 @@ static void mark_oom_victim(struct task_struct *tsk)
> >  * that TIF_MEMDIE tasks should be ignored.
> >  */
> > __thaw_task(tsk);
> > -   atomic_inc(_victims);
> > +
> > +   /*
> > +* If there are no oom victims in flight,
> > +* give the task an access to the memory reserves.
> > +*/
> > +   if (atomic_inc_return(_victims) == 1)
> > +   set_tsk_thread_flag(tsk, TIF_MEMDIE);
> >  }
> >  
> >  /**
> 
> The OOM reaper is not available for CONFIG_MMU=n kernels, and timeout based
> giveup is not permitted, but a multithreaded process might be selected as
> an OOM victim. Not setting TIF_MEMDIE to all threads sharing an OOM victim's
> mm increases possibility of preventing some OOM victim thread from terminating
> (e.g. one of them cannot leave __alloc_pages_slowpath() with mmap_sem held for
> write due to waiting for the TIF_MEMDIE thread to call exit_oom_victim() when
> the TIF_MEMDIE thread is waiting for the thread with mmap_sem held for write).

I agree, that CONFIG_MMU=n is a special case, and the proposed approach can't
be used directly. But can you, please, why do you find the first  chunk wrong?
Basically, it's exactly what we do now: we increment oom_victims for every oom
victim, and every process decrements this counter during it's exit path.
If there is at least one existing victim, we will select it again, so it's just
an optimization. Am I missing something? Why should we start new victim 
selection
if there processes that will likely quit and release memory soon?

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[v3 1/6] mm, oom: use oom_victims counter to synchronize oom victim selection

2017-06-21 Thread Roman Gushchin
Oom killer should avoid unnecessary kills. To prevent them, during
the tasks list traverse we check for task which was previously
selected as oom victims. If there is such a task, new victim
is not selected.

This approach is sub-optimal (we're doing costly iteration over the task
list every time) and will not work for the cgroup-aware oom killer.

We already have oom_victims counter, which can be effectively used
for the task.

If there are victims in flight, don't do anything; if the counter
falls to 0, there are no more oom victims left.
So, it's a good time to start looking for a new victim.

Signed-off-by: Roman Gushchin 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: Tejun Heo 
Cc: Tetsuo Handa 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 mm/oom_kill.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0e2c925..e3aaf5c8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -992,6 +992,13 @@ bool out_of_memory(struct oom_control *oc)
if (oom_killer_disabled)
return false;
 
+   /*
+* If there are oom victims in flight, we don't need to select
+* a new victim.
+*/
+   if (atomic_read(_victims) > 0)
+   return true;
+
if (!is_memcg_oom(oc)) {
blocking_notifier_call_chain(_notify_list, 0, );
if (freed > 0)
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html