Re: [PATCH 1/2] mm, oom: introduce oom reaper

2016-02-03 Thread Tetsuo Handa
David Rientjes wrote:
> On Tue, 2 Feb 2016, Michal Hocko wrote:
> > > I'm baffled by any reference to "memcg oom heavy loads", I don't 
> > > understand this paragraph, sorry.  If a memcg is oom, we shouldn't be
> > > disrupting the global runqueue by running oom_reaper at a high priority.  
> > > The disruption itself is not only in first wakeup but also in how long 
> > > the 
> > > reaper can run and when it is rescheduled: for a lot of memory this is 
> > > potentially long.  The reaper is best-effort, as the changelog indicates, 
> > > and we shouldn't have a reliance on this high priority: oom kill exiting 
> > > can't possibly be expected to be immediate.  This high priority should be 
> > > removed so memcg oom conditions are isolated and don't affect other loads.
> > 
> > If this is a concern then I would be tempted to simply disable oom
> > reaper for memcg oom altogether. For me it is much more important that
> > the reaper, even though a best effort, is guaranteed to schedule if
> > something goes terribly wrong on the machine.
> > 
> 
> I don't believe the higher priority guarantees it is able to schedule any 
> more than it was guaranteed to schedule before.  It will run, but it won't 
> preempt other innocent processes in disjoint memcgs or cpusets.  It's not 
> only a memcg issue, but it also impacts disjoint cpuset mems and mempolicy 
> nodemasks.  I think it would be disappointing to leave those out.  I think 
> the higher priority should simply be removed in terms of fairness.
> 
> Other than these issues, I don't see any reason why a refreshed series 
> wouldn't be immediately acked.  Thanks very much for continuing to work on 
> this!
> 

Excuse me, but I came to think that we should try to wake up the OOM reaper at

if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
if (!is_sysrq_oom(oc))
return OOM_SCAN_ABORT;
}

in oom_scan_process_thread() rather than at oom_kill_process() or at
mark_oom_victim(). Waking up the OOM reaper there will try to reap
task->mm, and give up eventually which will in turn naturally allow the
OOM killer to choose next OOM victim. The key point is PATCH 2/5 shown
below. What do you think?

PATCH 1/5 is (I think) a bug fix.
PATCH 2/5 is for waking up the OOM reaper from victim selection loop.
PATCH 3/5 is for helping the OOM killer to choose next OOM victim.
PATCH 4/5 is for handling corner cases.
PATCH 5/5 is for changing the OOM reaper to use default priority.

 include/linux/oom.h |3
 mm/oom_kill.c   |  173 
 2 files changed, 136 insertions(+), 40 deletions(-)


>From e1c0a78fbfd0a76f367efac269cbcf22c7df9292 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa 
Date: Wed, 3 Feb 2016 14:18:19 +0900
Subject: [PATCH 1/5] mm,oom: Fix incorrect oom_task_origin check.

Currently, the OOM killer unconditionally selects p if oom_task_origin(p)
is true, but p should not be OOM-killed if p is marked as OOM-unkillable.

This patch does not fix a race condition where p is selected when p was
by chance between set_current_oom_origin() and actually start operations
that might trigger an OOM event when an OOM event is triggered for some
reason other than operations by p.

Signed-off-by: Tetsuo Handa 
---
 include/linux/oom.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 45993b8..59481e6 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -67,7 +67,8 @@ static inline void clear_current_oom_origin(void)
 
 static inline bool oom_task_origin(const struct task_struct *p)
 {
-   return !!(p->signal->oom_flags & OOM_FLAG_ORIGIN);
+   return (p->signal->oom_flags & OOM_FLAG_ORIGIN) &&
+   p->signal->oom_score_adj != OOM_SCORE_ADJ_MIN;
 }
 
 extern void mark_oom_victim(struct task_struct *tsk);
-- 
1.8.3.1

>From 76cf60d33e4e1daa475e4c1e39087415a309c6e9 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa 
Date: Wed, 3 Feb 2016 14:20:07 +0900
Subject: [PATCH 2/5] mm,oom: Change timing of waking up the OOM reaper

Currently, the OOM reaper kernel thread is woken up when we set TIF_MEMDIE
on a task. But it is not easy to build a reliable OOM-reap queuing chain.

Since the OOM livelock problem occurs when we find TIF_MEMDIE on a task
which cannot terminate, waking up the OOM reaper when we found TIF_MEMDIE
on a task can simplify handling of the chain. Also, we don't need to wake
up the OOM reaper if the victim can smoothly terminate. Therefore, this
patch replaces wake_oom_reaper() called from oom_kill_process() with
try_oom_reap() called from oom_scan_process_thread().

Signed-off-by: Tetsuo Handa 
---
 mm/oom_kill.c | 99 +++
 1 file changed, 79 insertions(+), 20 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b51bcce..07c6389 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -268,6 +268,8 @@ static enum 

Re: [PATCH 1/2] mm, oom: introduce oom reaper

2016-02-03 Thread Tetsuo Handa
David Rientjes wrote:
> On Tue, 2 Feb 2016, Michal Hocko wrote:
> > > I'm baffled by any reference to "memcg oom heavy loads", I don't 
> > > understand this paragraph, sorry.  If a memcg is oom, we shouldn't be
> > > disrupting the global runqueue by running oom_reaper at a high priority.  
> > > The disruption itself is not only in first wakeup but also in how long 
> > > the 
> > > reaper can run and when it is rescheduled: for a lot of memory this is 
> > > potentially long.  The reaper is best-effort, as the changelog indicates, 
> > > and we shouldn't have a reliance on this high priority: oom kill exiting 
> > > can't possibly be expected to be immediate.  This high priority should be 
> > > removed so memcg oom conditions are isolated and don't affect other loads.
> > 
> > If this is a concern then I would be tempted to simply disable oom
> > reaper for memcg oom altogether. For me it is much more important that
> > the reaper, even though a best effort, is guaranteed to schedule if
> > something goes terribly wrong on the machine.
> > 
> 
> I don't believe the higher priority guarantees it is able to schedule any 
> more than it was guaranteed to schedule before.  It will run, but it won't 
> preempt other innocent processes in disjoint memcgs or cpusets.  It's not 
> only a memcg issue, but it also impacts disjoint cpuset mems and mempolicy 
> nodemasks.  I think it would be disappointing to leave those out.  I think 
> the higher priority should simply be removed in terms of fairness.
> 
> Other than these issues, I don't see any reason why a refreshed series 
> wouldn't be immediately acked.  Thanks very much for continuing to work on 
> this!
> 

Excuse me, but I came to think that we should try to wake up the OOM reaper at

if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
if (!is_sysrq_oom(oc))
return OOM_SCAN_ABORT;
}

in oom_scan_process_thread() rather than at oom_kill_process() or at
mark_oom_victim(). Waking up the OOM reaper there will try to reap
task->mm, and give up eventually which will in turn naturally allow the
OOM killer to choose next OOM victim. The key point is PATCH 2/5 shown
below. What do you think?

PATCH 1/5 is (I think) a bug fix.
PATCH 2/5 is for waking up the OOM reaper from victim selection loop.
PATCH 3/5 is for helping the OOM killer to choose next OOM victim.
PATCH 4/5 is for handling corner cases.
PATCH 5/5 is for changing the OOM reaper to use default priority.

 include/linux/oom.h |3
 mm/oom_kill.c   |  173 
 2 files changed, 136 insertions(+), 40 deletions(-)


>From e1c0a78fbfd0a76f367efac269cbcf22c7df9292 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa 
Date: Wed, 3 Feb 2016 14:18:19 +0900
Subject: [PATCH 1/5] mm,oom: Fix incorrect oom_task_origin check.

Currently, the OOM killer unconditionally selects p if oom_task_origin(p)
is true, but p should not be OOM-killed if p is marked as OOM-unkillable.

This patch does not fix a race condition where p is selected when p was
by chance between set_current_oom_origin() and actually start operations
that might trigger an OOM event when an OOM event is triggered for some
reason other than operations by p.

Signed-off-by: Tetsuo Handa 
---
 include/linux/oom.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 45993b8..59481e6 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -67,7 +67,8 @@ static inline void clear_current_oom_origin(void)
 
 static inline bool oom_task_origin(const struct task_struct *p)
 {
-   return !!(p->signal->oom_flags & OOM_FLAG_ORIGIN);
+   return (p->signal->oom_flags & OOM_FLAG_ORIGIN) &&
+   p->signal->oom_score_adj != OOM_SCORE_ADJ_MIN;
 }
 
 extern void mark_oom_victim(struct task_struct *tsk);
-- 
1.8.3.1

>From 76cf60d33e4e1daa475e4c1e39087415a309c6e9 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa 
Date: Wed, 3 Feb 2016 14:20:07 +0900
Subject: [PATCH 2/5] mm,oom: Change timing of waking up the OOM reaper

Currently, the OOM reaper kernel thread is woken up when we set TIF_MEMDIE
on a task. But it is not easy to build a reliable OOM-reap queuing chain.

Since the OOM livelock problem occurs when we find TIF_MEMDIE on a task
which cannot terminate, waking up the OOM reaper when we found TIF_MEMDIE
on a task can simplify handling of the chain. Also, we don't need to wake
up the OOM reaper if the victim can smoothly terminate. Therefore, this
patch replaces wake_oom_reaper() called from oom_kill_process() with
try_oom_reap() called from oom_scan_process_thread().

Signed-off-by: Tetsuo Handa 
---
 mm/oom_kill.c | 99 +++
 1 file changed, 79 insertions(+), 20 deletions(-)

diff 

Re: [PATCH 1/2] mm, oom: introduce oom reaper

2016-02-02 Thread David Rientjes
On Tue, 2 Feb 2016, Tetsuo Handa wrote:

> Maybe we all agree with introducing OOM reaper without queuing, but I do
> want to see a guarantee for scheduling for next OOM-kill operation before
> trying to build a reliable queuing chain.
> 

The race can be fixed in two ways which I've already enumerated, but the 
scheduling issue is tangential: the oom_reaper kthread is going to run; 
increasing it's priority will only interfere with other innocent processes 
that are not attached to the oom memcg hierarchy, have disjoint cpuset 
mems, or are happily allocating from mempolicy nodes with free memory.


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2016-02-02 Thread David Rientjes
On Tue, 2 Feb 2016, Michal Hocko wrote:

> > Not exclude them, but I would have expected untrack_pfn().
> 
> My understanding is that vm_normal_page will do the right thing for
> those mappings - especially for CoW VM_PFNMAP which are normal pages
> AFAIU. Wrt. to untrack_pfn I was relying that the victim will eventually
> enter exit_mmap and do the remaining house keepining. Maybe I am missing
> something but untrack_pfn shouldn't lead to releasing a considerable
> amount of memory. So is this really necessary or we can wait for
> exit_mmap?
> 

I think if you move the code to mm/memory.c that you may find a greater 
opportunity to share code with the implementations there and this will 
take care of itself :)  I'm concerned about this also from a 
maintainability standpoint where a future patch might modify one 
implementation while forgetting about the other.  I think there's a great 
opportunity here for a really clean and shiny interfance that doesn't 
introduce any more complexity.

> > The problem is that this is racy and quite easy to trigger: imagine if 
> > __oom_reap_vmas() finds mm->mm_users == 0, because the memory of the 
> > victim has been freed, and then another system-wide oom condition occurs 
> > before the oom reaper's mm_to_reap has been set to NULL.
> 
> Yes I realize this is potentially racy. I just didn't consider the race
> important enough to justify task queuing in the first submission. Tetsuo
> was pushing for this already and I tried to push back for simplicity in
> the first submission. But ohh well... I will queue up a patch to do this
> on top. I plan to repost the full patchset shortly.
> 

Ok, thanks!  It should probably be dropped from -mm in the interim until 
it has some acked-by's, but I think those will come pretty quickly once 
it's refreshed if all of this is handled.

> > In this case, the oom reaper has ignored the next victim and doesn't do 
> > anything; the simple race has prevented it from zapping memory and does 
> > not reduce the livelock probability.
> > 
> > This can be solved either by queueing mm's to reap or involving the oom 
> > reaper into the oom killer synchronization itself.
> 
> as we have already discussed previously oom reaper is really tricky to
> be called from the direct OOM context. I will go with queuing. 
>  

Hmm, I wasn't referring to oom context: it would be possible without 
queueing with an mm_to_reap_lock (or cmpxchg) in the oom reaper and when 
the final mmput() is done.  Set it when the mm is ready for reaping, clear 
it when the mm is being destroyed, and test it before calling the oom 
killer.  I think we'd want to defer the oom killer until potential reaping 
could be done anyway and I don't anticipate an issue where oom_reaper 
fails to schedule.

> > I'm baffled by any reference to "memcg oom heavy loads", I don't 
> > understand this paragraph, sorry.  If a memcg is oom, we shouldn't be
> > disrupting the global runqueue by running oom_reaper at a high priority.  
> > The disruption itself is not only in first wakeup but also in how long the 
> > reaper can run and when it is rescheduled: for a lot of memory this is 
> > potentially long.  The reaper is best-effort, as the changelog indicates, 
> > and we shouldn't have a reliance on this high priority: oom kill exiting 
> > can't possibly be expected to be immediate.  This high priority should be 
> > removed so memcg oom conditions are isolated and don't affect other loads.
> 
> If this is a concern then I would be tempted to simply disable oom
> reaper for memcg oom altogether. For me it is much more important that
> the reaper, even though a best effort, is guaranteed to schedule if
> something goes terribly wrong on the machine.
> 

I don't believe the higher priority guarantees it is able to schedule any 
more than it was guaranteed to schedule before.  It will run, but it won't 
preempt other innocent processes in disjoint memcgs or cpusets.  It's not 
only a memcg issue, but it also impacts disjoint cpuset mems and mempolicy 
nodemasks.  I think it would be disappointing to leave those out.  I think 
the higher priority should simply be removed in terms of fairness.

Other than these issues, I don't see any reason why a refreshed series 
wouldn't be immediately acked.  Thanks very much for continuing to work on 
this!


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2016-02-02 Thread Tetsuo Handa
Michal Hocko wrote:
> > In this case, the oom reaper has ignored the next victim and doesn't do 
> > anything; the simple race has prevented it from zapping memory and does 
> > not reduce the livelock probability.
> > 
> > This can be solved either by queueing mm's to reap or involving the oom 
> > reaper into the oom killer synchronization itself.
> 
> as we have already discussed previously oom reaper is really tricky to
> be called from the direct OOM context. I will go with queuing. 
>  

OK. But it is not easy to build a reliable OOM-reap queuing chain. I think
that a dedicated kernel thread which does OOM-kill operation and OOM-reap
operation will be expected. That will also handle the "sleeping for too
long with oom_lock held after sending SIGKILL" problem.

> > I'm baffled by any reference to "memcg oom heavy loads", I don't 
> > understand this paragraph, sorry.  If a memcg is oom, we shouldn't be
> > disrupting the global runqueue by running oom_reaper at a high priority.  
> > The disruption itself is not only in first wakeup but also in how long the 
> > reaper can run and when it is rescheduled: for a lot of memory this is 
> > potentially long.  The reaper is best-effort, as the changelog indicates, 
> > and we shouldn't have a reliance on this high priority: oom kill exiting 
> > can't possibly be expected to be immediate.  This high priority should be 
> > removed so memcg oom conditions are isolated and don't affect other loads.
> 
> If this is a concern then I would be tempted to simply disable oom
> reaper for memcg oom altogether. For me it is much more important that
> the reaper, even though a best effort, is guaranteed to schedule if
> something goes terribly wrong on the machine.

I think that if something goes terribly wrong on the machine, a guarantee for
scheduling the reaper will not help unless we build a reliable queuing chain.
Building a reliable queuing chain will break some of assumptions provided by
current behavior. For me, a guarantee for scheduling for next OOM-kill
operation (with globally opening some or all of memory reserves) before
building a reliable queuing chain is much more important.

>   But ohh well... I will queue up a patch to do this
> on top. I plan to repost the full patchset shortly.

Maybe we all agree with introducing OOM reaper without queuing, but I do
want to see a guarantee for scheduling for next OOM-kill operation before
trying to build a reliable queuing chain.


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2016-02-02 Thread Michal Hocko
On Mon 01-02-16 19:02:06, David Rientjes wrote:
> On Thu, 28 Jan 2016, Michal Hocko wrote:
> 
> > [...]
> > > > +static bool __oom_reap_vmas(struct mm_struct *mm)
> > > > +{
> > > > +   struct mmu_gather tlb;
> > > > +   struct vm_area_struct *vma;
> > > > +   struct zap_details details = {.check_swap_entries = true,
> > > > + .ignore_dirty = true};
> > > > +   bool ret = true;
> > > > +
> > > > +   /* We might have raced with exit path */
> > > > +   if (!atomic_inc_not_zero(>mm_users))
> > > > +   return true;
> > > > +
> > > > +   if (!down_read_trylock(>mmap_sem)) {
> > > > +   ret = false;
> > > > +   goto out;
> > > > +   }
> > > > +
> > > > +   tlb_gather_mmu(, mm, 0, -1);
> > > > +   for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> > > > +   if (is_vm_hugetlb_page(vma))
> > > > +   continue;
> > > > +
> > > > +   /*
> > > > +* mlocked VMAs require explicit munlocking before 
> > > > unmap.
> > > > +* Let's keep it simple here and skip such VMAs.
> > > > +*/
> > > > +   if (vma->vm_flags & VM_LOCKED)
> > > > +   continue;
> > > 
> > > Shouldn't there be VM_PFNMAP handling here?
> > 
> > What would be the reason to exclude them?
> > 
> 
> Not exclude them, but I would have expected untrack_pfn().

My understanding is that vm_normal_page will do the right thing for
those mappings - especially for CoW VM_PFNMAP which are normal pages
AFAIU. Wrt. to untrack_pfn I was relying that the victim will eventually
enter exit_mmap and do the remaining house keepining. Maybe I am missing
something but untrack_pfn shouldn't lead to releasing a considerable
amount of memory. So is this really necessary or we can wait for
exit_mmap?

> > > I'm wondering why zap_page_range() for vma->vm_start to vma->vm_end 
> > > wasn't 
> > > used here for simplicity?
> > 
> > I didn't use zap_page_range because I wanted to have a full control over
> > what and how gets torn down. E.g. it is much more easier to skip over
> > hugetlb pages than relying on i_mmap_lock_write which might be blocked
> > and the whole oom_reaper will get stuck.
> > 
> 
> Let me be clear that I think the implementation is fine, minus the missing 
> handling for VM_PFNMAP.  However, I think this implementation is better 
> placed into mm/memory.c to do the iteration, selection criteria, and then 
> unmap_page_range().  I don't think we should be exposing 
> unmap_page_range() globally, but rather add a new function to do the 
> iteration in mm/memory.c with the others.

I do not have any objections to moving the code but I felt this is a
single purpose thingy which doesn't need a wider exposure. The exclusion
criteria is tightly coupled to what oom reaper is allowed to do. In
other words such a function wouldn't be reusable for say MADV_DONTNEED
because it has different criteria. Having all the selection criteria
close to __oom_reap_task on the other hand makes it easier to evaluate
their relevance. So I am not really convinced. I can move it if you feel
strongly about that, though.

> > [...]
> > > > +static void wake_oom_reaper(struct mm_struct *mm)
> > > > +{
> > > > +   struct mm_struct *old_mm;
> > > > +
> > > > +   if (!oom_reaper_th)
> > > > +   return;
> > > > +
> > > > +   /*
> > > > +* Pin the given mm. Use mm_count instead of mm_users because
> > > > +* we do not want to delay the address space tear down.
> > > > +*/
> > > > +   atomic_inc(>mm_count);
> > > > +
> > > > +   /*
> > > > +* Make sure that only a single mm is ever queued for the reaper
> > > > +* because multiple are not necessary and the operation might be
> > > > +* disruptive so better reduce it to the bare minimum.
> > > > +*/
> > > > +   old_mm = cmpxchg(_to_reap, NULL, mm);
> > > > +   if (!old_mm)
> > > > +   wake_up(_reaper_wait);
> > > > +   else
> > > > +   mmdrop(mm);
> > > 
> > > This behavior is probably the only really significant concern I have 
> > > about 
> > > the patch: we just drop the mm and don't try any reaping if there is 
> > > already reaping in progress.
> > 
> > This is based on the assumption that OOM killer will not select another
> > task to kill until the previous one drops its TIF_MEMDIE. Should this
> > change in the future we will have to come up with a queuing mechanism. I
> > didn't want to do it right away to make the change as simple as
> > possible.
> > 
> 
> The problem is that this is racy and quite easy to trigger: imagine if 
> __oom_reap_vmas() finds mm->mm_users == 0, because the memory of the 
> victim has been freed, and then another system-wide oom condition occurs 
> before the oom reaper's mm_to_reap has been set to NULL.

Yes I realize this is potentially racy. I just didn't 

Re: [PATCH 1/2] mm, oom: introduce oom reaper

2016-02-02 Thread David Rientjes
On Tue, 2 Feb 2016, Michal Hocko wrote:

> > Not exclude them, but I would have expected untrack_pfn().
> 
> My understanding is that vm_normal_page will do the right thing for
> those mappings - especially for CoW VM_PFNMAP which are normal pages
> AFAIU. Wrt. to untrack_pfn I was relying that the victim will eventually
> enter exit_mmap and do the remaining house keepining. Maybe I am missing
> something but untrack_pfn shouldn't lead to releasing a considerable
> amount of memory. So is this really necessary or we can wait for
> exit_mmap?
> 

I think if you move the code to mm/memory.c that you may find a greater 
opportunity to share code with the implementations there and this will 
take care of itself :)  I'm concerned about this also from a 
maintainability standpoint where a future patch might modify one 
implementation while forgetting about the other.  I think there's a great 
opportunity here for a really clean and shiny interfance that doesn't 
introduce any more complexity.

> > The problem is that this is racy and quite easy to trigger: imagine if 
> > __oom_reap_vmas() finds mm->mm_users == 0, because the memory of the 
> > victim has been freed, and then another system-wide oom condition occurs 
> > before the oom reaper's mm_to_reap has been set to NULL.
> 
> Yes I realize this is potentially racy. I just didn't consider the race
> important enough to justify task queuing in the first submission. Tetsuo
> was pushing for this already and I tried to push back for simplicity in
> the first submission. But ohh well... I will queue up a patch to do this
> on top. I plan to repost the full patchset shortly.
> 

Ok, thanks!  It should probably be dropped from -mm in the interim until 
it has some acked-by's, but I think those will come pretty quickly once 
it's refreshed if all of this is handled.

> > In this case, the oom reaper has ignored the next victim and doesn't do 
> > anything; the simple race has prevented it from zapping memory and does 
> > not reduce the livelock probability.
> > 
> > This can be solved either by queueing mm's to reap or involving the oom 
> > reaper into the oom killer synchronization itself.
> 
> as we have already discussed previously oom reaper is really tricky to
> be called from the direct OOM context. I will go with queuing. 
>  

Hmm, I wasn't referring to oom context: it would be possible without 
queueing with an mm_to_reap_lock (or cmpxchg) in the oom reaper and when 
the final mmput() is done.  Set it when the mm is ready for reaping, clear 
it when the mm is being destroyed, and test it before calling the oom 
killer.  I think we'd want to defer the oom killer until potential reaping 
could be done anyway and I don't anticipate an issue where oom_reaper 
fails to schedule.

> > I'm baffled by any reference to "memcg oom heavy loads", I don't 
> > understand this paragraph, sorry.  If a memcg is oom, we shouldn't be
> > disrupting the global runqueue by running oom_reaper at a high priority.  
> > The disruption itself is not only in first wakeup but also in how long the 
> > reaper can run and when it is rescheduled: for a lot of memory this is 
> > potentially long.  The reaper is best-effort, as the changelog indicates, 
> > and we shouldn't have a reliance on this high priority: oom kill exiting 
> > can't possibly be expected to be immediate.  This high priority should be 
> > removed so memcg oom conditions are isolated and don't affect other loads.
> 
> If this is a concern then I would be tempted to simply disable oom
> reaper for memcg oom altogether. For me it is much more important that
> the reaper, even though a best effort, is guaranteed to schedule if
> something goes terribly wrong on the machine.
> 

I don't believe the higher priority guarantees it is able to schedule any 
more than it was guaranteed to schedule before.  It will run, but it won't 
preempt other innocent processes in disjoint memcgs or cpusets.  It's not 
only a memcg issue, but it also impacts disjoint cpuset mems and mempolicy 
nodemasks.  I think it would be disappointing to leave those out.  I think 
the higher priority should simply be removed in terms of fairness.

Other than these issues, I don't see any reason why a refreshed series 
wouldn't be immediately acked.  Thanks very much for continuing to work on 
this!


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2016-02-02 Thread David Rientjes
On Tue, 2 Feb 2016, Tetsuo Handa wrote:

> Maybe we all agree with introducing OOM reaper without queuing, but I do
> want to see a guarantee for scheduling for next OOM-kill operation before
> trying to build a reliable queuing chain.
> 

The race can be fixed in two ways which I've already enumerated, but the 
scheduling issue is tangential: the oom_reaper kthread is going to run; 
increasing it's priority will only interfere with other innocent processes 
that are not attached to the oom memcg hierarchy, have disjoint cpuset 
mems, or are happily allocating from mempolicy nodes with free memory.


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2016-02-02 Thread Michal Hocko
On Mon 01-02-16 19:02:06, David Rientjes wrote:
> On Thu, 28 Jan 2016, Michal Hocko wrote:
> 
> > [...]
> > > > +static bool __oom_reap_vmas(struct mm_struct *mm)
> > > > +{
> > > > +   struct mmu_gather tlb;
> > > > +   struct vm_area_struct *vma;
> > > > +   struct zap_details details = {.check_swap_entries = true,
> > > > + .ignore_dirty = true};
> > > > +   bool ret = true;
> > > > +
> > > > +   /* We might have raced with exit path */
> > > > +   if (!atomic_inc_not_zero(>mm_users))
> > > > +   return true;
> > > > +
> > > > +   if (!down_read_trylock(>mmap_sem)) {
> > > > +   ret = false;
> > > > +   goto out;
> > > > +   }
> > > > +
> > > > +   tlb_gather_mmu(, mm, 0, -1);
> > > > +   for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> > > > +   if (is_vm_hugetlb_page(vma))
> > > > +   continue;
> > > > +
> > > > +   /*
> > > > +* mlocked VMAs require explicit munlocking before 
> > > > unmap.
> > > > +* Let's keep it simple here and skip such VMAs.
> > > > +*/
> > > > +   if (vma->vm_flags & VM_LOCKED)
> > > > +   continue;
> > > 
> > > Shouldn't there be VM_PFNMAP handling here?
> > 
> > What would be the reason to exclude them?
> > 
> 
> Not exclude them, but I would have expected untrack_pfn().

My understanding is that vm_normal_page will do the right thing for
those mappings - especially for CoW VM_PFNMAP which are normal pages
AFAIU. Wrt. to untrack_pfn I was relying that the victim will eventually
enter exit_mmap and do the remaining house keepining. Maybe I am missing
something but untrack_pfn shouldn't lead to releasing a considerable
amount of memory. So is this really necessary or we can wait for
exit_mmap?

> > > I'm wondering why zap_page_range() for vma->vm_start to vma->vm_end 
> > > wasn't 
> > > used here for simplicity?
> > 
> > I didn't use zap_page_range because I wanted to have a full control over
> > what and how gets torn down. E.g. it is much more easier to skip over
> > hugetlb pages than relying on i_mmap_lock_write which might be blocked
> > and the whole oom_reaper will get stuck.
> > 
> 
> Let me be clear that I think the implementation is fine, minus the missing 
> handling for VM_PFNMAP.  However, I think this implementation is better 
> placed into mm/memory.c to do the iteration, selection criteria, and then 
> unmap_page_range().  I don't think we should be exposing 
> unmap_page_range() globally, but rather add a new function to do the 
> iteration in mm/memory.c with the others.

I do not have any objections to moving the code but I felt this is a
single purpose thingy which doesn't need a wider exposure. The exclusion
criteria is tightly coupled to what oom reaper is allowed to do. In
other words such a function wouldn't be reusable for say MADV_DONTNEED
because it has different criteria. Having all the selection criteria
close to __oom_reap_task on the other hand makes it easier to evaluate
their relevance. So I am not really convinced. I can move it if you feel
strongly about that, though.

> > [...]
> > > > +static void wake_oom_reaper(struct mm_struct *mm)
> > > > +{
> > > > +   struct mm_struct *old_mm;
> > > > +
> > > > +   if (!oom_reaper_th)
> > > > +   return;
> > > > +
> > > > +   /*
> > > > +* Pin the given mm. Use mm_count instead of mm_users because
> > > > +* we do not want to delay the address space tear down.
> > > > +*/
> > > > +   atomic_inc(>mm_count);
> > > > +
> > > > +   /*
> > > > +* Make sure that only a single mm is ever queued for the reaper
> > > > +* because multiple are not necessary and the operation might be
> > > > +* disruptive so better reduce it to the bare minimum.
> > > > +*/
> > > > +   old_mm = cmpxchg(_to_reap, NULL, mm);
> > > > +   if (!old_mm)
> > > > +   wake_up(_reaper_wait);
> > > > +   else
> > > > +   mmdrop(mm);
> > > 
> > > This behavior is probably the only really significant concern I have 
> > > about 
> > > the patch: we just drop the mm and don't try any reaping if there is 
> > > already reaping in progress.
> > 
> > This is based on the assumption that OOM killer will not select another
> > task to kill until the previous one drops its TIF_MEMDIE. Should this
> > change in the future we will have to come up with a queuing mechanism. I
> > didn't want to do it right away to make the change as simple as
> > possible.
> > 
> 
> The problem is that this is racy and quite easy to trigger: imagine if 
> __oom_reap_vmas() finds mm->mm_users == 0, because the memory of the 
> victim has been freed, and then another system-wide oom condition occurs 
> before the oom reaper's mm_to_reap has been set to NULL.

Yes I realize this is potentially racy. I just didn't 

Re: [PATCH 1/2] mm, oom: introduce oom reaper

2016-02-02 Thread Tetsuo Handa
Michal Hocko wrote:
> > In this case, the oom reaper has ignored the next victim and doesn't do 
> > anything; the simple race has prevented it from zapping memory and does 
> > not reduce the livelock probability.
> > 
> > This can be solved either by queueing mm's to reap or involving the oom 
> > reaper into the oom killer synchronization itself.
> 
> as we have already discussed previously oom reaper is really tricky to
> be called from the direct OOM context. I will go with queuing. 
>  

OK. But it is not easy to build a reliable OOM-reap queuing chain. I think
that a dedicated kernel thread which does OOM-kill operation and OOM-reap
operation will be expected. That will also handle the "sleeping for too
long with oom_lock held after sending SIGKILL" problem.

> > I'm baffled by any reference to "memcg oom heavy loads", I don't 
> > understand this paragraph, sorry.  If a memcg is oom, we shouldn't be
> > disrupting the global runqueue by running oom_reaper at a high priority.  
> > The disruption itself is not only in first wakeup but also in how long the 
> > reaper can run and when it is rescheduled: for a lot of memory this is 
> > potentially long.  The reaper is best-effort, as the changelog indicates, 
> > and we shouldn't have a reliance on this high priority: oom kill exiting 
> > can't possibly be expected to be immediate.  This high priority should be 
> > removed so memcg oom conditions are isolated and don't affect other loads.
> 
> If this is a concern then I would be tempted to simply disable oom
> reaper for memcg oom altogether. For me it is much more important that
> the reaper, even though a best effort, is guaranteed to schedule if
> something goes terribly wrong on the machine.

I think that if something goes terribly wrong on the machine, a guarantee for
scheduling the reaper will not help unless we build a reliable queuing chain.
Building a reliable queuing chain will break some of assumptions provided by
current behavior. For me, a guarantee for scheduling for next OOM-kill
operation (with globally opening some or all of memory reserves) before
building a reliable queuing chain is much more important.

>   But ohh well... I will queue up a patch to do this
> on top. I plan to repost the full patchset shortly.

Maybe we all agree with introducing OOM reaper without queuing, but I do
want to see a guarantee for scheduling for next OOM-kill operation before
trying to build a reliable queuing chain.


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2016-02-01 Thread David Rientjes
On Thu, 28 Jan 2016, Michal Hocko wrote:

> [...]
> > > +static bool __oom_reap_vmas(struct mm_struct *mm)
> > > +{
> > > + struct mmu_gather tlb;
> > > + struct vm_area_struct *vma;
> > > + struct zap_details details = {.check_swap_entries = true,
> > > +   .ignore_dirty = true};
> > > + bool ret = true;
> > > +
> > > + /* We might have raced with exit path */
> > > + if (!atomic_inc_not_zero(>mm_users))
> > > + return true;
> > > +
> > > + if (!down_read_trylock(>mmap_sem)) {
> > > + ret = false;
> > > + goto out;
> > > + }
> > > +
> > > + tlb_gather_mmu(, mm, 0, -1);
> > > + for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> > > + if (is_vm_hugetlb_page(vma))
> > > + continue;
> > > +
> > > + /*
> > > +  * mlocked VMAs require explicit munlocking before unmap.
> > > +  * Let's keep it simple here and skip such VMAs.
> > > +  */
> > > + if (vma->vm_flags & VM_LOCKED)
> > > + continue;
> > 
> > Shouldn't there be VM_PFNMAP handling here?
> 
> What would be the reason to exclude them?
> 

Not exclude them, but I would have expected untrack_pfn().

> > I'm wondering why zap_page_range() for vma->vm_start to vma->vm_end wasn't 
> > used here for simplicity?
> 
> I didn't use zap_page_range because I wanted to have a full control over
> what and how gets torn down. E.g. it is much more easier to skip over
> hugetlb pages than relying on i_mmap_lock_write which might be blocked
> and the whole oom_reaper will get stuck.
> 

Let me be clear that I think the implementation is fine, minus the missing 
handling for VM_PFNMAP.  However, I think this implementation is better 
placed into mm/memory.c to do the iteration, selection criteria, and then 
unmap_page_range().  I don't think we should be exposing 
unmap_page_range() globally, but rather add a new function to do the 
iteration in mm/memory.c with the others.

> [...]
> > > +static void wake_oom_reaper(struct mm_struct *mm)
> > > +{
> > > + struct mm_struct *old_mm;
> > > +
> > > + if (!oom_reaper_th)
> > > + return;
> > > +
> > > + /*
> > > +  * Pin the given mm. Use mm_count instead of mm_users because
> > > +  * we do not want to delay the address space tear down.
> > > +  */
> > > + atomic_inc(>mm_count);
> > > +
> > > + /*
> > > +  * Make sure that only a single mm is ever queued for the reaper
> > > +  * because multiple are not necessary and the operation might be
> > > +  * disruptive so better reduce it to the bare minimum.
> > > +  */
> > > + old_mm = cmpxchg(_to_reap, NULL, mm);
> > > + if (!old_mm)
> > > + wake_up(_reaper_wait);
> > > + else
> > > + mmdrop(mm);
> > 
> > This behavior is probably the only really significant concern I have about 
> > the patch: we just drop the mm and don't try any reaping if there is 
> > already reaping in progress.
> 
> This is based on the assumption that OOM killer will not select another
> task to kill until the previous one drops its TIF_MEMDIE. Should this
> change in the future we will have to come up with a queuing mechanism. I
> didn't want to do it right away to make the change as simple as
> possible.
> 

The problem is that this is racy and quite easy to trigger: imagine if 
__oom_reap_vmas() finds mm->mm_users == 0, because the memory of the 
victim has been freed, and then another system-wide oom condition occurs 
before the oom reaper's mm_to_reap has been set to NULL.  No 
synchronization prevents that from happening (not sure what the reference 
to TIF_MEMDIE is about).

In this case, the oom reaper has ignored the next victim and doesn't do 
anything; the simple race has prevented it from zapping memory and does 
not reduce the livelock probability.

This can be solved either by queueing mm's to reap or involving the oom 
reaper into the oom killer synchronization itself.

> > > +static int __init oom_init(void)
> > > +{
> > > + oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
> > > + if (IS_ERR(oom_reaper_th)) {
> > > + pr_err("Unable to start OOM reaper %ld. Continuing 
> > > regardless\n",
> > > + PTR_ERR(oom_reaper_th));
> > > + oom_reaper_th = NULL;
> > > + } else {
> > > + struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
> > > +
> > > + /*
> > > +  * Make sure our oom reaper thread will get scheduled when
> > > +  * ASAP and that it won't get preempted by malicious userspace.
> > > +  */
> > > + sched_setscheduler(oom_reaper_th, SCHED_FIFO, );
> > 
> > Eeek, do you really show this is necessary?  I would imagine that we would 
> > want to limit high priority processes system-wide and that we wouldn't 
> > want to be interferred with by memcg oom conditions that trigger the oom 
> > reaper, for example.
> 
> The idea was that we do not want to allow a high priority userspace to
> preempt this important operation. I do 

Re: [PATCH 1/2] mm, oom: introduce oom reaper

2016-02-01 Thread David Rientjes
On Thu, 28 Jan 2016, Michal Hocko wrote:

> [...]
> > > +static bool __oom_reap_vmas(struct mm_struct *mm)
> > > +{
> > > + struct mmu_gather tlb;
> > > + struct vm_area_struct *vma;
> > > + struct zap_details details = {.check_swap_entries = true,
> > > +   .ignore_dirty = true};
> > > + bool ret = true;
> > > +
> > > + /* We might have raced with exit path */
> > > + if (!atomic_inc_not_zero(>mm_users))
> > > + return true;
> > > +
> > > + if (!down_read_trylock(>mmap_sem)) {
> > > + ret = false;
> > > + goto out;
> > > + }
> > > +
> > > + tlb_gather_mmu(, mm, 0, -1);
> > > + for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> > > + if (is_vm_hugetlb_page(vma))
> > > + continue;
> > > +
> > > + /*
> > > +  * mlocked VMAs require explicit munlocking before unmap.
> > > +  * Let's keep it simple here and skip such VMAs.
> > > +  */
> > > + if (vma->vm_flags & VM_LOCKED)
> > > + continue;
> > 
> > Shouldn't there be VM_PFNMAP handling here?
> 
> What would be the reason to exclude them?
> 

Not exclude them, but I would have expected untrack_pfn().

> > I'm wondering why zap_page_range() for vma->vm_start to vma->vm_end wasn't 
> > used here for simplicity?
> 
> I didn't use zap_page_range because I wanted to have a full control over
> what and how gets torn down. E.g. it is much more easier to skip over
> hugetlb pages than relying on i_mmap_lock_write which might be blocked
> and the whole oom_reaper will get stuck.
> 

Let me be clear that I think the implementation is fine, minus the missing 
handling for VM_PFNMAP.  However, I think this implementation is better 
placed into mm/memory.c to do the iteration, selection criteria, and then 
unmap_page_range().  I don't think we should be exposing 
unmap_page_range() globally, but rather add a new function to do the 
iteration in mm/memory.c with the others.

> [...]
> > > +static void wake_oom_reaper(struct mm_struct *mm)
> > > +{
> > > + struct mm_struct *old_mm;
> > > +
> > > + if (!oom_reaper_th)
> > > + return;
> > > +
> > > + /*
> > > +  * Pin the given mm. Use mm_count instead of mm_users because
> > > +  * we do not want to delay the address space tear down.
> > > +  */
> > > + atomic_inc(>mm_count);
> > > +
> > > + /*
> > > +  * Make sure that only a single mm is ever queued for the reaper
> > > +  * because multiple are not necessary and the operation might be
> > > +  * disruptive so better reduce it to the bare minimum.
> > > +  */
> > > + old_mm = cmpxchg(_to_reap, NULL, mm);
> > > + if (!old_mm)
> > > + wake_up(_reaper_wait);
> > > + else
> > > + mmdrop(mm);
> > 
> > This behavior is probably the only really significant concern I have about 
> > the patch: we just drop the mm and don't try any reaping if there is 
> > already reaping in progress.
> 
> This is based on the assumption that OOM killer will not select another
> task to kill until the previous one drops its TIF_MEMDIE. Should this
> change in the future we will have to come up with a queuing mechanism. I
> didn't want to do it right away to make the change as simple as
> possible.
> 

The problem is that this is racy and quite easy to trigger: imagine if 
__oom_reap_vmas() finds mm->mm_users == 0, because the memory of the 
victim has been freed, and then another system-wide oom condition occurs 
before the oom reaper's mm_to_reap has been set to NULL.  No 
synchronization prevents that from happening (not sure what the reference 
to TIF_MEMDIE is about).

In this case, the oom reaper has ignored the next victim and doesn't do 
anything; the simple race has prevented it from zapping memory and does 
not reduce the livelock probability.

This can be solved either by queueing mm's to reap or involving the oom 
reaper into the oom killer synchronization itself.

> > > +static int __init oom_init(void)
> > > +{
> > > + oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
> > > + if (IS_ERR(oom_reaper_th)) {
> > > + pr_err("Unable to start OOM reaper %ld. Continuing 
> > > regardless\n",
> > > + PTR_ERR(oom_reaper_th));
> > > + oom_reaper_th = NULL;
> > > + } else {
> > > + struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
> > > +
> > > + /*
> > > +  * Make sure our oom reaper thread will get scheduled when
> > > +  * ASAP and that it won't get preempted by malicious userspace.
> > > +  */
> > > + sched_setscheduler(oom_reaper_th, SCHED_FIFO, );
> > 
> > Eeek, do you really show this is necessary?  I would imagine that we would 
> > want to limit high priority processes system-wide and that we wouldn't 
> > want to be interferred with by memcg oom conditions that trigger the oom 
> > reaper, for example.
> 
> The idea was that we do not want to allow a high priority userspace to
> preempt this important operation. I do 

Re: [PATCH 1/2] mm, oom: introduce oom reaper

2016-01-28 Thread Michal Hocko
On Wed 27-01-16 17:28:10, David Rientjes wrote:
> On Wed, 6 Jan 2016, Michal Hocko wrote:
> 
> > From: Michal Hocko 
> > 
> > This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> > independently brought up by Oleg Nesterov.
> > 
> 
> Suggested-bys?

Sure, why not.
 
> > The OOM killer currently allows to kill only a single task in a good
> > hope that the task will terminate in a reasonable time and frees up its
> > memory.  Such a task (oom victim) will get an access to memory reserves
> > via mark_oom_victim to allow a forward progress should there be a need
> > for additional memory during exit path.
> > 
> > It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
> > construct workloads which break the core assumption mentioned above and
> > the OOM victim might take unbounded amount of time to exit because it
> > might be blocked in the uninterruptible state waiting for on an event
> > (e.g. lock) which is blocked by another task looping in the page
> > allocator.
> > 
> 
> s/for on/for/

fixed
 
> I think it would be good to note in either of the two paragraphs above 
> that each victim is per-memcg hierarchy or system-wide and the oom reaper 
> is used for memcg oom conditions as well.  Otherwise, there's no mention 
> of the memcg usecase.

I didn't mention memcg usecase because that doesn't suffer from the
deadlock issue because the OOM is invoked from the lockless context. I
think this would just make the wording more confusing.

[...]
> > +static bool __oom_reap_vmas(struct mm_struct *mm)
> > +{
> > +   struct mmu_gather tlb;
> > +   struct vm_area_struct *vma;
> > +   struct zap_details details = {.check_swap_entries = true,
> > + .ignore_dirty = true};
> > +   bool ret = true;
> > +
> > +   /* We might have raced with exit path */
> > +   if (!atomic_inc_not_zero(>mm_users))
> > +   return true;
> > +
> > +   if (!down_read_trylock(>mmap_sem)) {
> > +   ret = false;
> > +   goto out;
> > +   }
> > +
> > +   tlb_gather_mmu(, mm, 0, -1);
> > +   for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> > +   if (is_vm_hugetlb_page(vma))
> > +   continue;
> > +
> > +   /*
> > +* mlocked VMAs require explicit munlocking before unmap.
> > +* Let's keep it simple here and skip such VMAs.
> > +*/
> > +   if (vma->vm_flags & VM_LOCKED)
> > +   continue;
> 
> Shouldn't there be VM_PFNMAP handling here?

What would be the reason to exclude them?

> I'm wondering why zap_page_range() for vma->vm_start to vma->vm_end wasn't 
> used here for simplicity?

I didn't use zap_page_range because I wanted to have a full control over
what and how gets torn down. E.g. it is much more easier to skip over
hugetlb pages than relying on i_mmap_lock_write which might be blocked
and the whole oom_reaper will get stuck.

[...]
> > +static void wake_oom_reaper(struct mm_struct *mm)
> > +{
> > +   struct mm_struct *old_mm;
> > +
> > +   if (!oom_reaper_th)
> > +   return;
> > +
> > +   /*
> > +* Pin the given mm. Use mm_count instead of mm_users because
> > +* we do not want to delay the address space tear down.
> > +*/
> > +   atomic_inc(>mm_count);
> > +
> > +   /*
> > +* Make sure that only a single mm is ever queued for the reaper
> > +* because multiple are not necessary and the operation might be
> > +* disruptive so better reduce it to the bare minimum.
> > +*/
> > +   old_mm = cmpxchg(_to_reap, NULL, mm);
> > +   if (!old_mm)
> > +   wake_up(_reaper_wait);
> > +   else
> > +   mmdrop(mm);
> 
> This behavior is probably the only really significant concern I have about 
> the patch: we just drop the mm and don't try any reaping if there is 
> already reaping in progress.

This is based on the assumption that OOM killer will not select another
task to kill until the previous one drops its TIF_MEMDIE. Should this
change in the future we will have to come up with a queuing mechanism. I
didn't want to do it right away to make the change as simple as
possible.

> We don't always have control over the amount of memory that can be reaped 
> from the victim, either because of oom kill prioritization through 
> /proc/pid/oom_score_adj or because the memory of the victim is not 
> eligible.
> 
> I'm imagining a scenario where the oom reaper has raced with a follow-up 
> oom kill before mm_to_reap has been set to NULL so there's no subsequent 
> reaping.  It's also possible that oom reaping of the first victim actually 
> freed little memory.
> 
> Would it really be difficult to queue mm's to reap from?  If memory has 
> already been freed before the reaper can get to it, the 
> find_lock_task_mm() should just fail and we're done.  I'm not sure why 
> this is being limited to a single mm system-wide.

It is not that complicated but I believe we can implement it on top once
we see this is really needed. 

Re: [PATCH 1/2] mm, oom: introduce oom reaper

2016-01-28 Thread Michal Hocko
On Wed 27-01-16 17:28:10, David Rientjes wrote:
> On Wed, 6 Jan 2016, Michal Hocko wrote:
> 
> > From: Michal Hocko 
> > 
> > This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> > independently brought up by Oleg Nesterov.
> > 
> 
> Suggested-bys?

Sure, why not.
 
> > The OOM killer currently allows to kill only a single task in a good
> > hope that the task will terminate in a reasonable time and frees up its
> > memory.  Such a task (oom victim) will get an access to memory reserves
> > via mark_oom_victim to allow a forward progress should there be a need
> > for additional memory during exit path.
> > 
> > It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
> > construct workloads which break the core assumption mentioned above and
> > the OOM victim might take unbounded amount of time to exit because it
> > might be blocked in the uninterruptible state waiting for on an event
> > (e.g. lock) which is blocked by another task looping in the page
> > allocator.
> > 
> 
> s/for on/for/

fixed
 
> I think it would be good to note in either of the two paragraphs above 
> that each victim is per-memcg hierarchy or system-wide and the oom reaper 
> is used for memcg oom conditions as well.  Otherwise, there's no mention 
> of the memcg usecase.

I didn't mention memcg usecase because that doesn't suffer from the
deadlock issue because the OOM is invoked from the lockless context. I
think this would just make the wording more confusing.

[...]
> > +static bool __oom_reap_vmas(struct mm_struct *mm)
> > +{
> > +   struct mmu_gather tlb;
> > +   struct vm_area_struct *vma;
> > +   struct zap_details details = {.check_swap_entries = true,
> > + .ignore_dirty = true};
> > +   bool ret = true;
> > +
> > +   /* We might have raced with exit path */
> > +   if (!atomic_inc_not_zero(>mm_users))
> > +   return true;
> > +
> > +   if (!down_read_trylock(>mmap_sem)) {
> > +   ret = false;
> > +   goto out;
> > +   }
> > +
> > +   tlb_gather_mmu(, mm, 0, -1);
> > +   for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> > +   if (is_vm_hugetlb_page(vma))
> > +   continue;
> > +
> > +   /*
> > +* mlocked VMAs require explicit munlocking before unmap.
> > +* Let's keep it simple here and skip such VMAs.
> > +*/
> > +   if (vma->vm_flags & VM_LOCKED)
> > +   continue;
> 
> Shouldn't there be VM_PFNMAP handling here?

What would be the reason to exclude them?

> I'm wondering why zap_page_range() for vma->vm_start to vma->vm_end wasn't 
> used here for simplicity?

I didn't use zap_page_range because I wanted to have a full control over
what and how gets torn down. E.g. it is much more easier to skip over
hugetlb pages than relying on i_mmap_lock_write which might be blocked
and the whole oom_reaper will get stuck.

[...]
> > +static void wake_oom_reaper(struct mm_struct *mm)
> > +{
> > +   struct mm_struct *old_mm;
> > +
> > +   if (!oom_reaper_th)
> > +   return;
> > +
> > +   /*
> > +* Pin the given mm. Use mm_count instead of mm_users because
> > +* we do not want to delay the address space tear down.
> > +*/
> > +   atomic_inc(>mm_count);
> > +
> > +   /*
> > +* Make sure that only a single mm is ever queued for the reaper
> > +* because multiple are not necessary and the operation might be
> > +* disruptive so better reduce it to the bare minimum.
> > +*/
> > +   old_mm = cmpxchg(_to_reap, NULL, mm);
> > +   if (!old_mm)
> > +   wake_up(_reaper_wait);
> > +   else
> > +   mmdrop(mm);
> 
> This behavior is probably the only really significant concern I have about 
> the patch: we just drop the mm and don't try any reaping if there is 
> already reaping in progress.

This is based on the assumption that OOM killer will not select another
task to kill until the previous one drops its TIF_MEMDIE. Should this
change in the future we will have to come up with a queuing mechanism. I
didn't want to do it right away to make the change as simple as
possible.

> We don't always have control over the amount of memory that can be reaped 
> from the victim, either because of oom kill prioritization through 
> /proc/pid/oom_score_adj or because the memory of the victim is not 
> eligible.
> 
> I'm imagining a scenario where the oom reaper has raced with a follow-up 
> oom kill before mm_to_reap has been set to NULL so there's no subsequent 
> reaping.  It's also possible that oom reaping of the first victim actually 
> freed little memory.
> 
> Would it really be difficult to queue mm's to reap from?  If memory has 
> already been freed before the reaper can get to it, the 
> find_lock_task_mm() should just fail and we're done.  I'm not sure why 
> this is being limited to a single mm system-wide.

It is not that complicated but I believe we can implement it on top once
we see this 

Re: [PATCH 1/2] mm, oom: introduce oom reaper

2016-01-27 Thread David Rientjes
On Wed, 6 Jan 2016, Michal Hocko wrote:

> From: Michal Hocko 
> 
> This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> independently brought up by Oleg Nesterov.
> 

Suggested-bys?

> The OOM killer currently allows to kill only a single task in a good
> hope that the task will terminate in a reasonable time and frees up its
> memory.  Such a task (oom victim) will get an access to memory reserves
> via mark_oom_victim to allow a forward progress should there be a need
> for additional memory during exit path.
> 
> It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
> construct workloads which break the core assumption mentioned above and
> the OOM victim might take unbounded amount of time to exit because it
> might be blocked in the uninterruptible state waiting for on an event
> (e.g. lock) which is blocked by another task looping in the page
> allocator.
> 

s/for on/for/

I think it would be good to note in either of the two paragraphs above 
that each victim is per-memcg hierarchy or system-wide and the oom reaper 
is used for memcg oom conditions as well.  Otherwise, there's no mention 
of the memcg usecase.

> This patch reduces the probability of such a lockup by introducing a
> specialized kernel thread (oom_reaper) which tries to reclaim additional
> memory by preemptively reaping the anonymous or swapped out memory
> owned by the oom victim under an assumption that such a memory won't
> be needed when its owner is killed and kicked from the userspace anyway.
> There is one notable exception to this, though, if the OOM victim was
> in the process of coredumping the result would be incomplete. This is
> considered a reasonable constrain because the overall system health is
> more important than debugability of a particular application.
> 
> A kernel thread has been chosen because we need a reliable way of
> invocation so workqueue context is not appropriate because all the
> workers might be busy (e.g. allocating memory). Kswapd which sounds
> like another good fit is not appropriate as well because it might get
> blocked on locks during reclaim as well.
> 

Very good points.  And I think this makes the case clear that oom_reaper 
is really a best-effort solution.

> oom_reaper has to take mmap_sem on the target task for reading so the
> solution is not 100% because the semaphore might be held or blocked for
> write but the probability is reduced considerably wrt. basically any
> lock blocking forward progress as described above. In order to prevent
> from blocking on the lock without any forward progress we are using only
> a trylock and retry 10 times with a short sleep in between.
> Users of mmap_sem which need it for write should be carefully reviewed
> to use _killable waiting as much as possible and reduce allocations
> requests done with the lock held to absolute minimum to reduce the risk
> even further.
> 
> The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
> updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition
> and oom_reaper clear this atomically once it is done with the work. This
> means that only a single mm_struct can be reaped at the time. As the
> operation is potentially disruptive we are trying to limit it to the
> ncessary minimum and the reaper blocks any updates while it operates on
> an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
> race is detected by atomic_inc_not_zero(mm_users).
> 
> Changes since v3
> - many style/compile fixups by Andrew
> - unmap_mapping_range_tree needs full initialization of zap_details
>   to prevent from missing unmaps and follow up BUG_ON during truncate
>   resp. misaccounting - Kirill/Andrew
> - exclude mlocked pages because they need an explicit munlock by Kirill
> - use subsys_initcall instead of module_init - Paul Gortmaker
> Changes since v2
> - fix mm_count refernce leak reported by Tetsuo
> - make sure oom_reaper_th is NULL after kthread_run fails - Tetsuo
> - use wait_event_freezable rather than open coded wait loop - suggested
>   by Tetsuo
> Changes since v1
> - fix the screwed up detail->check_swap_entries - Johannes
> - do not use kthread_should_stop because that would need a cleanup
>   and we do not have anybody to stop us - Tetsuo
> - move wake_oom_reaper to oom_kill_process because we have to wait
>   for all tasks sharing the same mm to get killed - Tetsuo
> - do not reap mm structs which are shared with unkillable tasks - Tetsuo
> 
> Acked-by: Mel Gorman 
> Signed-off-by: Michal Hocko 
> ---
>  include/linux/mm.h |   2 +
>  mm/internal.h  |   5 ++
>  mm/memory.c|  17 +++---
>  mm/oom_kill.c  | 157 
> +++--
>  4 files changed, 170 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 25cdec395f2c..d1ce03569942 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1061,6 +1061,8 @@ struct zap_details 

Re: [PATCH 1/2] mm, oom: introduce oom reaper

2016-01-27 Thread David Rientjes
On Wed, 6 Jan 2016, Michal Hocko wrote:

> From: Michal Hocko 
> 
> This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> independently brought up by Oleg Nesterov.
> 

Suggested-bys?

> The OOM killer currently allows to kill only a single task in a good
> hope that the task will terminate in a reasonable time and frees up its
> memory.  Such a task (oom victim) will get an access to memory reserves
> via mark_oom_victim to allow a forward progress should there be a need
> for additional memory during exit path.
> 
> It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
> construct workloads which break the core assumption mentioned above and
> the OOM victim might take unbounded amount of time to exit because it
> might be blocked in the uninterruptible state waiting for on an event
> (e.g. lock) which is blocked by another task looping in the page
> allocator.
> 

s/for on/for/

I think it would be good to note in either of the two paragraphs above 
that each victim is per-memcg hierarchy or system-wide and the oom reaper 
is used for memcg oom conditions as well.  Otherwise, there's no mention 
of the memcg usecase.

> This patch reduces the probability of such a lockup by introducing a
> specialized kernel thread (oom_reaper) which tries to reclaim additional
> memory by preemptively reaping the anonymous or swapped out memory
> owned by the oom victim under an assumption that such a memory won't
> be needed when its owner is killed and kicked from the userspace anyway.
> There is one notable exception to this, though, if the OOM victim was
> in the process of coredumping the result would be incomplete. This is
> considered a reasonable constrain because the overall system health is
> more important than debugability of a particular application.
> 
> A kernel thread has been chosen because we need a reliable way of
> invocation so workqueue context is not appropriate because all the
> workers might be busy (e.g. allocating memory). Kswapd which sounds
> like another good fit is not appropriate as well because it might get
> blocked on locks during reclaim as well.
> 

Very good points.  And I think this makes the case clear that oom_reaper 
is really a best-effort solution.

> oom_reaper has to take mmap_sem on the target task for reading so the
> solution is not 100% because the semaphore might be held or blocked for
> write but the probability is reduced considerably wrt. basically any
> lock blocking forward progress as described above. In order to prevent
> from blocking on the lock without any forward progress we are using only
> a trylock and retry 10 times with a short sleep in between.
> Users of mmap_sem which need it for write should be carefully reviewed
> to use _killable waiting as much as possible and reduce allocations
> requests done with the lock held to absolute minimum to reduce the risk
> even further.
> 
> The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
> updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition
> and oom_reaper clear this atomically once it is done with the work. This
> means that only a single mm_struct can be reaped at the time. As the
> operation is potentially disruptive we are trying to limit it to the
> ncessary minimum and the reaper blocks any updates while it operates on
> an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
> race is detected by atomic_inc_not_zero(mm_users).
> 
> Changes since v3
> - many style/compile fixups by Andrew
> - unmap_mapping_range_tree needs full initialization of zap_details
>   to prevent from missing unmaps and follow up BUG_ON during truncate
>   resp. misaccounting - Kirill/Andrew
> - exclude mlocked pages because they need an explicit munlock by Kirill
> - use subsys_initcall instead of module_init - Paul Gortmaker
> Changes since v2
> - fix mm_count refernce leak reported by Tetsuo
> - make sure oom_reaper_th is NULL after kthread_run fails - Tetsuo
> - use wait_event_freezable rather than open coded wait loop - suggested
>   by Tetsuo
> Changes since v1
> - fix the screwed up detail->check_swap_entries - Johannes
> - do not use kthread_should_stop because that would need a cleanup
>   and we do not have anybody to stop us - Tetsuo
> - move wake_oom_reaper to oom_kill_process because we have to wait
>   for all tasks sharing the same mm to get killed - Tetsuo
> - do not reap mm structs which are shared with unkillable tasks - Tetsuo
> 
> Acked-by: Mel Gorman 
> Signed-off-by: Michal Hocko 
> ---
>  include/linux/mm.h |   2 +
>  mm/internal.h  |   5 ++
>  mm/memory.c|  17 +++---
>  mm/oom_kill.c  | 157 
> +++--
>  4 files changed, 170 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 25cdec395f2c..d1ce03569942 100644
> --- a/include/linux/mm.h
> +++ 

Re: [PATCH 1/2] mm, oom: introduce oom reaper

2016-01-07 Thread Michal Hocko
On Thu 07-01-16 20:23:04, Tetsuo Handa wrote:
[...]
> According to commit a2b829d95958da20 ("mm/oom_kill.c: avoid attempting
> to kill init sharing same memory"), below patch is needed for avoid
> killing init process with SIGSEGV.
> 
> --
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 9548dce..9832f3f 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -784,9 +784,7 @@ void oom_kill_process(struct oom_control *oc, struct 
> task_struct *p,
>   continue;
>   if (same_thread_group(p, victim))
>   continue;
> - if (is_global_init(p))
> - continue;
> - if (unlikely(p->flags & PF_KTHREAD) ||
> + if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
>   p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
>   /*
>* We cannot use oom_reaper for the mm shared by this
[...]
> [3.132836] [ pid ]   uid  tgid total_vm  rss nr_ptes nr_pmds swapents 
> oom_score_adj name
> [3.137232] [   98] 098   279607   244400 489   50 
> 0 init
> [3.141664] Out of memory: Kill process 98 (init) score 940 or sacrifice 
> child
> [3.145346] Killed process 98 (init) total-vm:1118428kB, 
> anon-rss:977464kB, file-rss:136kB, shmem-rss:0kB
> [3.416105] init[1]: segfault at 0 ip   (null) sp 7ffd484cf5f0 
> error 14 in init[40+1000]
> [3.439074] Kernel panic - not syncing: Attempted to kill init! 
> exitcode=0x000b
> [3.439074]
> [3.450193] Kernel Offset: disabled
> [3.456259] ---[ end Kernel panic - not syncing: Attempted to kill init! 
> exitcode=0x000b
> [3.456259]

Ouch. You are right. The reaper will tear down the shared mm and the
global init will blow up. Very well spotted! The system will blow up
later, I would guess, because killing the victim wouldn't release a lot
of memory which will be pinned by the global init. So a panic sounds
unevitable. The scenario is really insane but what you are proposing is
correct.

Updated patch below.
--- 
>From 71c6f4135fe4a8d448d63d4904ba514787dea008 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Mon, 23 Nov 2015 18:20:57 +0100
Subject: [PATCH] mm, oom: introduce oom reaper

This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
independently brought up by Oleg Nesterov.

The OOM killer currently allows to kill only a single task in a good
hope that the task will terminate in a reasonable time and frees up its
memory.  Such a task (oom victim) will get an access to memory reserves
via mark_oom_victim to allow a forward progress should there be a need
for additional memory during exit path.

It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
construct workloads which break the core assumption mentioned above and
the OOM victim might take unbounded amount of time to exit because it
might be blocked in the uninterruptible state waiting for on an event
(e.g. lock) which is blocked by another task looping in the page
allocator.

This patch reduces the probability of such a lockup by introducing a
specialized kernel thread (oom_reaper) which tries to reclaim additional
memory by preemptively reaping the anonymous or swapped out memory
owned by the oom victim under an assumption that such a memory won't
be needed when its owner is killed and kicked from the userspace anyway.
There is one notable exception to this, though, if the OOM victim was
in the process of coredumping the result would be incomplete. This is
considered a reasonable constrain because the overall system health is
more important than debugability of a particular application.

A kernel thread has been chosen because we need a reliable way of
invocation so workqueue context is not appropriate because all the
workers might be busy (e.g. allocating memory). Kswapd which sounds
like another good fit is not appropriate as well because it might get
blocked on locks during reclaim as well.

oom_reaper has to take mmap_sem on the target task for reading so the
solution is not 100% because the semaphore might be held or blocked for
write but the probability is reduced considerably wrt. basically any
lock blocking forward progress as described above. In order to prevent
from blocking on the lock without any forward progress we are using only
a trylock and retry 10 times with a short sleep in between.
Users of mmap_sem which need it for write should be carefully reviewed
to use _killable waiting as much as possible and reduce allocations
requests done with the lock held to absolute minimum to reduce the risk
even further.

The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition
and oom_reaper clear this atomically once it is done with the work. This
means that only a single mm_struct can be reaped at the time. As the
operation is 

Re: [PATCH 1/2] mm, oom: introduce oom reaper

2016-01-07 Thread Michal Hocko
On Thu 07-01-16 20:23:04, Tetsuo Handa wrote:
[...]
> According to commit a2b829d95958da20 ("mm/oom_kill.c: avoid attempting
> to kill init sharing same memory"), below patch is needed for avoid
> killing init process with SIGSEGV.
> 
> --
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 9548dce..9832f3f 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -784,9 +784,7 @@ void oom_kill_process(struct oom_control *oc, struct 
> task_struct *p,
>   continue;
>   if (same_thread_group(p, victim))
>   continue;
> - if (is_global_init(p))
> - continue;
> - if (unlikely(p->flags & PF_KTHREAD) ||
> + if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
>   p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
>   /*
>* We cannot use oom_reaper for the mm shared by this
[...]
> [3.132836] [ pid ]   uid  tgid total_vm  rss nr_ptes nr_pmds swapents 
> oom_score_adj name
> [3.137232] [   98] 098   279607   244400 489   50 
> 0 init
> [3.141664] Out of memory: Kill process 98 (init) score 940 or sacrifice 
> child
> [3.145346] Killed process 98 (init) total-vm:1118428kB, 
> anon-rss:977464kB, file-rss:136kB, shmem-rss:0kB
> [3.416105] init[1]: segfault at 0 ip   (null) sp 7ffd484cf5f0 
> error 14 in init[40+1000]
> [3.439074] Kernel panic - not syncing: Attempted to kill init! 
> exitcode=0x000b
> [3.439074]
> [3.450193] Kernel Offset: disabled
> [3.456259] ---[ end Kernel panic - not syncing: Attempted to kill init! 
> exitcode=0x000b
> [3.456259]

Ouch. You are right. The reaper will tear down the shared mm and the
global init will blow up. Very well spotted! The system will blow up
later, I would guess, because killing the victim wouldn't release a lot
of memory which will be pinned by the global init. So a panic sounds
unevitable. The scenario is really insane but what you are proposing is
correct.

Updated patch below.
--- 
>From 71c6f4135fe4a8d448d63d4904ba514787dea008 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Mon, 23 Nov 2015 18:20:57 +0100
Subject: [PATCH] mm, oom: introduce oom reaper

This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
independently brought up by Oleg Nesterov.

The OOM killer currently allows to kill only a single task in a good
hope that the task will terminate in a reasonable time and frees up its
memory.  Such a task (oom victim) will get an access to memory reserves
via mark_oom_victim to allow a forward progress should there be a need
for additional memory during exit path.

It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
construct workloads which break the core assumption mentioned above and
the OOM victim might take unbounded amount of time to exit because it
might be blocked in the uninterruptible state waiting for on an event
(e.g. lock) which is blocked by another task looping in the page
allocator.

This patch reduces the probability of such a lockup by introducing a
specialized kernel thread (oom_reaper) which tries to reclaim additional
memory by preemptively reaping the anonymous or swapped out memory
owned by the oom victim under an assumption that such a memory won't
be needed when its owner is killed and kicked from the userspace anyway.
There is one notable exception to this, though, if the OOM victim was
in the process of coredumping the result would be incomplete. This is
considered a reasonable constrain because the overall system health is
more important than debugability of a particular application.

A kernel thread has been chosen because we need a reliable way of
invocation so workqueue context is not appropriate because all the
workers might be busy (e.g. allocating memory). Kswapd which sounds
like another good fit is not appropriate as well because it might get
blocked on locks during reclaim as well.

oom_reaper has to take mmap_sem on the target task for reading so the
solution is not 100% because the semaphore might be held or blocked for
write but the probability is reduced considerably wrt. basically any
lock blocking forward progress as described above. In order to prevent
from blocking on the lock without any forward progress we are using only
a trylock and retry 10 times with a short sleep in between.
Users of mmap_sem which need it for write should be carefully reviewed
to use _killable waiting as much as possible and reduce allocations
requests done with the lock held to absolute minimum to reduce the risk
even further.

The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition
and oom_reaper clear this atomically once it is done with the work. This
means that only a single mm_struct can be reaped at the time. As 

[PATCH 1/2] mm, oom: introduce oom reaper

2016-01-06 Thread Michal Hocko
From: Michal Hocko 

This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
independently brought up by Oleg Nesterov.

The OOM killer currently allows to kill only a single task in a good
hope that the task will terminate in a reasonable time and frees up its
memory.  Such a task (oom victim) will get an access to memory reserves
via mark_oom_victim to allow a forward progress should there be a need
for additional memory during exit path.

It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
construct workloads which break the core assumption mentioned above and
the OOM victim might take unbounded amount of time to exit because it
might be blocked in the uninterruptible state waiting for on an event
(e.g. lock) which is blocked by another task looping in the page
allocator.

This patch reduces the probability of such a lockup by introducing a
specialized kernel thread (oom_reaper) which tries to reclaim additional
memory by preemptively reaping the anonymous or swapped out memory
owned by the oom victim under an assumption that such a memory won't
be needed when its owner is killed and kicked from the userspace anyway.
There is one notable exception to this, though, if the OOM victim was
in the process of coredumping the result would be incomplete. This is
considered a reasonable constrain because the overall system health is
more important than debugability of a particular application.

A kernel thread has been chosen because we need a reliable way of
invocation so workqueue context is not appropriate because all the
workers might be busy (e.g. allocating memory). Kswapd which sounds
like another good fit is not appropriate as well because it might get
blocked on locks during reclaim as well.

oom_reaper has to take mmap_sem on the target task for reading so the
solution is not 100% because the semaphore might be held or blocked for
write but the probability is reduced considerably wrt. basically any
lock blocking forward progress as described above. In order to prevent
from blocking on the lock without any forward progress we are using only
a trylock and retry 10 times with a short sleep in between.
Users of mmap_sem which need it for write should be carefully reviewed
to use _killable waiting as much as possible and reduce allocations
requests done with the lock held to absolute minimum to reduce the risk
even further.

The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition
and oom_reaper clear this atomically once it is done with the work. This
means that only a single mm_struct can be reaped at the time. As the
operation is potentially disruptive we are trying to limit it to the
ncessary minimum and the reaper blocks any updates while it operates on
an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
race is detected by atomic_inc_not_zero(mm_users).

Changes since v3
- many style/compile fixups by Andrew
- unmap_mapping_range_tree needs full initialization of zap_details
  to prevent from missing unmaps and follow up BUG_ON during truncate
  resp. misaccounting - Kirill/Andrew
- exclude mlocked pages because they need an explicit munlock by Kirill
- use subsys_initcall instead of module_init - Paul Gortmaker
Changes since v2
- fix mm_count refernce leak reported by Tetsuo
- make sure oom_reaper_th is NULL after kthread_run fails - Tetsuo
- use wait_event_freezable rather than open coded wait loop - suggested
  by Tetsuo
Changes since v1
- fix the screwed up detail->check_swap_entries - Johannes
- do not use kthread_should_stop because that would need a cleanup
  and we do not have anybody to stop us - Tetsuo
- move wake_oom_reaper to oom_kill_process because we have to wait
  for all tasks sharing the same mm to get killed - Tetsuo
- do not reap mm structs which are shared with unkillable tasks - Tetsuo

Acked-by: Mel Gorman 
Signed-off-by: Michal Hocko 
---
 include/linux/mm.h |   2 +
 mm/internal.h  |   5 ++
 mm/memory.c|  17 +++---
 mm/oom_kill.c  | 157 +++--
 4 files changed, 170 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 25cdec395f2c..d1ce03569942 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1061,6 +1061,8 @@ struct zap_details {
struct address_space *check_mapping;/* Check page->mapping if set */
pgoff_t first_index;/* Lowest page->index to unmap 
*/
pgoff_t last_index; /* Highest page->index to unmap 
*/
+   bool ignore_dirty;  /* Ignore dirty pages */
+   bool check_swap_entries;/* Check also swap entries */
 };
 
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/internal.h b/mm/internal.h
index 4ae7b7c7462b..9006ce1960ff 100644
--- a/mm/internal.h
+++ b/mm/internal.h

Re: [PATCH 1/2] mm, oom: introduce oom reaper

2016-01-06 Thread Michal Hocko
On Wed 06-01-16 09:26:12, Paul Gortmaker wrote:
> [Re: [PATCH 1/2] mm, oom: introduce oom reaper] On 06/01/2016 (Wed 10:10) 
> Michal Hocko wrote:
> 
> > On Mon 21-12-15 15:38:21, Paul Gortmaker wrote:
> > [...]
> > > ...use one of the non-modular initcalls here?   I'm trying to clean up 
> > > most of
> > > the non-modular uses of modular macros etc. since:
> > > 
> > >  (1) it is easy to accidentally code up an unused module_exit function
> > >  (2) it can be misleading when reading the source, thinking it can be
> > >   modular when the Makefile and/or Kconfig prohibit it
> > >  (3) it requires the include of the module.h header file which in turn
> > >  includes nearly everything else, thus increasing CPP overhead.
> > > 
> > > I figured no point in sending a follow on patch since this came in via
> > > the akpm tree into next and that gets rebased/updated regularly.
> > 
> > Sorry for the late reply. I was mostly offline throughout the last 2
> > weeks last year. Is the following what you would like to see? If yes I
> > will fold it into the original patch.
> 
> Yes, that looks fine.  Do note that susbsys_initcall is earlier than the
> module_init that you were using previously though.

Yes, I have noticed that but quite honestly module_init choice was just
"look around and use what others are doing". So there was no particular
reason to stick with that order.

Thanks for double checking after me!
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2016-01-06 Thread Paul Gortmaker
[Re: [PATCH 1/2] mm, oom: introduce oom reaper] On 06/01/2016 (Wed 10:10) 
Michal Hocko wrote:

> On Mon 21-12-15 15:38:21, Paul Gortmaker wrote:
> [...]
> > ...use one of the non-modular initcalls here?   I'm trying to clean up most 
> > of
> > the non-modular uses of modular macros etc. since:
> > 
> >  (1) it is easy to accidentally code up an unused module_exit function
> >  (2) it can be misleading when reading the source, thinking it can be
> >   modular when the Makefile and/or Kconfig prohibit it
> >  (3) it requires the include of the module.h header file which in turn
> >  includes nearly everything else, thus increasing CPP overhead.
> > 
> > I figured no point in sending a follow on patch since this came in via
> > the akpm tree into next and that gets rebased/updated regularly.
> 
> Sorry for the late reply. I was mostly offline throughout the last 2
> weeks last year. Is the following what you would like to see? If yes I
> will fold it into the original patch.

Yes, that looks fine.  Do note that susbsys_initcall is earlier than the
module_init that you were using previously though.

Thanks,
Paul.
--

> 
> Thanks!
> ---
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 7a9678c50edd..1ece40b94725 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -36,7 +36,7 @@
>  #include 
>  #include 
>  #include 
> -#include 
> +#include 
>  
>  #include 
>  #include "internal.h"
> @@ -541,7 +541,7 @@ static int __init oom_init(void)
>   }
>   return 0;
>  }
> -module_init(oom_init)
> +subsys_initcall(oom_init)
>  #else
>  static void wake_oom_reaper(struct mm_struct *mm)
>  {
> -- 
> Michal Hocko
> SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2016-01-06 Thread Michal Hocko
On Mon 21-12-15 15:38:21, Paul Gortmaker wrote:
[...]
> ...use one of the non-modular initcalls here?   I'm trying to clean up most of
> the non-modular uses of modular macros etc. since:
> 
>  (1) it is easy to accidentally code up an unused module_exit function
>  (2) it can be misleading when reading the source, thinking it can be
>   modular when the Makefile and/or Kconfig prohibit it
>  (3) it requires the include of the module.h header file which in turn
>  includes nearly everything else, thus increasing CPP overhead.
> 
> I figured no point in sending a follow on patch since this came in via
> the akpm tree into next and that gets rebased/updated regularly.

Sorry for the late reply. I was mostly offline throughout the last 2
weeks last year. Is the following what you would like to see? If yes I
will fold it into the original patch.

Thanks!
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7a9678c50edd..1ece40b94725 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -36,7 +36,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 
 #include 
 #include "internal.h"
@@ -541,7 +541,7 @@ static int __init oom_init(void)
}
return 0;
 }
-module_init(oom_init)
+subsys_initcall(oom_init)
 #else
 static void wake_oom_reaper(struct mm_struct *mm)
 {
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2016-01-06 Thread Michal Hocko
On Mon 21-12-15 15:38:21, Paul Gortmaker wrote:
[...]
> ...use one of the non-modular initcalls here?   I'm trying to clean up most of
> the non-modular uses of modular macros etc. since:
> 
>  (1) it is easy to accidentally code up an unused module_exit function
>  (2) it can be misleading when reading the source, thinking it can be
>   modular when the Makefile and/or Kconfig prohibit it
>  (3) it requires the include of the module.h header file which in turn
>  includes nearly everything else, thus increasing CPP overhead.
> 
> I figured no point in sending a follow on patch since this came in via
> the akpm tree into next and that gets rebased/updated regularly.

Sorry for the late reply. I was mostly offline throughout the last 2
weeks last year. Is the following what you would like to see? If yes I
will fold it into the original patch.

Thanks!
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7a9678c50edd..1ece40b94725 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -36,7 +36,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 
 #include 
 #include "internal.h"
@@ -541,7 +541,7 @@ static int __init oom_init(void)
}
return 0;
 }
-module_init(oom_init)
+subsys_initcall(oom_init)
 #else
 static void wake_oom_reaper(struct mm_struct *mm)
 {
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2016-01-06 Thread Paul Gortmaker
[Re: [PATCH 1/2] mm, oom: introduce oom reaper] On 06/01/2016 (Wed 10:10) 
Michal Hocko wrote:

> On Mon 21-12-15 15:38:21, Paul Gortmaker wrote:
> [...]
> > ...use one of the non-modular initcalls here?   I'm trying to clean up most 
> > of
> > the non-modular uses of modular macros etc. since:
> > 
> >  (1) it is easy to accidentally code up an unused module_exit function
> >  (2) it can be misleading when reading the source, thinking it can be
> >   modular when the Makefile and/or Kconfig prohibit it
> >  (3) it requires the include of the module.h header file which in turn
> >  includes nearly everything else, thus increasing CPP overhead.
> > 
> > I figured no point in sending a follow on patch since this came in via
> > the akpm tree into next and that gets rebased/updated regularly.
> 
> Sorry for the late reply. I was mostly offline throughout the last 2
> weeks last year. Is the following what you would like to see? If yes I
> will fold it into the original patch.

Yes, that looks fine.  Do note that susbsys_initcall is earlier than the
module_init that you were using previously though.

Thanks,
Paul.
--

> 
> Thanks!
> ---
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 7a9678c50edd..1ece40b94725 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -36,7 +36,7 @@
>  #include 
>  #include 
>  #include 
> -#include 
> +#include 
>  
>  #include 
>  #include "internal.h"
> @@ -541,7 +541,7 @@ static int __init oom_init(void)
>   }
>   return 0;
>  }
> -module_init(oom_init)
> +subsys_initcall(oom_init)
>  #else
>  static void wake_oom_reaper(struct mm_struct *mm)
>  {
> -- 
> Michal Hocko
> SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] mm, oom: introduce oom reaper

2016-01-06 Thread Michal Hocko
From: Michal Hocko 

This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
independently brought up by Oleg Nesterov.

The OOM killer currently allows to kill only a single task in a good
hope that the task will terminate in a reasonable time and frees up its
memory.  Such a task (oom victim) will get an access to memory reserves
via mark_oom_victim to allow a forward progress should there be a need
for additional memory during exit path.

It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
construct workloads which break the core assumption mentioned above and
the OOM victim might take unbounded amount of time to exit because it
might be blocked in the uninterruptible state waiting for on an event
(e.g. lock) which is blocked by another task looping in the page
allocator.

This patch reduces the probability of such a lockup by introducing a
specialized kernel thread (oom_reaper) which tries to reclaim additional
memory by preemptively reaping the anonymous or swapped out memory
owned by the oom victim under an assumption that such a memory won't
be needed when its owner is killed and kicked from the userspace anyway.
There is one notable exception to this, though, if the OOM victim was
in the process of coredumping the result would be incomplete. This is
considered a reasonable constrain because the overall system health is
more important than debugability of a particular application.

A kernel thread has been chosen because we need a reliable way of
invocation so workqueue context is not appropriate because all the
workers might be busy (e.g. allocating memory). Kswapd which sounds
like another good fit is not appropriate as well because it might get
blocked on locks during reclaim as well.

oom_reaper has to take mmap_sem on the target task for reading so the
solution is not 100% because the semaphore might be held or blocked for
write but the probability is reduced considerably wrt. basically any
lock blocking forward progress as described above. In order to prevent
from blocking on the lock without any forward progress we are using only
a trylock and retry 10 times with a short sleep in between.
Users of mmap_sem which need it for write should be carefully reviewed
to use _killable waiting as much as possible and reduce allocations
requests done with the lock held to absolute minimum to reduce the risk
even further.

The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition
and oom_reaper clear this atomically once it is done with the work. This
means that only a single mm_struct can be reaped at the time. As the
operation is potentially disruptive we are trying to limit it to the
ncessary minimum and the reaper blocks any updates while it operates on
an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
race is detected by atomic_inc_not_zero(mm_users).

Changes since v3
- many style/compile fixups by Andrew
- unmap_mapping_range_tree needs full initialization of zap_details
  to prevent from missing unmaps and follow up BUG_ON during truncate
  resp. misaccounting - Kirill/Andrew
- exclude mlocked pages because they need an explicit munlock by Kirill
- use subsys_initcall instead of module_init - Paul Gortmaker
Changes since v2
- fix mm_count refernce leak reported by Tetsuo
- make sure oom_reaper_th is NULL after kthread_run fails - Tetsuo
- use wait_event_freezable rather than open coded wait loop - suggested
  by Tetsuo
Changes since v1
- fix the screwed up detail->check_swap_entries - Johannes
- do not use kthread_should_stop because that would need a cleanup
  and we do not have anybody to stop us - Tetsuo
- move wake_oom_reaper to oom_kill_process because we have to wait
  for all tasks sharing the same mm to get killed - Tetsuo
- do not reap mm structs which are shared with unkillable tasks - Tetsuo

Acked-by: Mel Gorman 
Signed-off-by: Michal Hocko 
---
 include/linux/mm.h |   2 +
 mm/internal.h  |   5 ++
 mm/memory.c|  17 +++---
 mm/oom_kill.c  | 157 +++--
 4 files changed, 170 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 25cdec395f2c..d1ce03569942 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1061,6 +1061,8 @@ struct zap_details {
struct address_space *check_mapping;/* Check page->mapping if set */
pgoff_t first_index;/* Lowest page->index to unmap 
*/
pgoff_t last_index; /* Highest page->index to unmap 
*/
+   bool ignore_dirty;  /* Ignore dirty pages */
+   bool check_swap_entries;/* Check also swap entries */
 };
 
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/internal.h b/mm/internal.h
index 

Re: [PATCH 1/2] mm, oom: introduce oom reaper

2016-01-06 Thread Michal Hocko
On Wed 06-01-16 09:26:12, Paul Gortmaker wrote:
> [Re: [PATCH 1/2] mm, oom: introduce oom reaper] On 06/01/2016 (Wed 10:10) 
> Michal Hocko wrote:
> 
> > On Mon 21-12-15 15:38:21, Paul Gortmaker wrote:
> > [...]
> > > ...use one of the non-modular initcalls here?   I'm trying to clean up 
> > > most of
> > > the non-modular uses of modular macros etc. since:
> > > 
> > >  (1) it is easy to accidentally code up an unused module_exit function
> > >  (2) it can be misleading when reading the source, thinking it can be
> > >   modular when the Makefile and/or Kconfig prohibit it
> > >  (3) it requires the include of the module.h header file which in turn
> > >  includes nearly everything else, thus increasing CPP overhead.
> > > 
> > > I figured no point in sending a follow on patch since this came in via
> > > the akpm tree into next and that gets rebased/updated regularly.
> > 
> > Sorry for the late reply. I was mostly offline throughout the last 2
> > weeks last year. Is the following what you would like to see? If yes I
> > will fold it into the original patch.
> 
> Yes, that looks fine.  Do note that susbsys_initcall is earlier than the
> module_init that you were using previously though.

Yes, I have noticed that but quite honestly module_init choice was just
"look around and use what others are doing". So there was no particular
reason to stick with that order.

Thanks for double checking after me!
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-25 Thread Michal Hocko
On Fri 25-12-15 12:35:37, Michal Hocko wrote:
[...]
> Thanks I will try to reproduce early next year. But so far I think this
> is just a general issue of MADV_DONTNEED vs. truncate and oom_reaper is
> just lucky to trigger it. There shouldn't be anything oom_reaper
> specific here. Maybe there is some additional locking missing?

Hmm, scratch that. I think Tetsuo has nailed it. It seems like
the missing initialization of details structure during unmap
is the culprit. So there most probably was on OOM killing
invoked. It is just a side effect of the patch and missing
http://marc.info/?l=linux-mm=145068666428057 follow up fix.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-25 Thread Michal Hocko
On Thu 24-12-15 20:06:50, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > This is VM_BUG_ON_PAGE(page_mapped(page), page), right? Could you attach
> > the full kernel log? It all smells like a race when OOM reaper tears
> > down the mapping and there is a truncate still in progress. But hitting
> > the BUG_ON just because of that doesn't make much sense to me. OOM
> > reaper is essentially MADV_DONTNEED. I have to think about this some
> > more, though, but I am in a holiday mode until early next year so please
> > bear with me.
> 
> I don't know whether the OOM killer was invoked just before this
> VM_BUG_ON_PAGE().
> 
> > Is this somehow DAX related?
> 
> 4.4.0-rc6-next-20151223_new_fsync_v6+ suggests that this kernel
> has "[PATCH v6 0/7] DAX fsync/msync support" applied. But I think
> http://marc.info/?l=linux-mm=145068666428057 should be applied
> when retesting. (20151223 does not have this fix.)

Hmm, I think you are right! Very well spotted! If ignore_dirty ends up
being true then we would simply skip over dirty page and wouldn't end up
doing page_remove_rmap. I can see that the truncation code can later trip
over this page.

Thanks!
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-25 Thread Michal Hocko
On Thu 24-12-15 13:44:03, Ross Zwisler wrote:
> On Thu, Dec 24, 2015 at 2:47 AM, Michal Hocko  wrote:
> > On Wed 23-12-15 16:00:09, Ross Zwisler wrote:
> > [...]
> >> While running xfstests on next-20151223 I hit a pair of kernel BUGs
> >> that bisected to this commit:
> >>
> >> 1eb3a80d8239 ("mm, oom: introduce oom reaper")
> >
> > Thank you for the report and the bisection.
> >
> >> Here is a BUG produced by generic/029 when run against XFS:
> >>
> >> [  235.751723] [ cut here ]
> >> [  235.752194] kernel BUG at mm/filemap.c:208!
> >
> > This is VM_BUG_ON_PAGE(page_mapped(page), page), right? Could you attach
> > the full kernel log? It all smells like a race when OOM reaper tears
> > down the mapping and there is a truncate still in progress. But hitting
> > the BUG_ON just because of that doesn't make much sense to me. OOM
> > reaper is essentially MADV_DONTNEED. I have to think about this some
> > more, though, but I am in a holiday mode until early next year so please
> > bear with me.
> 
> The two stack traces were gathered with next-20151223, so the line numbers
> may have moved around a bit when compared to the actual "mm, oom: introduce
> oom reaper" commit.

I was looking at the same next tree, I believe
$ git describe
next-20151223

[...]
> > There was a warning before this triggered. The full kernel log would be
> > helpful as well.
> 
> Sure, I can gather full kernel logs, but it'll probably after the new year.

OK, I will wait for the logs. It is really interesting to see what was
the timing between OOM killer invocation and this trace.

> > [...]
> >> [  609.425325] Call Trace:
> >> [  609.425797]  [] invalidate_inode_pages2+0x17/0x20
> >> [  609.426971]  [] xfs_file_read_iter+0x297/0x300
> >> [  609.428097]  [] __vfs_read+0xc9/0x100
> >> [  609.429073]  [] vfs_read+0x89/0x130
> >> [  609.430010]  [] SyS_read+0x58/0xd0
> >> [  609.430943]  [] entry_SYSCALL_64_fastpath+0x12/0x76
> >> [  609.432139] Code: 85 d8 fe ff ff 01 00 00 00 f6 c4 40 0f 84 59 ff
> >> ff ff 49 8b 47 20 48 8d 78 ff a8 01 49 0f 44 ff 8b 47 48 85 c0 0f 88
> >> bd 01 00 00 <0f> 0b 4d 3b 67 08 0f 85 70 ff ff ff 49 f7 07 00 18 00 00
> >> 74 15
> > [...]
> >> My test setup is a qemu guest machine with a pair of 4 GiB PMEM
> >> ramdisk test devices, one for the xfstest test disk and one for the
> >> scratch disk.
> >
> > Is this just a plain ramdisk device or it needs a special configuration?
> 
> Just a plain PMEM ram disk with DAX turned off.  Configuration instructions
> for PMEM can be found here:
> 
> https://nvdimm.wiki.kernel.org/

Thanks I will try to reproduce early next year. But so far I think this
is just a general issue of MADV_DONTNEED vs. truncate and oom_reaper is
just lucky to trigger it. There shouldn't be anything oom_reaper
specific here. Maybe there is some additional locking missing?

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-25 Thread Michal Hocko
On Thu 24-12-15 20:06:50, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > This is VM_BUG_ON_PAGE(page_mapped(page), page), right? Could you attach
> > the full kernel log? It all smells like a race when OOM reaper tears
> > down the mapping and there is a truncate still in progress. But hitting
> > the BUG_ON just because of that doesn't make much sense to me. OOM
> > reaper is essentially MADV_DONTNEED. I have to think about this some
> > more, though, but I am in a holiday mode until early next year so please
> > bear with me.
> 
> I don't know whether the OOM killer was invoked just before this
> VM_BUG_ON_PAGE().
> 
> > Is this somehow DAX related?
> 
> 4.4.0-rc6-next-20151223_new_fsync_v6+ suggests that this kernel
> has "[PATCH v6 0/7] DAX fsync/msync support" applied. But I think
> http://marc.info/?l=linux-mm=145068666428057 should be applied
> when retesting. (20151223 does not have this fix.)

Hmm, I think you are right! Very well spotted! If ignore_dirty ends up
being true then we would simply skip over dirty page and wouldn't end up
doing page_remove_rmap. I can see that the truncation code can later trip
over this page.

Thanks!
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-25 Thread Michal Hocko
On Fri 25-12-15 12:35:37, Michal Hocko wrote:
[...]
> Thanks I will try to reproduce early next year. But so far I think this
> is just a general issue of MADV_DONTNEED vs. truncate and oom_reaper is
> just lucky to trigger it. There shouldn't be anything oom_reaper
> specific here. Maybe there is some additional locking missing?

Hmm, scratch that. I think Tetsuo has nailed it. It seems like
the missing initialization of details structure during unmap
is the culprit. So there most probably was on OOM killing
invoked. It is just a side effect of the patch and missing
http://marc.info/?l=linux-mm=145068666428057 follow up fix.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-25 Thread Michal Hocko
On Thu 24-12-15 13:44:03, Ross Zwisler wrote:
> On Thu, Dec 24, 2015 at 2:47 AM, Michal Hocko  wrote:
> > On Wed 23-12-15 16:00:09, Ross Zwisler wrote:
> > [...]
> >> While running xfstests on next-20151223 I hit a pair of kernel BUGs
> >> that bisected to this commit:
> >>
> >> 1eb3a80d8239 ("mm, oom: introduce oom reaper")
> >
> > Thank you for the report and the bisection.
> >
> >> Here is a BUG produced by generic/029 when run against XFS:
> >>
> >> [  235.751723] [ cut here ]
> >> [  235.752194] kernel BUG at mm/filemap.c:208!
> >
> > This is VM_BUG_ON_PAGE(page_mapped(page), page), right? Could you attach
> > the full kernel log? It all smells like a race when OOM reaper tears
> > down the mapping and there is a truncate still in progress. But hitting
> > the BUG_ON just because of that doesn't make much sense to me. OOM
> > reaper is essentially MADV_DONTNEED. I have to think about this some
> > more, though, but I am in a holiday mode until early next year so please
> > bear with me.
> 
> The two stack traces were gathered with next-20151223, so the line numbers
> may have moved around a bit when compared to the actual "mm, oom: introduce
> oom reaper" commit.

I was looking at the same next tree, I believe
$ git describe
next-20151223

[...]
> > There was a warning before this triggered. The full kernel log would be
> > helpful as well.
> 
> Sure, I can gather full kernel logs, but it'll probably after the new year.

OK, I will wait for the logs. It is really interesting to see what was
the timing between OOM killer invocation and this trace.

> > [...]
> >> [  609.425325] Call Trace:
> >> [  609.425797]  [] invalidate_inode_pages2+0x17/0x20
> >> [  609.426971]  [] xfs_file_read_iter+0x297/0x300
> >> [  609.428097]  [] __vfs_read+0xc9/0x100
> >> [  609.429073]  [] vfs_read+0x89/0x130
> >> [  609.430010]  [] SyS_read+0x58/0xd0
> >> [  609.430943]  [] entry_SYSCALL_64_fastpath+0x12/0x76
> >> [  609.432139] Code: 85 d8 fe ff ff 01 00 00 00 f6 c4 40 0f 84 59 ff
> >> ff ff 49 8b 47 20 48 8d 78 ff a8 01 49 0f 44 ff 8b 47 48 85 c0 0f 88
> >> bd 01 00 00 <0f> 0b 4d 3b 67 08 0f 85 70 ff ff ff 49 f7 07 00 18 00 00
> >> 74 15
> > [...]
> >> My test setup is a qemu guest machine with a pair of 4 GiB PMEM
> >> ramdisk test devices, one for the xfstest test disk and one for the
> >> scratch disk.
> >
> > Is this just a plain ramdisk device or it needs a special configuration?
> 
> Just a plain PMEM ram disk with DAX turned off.  Configuration instructions
> for PMEM can be found here:
> 
> https://nvdimm.wiki.kernel.org/

Thanks I will try to reproduce early next year. But so far I think this
is just a general issue of MADV_DONTNEED vs. truncate and oom_reaper is
just lucky to trigger it. There shouldn't be anything oom_reaper
specific here. Maybe there is some additional locking missing?

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-24 Thread Ross Zwisler
On Thu, Dec 24, 2015 at 4:06 AM, Tetsuo Handa
 wrote:
> Michal Hocko wrote:
>> This is VM_BUG_ON_PAGE(page_mapped(page), page), right? Could you attach
>> the full kernel log? It all smells like a race when OOM reaper tears
>> down the mapping and there is a truncate still in progress. But hitting
>> the BUG_ON just because of that doesn't make much sense to me. OOM
>> reaper is essentially MADV_DONTNEED. I have to think about this some
>> more, though, but I am in a holiday mode until early next year so please
>> bear with me.
>
> I don't know whether the OOM killer was invoked just before this
> VM_BUG_ON_PAGE().
>
>> Is this somehow DAX related?
>
> 4.4.0-rc6-next-20151223_new_fsync_v6+ suggests that this kernel
> has "[PATCH v6 0/7] DAX fsync/msync support" applied. But I think
> http://marc.info/?l=linux-mm=145068666428057 should be applied
> when retesting. (20151223 does not have this fix.)

No, DAX was not turned on, and while my fsync/msync patches were the initial
reason I was testing (hence the new_fsync_v6 kernel name) they were not
applied during the bisect, so I'm sure they are not related to this issue.

I will retest with the patch referenced above, but it probably won't
happen until
the new year.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-24 Thread Ross Zwisler
On Thu, Dec 24, 2015 at 2:47 AM, Michal Hocko  wrote:
> On Wed 23-12-15 16:00:09, Ross Zwisler wrote:
> [...]
>> While running xfstests on next-20151223 I hit a pair of kernel BUGs
>> that bisected to this commit:
>>
>> 1eb3a80d8239 ("mm, oom: introduce oom reaper")
>
> Thank you for the report and the bisection.
>
>> Here is a BUG produced by generic/029 when run against XFS:
>>
>> [  235.751723] [ cut here ]
>> [  235.752194] kernel BUG at mm/filemap.c:208!
>
> This is VM_BUG_ON_PAGE(page_mapped(page), page), right? Could you attach
> the full kernel log? It all smells like a race when OOM reaper tears
> down the mapping and there is a truncate still in progress. But hitting
> the BUG_ON just because of that doesn't make much sense to me. OOM
> reaper is essentially MADV_DONTNEED. I have to think about this some
> more, though, but I am in a holiday mode until early next year so please
> bear with me.

The two stack traces were gathered with next-20151223, so the line numbers
may have moved around a bit when compared to the actual "mm, oom: introduce
oom reaper" commit.

> [...]
>> [  235.765638] Call Trace:
>> [  235.765903]  [] delete_from_page_cache+0x63/0xd0
>> [  235.766513]  [] truncate_inode_page+0xa5/0x120
>> [  235.767088]  [] truncate_inode_pages_range+0x1a8/0x7f0
>> [  235.767725]  [] ? sched_clock+0x9/0x10
>> [  235.768239]  [] ? local_clock+0x1c/0x20
>> [  235.768779]  [] ? unmap_mapping_range+0x64/0x130
>> [  235.769385]  [] ? unmap_mapping_range+0x74/0x130
>> [  235.770010]  [] ? up_write+0x1f/0x40
>> [  235.770501]  [] ? unmap_mapping_range+0x74/0x130
>> [  235.771092]  [] truncate_pagecache+0x48/0x70
>> [  235.771646]  [] truncate_setsize+0x32/0x40
>> [  235.772276]  [] xfs_setattr_size+0x232/0x470
>> [  235.772839]  [] xfs_vn_setattr+0xb4/0xc0
>> [  235.773369]  [] notify_change+0x237/0x350
>> [  235.773945]  [] do_truncate+0x77/0xc0
>> [  235.774446]  [] do_sys_ftruncate.constprop.15+0xef/0x150
>> [  235.775156]  [] SyS_ftruncate+0xe/0x10
>> [  235.775650]  [] entry_SYSCALL_64_fastpath+0x12/0x76
>> [  235.776257] Code: 5f 5d c3 48 8b 43 20 48 8d 78 ff a8 01 48 0f 44
>> fb 8b 47 48 85 c0 0f 88 2b 01 00 00 48 c7 c6 a8 57 f0 81 48 89 df e8
>> fa 1a 03 00 <0f> 0b 4c 89 ce 44 89 fa 4c 89 e7 4c 89 45 b0 4c 89 4d b8
>> e8 32
>> [  235.778695] RIP  [] __delete_from_page_cache+0x206/0x440
>> [  235.779350]  RSP 
>> [  235.779694] ---[ end trace fac9dd65c4cdd828 ]---
>>
>> And a different BUG produced by generic/095, also with XFS:
>>
>> [  609.398897] [ cut here ]
>> [  609.399843] kernel BUG at mm/truncate.c:629!
>
> Hmm, I do not see any BUG_ON at this line. But there is
> BUG_ON(page_mapped(page)) at line 620.

Ditto - check out next-20151223 for real line numbers.

>> [  609.400666] invalid opcode:  [#1] SMP
>> [  609.401512] Modules linked in: nd_pmem nd_btt nd_e820 libnvdimm
>> [  609.402719] CPU: 4 PID: 26782 Comm: fio Tainted: GW
>
> There was a warning before this triggered. The full kernel log would be
> helpful as well.

Sure, I can gather full kernel logs, but it'll probably after the new year.

> [...]
>> [  609.425325] Call Trace:
>> [  609.425797]  [] invalidate_inode_pages2+0x17/0x20
>> [  609.426971]  [] xfs_file_read_iter+0x297/0x300
>> [  609.428097]  [] __vfs_read+0xc9/0x100
>> [  609.429073]  [] vfs_read+0x89/0x130
>> [  609.430010]  [] SyS_read+0x58/0xd0
>> [  609.430943]  [] entry_SYSCALL_64_fastpath+0x12/0x76
>> [  609.432139] Code: 85 d8 fe ff ff 01 00 00 00 f6 c4 40 0f 84 59 ff
>> ff ff 49 8b 47 20 48 8d 78 ff a8 01 49 0f 44 ff 8b 47 48 85 c0 0f 88
>> bd 01 00 00 <0f> 0b 4d 3b 67 08 0f 85 70 ff ff ff 49 f7 07 00 18 00 00
>> 74 15
> [...]
>> My test setup is a qemu guest machine with a pair of 4 GiB PMEM
>> ramdisk test devices, one for the xfstest test disk and one for the
>> scratch disk.
>
> Is this just a plain ramdisk device or it needs a special configuration?

Just a plain PMEM ram disk with DAX turned off.  Configuration instructions
for PMEM can be found here:

https://nvdimm.wiki.kernel.org/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-24 Thread Tetsuo Handa
Michal Hocko wrote:
> This is VM_BUG_ON_PAGE(page_mapped(page), page), right? Could you attach
> the full kernel log? It all smells like a race when OOM reaper tears
> down the mapping and there is a truncate still in progress. But hitting
> the BUG_ON just because of that doesn't make much sense to me. OOM
> reaper is essentially MADV_DONTNEED. I have to think about this some
> more, though, but I am in a holiday mode until early next year so please
> bear with me.

I don't know whether the OOM killer was invoked just before this
VM_BUG_ON_PAGE().

> Is this somehow DAX related?

4.4.0-rc6-next-20151223_new_fsync_v6+ suggests that this kernel
has "[PATCH v6 0/7] DAX fsync/msync support" applied. But I think
http://marc.info/?l=linux-mm=145068666428057 should be applied
when retesting. (20151223 does not have this fix.)

[  235.768779]  [] ? unmap_mapping_range+0x64/0x130
[  235.769385]  [] ? unmap_mapping_range+0x74/0x130
[  235.770010]  [] ? up_write+0x1f/0x40
[  235.770501]  [] ? unmap_mapping_range+0x74/0x130
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-24 Thread Michal Hocko
On Wed 23-12-15 16:00:09, Ross Zwisler wrote:
[...]
> While running xfstests on next-20151223 I hit a pair of kernel BUGs
> that bisected to this commit:
> 
> 1eb3a80d8239 ("mm, oom: introduce oom reaper")

Thank you for the report and the bisection.

> Here is a BUG produced by generic/029 when run against XFS:
> 
> [  235.751723] [ cut here ]
> [  235.752194] kernel BUG at mm/filemap.c:208!

This is VM_BUG_ON_PAGE(page_mapped(page), page), right? Could you attach
the full kernel log? It all smells like a race when OOM reaper tears
down the mapping and there is a truncate still in progress. But hitting
the BUG_ON just because of that doesn't make much sense to me. OOM
reaper is essentially MADV_DONTNEED. I have to think about this some
more, though, but I am in a holiday mode until early next year so please
bear with me.

[...]
> [  235.765638] Call Trace:
> [  235.765903]  [] delete_from_page_cache+0x63/0xd0
> [  235.766513]  [] truncate_inode_page+0xa5/0x120
> [  235.767088]  [] truncate_inode_pages_range+0x1a8/0x7f0
> [  235.767725]  [] ? sched_clock+0x9/0x10
> [  235.768239]  [] ? local_clock+0x1c/0x20
> [  235.768779]  [] ? unmap_mapping_range+0x64/0x130
> [  235.769385]  [] ? unmap_mapping_range+0x74/0x130
> [  235.770010]  [] ? up_write+0x1f/0x40
> [  235.770501]  [] ? unmap_mapping_range+0x74/0x130
> [  235.771092]  [] truncate_pagecache+0x48/0x70
> [  235.771646]  [] truncate_setsize+0x32/0x40
> [  235.772276]  [] xfs_setattr_size+0x232/0x470
> [  235.772839]  [] xfs_vn_setattr+0xb4/0xc0
> [  235.773369]  [] notify_change+0x237/0x350
> [  235.773945]  [] do_truncate+0x77/0xc0
> [  235.774446]  [] do_sys_ftruncate.constprop.15+0xef/0x150
> [  235.775156]  [] SyS_ftruncate+0xe/0x10
> [  235.775650]  [] entry_SYSCALL_64_fastpath+0x12/0x76
> [  235.776257] Code: 5f 5d c3 48 8b 43 20 48 8d 78 ff a8 01 48 0f 44
> fb 8b 47 48 85 c0 0f 88 2b 01 00 00 48 c7 c6 a8 57 f0 81 48 89 df e8
> fa 1a 03 00 <0f> 0b 4c 89 ce 44 89 fa 4c 89 e7 4c 89 45 b0 4c 89 4d b8
> e8 32
> [  235.778695] RIP  [] __delete_from_page_cache+0x206/0x440
> [  235.779350]  RSP 
> [  235.779694] ---[ end trace fac9dd65c4cdd828 ]---
> 
> And a different BUG produced by generic/095, also with XFS:
> 
> [  609.398897] [ cut here ]
> [  609.399843] kernel BUG at mm/truncate.c:629!

Hmm, I do not see any BUG_ON at this line. But there is
BUG_ON(page_mapped(page)) at line 620.

> [  609.400666] invalid opcode:  [#1] SMP
> [  609.401512] Modules linked in: nd_pmem nd_btt nd_e820 libnvdimm
> [  609.402719] CPU: 4 PID: 26782 Comm: fio Tainted: GW

There was a warning before this triggered. The full kernel log would be
helpful as well.

[...]
> [  609.425325] Call Trace:
> [  609.425797]  [] invalidate_inode_pages2+0x17/0x20
> [  609.426971]  [] xfs_file_read_iter+0x297/0x300
> [  609.428097]  [] __vfs_read+0xc9/0x100
> [  609.429073]  [] vfs_read+0x89/0x130
> [  609.430010]  [] SyS_read+0x58/0xd0
> [  609.430943]  [] entry_SYSCALL_64_fastpath+0x12/0x76
> [  609.432139] Code: 85 d8 fe ff ff 01 00 00 00 f6 c4 40 0f 84 59 ff
> ff ff 49 8b 47 20 48 8d 78 ff a8 01 49 0f 44 ff 8b 47 48 85 c0 0f 88
> bd 01 00 00 <0f> 0b 4d 3b 67 08 0f 85 70 ff ff ff 49 f7 07 00 18 00 00
> 74 15
[...]
> My test setup is a qemu guest machine with a pair of 4 GiB PMEM
> ramdisk test devices, one for the xfstest test disk and one for the
> scratch disk.

Is this just a plain ramdisk device or it needs a special configuration?
Is this somehow DAX related?

Thanks!
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-24 Thread Ross Zwisler
On Thu, Dec 24, 2015 at 4:06 AM, Tetsuo Handa
 wrote:
> Michal Hocko wrote:
>> This is VM_BUG_ON_PAGE(page_mapped(page), page), right? Could you attach
>> the full kernel log? It all smells like a race when OOM reaper tears
>> down the mapping and there is a truncate still in progress. But hitting
>> the BUG_ON just because of that doesn't make much sense to me. OOM
>> reaper is essentially MADV_DONTNEED. I have to think about this some
>> more, though, but I am in a holiday mode until early next year so please
>> bear with me.
>
> I don't know whether the OOM killer was invoked just before this
> VM_BUG_ON_PAGE().
>
>> Is this somehow DAX related?
>
> 4.4.0-rc6-next-20151223_new_fsync_v6+ suggests that this kernel
> has "[PATCH v6 0/7] DAX fsync/msync support" applied. But I think
> http://marc.info/?l=linux-mm=145068666428057 should be applied
> when retesting. (20151223 does not have this fix.)

No, DAX was not turned on, and while my fsync/msync patches were the initial
reason I was testing (hence the new_fsync_v6 kernel name) they were not
applied during the bisect, so I'm sure they are not related to this issue.

I will retest with the patch referenced above, but it probably won't
happen until
the new year.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-24 Thread Ross Zwisler
On Thu, Dec 24, 2015 at 2:47 AM, Michal Hocko  wrote:
> On Wed 23-12-15 16:00:09, Ross Zwisler wrote:
> [...]
>> While running xfstests on next-20151223 I hit a pair of kernel BUGs
>> that bisected to this commit:
>>
>> 1eb3a80d8239 ("mm, oom: introduce oom reaper")
>
> Thank you for the report and the bisection.
>
>> Here is a BUG produced by generic/029 when run against XFS:
>>
>> [  235.751723] [ cut here ]
>> [  235.752194] kernel BUG at mm/filemap.c:208!
>
> This is VM_BUG_ON_PAGE(page_mapped(page), page), right? Could you attach
> the full kernel log? It all smells like a race when OOM reaper tears
> down the mapping and there is a truncate still in progress. But hitting
> the BUG_ON just because of that doesn't make much sense to me. OOM
> reaper is essentially MADV_DONTNEED. I have to think about this some
> more, though, but I am in a holiday mode until early next year so please
> bear with me.

The two stack traces were gathered with next-20151223, so the line numbers
may have moved around a bit when compared to the actual "mm, oom: introduce
oom reaper" commit.

> [...]
>> [  235.765638] Call Trace:
>> [  235.765903]  [] delete_from_page_cache+0x63/0xd0
>> [  235.766513]  [] truncate_inode_page+0xa5/0x120
>> [  235.767088]  [] truncate_inode_pages_range+0x1a8/0x7f0
>> [  235.767725]  [] ? sched_clock+0x9/0x10
>> [  235.768239]  [] ? local_clock+0x1c/0x20
>> [  235.768779]  [] ? unmap_mapping_range+0x64/0x130
>> [  235.769385]  [] ? unmap_mapping_range+0x74/0x130
>> [  235.770010]  [] ? up_write+0x1f/0x40
>> [  235.770501]  [] ? unmap_mapping_range+0x74/0x130
>> [  235.771092]  [] truncate_pagecache+0x48/0x70
>> [  235.771646]  [] truncate_setsize+0x32/0x40
>> [  235.772276]  [] xfs_setattr_size+0x232/0x470
>> [  235.772839]  [] xfs_vn_setattr+0xb4/0xc0
>> [  235.773369]  [] notify_change+0x237/0x350
>> [  235.773945]  [] do_truncate+0x77/0xc0
>> [  235.774446]  [] do_sys_ftruncate.constprop.15+0xef/0x150
>> [  235.775156]  [] SyS_ftruncate+0xe/0x10
>> [  235.775650]  [] entry_SYSCALL_64_fastpath+0x12/0x76
>> [  235.776257] Code: 5f 5d c3 48 8b 43 20 48 8d 78 ff a8 01 48 0f 44
>> fb 8b 47 48 85 c0 0f 88 2b 01 00 00 48 c7 c6 a8 57 f0 81 48 89 df e8
>> fa 1a 03 00 <0f> 0b 4c 89 ce 44 89 fa 4c 89 e7 4c 89 45 b0 4c 89 4d b8
>> e8 32
>> [  235.778695] RIP  [] __delete_from_page_cache+0x206/0x440
>> [  235.779350]  RSP 
>> [  235.779694] ---[ end trace fac9dd65c4cdd828 ]---
>>
>> And a different BUG produced by generic/095, also with XFS:
>>
>> [  609.398897] [ cut here ]
>> [  609.399843] kernel BUG at mm/truncate.c:629!
>
> Hmm, I do not see any BUG_ON at this line. But there is
> BUG_ON(page_mapped(page)) at line 620.

Ditto - check out next-20151223 for real line numbers.

>> [  609.400666] invalid opcode:  [#1] SMP
>> [  609.401512] Modules linked in: nd_pmem nd_btt nd_e820 libnvdimm
>> [  609.402719] CPU: 4 PID: 26782 Comm: fio Tainted: GW
>
> There was a warning before this triggered. The full kernel log would be
> helpful as well.

Sure, I can gather full kernel logs, but it'll probably after the new year.

> [...]
>> [  609.425325] Call Trace:
>> [  609.425797]  [] invalidate_inode_pages2+0x17/0x20
>> [  609.426971]  [] xfs_file_read_iter+0x297/0x300
>> [  609.428097]  [] __vfs_read+0xc9/0x100
>> [  609.429073]  [] vfs_read+0x89/0x130
>> [  609.430010]  [] SyS_read+0x58/0xd0
>> [  609.430943]  [] entry_SYSCALL_64_fastpath+0x12/0x76
>> [  609.432139] Code: 85 d8 fe ff ff 01 00 00 00 f6 c4 40 0f 84 59 ff
>> ff ff 49 8b 47 20 48 8d 78 ff a8 01 49 0f 44 ff 8b 47 48 85 c0 0f 88
>> bd 01 00 00 <0f> 0b 4d 3b 67 08 0f 85 70 ff ff ff 49 f7 07 00 18 00 00
>> 74 15
> [...]
>> My test setup is a qemu guest machine with a pair of 4 GiB PMEM
>> ramdisk test devices, one for the xfstest test disk and one for the
>> scratch disk.
>
> Is this just a plain ramdisk device or it needs a special configuration?

Just a plain PMEM ram disk with DAX turned off.  Configuration instructions
for PMEM can be found here:

https://nvdimm.wiki.kernel.org/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-24 Thread Tetsuo Handa
Michal Hocko wrote:
> This is VM_BUG_ON_PAGE(page_mapped(page), page), right? Could you attach
> the full kernel log? It all smells like a race when OOM reaper tears
> down the mapping and there is a truncate still in progress. But hitting
> the BUG_ON just because of that doesn't make much sense to me. OOM
> reaper is essentially MADV_DONTNEED. I have to think about this some
> more, though, but I am in a holiday mode until early next year so please
> bear with me.

I don't know whether the OOM killer was invoked just before this
VM_BUG_ON_PAGE().

> Is this somehow DAX related?

4.4.0-rc6-next-20151223_new_fsync_v6+ suggests that this kernel
has "[PATCH v6 0/7] DAX fsync/msync support" applied. But I think
http://marc.info/?l=linux-mm=145068666428057 should be applied
when retesting. (20151223 does not have this fix.)

[  235.768779]  [] ? unmap_mapping_range+0x64/0x130
[  235.769385]  [] ? unmap_mapping_range+0x74/0x130
[  235.770010]  [] ? up_write+0x1f/0x40
[  235.770501]  [] ? unmap_mapping_range+0x74/0x130
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-24 Thread Michal Hocko
On Wed 23-12-15 16:00:09, Ross Zwisler wrote:
[...]
> While running xfstests on next-20151223 I hit a pair of kernel BUGs
> that bisected to this commit:
> 
> 1eb3a80d8239 ("mm, oom: introduce oom reaper")

Thank you for the report and the bisection.

> Here is a BUG produced by generic/029 when run against XFS:
> 
> [  235.751723] [ cut here ]
> [  235.752194] kernel BUG at mm/filemap.c:208!

This is VM_BUG_ON_PAGE(page_mapped(page), page), right? Could you attach
the full kernel log? It all smells like a race when OOM reaper tears
down the mapping and there is a truncate still in progress. But hitting
the BUG_ON just because of that doesn't make much sense to me. OOM
reaper is essentially MADV_DONTNEED. I have to think about this some
more, though, but I am in a holiday mode until early next year so please
bear with me.

[...]
> [  235.765638] Call Trace:
> [  235.765903]  [] delete_from_page_cache+0x63/0xd0
> [  235.766513]  [] truncate_inode_page+0xa5/0x120
> [  235.767088]  [] truncate_inode_pages_range+0x1a8/0x7f0
> [  235.767725]  [] ? sched_clock+0x9/0x10
> [  235.768239]  [] ? local_clock+0x1c/0x20
> [  235.768779]  [] ? unmap_mapping_range+0x64/0x130
> [  235.769385]  [] ? unmap_mapping_range+0x74/0x130
> [  235.770010]  [] ? up_write+0x1f/0x40
> [  235.770501]  [] ? unmap_mapping_range+0x74/0x130
> [  235.771092]  [] truncate_pagecache+0x48/0x70
> [  235.771646]  [] truncate_setsize+0x32/0x40
> [  235.772276]  [] xfs_setattr_size+0x232/0x470
> [  235.772839]  [] xfs_vn_setattr+0xb4/0xc0
> [  235.773369]  [] notify_change+0x237/0x350
> [  235.773945]  [] do_truncate+0x77/0xc0
> [  235.774446]  [] do_sys_ftruncate.constprop.15+0xef/0x150
> [  235.775156]  [] SyS_ftruncate+0xe/0x10
> [  235.775650]  [] entry_SYSCALL_64_fastpath+0x12/0x76
> [  235.776257] Code: 5f 5d c3 48 8b 43 20 48 8d 78 ff a8 01 48 0f 44
> fb 8b 47 48 85 c0 0f 88 2b 01 00 00 48 c7 c6 a8 57 f0 81 48 89 df e8
> fa 1a 03 00 <0f> 0b 4c 89 ce 44 89 fa 4c 89 e7 4c 89 45 b0 4c 89 4d b8
> e8 32
> [  235.778695] RIP  [] __delete_from_page_cache+0x206/0x440
> [  235.779350]  RSP 
> [  235.779694] ---[ end trace fac9dd65c4cdd828 ]---
> 
> And a different BUG produced by generic/095, also with XFS:
> 
> [  609.398897] [ cut here ]
> [  609.399843] kernel BUG at mm/truncate.c:629!

Hmm, I do not see any BUG_ON at this line. But there is
BUG_ON(page_mapped(page)) at line 620.

> [  609.400666] invalid opcode:  [#1] SMP
> [  609.401512] Modules linked in: nd_pmem nd_btt nd_e820 libnvdimm
> [  609.402719] CPU: 4 PID: 26782 Comm: fio Tainted: GW

There was a warning before this triggered. The full kernel log would be
helpful as well.

[...]
> [  609.425325] Call Trace:
> [  609.425797]  [] invalidate_inode_pages2+0x17/0x20
> [  609.426971]  [] xfs_file_read_iter+0x297/0x300
> [  609.428097]  [] __vfs_read+0xc9/0x100
> [  609.429073]  [] vfs_read+0x89/0x130
> [  609.430010]  [] SyS_read+0x58/0xd0
> [  609.430943]  [] entry_SYSCALL_64_fastpath+0x12/0x76
> [  609.432139] Code: 85 d8 fe ff ff 01 00 00 00 f6 c4 40 0f 84 59 ff
> ff ff 49 8b 47 20 48 8d 78 ff a8 01 49 0f 44 ff 8b 47 48 85 c0 0f 88
> bd 01 00 00 <0f> 0b 4d 3b 67 08 0f 85 70 ff ff ff 49 f7 07 00 18 00 00
> 74 15
[...]
> My test setup is a qemu guest machine with a pair of 4 GiB PMEM
> ramdisk test devices, one for the xfstest test disk and one for the
> scratch disk.

Is this just a plain ramdisk device or it needs a special configuration?
Is this somehow DAX related?

Thanks!
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-23 Thread Ross Zwisler
On Tue, Dec 15, 2015 at 11:36 AM, Michal Hocko  wrote:
> From: Michal Hocko 
>
> This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> independently brought up by Oleg Nesterov.
>
> The OOM killer currently allows to kill only a single task in a good
> hope that the task will terminate in a reasonable time and frees up its
> memory.  Such a task (oom victim) will get an access to memory reserves
> via mark_oom_victim to allow a forward progress should there be a need
> for additional memory during exit path.
>
> It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
> construct workloads which break the core assumption mentioned above and
> the OOM victim might take unbounded amount of time to exit because it
> might be blocked in the uninterruptible state waiting for on an event
> (e.g. lock) which is blocked by another task looping in the page
> allocator.
>
> This patch reduces the probability of such a lockup by introducing a
> specialized kernel thread (oom_reaper) which tries to reclaim additional
> memory by preemptively reaping the anonymous or swapped out memory
> owned by the oom victim under an assumption that such a memory won't
> be needed when its owner is killed and kicked from the userspace anyway.
> There is one notable exception to this, though, if the OOM victim was
> in the process of coredumping the result would be incomplete. This is
> considered a reasonable constrain because the overall system health is
> more important than debugability of a particular application.
>
> A kernel thread has been chosen because we need a reliable way of
> invocation so workqueue context is not appropriate because all the
> workers might be busy (e.g. allocating memory). Kswapd which sounds
> like another good fit is not appropriate as well because it might get
> blocked on locks during reclaim as well.
>
> oom_reaper has to take mmap_sem on the target task for reading so the
> solution is not 100% because the semaphore might be held or blocked for
> write but the probability is reduced considerably wrt. basically any
> lock blocking forward progress as described above. In order to prevent
> from blocking on the lock without any forward progress we are using only
> a trylock and retry 10 times with a short sleep in between.
> Users of mmap_sem which need it for write should be carefully reviewed
> to use _killable waiting as much as possible and reduce allocations
> requests done with the lock held to absolute minimum to reduce the risk
> even further.
>
> The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
> updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition
> and oom_reaper clear this atomically once it is done with the work. This
> means that only a single mm_struct can be reaped at the time. As the
> operation is potentially disruptive we are trying to limit it to the
> ncessary minimum and the reaper blocks any updates while it operates on
> an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
> race is detected by atomic_inc_not_zero(mm_users).
>
> Changes since v2
> - fix mm_count refernce leak reported by Tetsuo
> - make sure oom_reaper_th is NULL after kthread_run fails - Tetsuo
> - use wait_event_freezable rather than open coded wait loop - suggested
>   by Tetsuo
> Changes since v1
> - fix the screwed up detail->check_swap_entries - Johannes
> - do not use kthread_should_stop because that would need a cleanup
>   and we do not have anybody to stop us - Tetsuo
> - move wake_oom_reaper to oom_kill_process because we have to wait
>   for all tasks sharing the same mm to get killed - Tetsuo
> - do not reap mm structs which are shared with unkillable tasks - Tetsuo
>
> Acked-by: Mel Gorman 
> Signed-off-by: Michal Hocko 

While running xfstests on next-20151223 I hit a pair of kernel BUGs
that bisected to this commit:

1eb3a80d8239 ("mm, oom: introduce oom reaper")

Here is a BUG produced by generic/029 when run against XFS:

[  235.751723] [ cut here ]
[  235.752194] kernel BUG at mm/filemap.c:208!
[  235.752595] invalid opcode:  [#1] SMP
[  235.753036] Modules linked in: nd_pmem nd_btt nd_e820 libnvdimm
[  235.753681] CPU: 3 PID: 17586 Comm: xfs_io Not tainted
4.4.0-rc6-next-20151223_new_fsync_v6+ #8
[  235.754535] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.7.5-20140709_153950- 04/01/2014
[  235.755451] task: 88040bde19c0 ti: 8800bab8 task.ti:
8800bab8
[  235.756202] RIP: 0010:[]  []
__delete_from_page_cache+0x206/0x440
[  235.757151] RSP: 0018:8800bab83b60  EFLAGS: 00010082
[  235.757679] RAX: 0021 RBX: ea0007d37e00 RCX: 0006
[  235.758360] RDX:  RSI: 0001 RDI: 8804117ce380
[  235.759043] RBP: 8800bab83bb8 R08: 0001 R09: 0001
[  235.759749] R10:  R11: 00028dc0 R12: 8800b1e7db00
[  235.760444] R13: 

Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-23 Thread Ross Zwisler
On Tue, Dec 15, 2015 at 11:36 AM, Michal Hocko  wrote:
> From: Michal Hocko 
>
> This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> independently brought up by Oleg Nesterov.
>
> The OOM killer currently allows to kill only a single task in a good
> hope that the task will terminate in a reasonable time and frees up its
> memory.  Such a task (oom victim) will get an access to memory reserves
> via mark_oom_victim to allow a forward progress should there be a need
> for additional memory during exit path.
>
> It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
> construct workloads which break the core assumption mentioned above and
> the OOM victim might take unbounded amount of time to exit because it
> might be blocked in the uninterruptible state waiting for on an event
> (e.g. lock) which is blocked by another task looping in the page
> allocator.
>
> This patch reduces the probability of such a lockup by introducing a
> specialized kernel thread (oom_reaper) which tries to reclaim additional
> memory by preemptively reaping the anonymous or swapped out memory
> owned by the oom victim under an assumption that such a memory won't
> be needed when its owner is killed and kicked from the userspace anyway.
> There is one notable exception to this, though, if the OOM victim was
> in the process of coredumping the result would be incomplete. This is
> considered a reasonable constrain because the overall system health is
> more important than debugability of a particular application.
>
> A kernel thread has been chosen because we need a reliable way of
> invocation so workqueue context is not appropriate because all the
> workers might be busy (e.g. allocating memory). Kswapd which sounds
> like another good fit is not appropriate as well because it might get
> blocked on locks during reclaim as well.
>
> oom_reaper has to take mmap_sem on the target task for reading so the
> solution is not 100% because the semaphore might be held or blocked for
> write but the probability is reduced considerably wrt. basically any
> lock blocking forward progress as described above. In order to prevent
> from blocking on the lock without any forward progress we are using only
> a trylock and retry 10 times with a short sleep in between.
> Users of mmap_sem which need it for write should be carefully reviewed
> to use _killable waiting as much as possible and reduce allocations
> requests done with the lock held to absolute minimum to reduce the risk
> even further.
>
> The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
> updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition
> and oom_reaper clear this atomically once it is done with the work. This
> means that only a single mm_struct can be reaped at the time. As the
> operation is potentially disruptive we are trying to limit it to the
> ncessary minimum and the reaper blocks any updates while it operates on
> an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
> race is detected by atomic_inc_not_zero(mm_users).
>
> Changes since v2
> - fix mm_count refernce leak reported by Tetsuo
> - make sure oom_reaper_th is NULL after kthread_run fails - Tetsuo
> - use wait_event_freezable rather than open coded wait loop - suggested
>   by Tetsuo
> Changes since v1
> - fix the screwed up detail->check_swap_entries - Johannes
> - do not use kthread_should_stop because that would need a cleanup
>   and we do not have anybody to stop us - Tetsuo
> - move wake_oom_reaper to oom_kill_process because we have to wait
>   for all tasks sharing the same mm to get killed - Tetsuo
> - do not reap mm structs which are shared with unkillable tasks - Tetsuo
>
> Acked-by: Mel Gorman 
> Signed-off-by: Michal Hocko 

While running xfstests on next-20151223 I hit a pair of kernel BUGs
that bisected to this commit:

1eb3a80d8239 ("mm, oom: introduce oom reaper")

Here is a BUG produced by generic/029 when run against XFS:

[  235.751723] [ cut here ]
[  235.752194] kernel BUG at mm/filemap.c:208!
[  235.752595] invalid opcode:  [#1] SMP
[  235.753036] Modules linked in: nd_pmem nd_btt nd_e820 libnvdimm
[  235.753681] CPU: 3 PID: 17586 Comm: xfs_io Not tainted
4.4.0-rc6-next-20151223_new_fsync_v6+ #8
[  235.754535] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.7.5-20140709_153950- 04/01/2014
[  235.755451] task: 88040bde19c0 ti: 8800bab8 task.ti:
8800bab8
[  235.756202] RIP: 0010:[]  []
__delete_from_page_cache+0x206/0x440
[  235.757151] RSP: 0018:8800bab83b60  EFLAGS: 00010082
[  235.757679] RAX: 0021 RBX: ea0007d37e00 RCX: 0006
[  235.758360] RDX:  RSI: 0001 RDI: 8804117ce380
[  235.759043] RBP: 8800bab83bb8 R08: 0001 R09: 0001
[  235.759749] R10:  R11: 

Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-21 Thread Paul Gortmaker
On Tue, Dec 15, 2015 at 1:36 PM, Michal Hocko  wrote:
> From: Michal Hocko 
>
> This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> independently brought up by Oleg Nesterov.
>

[...]

Since this is built-in always, can we

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 5314b206caa5..48025a21f8c4 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -35,6 +35,11 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 

...use  instead above, and then...

> +
> +#include 
> +#include "internal.h"
>

[...]

> +* Make sure our oom reaper thread will get scheduled when
> +* ASAP and that it won't get preempted by malicious 
> userspace.
> +*/
> +   sched_setscheduler(oom_reaper_th, SCHED_FIFO, );
> +   }
> +   return 0;
> +}
> +module_init(oom_init)

...use one of the non-modular initcalls here?   I'm trying to clean up most of
the non-modular uses of modular macros etc. since:

 (1) it is easy to accidentally code up an unused module_exit function
 (2) it can be misleading when reading the source, thinking it can be
  modular when the Makefile and/or Kconfig prohibit it
 (3) it requires the include of the module.h header file which in turn
 includes nearly everything else, thus increasing CPP overhead.

I figured no point in sending a follow on patch since this came in via
the akpm tree into next and that gets rebased/updated regularly.

Thanks,
Paul.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-21 Thread Michal Hocko
On Fri 18-12-15 13:14:00, Andrew Morton wrote:
> On Fri, 18 Dec 2015 12:54:55 +0100 Michal Hocko  wrote:
> 
> > /* Retry the down_read_trylock(mmap_sem) a few times */
> > -   while (attempts++ < 10 && !__oom_reap_vmas(mm))
> > -   msleep_interruptible(100);
> > +   while (attempts++ < 10 && !__oom_reap_vmas(mm)) {
> > +   __set_task_state(current, TASK_IDLE);
> > +   schedule_timeout(HZ/10);
> > +   }
> 
> If you won't, I shall ;)
> 
> 
> From: Andrew Morton 
> Subject: sched: add schedule_timeout_idle()
> 
> This will be needed in the patch "mm, oom: introduce oom reaper".
> 
> Cc: Michal Hocko 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Signed-off-by: Andrew Morton 

Thanks! This makes more sense.

Acked-by: Michal Hocko 

> ---
> 
>  include/linux/sched.h |1 +
>  kernel/time/timer.c   |   11 +++
>  2 files changed, 12 insertions(+)
> 
> diff -puN kernel/time/timer.c~sched-add-schedule_timeout_idle 
> kernel/time/timer.c
> --- a/kernel/time/timer.c~sched-add-schedule_timeout_idle
> +++ a/kernel/time/timer.c
> @@ -1566,6 +1566,17 @@ signed long __sched schedule_timeout_uni
>  }
>  EXPORT_SYMBOL(schedule_timeout_uninterruptible);
>  
> +/*
> + * Like schedule_timeout_uninterruptible(), except this task will not 
> contribute
> + * to load average.
> + */
> +signed long __sched schedule_timeout_idle(signed long timeout)
> +{
> + __set_current_state(TASK_IDLE);
> + return schedule_timeout(timeout);
> +}
> +EXPORT_SYMBOL(schedule_timeout_idle);
> +
>  #ifdef CONFIG_HOTPLUG_CPU
>  static void migrate_timer_list(struct tvec_base *new_base, struct hlist_head 
> *head)
>  {
> diff -puN include/linux/sched.h~sched-add-schedule_timeout_idle 
> include/linux/sched.h
> --- a/include/linux/sched.h~sched-add-schedule_timeout_idle
> +++ a/include/linux/sched.h
> @@ -423,6 +423,7 @@ extern signed long schedule_timeout(sign
>  extern signed long schedule_timeout_interruptible(signed long timeout);
>  extern signed long schedule_timeout_killable(signed long timeout);
>  extern signed long schedule_timeout_uninterruptible(signed long timeout);
> +extern signed long schedule_timeout_idle(signed long timeout);
>  asmlinkage void schedule(void);
>  extern void schedule_preempt_disabled(void);
>  
> _

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-21 Thread Paul Gortmaker
On Tue, Dec 15, 2015 at 1:36 PM, Michal Hocko  wrote:
> From: Michal Hocko 
>
> This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> independently brought up by Oleg Nesterov.
>

[...]

Since this is built-in always, can we

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 5314b206caa5..48025a21f8c4 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -35,6 +35,11 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 

...use  instead above, and then...

> +
> +#include 
> +#include "internal.h"
>

[...]

> +* Make sure our oom reaper thread will get scheduled when
> +* ASAP and that it won't get preempted by malicious 
> userspace.
> +*/
> +   sched_setscheduler(oom_reaper_th, SCHED_FIFO, );
> +   }
> +   return 0;
> +}
> +module_init(oom_init)

...use one of the non-modular initcalls here?   I'm trying to clean up most of
the non-modular uses of modular macros etc. since:

 (1) it is easy to accidentally code up an unused module_exit function
 (2) it can be misleading when reading the source, thinking it can be
  modular when the Makefile and/or Kconfig prohibit it
 (3) it requires the include of the module.h header file which in turn
 includes nearly everything else, thus increasing CPP overhead.

I figured no point in sending a follow on patch since this came in via
the akpm tree into next and that gets rebased/updated regularly.

Thanks,
Paul.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-21 Thread Michal Hocko
On Fri 18-12-15 13:14:00, Andrew Morton wrote:
> On Fri, 18 Dec 2015 12:54:55 +0100 Michal Hocko  wrote:
> 
> > /* Retry the down_read_trylock(mmap_sem) a few times */
> > -   while (attempts++ < 10 && !__oom_reap_vmas(mm))
> > -   msleep_interruptible(100);
> > +   while (attempts++ < 10 && !__oom_reap_vmas(mm)) {
> > +   __set_task_state(current, TASK_IDLE);
> > +   schedule_timeout(HZ/10);
> > +   }
> 
> If you won't, I shall ;)
> 
> 
> From: Andrew Morton 
> Subject: sched: add schedule_timeout_idle()
> 
> This will be needed in the patch "mm, oom: introduce oom reaper".
> 
> Cc: Michal Hocko 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Signed-off-by: Andrew Morton 

Thanks! This makes more sense.

Acked-by: Michal Hocko 

> ---
> 
>  include/linux/sched.h |1 +
>  kernel/time/timer.c   |   11 +++
>  2 files changed, 12 insertions(+)
> 
> diff -puN kernel/time/timer.c~sched-add-schedule_timeout_idle 
> kernel/time/timer.c
> --- a/kernel/time/timer.c~sched-add-schedule_timeout_idle
> +++ a/kernel/time/timer.c
> @@ -1566,6 +1566,17 @@ signed long __sched schedule_timeout_uni
>  }
>  EXPORT_SYMBOL(schedule_timeout_uninterruptible);
>  
> +/*
> + * Like schedule_timeout_uninterruptible(), except this task will not 
> contribute
> + * to load average.
> + */
> +signed long __sched schedule_timeout_idle(signed long timeout)
> +{
> + __set_current_state(TASK_IDLE);
> + return schedule_timeout(timeout);
> +}
> +EXPORT_SYMBOL(schedule_timeout_idle);
> +
>  #ifdef CONFIG_HOTPLUG_CPU
>  static void migrate_timer_list(struct tvec_base *new_base, struct hlist_head 
> *head)
>  {
> diff -puN include/linux/sched.h~sched-add-schedule_timeout_idle 
> include/linux/sched.h
> --- a/include/linux/sched.h~sched-add-schedule_timeout_idle
> +++ a/include/linux/sched.h
> @@ -423,6 +423,7 @@ extern signed long schedule_timeout(sign
>  extern signed long schedule_timeout_interruptible(signed long timeout);
>  extern signed long schedule_timeout_killable(signed long timeout);
>  extern signed long schedule_timeout_uninterruptible(signed long timeout);
> +extern signed long schedule_timeout_idle(signed long timeout);
>  asmlinkage void schedule(void);
>  extern void schedule_preempt_disabled(void);
>  
> _

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-19 Thread Tetsuo Handa
Tetsuo Handa wrote:
> Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20151218.txt.xz .
> --
> [  438.304082] Killed process 12680 (oom_reaper-test) total-vm:4324kB, 
> anon-rss:120kB, file-rss:0kB, shmem-rss:0kB
> [  439.318951] oom_reaper: attempts=11
> [  445.581171] MemAlloc-Info: 796 stalling task, 0 dying task, 0 victim task.
> [  618.955215] MemAlloc-Info: 979 stalling task, 0 dying task, 0 victim task.
> --
> 
> Yes, this is an insane program. But what is important will be we prepare for
> cases when oom_reap_vmas() gave up waiting. Silent hang up is annoying.

s/gave up waiting/did not help/



I noticed yet another problem with this program.

The OOM victim (a child of memory hog process) received SIGKILL at
uptime = 438 and it terminated before uptime = 445 even though
oom_reap_vmas() gave up waiting at uptime = 439. However, the OOM killer
was not invoked again in order to kill the memory hog process before I
gave up waiting at uptime = 679. The OOM killer was needlessly kept
disabled for more than 234 seconds after the OOM victim terminated.

--
[  438.180596] oom_reaper-test invoked oom-killer: order=0, oom_score_adj=0, 
gfp_mask=0x26040c0(GFP_KERNEL|GFP_COMP|GFP_NOTRACK)
[  438.183524] oom_reaper-test cpuset=/ mems_allowed=0
[  438.185440] CPU: 0 PID: 13451 Comm: oom_reaper-test Not tainted 
4.4.0-rc5-next-20151217+ #248
(...snipped...)
[  438.301687] Out of memory: Kill process 12679 (oom_reaper-test) score 876 or 
sacrifice child
[  438.304082] Killed process 12680 (oom_reaper-test) total-vm:4324kB, 
anon-rss:120kB, file-rss:0kB, shmem-rss:0kB
(...snipped...)
[  568.185582] MemAlloc: oom_reaper-test(13451) seq=2 gfp=0x26040c0 order=0 
delay=7403
[  568.187593] oom_reaper-test R  running task0 13451   8130 0x0080
[  568.189546]  88007a4cb918 88007a5c4140 88007a4c 
88007a4cc000
[  568.191637]  88007a4cb950 88007fc10240 000100021bb8 
81c11730
[  568.193713]  88007a4cb930 816f5b77 88007fc10240 
88007a4cb9d0
[  568.195798] Call Trace:
[  568.196878]  [] schedule+0x37/0x90
[  568.198402]  [] schedule_timeout+0x117/0x1c0
[  568.200078]  [] ? init_timer_key+0x40/0x40
[  568.201725]  [] schedule_timeout_killable+0x24/0x30
[  568.203512]  [] out_of_memory+0x1f9/0x5a0
[  568.205144]  [] ? out_of_memory+0x2ad/0x5a0
[  568.206800]  [] __alloc_pages_nodemask+0xc43/0xc80
[  568.208564]  [] alloc_pages_current+0x96/0x1b0
[  568.210259]  [] ? new_slab+0x357/0x470
[  568.211811]  [] new_slab+0x3ce/0x470
[  568.213329]  [] ___slab_alloc+0x42a/0x5c0
[  568.214917]  [] ? seq_buf_alloc+0x35/0x40
[  568.216486]  [] ? mem_cgroup_end_page_stat+0x2d/0xb0
[  568.218240]  [] ? __lock_is_held+0x49/0x70
[  568.219815]  [] ? seq_buf_alloc+0x35/0x40
[  568.221388]  [] __slab_alloc+0x4a/0x81
[  568.222980]  [] ? seq_buf_alloc+0x35/0x40
[  568.224565]  [] __kmalloc+0x163/0x1b0
[  568.226075]  [] seq_buf_alloc+0x35/0x40
[  568.227710]  [] seq_read+0x31b/0x3c0
[  568.229184]  [] __vfs_read+0x32/0xf0
[  568.230673]  [] ? security_file_permission+0xa9/0xc0
[  568.232408]  [] ? rw_verify_area+0x4d/0xd0
[  568.234068]  [] vfs_read+0x7a/0x120
[  568.235561]  [] SyS_pread64+0x90/0xb0
[  568.237062]  [] entry_SYSCALL_64_fastpath+0x12/0x76
[  568.238766] 2 locks held by oom_reaper-test/13451:
[  568.240188]  #0:  (>lock){+.+.+.}, at: [] 
seq_read+0x47/0x3c0
[  568.242303]  #1:  (oom_lock){+.+...}, at: [] 
__alloc_pages_nodemask+0x8a8/0xc80
(...snipped...)
[  658.711079] MemAlloc: oom_reaper-test(13451) seq=2 gfp=0x26040c0 order=0 
delay=180777
[  658.713110] oom_reaper-test R  running task0 13451   8130 0x0080
[  658.715073]  88007a4cb918 88007a5c4140 88007a4c 
88007a4cc000
[  658.717166]  88007a4cb950 88007fc10240 000100021bb8 
81c11730
[  658.719248]  88007a4cb930 816f5b77 88007fc10240 
88007a4cb9d0
[  658.721345] Call Trace:
[  658.722426]  [] schedule+0x37/0x90
[  658.723950]  [] schedule_timeout+0x117/0x1c0
[  658.725636]  [] ? init_timer_key+0x40/0x40
[  658.727304]  [] schedule_timeout_killable+0x24/0x30
[  658.729113]  [] out_of_memory+0x1f9/0x5a0
[  658.730757]  [] ? out_of_memory+0x2ad/0x5a0
[  658.732444]  [] __alloc_pages_nodemask+0xc43/0xc80
[  658.734314]  [] alloc_pages_current+0x96/0x1b0
[  658.736041]  [] ? new_slab+0x357/0x470
[  658.737643]  [] new_slab+0x3ce/0x470
[  658.739239]  [] ___slab_alloc+0x42a/0x5c0
[  658.740899]  [] ? seq_buf_alloc+0x35/0x40
[  658.742617]  [] ? mem_cgroup_end_page_stat+0x2d/0xb0
[  658.744489]  [] ? __lock_is_held+0x49/0x70
[  658.746157]  [] ? seq_buf_alloc+0x35/0x40
[  658.747801]  [] __slab_alloc+0x4a/0x81
[  658.749392]  [] ? seq_buf_alloc+0x35/0x40
[  658.751021]  [] __kmalloc+0x163/0x1b0
[  658.752562]  [] seq_buf_alloc+0x35/0x40
[  658.754145]  [] seq_read+0x31b/0x3c0
[  658.755678]  [] __vfs_read+0x32/0xf0
[  658.757194]  [] ? security_file_permission+0xa9/0xc0
[  658.758959]  [] ? 

Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-19 Thread Tetsuo Handa
Tetsuo Handa wrote:
> Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20151218.txt.xz .
> --
> [  438.304082] Killed process 12680 (oom_reaper-test) total-vm:4324kB, 
> anon-rss:120kB, file-rss:0kB, shmem-rss:0kB
> [  439.318951] oom_reaper: attempts=11
> [  445.581171] MemAlloc-Info: 796 stalling task, 0 dying task, 0 victim task.
> [  618.955215] MemAlloc-Info: 979 stalling task, 0 dying task, 0 victim task.
> --
> 
> Yes, this is an insane program. But what is important will be we prepare for
> cases when oom_reap_vmas() gave up waiting. Silent hang up is annoying.

s/gave up waiting/did not help/



I noticed yet another problem with this program.

The OOM victim (a child of memory hog process) received SIGKILL at
uptime = 438 and it terminated before uptime = 445 even though
oom_reap_vmas() gave up waiting at uptime = 439. However, the OOM killer
was not invoked again in order to kill the memory hog process before I
gave up waiting at uptime = 679. The OOM killer was needlessly kept
disabled for more than 234 seconds after the OOM victim terminated.

--
[  438.180596] oom_reaper-test invoked oom-killer: order=0, oom_score_adj=0, 
gfp_mask=0x26040c0(GFP_KERNEL|GFP_COMP|GFP_NOTRACK)
[  438.183524] oom_reaper-test cpuset=/ mems_allowed=0
[  438.185440] CPU: 0 PID: 13451 Comm: oom_reaper-test Not tainted 
4.4.0-rc5-next-20151217+ #248
(...snipped...)
[  438.301687] Out of memory: Kill process 12679 (oom_reaper-test) score 876 or 
sacrifice child
[  438.304082] Killed process 12680 (oom_reaper-test) total-vm:4324kB, 
anon-rss:120kB, file-rss:0kB, shmem-rss:0kB
(...snipped...)
[  568.185582] MemAlloc: oom_reaper-test(13451) seq=2 gfp=0x26040c0 order=0 
delay=7403
[  568.187593] oom_reaper-test R  running task0 13451   8130 0x0080
[  568.189546]  88007a4cb918 88007a5c4140 88007a4c 
88007a4cc000
[  568.191637]  88007a4cb950 88007fc10240 000100021bb8 
81c11730
[  568.193713]  88007a4cb930 816f5b77 88007fc10240 
88007a4cb9d0
[  568.195798] Call Trace:
[  568.196878]  [] schedule+0x37/0x90
[  568.198402]  [] schedule_timeout+0x117/0x1c0
[  568.200078]  [] ? init_timer_key+0x40/0x40
[  568.201725]  [] schedule_timeout_killable+0x24/0x30
[  568.203512]  [] out_of_memory+0x1f9/0x5a0
[  568.205144]  [] ? out_of_memory+0x2ad/0x5a0
[  568.206800]  [] __alloc_pages_nodemask+0xc43/0xc80
[  568.208564]  [] alloc_pages_current+0x96/0x1b0
[  568.210259]  [] ? new_slab+0x357/0x470
[  568.211811]  [] new_slab+0x3ce/0x470
[  568.213329]  [] ___slab_alloc+0x42a/0x5c0
[  568.214917]  [] ? seq_buf_alloc+0x35/0x40
[  568.216486]  [] ? mem_cgroup_end_page_stat+0x2d/0xb0
[  568.218240]  [] ? __lock_is_held+0x49/0x70
[  568.219815]  [] ? seq_buf_alloc+0x35/0x40
[  568.221388]  [] __slab_alloc+0x4a/0x81
[  568.222980]  [] ? seq_buf_alloc+0x35/0x40
[  568.224565]  [] __kmalloc+0x163/0x1b0
[  568.226075]  [] seq_buf_alloc+0x35/0x40
[  568.227710]  [] seq_read+0x31b/0x3c0
[  568.229184]  [] __vfs_read+0x32/0xf0
[  568.230673]  [] ? security_file_permission+0xa9/0xc0
[  568.232408]  [] ? rw_verify_area+0x4d/0xd0
[  568.234068]  [] vfs_read+0x7a/0x120
[  568.235561]  [] SyS_pread64+0x90/0xb0
[  568.237062]  [] entry_SYSCALL_64_fastpath+0x12/0x76
[  568.238766] 2 locks held by oom_reaper-test/13451:
[  568.240188]  #0:  (>lock){+.+.+.}, at: [] 
seq_read+0x47/0x3c0
[  568.242303]  #1:  (oom_lock){+.+...}, at: [] 
__alloc_pages_nodemask+0x8a8/0xc80
(...snipped...)
[  658.711079] MemAlloc: oom_reaper-test(13451) seq=2 gfp=0x26040c0 order=0 
delay=180777
[  658.713110] oom_reaper-test R  running task0 13451   8130 0x0080
[  658.715073]  88007a4cb918 88007a5c4140 88007a4c 
88007a4cc000
[  658.717166]  88007a4cb950 88007fc10240 000100021bb8 
81c11730
[  658.719248]  88007a4cb930 816f5b77 88007fc10240 
88007a4cb9d0
[  658.721345] Call Trace:
[  658.722426]  [] schedule+0x37/0x90
[  658.723950]  [] schedule_timeout+0x117/0x1c0
[  658.725636]  [] ? init_timer_key+0x40/0x40
[  658.727304]  [] schedule_timeout_killable+0x24/0x30
[  658.729113]  [] out_of_memory+0x1f9/0x5a0
[  658.730757]  [] ? out_of_memory+0x2ad/0x5a0
[  658.732444]  [] __alloc_pages_nodemask+0xc43/0xc80
[  658.734314]  [] alloc_pages_current+0x96/0x1b0
[  658.736041]  [] ? new_slab+0x357/0x470
[  658.737643]  [] new_slab+0x3ce/0x470
[  658.739239]  [] ___slab_alloc+0x42a/0x5c0
[  658.740899]  [] ? seq_buf_alloc+0x35/0x40
[  658.742617]  [] ? mem_cgroup_end_page_stat+0x2d/0xb0
[  658.744489]  [] ? __lock_is_held+0x49/0x70
[  658.746157]  [] ? seq_buf_alloc+0x35/0x40
[  658.747801]  [] __slab_alloc+0x4a/0x81
[  658.749392]  [] ? seq_buf_alloc+0x35/0x40
[  658.751021]  [] __kmalloc+0x163/0x1b0
[  658.752562]  [] seq_buf_alloc+0x35/0x40
[  658.754145]  [] seq_read+0x31b/0x3c0
[  658.755678]  [] __vfs_read+0x32/0xf0
[  658.757194]  [] ? security_file_permission+0xa9/0xc0
[  658.758959]  [] ? 

Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-18 Thread Andrew Morton
On Fri, 18 Dec 2015 12:54:55 +0100 Michal Hocko  wrote:

>   /* Retry the down_read_trylock(mmap_sem) a few times */
> - while (attempts++ < 10 && !__oom_reap_vmas(mm))
> - msleep_interruptible(100);
> + while (attempts++ < 10 && !__oom_reap_vmas(mm)) {
> + __set_task_state(current, TASK_IDLE);
> + schedule_timeout(HZ/10);
> + }

If you won't, I shall ;)


From: Andrew Morton 
Subject: sched: add schedule_timeout_idle()

This will be needed in the patch "mm, oom: introduce oom reaper".

Cc: Michal Hocko 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Signed-off-by: Andrew Morton 
---

 include/linux/sched.h |1 +
 kernel/time/timer.c   |   11 +++
 2 files changed, 12 insertions(+)

diff -puN kernel/time/timer.c~sched-add-schedule_timeout_idle 
kernel/time/timer.c
--- a/kernel/time/timer.c~sched-add-schedule_timeout_idle
+++ a/kernel/time/timer.c
@@ -1566,6 +1566,17 @@ signed long __sched schedule_timeout_uni
 }
 EXPORT_SYMBOL(schedule_timeout_uninterruptible);
 
+/*
+ * Like schedule_timeout_uninterruptible(), except this task will not 
contribute
+ * to load average.
+ */
+signed long __sched schedule_timeout_idle(signed long timeout)
+{
+   __set_current_state(TASK_IDLE);
+   return schedule_timeout(timeout);
+}
+EXPORT_SYMBOL(schedule_timeout_idle);
+
 #ifdef CONFIG_HOTPLUG_CPU
 static void migrate_timer_list(struct tvec_base *new_base, struct hlist_head 
*head)
 {
diff -puN include/linux/sched.h~sched-add-schedule_timeout_idle 
include/linux/sched.h
--- a/include/linux/sched.h~sched-add-schedule_timeout_idle
+++ a/include/linux/sched.h
@@ -423,6 +423,7 @@ extern signed long schedule_timeout(sign
 extern signed long schedule_timeout_interruptible(signed long timeout);
 extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
+extern signed long schedule_timeout_idle(signed long timeout);
 asmlinkage void schedule(void);
 extern void schedule_preempt_disabled(void);
 
_

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-18 Thread Michal Hocko
On Thu 17-12-15 13:13:56, Andrew Morton wrote:
[...]
> Also, re-reading your description:
> 
> : It has been shown (e.g.  by Tetsuo Handa) that it is not that hard to
> : construct workloads which break the core assumption mentioned above and
> : the OOM victim might take unbounded amount of time to exit because it
> : might be blocked in the uninterruptible state waiting for on an event
> : (e.g.  lock) which is blocked by another task looping in the page
> : allocator.
> 
> So the allocating task has done an oom-kill and is waiting for memory
> to become available.  The killed task is stuck on some lock, unable to
> free memory.
> 
> But the problematic lock will sometimes be the killed tasks's mmap_sem,
> so the reaper won't reap anything.  This scenario requires that the
> mmap_sem is held for writing, which sounds like it will be uncommon. 

Yes, I have mentioned that in the changelog:
"
oom_reaper has to take mmap_sem on the target task for reading so the
solution is not 100% because the semaphore might be held or blocked for
write but the probability is reduced considerably wrt. basically any
lock blocking forward progress as described above.
"

Another thing is to do is to change down_write(mmap_sem) to
down_write_killable in most cases where we have a clear ENITR semantic.
This is on my todo list.

> hm.  sigh.  I hate the oom-killer.  Just buy some more memory already!

Tell me something about that...
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-18 Thread Tetsuo Handa
Michal Hocko wrote:
> On Wed 16-12-15 16:50:35, Andrew Morton wrote:
> > On Tue, 15 Dec 2015 19:36:15 +0100 Michal Hocko  wrote:
> [...]
> > > +static void oom_reap_vmas(struct mm_struct *mm)
> > > +{
> > > + int attempts = 0;
> > > +
> > > + while (attempts++ < 10 && !__oom_reap_vmas(mm))
> > > + schedule_timeout(HZ/10);
> > 
> > schedule_timeout() in state TASK_RUNNING doesn't do anything.  Use
> > msleep() or msleep_interruptible().  I can't decide which is more
> > appropriate - it only affects the load average display.
> 
> Ups. You are right. I will go with msleep_interruptible(100).
>  

I didn't know that. My testing was almost without oom_reap_vmas().

> > I guess it means that the __oom_reap_vmas() success rate is nice anud
> > high ;)
> 
> I had a debugging trace_printks around this and there were no reties
> during my testing so I was probably lucky to not trigger the mmap_sem
> contention.

Yes, you are lucky that you did not hit the mmap_sem contention.
I retested with

 static void oom_reap_vmas(struct mm_struct *mm)
 {
int attempts = 0;
 
while (attempts++ < 10 && !__oom_reap_vmas(mm))
-   schedule_timeout(HZ/10);
+   msleep_interruptible(100);
+   printk(KERN_WARNING "oom_reaper: attempts=%u\n", attempts);
 
/* Drop a reference taken by wake_oom_reaper */
mmdrop(mm);
 }

and I can hit that attempts becomes 11 (i.e. oom_reap_vmas() gives up
waiting) if I ran a memory stressing program with many contending
mmap_sem readers and writers shown below.

--
#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static cpu_set_t set = { { 1 } }; /* Allow only CPU 0. */
static char filename[32] = { };

/* down_read(>mmap_sem) requester. */
static int reader(void *unused)
{
const int fd = open(filename, O_RDONLY);
char buffer[128];
sched_setaffinity(0, sizeof(set), );
sleep(2);
while (pread(fd, buffer, sizeof(buffer), 0) > 0);
while (1)
pause();
return 0;
}

/* down_write(>mmap_sem) requester. */
static int writer(void *unused)
{
const int fd = open("/proc/self/exe", O_RDONLY);
sched_setaffinity(0, sizeof(set), );
sleep(2);
while (1) {
void *ptr = mmap(NULL, 4096, PROT_READ, MAP_PRIVATE, fd, 0);
munmap(ptr, 4096);
}
return 0;
}

static void my_clone(int (*func) (void *))
{
char *stack = malloc(4096);
if (stack)
clone(func, stack + 4096,
  CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL);
}

/* Memory consumer for invoking the OOM killer. */
static void memory_eater(void) {
char *buf = NULL;
unsigned long i;
unsigned long size = 0;
sleep(4);
for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
char *cp = realloc(buf, size);
if (!cp) {
size >>= 1;
break;
}
buf = cp;
}
fprintf(stderr, "Start eating memory\n");
for (i = 0; i < size; i += 4096)
buf[i] = '\0'; /* Will cause OOM due to overcommit */
}

int main(int argc, char *argv[])
{
int i;
const pid_t pid = fork();
if (pid == 0) {
for (i = 0; i < 9; i++)
my_clone(writer);
writer(NULL);
_exit(0);
} else if (pid > 0) {
snprintf(filename, sizeof(filename), "/proc/%u/stat", pid);
for (i = 0; i < 1000; i++)
my_clone(reader);
}
memory_eater();
return *(char *) NULL; /* Not reached. */
}
--

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20151218.txt.xz .
--
[   90.790847] Killed process 9560 (oom_reaper-test) total-vm:4312kB, 
anon-rss:124kB, file-rss:0kB, shmem-rss:0kB
[   91.803154] oom_reaper: attempts=11
[  100.701494] MemAlloc-Info: 509 stalling task, 0 dying task, 1 victim task.
[  102.439082] Killed process 9559 (oom_reaper-test) total-vm:2170960kB, 
anon-rss:1564600kB, file-rss:0kB, shmem-rss:0kB
[  102.441937] Killed process 9561 (oom_reaper-test) total-vm:2170960kB, 
anon-rss:1564776kB, file-rss:0kB, shmem-rss:0kB
[  102.731326] oom_reaper: attempts=1
[  125.420727] Killed process 10573 (oom_reaper-test) total-vm:4340kB, 
anon-rss:80kB, file-rss:0kB, shmem-rss:0kB
[  126.440392] oom_reaper: attempts=11
[  135.354193] MemAlloc-Info: 450 stalling task, 0 dying task, 0 victim task.
[  240.023256] MemAlloc-Info: 1016 stalling task, 0 dying task, 0 victim task.
[  302.246975] Killed process 10572 (oom_reaper-test) total-vm:2170960kB, 
anon-rss:1562128kB, file-rss:0kB, shmem-rss:0kB
[  302.263515] oom_reaper: attempts=1
[  382.961343] Killed process 11667 (oom_reaper-test) total-vm:4312kB, 
anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[ 

Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-18 Thread Michal Hocko
On Thu 17-12-15 12:00:04, Andrew Morton wrote:
> On Thu, 17 Dec 2015 11:55:11 -0800 Linus Torvalds 
>  wrote:
> 
> > On Thu, Dec 17, 2015 at 5:02 AM, Michal Hocko  wrote:
> > > Ups. You are right. I will go with msleep_interruptible(100).
> > 
> > I don't think that's right.
> > 
> > If a signal happens, that loop is now (again) just busy-looping.
> 
> It's called only by a kernel thread so no signal_pending().

Yes that was the thinking.

> This relationship is a bit unobvious and fragile, but we do it in
> quite a few places.

I guess Linus is right and __set_task_state(current, TASK_IDLE) would be
better and easier to read.
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4b0a5d8b92e1..eed99506b891 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -472,8 +472,10 @@ static void oom_reap_vmas(struct mm_struct *mm)
int attempts = 0;
 
/* Retry the down_read_trylock(mmap_sem) a few times */
-   while (attempts++ < 10 && !__oom_reap_vmas(mm))
-   msleep_interruptible(100);
+   while (attempts++ < 10 && !__oom_reap_vmas(mm)) {
+   __set_task_state(current, TASK_IDLE);
+   schedule_timeout(HZ/10);
+   }
 
/* Drop a reference taken by wake_oom_reaper */
mmdrop(mm);
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-18 Thread Michal Hocko
On Thu 17-12-15 16:15:21, Andrew Morton wrote:
> On Tue, 15 Dec 2015 19:36:15 +0100 Michal Hocko  wrote:
> 
> > This patch reduces the probability of such a lockup by introducing a
> > specialized kernel thread (oom_reaper) 
> 
> CONFIG_MMU=n:
> 
> slub.c:(.text+0x4184): undefined reference to `tlb_gather_mmu'
> slub.c:(.text+0x41bc): undefined reference to `unmap_page_range'
> slub.c:(.text+0x41d8): undefined reference to `tlb_finish_mmu'
> 
> I did the below so I can get an mmotm out the door, but hopefully
> there's a cleaner way.

Sorry about that and thanks for your fixup! I am not very familiar with
!MMU world and haven't heard about issues with the OOM deadlocks yet. So
I guess making this MMU only makes some sense. I would just get rid of
ifdefs in oom_kill_process and provide an empty wake_oom_reaper for
!CONFIG_MMU.  The following on top of yours:
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4b0a5d8b92e1..56ff1ff18c0e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -537,6 +537,10 @@ static int __init oom_init(void)
return 0;
 }
 module_init(oom_init)
+#else
+static void wake_oom_reaper(struct mm_struct *mm)
+{
+}
 #endif
 
 /**
@@ -648,9 +652,7 @@ void oom_kill_process(struct oom_control *oc, struct 
task_struct *p,
unsigned int victim_points = 0;
static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
  DEFAULT_RATELIMIT_BURST);
-#ifdef CONFIG_MMU
bool can_oom_reap = true;
-#endif
 
/*
 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -743,7 +745,6 @@ void oom_kill_process(struct oom_control *oc, struct 
task_struct *p,
continue;
if (is_global_init(p))
continue;
-#ifdef CONFIG_MMU
if (unlikely(p->flags & PF_KTHREAD) ||
p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
/*
@@ -754,15 +755,12 @@ void oom_kill_process(struct oom_control *oc, struct 
task_struct *p,
can_oom_reap = false;
continue;
}
-#endif
do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
}
rcu_read_unlock();
 
-#ifdef CONFIG_MMU
if (can_oom_reap)
wake_oom_reaper(mm);
-#endif
 
mmdrop(mm);
put_task_struct(victim);
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-18 Thread Michal Hocko
On Thu 17-12-15 12:00:04, Andrew Morton wrote:
> On Thu, 17 Dec 2015 11:55:11 -0800 Linus Torvalds 
>  wrote:
> 
> > On Thu, Dec 17, 2015 at 5:02 AM, Michal Hocko  wrote:
> > > Ups. You are right. I will go with msleep_interruptible(100).
> > 
> > I don't think that's right.
> > 
> > If a signal happens, that loop is now (again) just busy-looping.
> 
> It's called only by a kernel thread so no signal_pending().

Yes that was the thinking.

> This relationship is a bit unobvious and fragile, but we do it in
> quite a few places.

I guess Linus is right and __set_task_state(current, TASK_IDLE) would be
better and easier to read.
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4b0a5d8b92e1..eed99506b891 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -472,8 +472,10 @@ static void oom_reap_vmas(struct mm_struct *mm)
int attempts = 0;
 
/* Retry the down_read_trylock(mmap_sem) a few times */
-   while (attempts++ < 10 && !__oom_reap_vmas(mm))
-   msleep_interruptible(100);
+   while (attempts++ < 10 && !__oom_reap_vmas(mm)) {
+   __set_task_state(current, TASK_IDLE);
+   schedule_timeout(HZ/10);
+   }
 
/* Drop a reference taken by wake_oom_reaper */
mmdrop(mm);
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-18 Thread Tetsuo Handa
Michal Hocko wrote:
> On Wed 16-12-15 16:50:35, Andrew Morton wrote:
> > On Tue, 15 Dec 2015 19:36:15 +0100 Michal Hocko  wrote:
> [...]
> > > +static void oom_reap_vmas(struct mm_struct *mm)
> > > +{
> > > + int attempts = 0;
> > > +
> > > + while (attempts++ < 10 && !__oom_reap_vmas(mm))
> > > + schedule_timeout(HZ/10);
> > 
> > schedule_timeout() in state TASK_RUNNING doesn't do anything.  Use
> > msleep() or msleep_interruptible().  I can't decide which is more
> > appropriate - it only affects the load average display.
> 
> Ups. You are right. I will go with msleep_interruptible(100).
>  

I didn't know that. My testing was almost without oom_reap_vmas().

> > I guess it means that the __oom_reap_vmas() success rate is nice anud
> > high ;)
> 
> I had a debugging trace_printks around this and there were no reties
> during my testing so I was probably lucky to not trigger the mmap_sem
> contention.

Yes, you are lucky that you did not hit the mmap_sem contention.
I retested with

 static void oom_reap_vmas(struct mm_struct *mm)
 {
int attempts = 0;
 
while (attempts++ < 10 && !__oom_reap_vmas(mm))
-   schedule_timeout(HZ/10);
+   msleep_interruptible(100);
+   printk(KERN_WARNING "oom_reaper: attempts=%u\n", attempts);
 
/* Drop a reference taken by wake_oom_reaper */
mmdrop(mm);
 }

and I can hit that attempts becomes 11 (i.e. oom_reap_vmas() gives up
waiting) if I ran a memory stressing program with many contending
mmap_sem readers and writers shown below.

--
#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static cpu_set_t set = { { 1 } }; /* Allow only CPU 0. */
static char filename[32] = { };

/* down_read(>mmap_sem) requester. */
static int reader(void *unused)
{
const int fd = open(filename, O_RDONLY);
char buffer[128];
sched_setaffinity(0, sizeof(set), );
sleep(2);
while (pread(fd, buffer, sizeof(buffer), 0) > 0);
while (1)
pause();
return 0;
}

/* down_write(>mmap_sem) requester. */
static int writer(void *unused)
{
const int fd = open("/proc/self/exe", O_RDONLY);
sched_setaffinity(0, sizeof(set), );
sleep(2);
while (1) {
void *ptr = mmap(NULL, 4096, PROT_READ, MAP_PRIVATE, fd, 0);
munmap(ptr, 4096);
}
return 0;
}

static void my_clone(int (*func) (void *))
{
char *stack = malloc(4096);
if (stack)
clone(func, stack + 4096,
  CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL);
}

/* Memory consumer for invoking the OOM killer. */
static void memory_eater(void) {
char *buf = NULL;
unsigned long i;
unsigned long size = 0;
sleep(4);
for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
char *cp = realloc(buf, size);
if (!cp) {
size >>= 1;
break;
}
buf = cp;
}
fprintf(stderr, "Start eating memory\n");
for (i = 0; i < size; i += 4096)
buf[i] = '\0'; /* Will cause OOM due to overcommit */
}

int main(int argc, char *argv[])
{
int i;
const pid_t pid = fork();
if (pid == 0) {
for (i = 0; i < 9; i++)
my_clone(writer);
writer(NULL);
_exit(0);
} else if (pid > 0) {
snprintf(filename, sizeof(filename), "/proc/%u/stat", pid);
for (i = 0; i < 1000; i++)
my_clone(reader);
}
memory_eater();
return *(char *) NULL; /* Not reached. */
}
--

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20151218.txt.xz .
--
[   90.790847] Killed process 9560 (oom_reaper-test) total-vm:4312kB, 
anon-rss:124kB, file-rss:0kB, shmem-rss:0kB
[   91.803154] oom_reaper: attempts=11
[  100.701494] MemAlloc-Info: 509 stalling task, 0 dying task, 1 victim task.
[  102.439082] Killed process 9559 (oom_reaper-test) total-vm:2170960kB, 
anon-rss:1564600kB, file-rss:0kB, shmem-rss:0kB
[  102.441937] Killed process 9561 (oom_reaper-test) total-vm:2170960kB, 
anon-rss:1564776kB, file-rss:0kB, shmem-rss:0kB
[  102.731326] oom_reaper: attempts=1
[  125.420727] Killed process 10573 (oom_reaper-test) total-vm:4340kB, 
anon-rss:80kB, file-rss:0kB, shmem-rss:0kB
[  126.440392] oom_reaper: attempts=11
[  135.354193] MemAlloc-Info: 450 stalling task, 0 dying task, 0 victim task.
[  240.023256] MemAlloc-Info: 1016 stalling task, 0 dying task, 0 victim task.
[  302.246975] Killed process 10572 (oom_reaper-test) total-vm:2170960kB, 
anon-rss:1562128kB, file-rss:0kB, shmem-rss:0kB
[  302.263515] oom_reaper: attempts=1
[  382.961343] Killed process 11667 (oom_reaper-test) total-vm:4312kB, 
anon-rss:84kB, 

Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-18 Thread Michal Hocko
On Thu 17-12-15 16:15:21, Andrew Morton wrote:
> On Tue, 15 Dec 2015 19:36:15 +0100 Michal Hocko  wrote:
> 
> > This patch reduces the probability of such a lockup by introducing a
> > specialized kernel thread (oom_reaper) 
> 
> CONFIG_MMU=n:
> 
> slub.c:(.text+0x4184): undefined reference to `tlb_gather_mmu'
> slub.c:(.text+0x41bc): undefined reference to `unmap_page_range'
> slub.c:(.text+0x41d8): undefined reference to `tlb_finish_mmu'
> 
> I did the below so I can get an mmotm out the door, but hopefully
> there's a cleaner way.

Sorry about that and thanks for your fixup! I am not very familiar with
!MMU world and haven't heard about issues with the OOM deadlocks yet. So
I guess making this MMU only makes some sense. I would just get rid of
ifdefs in oom_kill_process and provide an empty wake_oom_reaper for
!CONFIG_MMU.  The following on top of yours:
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4b0a5d8b92e1..56ff1ff18c0e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -537,6 +537,10 @@ static int __init oom_init(void)
return 0;
 }
 module_init(oom_init)
+#else
+static void wake_oom_reaper(struct mm_struct *mm)
+{
+}
 #endif
 
 /**
@@ -648,9 +652,7 @@ void oom_kill_process(struct oom_control *oc, struct 
task_struct *p,
unsigned int victim_points = 0;
static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
  DEFAULT_RATELIMIT_BURST);
-#ifdef CONFIG_MMU
bool can_oom_reap = true;
-#endif
 
/*
 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -743,7 +745,6 @@ void oom_kill_process(struct oom_control *oc, struct 
task_struct *p,
continue;
if (is_global_init(p))
continue;
-#ifdef CONFIG_MMU
if (unlikely(p->flags & PF_KTHREAD) ||
p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
/*
@@ -754,15 +755,12 @@ void oom_kill_process(struct oom_control *oc, struct 
task_struct *p,
can_oom_reap = false;
continue;
}
-#endif
do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
}
rcu_read_unlock();
 
-#ifdef CONFIG_MMU
if (can_oom_reap)
wake_oom_reaper(mm);
-#endif
 
mmdrop(mm);
put_task_struct(victim);
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-18 Thread Michal Hocko
On Thu 17-12-15 13:13:56, Andrew Morton wrote:
[...]
> Also, re-reading your description:
> 
> : It has been shown (e.g.  by Tetsuo Handa) that it is not that hard to
> : construct workloads which break the core assumption mentioned above and
> : the OOM victim might take unbounded amount of time to exit because it
> : might be blocked in the uninterruptible state waiting for on an event
> : (e.g.  lock) which is blocked by another task looping in the page
> : allocator.
> 
> So the allocating task has done an oom-kill and is waiting for memory
> to become available.  The killed task is stuck on some lock, unable to
> free memory.
> 
> But the problematic lock will sometimes be the killed tasks's mmap_sem,
> so the reaper won't reap anything.  This scenario requires that the
> mmap_sem is held for writing, which sounds like it will be uncommon. 

Yes, I have mentioned that in the changelog:
"
oom_reaper has to take mmap_sem on the target task for reading so the
solution is not 100% because the semaphore might be held or blocked for
write but the probability is reduced considerably wrt. basically any
lock blocking forward progress as described above.
"

Another thing is to do is to change down_write(mmap_sem) to
down_write_killable in most cases where we have a clear ENITR semantic.
This is on my todo list.

> hm.  sigh.  I hate the oom-killer.  Just buy some more memory already!

Tell me something about that...
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-18 Thread Andrew Morton
On Fri, 18 Dec 2015 12:54:55 +0100 Michal Hocko  wrote:

>   /* Retry the down_read_trylock(mmap_sem) a few times */
> - while (attempts++ < 10 && !__oom_reap_vmas(mm))
> - msleep_interruptible(100);
> + while (attempts++ < 10 && !__oom_reap_vmas(mm)) {
> + __set_task_state(current, TASK_IDLE);
> + schedule_timeout(HZ/10);
> + }

If you won't, I shall ;)


From: Andrew Morton 
Subject: sched: add schedule_timeout_idle()

This will be needed in the patch "mm, oom: introduce oom reaper".

Cc: Michal Hocko 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Signed-off-by: Andrew Morton 
---

 include/linux/sched.h |1 +
 kernel/time/timer.c   |   11 +++
 2 files changed, 12 insertions(+)

diff -puN kernel/time/timer.c~sched-add-schedule_timeout_idle 
kernel/time/timer.c
--- a/kernel/time/timer.c~sched-add-schedule_timeout_idle
+++ a/kernel/time/timer.c
@@ -1566,6 +1566,17 @@ signed long __sched schedule_timeout_uni
 }
 EXPORT_SYMBOL(schedule_timeout_uninterruptible);
 
+/*
+ * Like schedule_timeout_uninterruptible(), except this task will not 
contribute
+ * to load average.
+ */
+signed long __sched schedule_timeout_idle(signed long timeout)
+{
+   __set_current_state(TASK_IDLE);
+   return schedule_timeout(timeout);
+}
+EXPORT_SYMBOL(schedule_timeout_idle);
+
 #ifdef CONFIG_HOTPLUG_CPU
 static void migrate_timer_list(struct tvec_base *new_base, struct hlist_head 
*head)
 {
diff -puN include/linux/sched.h~sched-add-schedule_timeout_idle 
include/linux/sched.h
--- a/include/linux/sched.h~sched-add-schedule_timeout_idle
+++ a/include/linux/sched.h
@@ -423,6 +423,7 @@ extern signed long schedule_timeout(sign
 extern signed long schedule_timeout_interruptible(signed long timeout);
 extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
+extern signed long schedule_timeout_idle(signed long timeout);
 asmlinkage void schedule(void);
 extern void schedule_preempt_disabled(void);
 
_

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-17 Thread Andrew Morton
On Tue, 15 Dec 2015 19:36:15 +0100 Michal Hocko  wrote:

> This patch reduces the probability of such a lockup by introducing a
> specialized kernel thread (oom_reaper) 

CONFIG_MMU=n:

slub.c:(.text+0x4184): undefined reference to `tlb_gather_mmu'
slub.c:(.text+0x41bc): undefined reference to `unmap_page_range'
slub.c:(.text+0x41d8): undefined reference to `tlb_finish_mmu'

I did the below so I can get an mmotm out the door, but hopefully
there's a cleaner way.

--- a/mm/oom_kill.c~mm-oom-introduce-oom-reaper-fix-3
+++ a/mm/oom_kill.c
@@ -415,6 +415,7 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victi
 
 bool oom_killer_disabled __read_mostly;
 
+#ifdef CONFIG_MMU
 /*
  * OOM Reaper kernel thread which tries to reap the memory used by the OOM
  * victim (if that is possible) to help the OOM killer to move on.
@@ -517,6 +518,27 @@ static void wake_oom_reaper(struct mm_st
mmdrop(mm);
 }
 
+static int __init oom_init(void)
+{
+   oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
+   if (IS_ERR(oom_reaper_th)) {
+   pr_err("Unable to start OOM reaper %ld. Continuing 
regardless\n",
+   PTR_ERR(oom_reaper_th));
+   oom_reaper_th = NULL;
+   } else {
+   struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
+
+   /*
+* Make sure our oom reaper thread will get scheduled when
+* ASAP and that it won't get preempted by malicious userspace.
+*/
+   sched_setscheduler(oom_reaper_th, SCHED_FIFO, );
+   }
+   return 0;
+}
+module_init(oom_init)
+#endif
+
 /**
  * mark_oom_victim - mark the given task as OOM victim
  * @tsk: task to mark
@@ -626,7 +648,9 @@ void oom_kill_process(struct oom_control
unsigned int victim_points = 0;
static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
  DEFAULT_RATELIMIT_BURST);
+#ifdef CONFIG_MMU
bool can_oom_reap = true;
+#endif
 
/*
 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -719,6 +743,7 @@ void oom_kill_process(struct oom_control
continue;
if (is_global_init(p))
continue;
+#ifdef CONFIG_MMU
if (unlikely(p->flags & PF_KTHREAD) ||
p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
/*
@@ -729,13 +754,16 @@ void oom_kill_process(struct oom_control
can_oom_reap = false;
continue;
}
-
+#endif
do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
}
rcu_read_unlock();
 
+#ifdef CONFIG_MMU
if (can_oom_reap)
wake_oom_reaper(mm);
+#endif
+
mmdrop(mm);
put_task_struct(victim);
 }
@@ -887,23 +915,3 @@ void pagefault_out_of_memory(void)
 
mutex_unlock(_lock);
 }
-
-static int __init oom_init(void)
-{
-   oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
-   if (IS_ERR(oom_reaper_th)) {
-   pr_err("Unable to start OOM reaper %ld. Continuing 
regardless\n",
-   PTR_ERR(oom_reaper_th));
-   oom_reaper_th = NULL;
-   } else {
-   struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
-
-   /*
-* Make sure our oom reaper thread will get scheduled when
-* ASAP and that it won't get preempted by malicious userspace.
-*/
-   sched_setscheduler(oom_reaper_th, SCHED_FIFO, );
-   }
-   return 0;
-}
-module_init(oom_init)
_

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-17 Thread Andrew Morton
On Thu, 17 Dec 2015 14:02:24 +0100 Michal Hocko  wrote:

> > I guess it means that the __oom_reap_vmas() success rate is nice anud
> > high ;)
> 
> I had a debugging trace_printks around this and there were no reties
> during my testing so I was probably lucky to not trigger the mmap_sem
> contention.
> ---
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 48025a21f8c4..f53f87cfd899 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -469,7 +469,7 @@ static void oom_reap_vmas(struct mm_struct *mm)
>   int attempts = 0;
>  
>   while (attempts++ < 10 && !__oom_reap_vmas(mm))
> - schedule_timeout(HZ/10);
> + msleep_interruptible(100);
>  
>   /* Drop a reference taken by wake_oom_reaper */
>   mmdrop(mm);

Timeliness matter here.  Over on the other CPU, direct reclaim is
pounding along, on its way to declaring oom.  Sometimes the oom_reaper
thread will end up scavenging memory on behalf of a caller who gave up
a long time ago.  But we shouldn't atempt to "fix" that unless we can
demonstrate that it's a problem.


Also, re-reading your description:

: It has been shown (e.g.  by Tetsuo Handa) that it is not that hard to
: construct workloads which break the core assumption mentioned above and
: the OOM victim might take unbounded amount of time to exit because it
: might be blocked in the uninterruptible state waiting for on an event
: (e.g.  lock) which is blocked by another task looping in the page
: allocator.

So the allocating task has done an oom-kill and is waiting for memory
to become available.  The killed task is stuck on some lock, unable to
free memory.

But the problematic lock will sometimes be the killed tasks's mmap_sem,
so the reaper won't reap anything.  This scenario requires that the
mmap_sem is held for writing, which sounds like it will be uncommon. 
hm.  sigh.  I hate the oom-killer.  Just buy some more memory already!


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-17 Thread Andrew Morton
On Thu, 17 Dec 2015 11:55:11 -0800 Linus Torvalds 
 wrote:

> On Thu, Dec 17, 2015 at 5:02 AM, Michal Hocko  wrote:
> > Ups. You are right. I will go with msleep_interruptible(100).
> 
> I don't think that's right.
> 
> If a signal happens, that loop is now (again) just busy-looping.

It's called only by a kernel thread so no signal_pending().  This
relationship is a bit unobvious and fragile, but we do it in quite a
few places.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-17 Thread Linus Torvalds
On Thu, Dec 17, 2015 at 5:02 AM, Michal Hocko  wrote:
> Ups. You are right. I will go with msleep_interruptible(100).

I don't think that's right.

If a signal happens, that loop is now (again) just busy-looping. That
doesn't sound right, although with the maximum limit of 10 attempts,
maybe it's fine - the thing is technically "busylooping", but it will
definitely not busy-loop for very long.

So maybe that code is fine, but I think the signal case might at least
merit a comment?

Also, if you actually do want UNINTERRUPTIBLE (no reaction to signals
at all), but don't want to be seen as being "load" on the system, you
can use TASK_IDLE, which is a combination of TASK_UNINTERRUPTIBLE |
TASK_NOLOAD.

Because if you sleep interruptibly, you do generally need to handle
signals (although that limit count may make it ok in this case).

There's basically three levels:

 - TASK_UNINTERRUPTIBLE: no signal handling at all

 - TASK_KILLABLE: no normal signal handling, but ok to be killed
(needs to check fatal_signal_pending() and exit)

 - TASK_INTERRUPTIBLE: will react to signals

(and then that TASK_IDLE thing that is semantically the same as
uninterruptible, but doesn't count against the load average).

The main use for TASK_KILLABLE is in places where expected semantics
do not allow a EINTR return, but we know that because the process is
about to be killed, we can ignore that, for the simple reason that
nobody will ever *see* the EINTR.

Btw, I think you might want to re-run your test-case after this
change, since the whole "busy loop vs actually sleeping" might just
have changed the result..

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-17 Thread Michal Hocko
On Wed 16-12-15 16:50:35, Andrew Morton wrote:
> On Tue, 15 Dec 2015 19:36:15 +0100 Michal Hocko  wrote:
[...]
> > +static void oom_reap_vmas(struct mm_struct *mm)
> > +{
> > +   int attempts = 0;
> > +
> > +   while (attempts++ < 10 && !__oom_reap_vmas(mm))
> > +   schedule_timeout(HZ/10);
> 
> schedule_timeout() in state TASK_RUNNING doesn't do anything.  Use
> msleep() or msleep_interruptible().  I can't decide which is more
> appropriate - it only affects the load average display.

Ups. You are right. I will go with msleep_interruptible(100).
 
> Which prompts the obvious question: as the no-operativeness of this
> call wasn't noticed in testing, why do we have it there...

Well, the idea was that an interfering mmap_sem operation which holds
it for write might block us for a short time period - e.g. when not
depending on an allocation or accessing the memory reserves helps
to progress the allocation. If the holder of the semaphore is stuck
then the retry is pointless. On the other hand the retry shouldn't be
harmful. All in all this is just a heuristic and we do not depend on
it. I guess we can drop it and nobody would actually notice. Let me know
if you prefer that and I will respin the patch.


> I guess it means that the __oom_reap_vmas() success rate is nice anud
> high ;)

I had a debugging trace_printks around this and there were no reties
during my testing so I was probably lucky to not trigger the mmap_sem
contention.
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 48025a21f8c4..f53f87cfd899 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -469,7 +469,7 @@ static void oom_reap_vmas(struct mm_struct *mm)
int attempts = 0;
 
while (attempts++ < 10 && !__oom_reap_vmas(mm))
-   schedule_timeout(HZ/10);
+   msleep_interruptible(100);
 
/* Drop a reference taken by wake_oom_reaper */
mmdrop(mm);

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-17 Thread Michal Hocko
On Wed 16-12-15 16:50:35, Andrew Morton wrote:
> On Tue, 15 Dec 2015 19:36:15 +0100 Michal Hocko  wrote:
[...]
> > +static void oom_reap_vmas(struct mm_struct *mm)
> > +{
> > +   int attempts = 0;
> > +
> > +   while (attempts++ < 10 && !__oom_reap_vmas(mm))
> > +   schedule_timeout(HZ/10);
> 
> schedule_timeout() in state TASK_RUNNING doesn't do anything.  Use
> msleep() or msleep_interruptible().  I can't decide which is more
> appropriate - it only affects the load average display.

Ups. You are right. I will go with msleep_interruptible(100).
 
> Which prompts the obvious question: as the no-operativeness of this
> call wasn't noticed in testing, why do we have it there...

Well, the idea was that an interfering mmap_sem operation which holds
it for write might block us for a short time period - e.g. when not
depending on an allocation or accessing the memory reserves helps
to progress the allocation. If the holder of the semaphore is stuck
then the retry is pointless. On the other hand the retry shouldn't be
harmful. All in all this is just a heuristic and we do not depend on
it. I guess we can drop it and nobody would actually notice. Let me know
if you prefer that and I will respin the patch.


> I guess it means that the __oom_reap_vmas() success rate is nice anud
> high ;)

I had a debugging trace_printks around this and there were no reties
during my testing so I was probably lucky to not trigger the mmap_sem
contention.
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 48025a21f8c4..f53f87cfd899 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -469,7 +469,7 @@ static void oom_reap_vmas(struct mm_struct *mm)
int attempts = 0;
 
while (attempts++ < 10 && !__oom_reap_vmas(mm))
-   schedule_timeout(HZ/10);
+   msleep_interruptible(100);
 
/* Drop a reference taken by wake_oom_reaper */
mmdrop(mm);

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-17 Thread Linus Torvalds
On Thu, Dec 17, 2015 at 5:02 AM, Michal Hocko  wrote:
> Ups. You are right. I will go with msleep_interruptible(100).

I don't think that's right.

If a signal happens, that loop is now (again) just busy-looping. That
doesn't sound right, although with the maximum limit of 10 attempts,
maybe it's fine - the thing is technically "busylooping", but it will
definitely not busy-loop for very long.

So maybe that code is fine, but I think the signal case might at least
merit a comment?

Also, if you actually do want UNINTERRUPTIBLE (no reaction to signals
at all), but don't want to be seen as being "load" on the system, you
can use TASK_IDLE, which is a combination of TASK_UNINTERRUPTIBLE |
TASK_NOLOAD.

Because if you sleep interruptibly, you do generally need to handle
signals (although that limit count may make it ok in this case).

There's basically three levels:

 - TASK_UNINTERRUPTIBLE: no signal handling at all

 - TASK_KILLABLE: no normal signal handling, but ok to be killed
(needs to check fatal_signal_pending() and exit)

 - TASK_INTERRUPTIBLE: will react to signals

(and then that TASK_IDLE thing that is semantically the same as
uninterruptible, but doesn't count against the load average).

The main use for TASK_KILLABLE is in places where expected semantics
do not allow a EINTR return, but we know that because the process is
about to be killed, we can ignore that, for the simple reason that
nobody will ever *see* the EINTR.

Btw, I think you might want to re-run your test-case after this
change, since the whole "busy loop vs actually sleeping" might just
have changed the result..

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-17 Thread Andrew Morton
On Thu, 17 Dec 2015 11:55:11 -0800 Linus Torvalds 
 wrote:

> On Thu, Dec 17, 2015 at 5:02 AM, Michal Hocko  wrote:
> > Ups. You are right. I will go with msleep_interruptible(100).
> 
> I don't think that's right.
> 
> If a signal happens, that loop is now (again) just busy-looping.

It's called only by a kernel thread so no signal_pending().  This
relationship is a bit unobvious and fragile, but we do it in quite a
few places.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-17 Thread Andrew Morton
On Thu, 17 Dec 2015 14:02:24 +0100 Michal Hocko  wrote:

> > I guess it means that the __oom_reap_vmas() success rate is nice anud
> > high ;)
> 
> I had a debugging trace_printks around this and there were no reties
> during my testing so I was probably lucky to not trigger the mmap_sem
> contention.
> ---
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 48025a21f8c4..f53f87cfd899 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -469,7 +469,7 @@ static void oom_reap_vmas(struct mm_struct *mm)
>   int attempts = 0;
>  
>   while (attempts++ < 10 && !__oom_reap_vmas(mm))
> - schedule_timeout(HZ/10);
> + msleep_interruptible(100);
>  
>   /* Drop a reference taken by wake_oom_reaper */
>   mmdrop(mm);

Timeliness matter here.  Over on the other CPU, direct reclaim is
pounding along, on its way to declaring oom.  Sometimes the oom_reaper
thread will end up scavenging memory on behalf of a caller who gave up
a long time ago.  But we shouldn't atempt to "fix" that unless we can
demonstrate that it's a problem.


Also, re-reading your description:

: It has been shown (e.g.  by Tetsuo Handa) that it is not that hard to
: construct workloads which break the core assumption mentioned above and
: the OOM victim might take unbounded amount of time to exit because it
: might be blocked in the uninterruptible state waiting for on an event
: (e.g.  lock) which is blocked by another task looping in the page
: allocator.

So the allocating task has done an oom-kill and is waiting for memory
to become available.  The killed task is stuck on some lock, unable to
free memory.

But the problematic lock will sometimes be the killed tasks's mmap_sem,
so the reaper won't reap anything.  This scenario requires that the
mmap_sem is held for writing, which sounds like it will be uncommon. 
hm.  sigh.  I hate the oom-killer.  Just buy some more memory already!


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-17 Thread Andrew Morton
On Tue, 15 Dec 2015 19:36:15 +0100 Michal Hocko  wrote:

> This patch reduces the probability of such a lockup by introducing a
> specialized kernel thread (oom_reaper) 

CONFIG_MMU=n:

slub.c:(.text+0x4184): undefined reference to `tlb_gather_mmu'
slub.c:(.text+0x41bc): undefined reference to `unmap_page_range'
slub.c:(.text+0x41d8): undefined reference to `tlb_finish_mmu'

I did the below so I can get an mmotm out the door, but hopefully
there's a cleaner way.

--- a/mm/oom_kill.c~mm-oom-introduce-oom-reaper-fix-3
+++ a/mm/oom_kill.c
@@ -415,6 +415,7 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victi
 
 bool oom_killer_disabled __read_mostly;
 
+#ifdef CONFIG_MMU
 /*
  * OOM Reaper kernel thread which tries to reap the memory used by the OOM
  * victim (if that is possible) to help the OOM killer to move on.
@@ -517,6 +518,27 @@ static void wake_oom_reaper(struct mm_st
mmdrop(mm);
 }
 
+static int __init oom_init(void)
+{
+   oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
+   if (IS_ERR(oom_reaper_th)) {
+   pr_err("Unable to start OOM reaper %ld. Continuing 
regardless\n",
+   PTR_ERR(oom_reaper_th));
+   oom_reaper_th = NULL;
+   } else {
+   struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
+
+   /*
+* Make sure our oom reaper thread will get scheduled when
+* ASAP and that it won't get preempted by malicious userspace.
+*/
+   sched_setscheduler(oom_reaper_th, SCHED_FIFO, );
+   }
+   return 0;
+}
+module_init(oom_init)
+#endif
+
 /**
  * mark_oom_victim - mark the given task as OOM victim
  * @tsk: task to mark
@@ -626,7 +648,9 @@ void oom_kill_process(struct oom_control
unsigned int victim_points = 0;
static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
  DEFAULT_RATELIMIT_BURST);
+#ifdef CONFIG_MMU
bool can_oom_reap = true;
+#endif
 
/*
 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -719,6 +743,7 @@ void oom_kill_process(struct oom_control
continue;
if (is_global_init(p))
continue;
+#ifdef CONFIG_MMU
if (unlikely(p->flags & PF_KTHREAD) ||
p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
/*
@@ -729,13 +754,16 @@ void oom_kill_process(struct oom_control
can_oom_reap = false;
continue;
}
-
+#endif
do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
}
rcu_read_unlock();
 
+#ifdef CONFIG_MMU
if (can_oom_reap)
wake_oom_reaper(mm);
+#endif
+
mmdrop(mm);
put_task_struct(victim);
 }
@@ -887,23 +915,3 @@ void pagefault_out_of_memory(void)
 
mutex_unlock(_lock);
 }
-
-static int __init oom_init(void)
-{
-   oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
-   if (IS_ERR(oom_reaper_th)) {
-   pr_err("Unable to start OOM reaper %ld. Continuing 
regardless\n",
-   PTR_ERR(oom_reaper_th));
-   oom_reaper_th = NULL;
-   } else {
-   struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
-
-   /*
-* Make sure our oom reaper thread will get scheduled when
-* ASAP and that it won't get preempted by malicious userspace.
-*/
-   sched_setscheduler(oom_reaper_th, SCHED_FIFO, );
-   }
-   return 0;
-}
-module_init(oom_init)
_

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-16 Thread Andrew Morton
On Tue, 15 Dec 2015 19:36:15 +0100 Michal Hocko  wrote:

> From: Michal Hocko 
> 
> This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> independently brought up by Oleg Nesterov.
> 
> The OOM killer currently allows to kill only a single task in a good
> hope that the task will terminate in a reasonable time and frees up its
> memory.  Such a task (oom victim) will get an access to memory reserves
> via mark_oom_victim to allow a forward progress should there be a need
> for additional memory during exit path.
>
> ...
>
> +static void oom_reap_vmas(struct mm_struct *mm)
> +{
> + int attempts = 0;
> +
> + while (attempts++ < 10 && !__oom_reap_vmas(mm))
> + schedule_timeout(HZ/10);

schedule_timeout() in state TASK_RUNNING doesn't do anything.  Use
msleep() or msleep_interruptible().  I can't decide which is more
appropriate - it only affects the load average display.

Which prompts the obvious question: as the no-operativeness of this
call wasn't noticed in testing, why do we have it there...

I guess it means that the __oom_reap_vmas() success rate is nice and
high ;)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm, oom: introduce oom reaper

2015-12-16 Thread Andrew Morton
On Tue, 15 Dec 2015 19:36:15 +0100 Michal Hocko  wrote:

> From: Michal Hocko 
> 
> This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> independently brought up by Oleg Nesterov.
> 
> The OOM killer currently allows to kill only a single task in a good
> hope that the task will terminate in a reasonable time and frees up its
> memory.  Such a task (oom victim) will get an access to memory reserves
> via mark_oom_victim to allow a forward progress should there be a need
> for additional memory during exit path.
>
> ...
>
> +static void oom_reap_vmas(struct mm_struct *mm)
> +{
> + int attempts = 0;
> +
> + while (attempts++ < 10 && !__oom_reap_vmas(mm))
> + schedule_timeout(HZ/10);

schedule_timeout() in state TASK_RUNNING doesn't do anything.  Use
msleep() or msleep_interruptible().  I can't decide which is more
appropriate - it only affects the load average display.

Which prompts the obvious question: as the no-operativeness of this
call wasn't noticed in testing, why do we have it there...

I guess it means that the __oom_reap_vmas() success rate is nice and
high ;)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] mm, oom: introduce oom reaper

2015-12-15 Thread Michal Hocko
From: Michal Hocko 

This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
independently brought up by Oleg Nesterov.

The OOM killer currently allows to kill only a single task in a good
hope that the task will terminate in a reasonable time and frees up its
memory.  Such a task (oom victim) will get an access to memory reserves
via mark_oom_victim to allow a forward progress should there be a need
for additional memory during exit path.

It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
construct workloads which break the core assumption mentioned above and
the OOM victim might take unbounded amount of time to exit because it
might be blocked in the uninterruptible state waiting for on an event
(e.g. lock) which is blocked by another task looping in the page
allocator.

This patch reduces the probability of such a lockup by introducing a
specialized kernel thread (oom_reaper) which tries to reclaim additional
memory by preemptively reaping the anonymous or swapped out memory
owned by the oom victim under an assumption that such a memory won't
be needed when its owner is killed and kicked from the userspace anyway.
There is one notable exception to this, though, if the OOM victim was
in the process of coredumping the result would be incomplete. This is
considered a reasonable constrain because the overall system health is
more important than debugability of a particular application.

A kernel thread has been chosen because we need a reliable way of
invocation so workqueue context is not appropriate because all the
workers might be busy (e.g. allocating memory). Kswapd which sounds
like another good fit is not appropriate as well because it might get
blocked on locks during reclaim as well.

oom_reaper has to take mmap_sem on the target task for reading so the
solution is not 100% because the semaphore might be held or blocked for
write but the probability is reduced considerably wrt. basically any
lock blocking forward progress as described above. In order to prevent
from blocking on the lock without any forward progress we are using only
a trylock and retry 10 times with a short sleep in between.
Users of mmap_sem which need it for write should be carefully reviewed
to use _killable waiting as much as possible and reduce allocations
requests done with the lock held to absolute minimum to reduce the risk
even further.

The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition
and oom_reaper clear this atomically once it is done with the work. This
means that only a single mm_struct can be reaped at the time. As the
operation is potentially disruptive we are trying to limit it to the
ncessary minimum and the reaper blocks any updates while it operates on
an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
race is detected by atomic_inc_not_zero(mm_users).

Changes since v2
- fix mm_count refernce leak reported by Tetsuo
- make sure oom_reaper_th is NULL after kthread_run fails - Tetsuo
- use wait_event_freezable rather than open coded wait loop - suggested
  by Tetsuo
Changes since v1
- fix the screwed up detail->check_swap_entries - Johannes
- do not use kthread_should_stop because that would need a cleanup
  and we do not have anybody to stop us - Tetsuo
- move wake_oom_reaper to oom_kill_process because we have to wait
  for all tasks sharing the same mm to get killed - Tetsuo
- do not reap mm structs which are shared with unkillable tasks - Tetsuo

Acked-by: Mel Gorman 
Signed-off-by: Michal Hocko 
---
Hi,
this is version 3 of the patch. Thanks to Tetsuo for his exhaustive
review.  There haven't been any fundamental objections to the approach
so I have dropped the RFC status. Any further feedback is welcome of
course.

 include/linux/mm.h |   2 +
 mm/internal.h  |   5 ++
 mm/memory.c|  12 ++---
 mm/oom_kill.c  | 139 +++--
 4 files changed, 149 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 25cdec395f2c..d1ce03569942 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1061,6 +1061,8 @@ struct zap_details {
struct address_space *check_mapping;/* Check page->mapping if set */
pgoff_t first_index;/* Lowest page->index to unmap 
*/
pgoff_t last_index; /* Highest page->index to unmap 
*/
+   bool ignore_dirty;  /* Ignore dirty pages */
+   bool check_swap_entries;/* Check also swap entries */
 };
 
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/internal.h b/mm/internal.h
index 4ae7b7c7462b..9006ce1960ff 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -41,6 +41,11 @@ extern int do_swap_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
 void free_pgtables(struct mmu_gather *tlb, struct 

[PATCH 1/2] mm, oom: introduce oom reaper

2015-12-15 Thread Michal Hocko
From: Michal Hocko 

This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
independently brought up by Oleg Nesterov.

The OOM killer currently allows to kill only a single task in a good
hope that the task will terminate in a reasonable time and frees up its
memory.  Such a task (oom victim) will get an access to memory reserves
via mark_oom_victim to allow a forward progress should there be a need
for additional memory during exit path.

It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
construct workloads which break the core assumption mentioned above and
the OOM victim might take unbounded amount of time to exit because it
might be blocked in the uninterruptible state waiting for on an event
(e.g. lock) which is blocked by another task looping in the page
allocator.

This patch reduces the probability of such a lockup by introducing a
specialized kernel thread (oom_reaper) which tries to reclaim additional
memory by preemptively reaping the anonymous or swapped out memory
owned by the oom victim under an assumption that such a memory won't
be needed when its owner is killed and kicked from the userspace anyway.
There is one notable exception to this, though, if the OOM victim was
in the process of coredumping the result would be incomplete. This is
considered a reasonable constrain because the overall system health is
more important than debugability of a particular application.

A kernel thread has been chosen because we need a reliable way of
invocation so workqueue context is not appropriate because all the
workers might be busy (e.g. allocating memory). Kswapd which sounds
like another good fit is not appropriate as well because it might get
blocked on locks during reclaim as well.

oom_reaper has to take mmap_sem on the target task for reading so the
solution is not 100% because the semaphore might be held or blocked for
write but the probability is reduced considerably wrt. basically any
lock blocking forward progress as described above. In order to prevent
from blocking on the lock without any forward progress we are using only
a trylock and retry 10 times with a short sleep in between.
Users of mmap_sem which need it for write should be carefully reviewed
to use _killable waiting as much as possible and reduce allocations
requests done with the lock held to absolute minimum to reduce the risk
even further.

The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition
and oom_reaper clear this atomically once it is done with the work. This
means that only a single mm_struct can be reaped at the time. As the
operation is potentially disruptive we are trying to limit it to the
ncessary minimum and the reaper blocks any updates while it operates on
an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
race is detected by atomic_inc_not_zero(mm_users).

Changes since v2
- fix mm_count refernce leak reported by Tetsuo
- make sure oom_reaper_th is NULL after kthread_run fails - Tetsuo
- use wait_event_freezable rather than open coded wait loop - suggested
  by Tetsuo
Changes since v1
- fix the screwed up detail->check_swap_entries - Johannes
- do not use kthread_should_stop because that would need a cleanup
  and we do not have anybody to stop us - Tetsuo
- move wake_oom_reaper to oom_kill_process because we have to wait
  for all tasks sharing the same mm to get killed - Tetsuo
- do not reap mm structs which are shared with unkillable tasks - Tetsuo

Acked-by: Mel Gorman 
Signed-off-by: Michal Hocko 
---
Hi,
this is version 3 of the patch. Thanks to Tetsuo for his exhaustive
review.  There haven't been any fundamental objections to the approach
so I have dropped the RFC status. Any further feedback is welcome of
course.

 include/linux/mm.h |   2 +
 mm/internal.h  |   5 ++
 mm/memory.c|  12 ++---
 mm/oom_kill.c  | 139 +++--
 4 files changed, 149 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 25cdec395f2c..d1ce03569942 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1061,6 +1061,8 @@ struct zap_details {
struct address_space *check_mapping;/* Check page->mapping if set */
pgoff_t first_index;/* Lowest page->index to unmap 
*/
pgoff_t last_index; /* Highest page->index to unmap 
*/
+   bool ignore_dirty;  /* Ignore dirty pages */
+   bool check_swap_entries;/* Check also swap entries */
 };
 
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/internal.h b/mm/internal.h
index 4ae7b7c7462b..9006ce1960ff 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -41,6 +41,11 @@ extern int do_swap_page(struct mm_struct *mm, struct 
vm_area_struct *vma,