Re: Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-23 Thread Michal Hocko
On Sat 17-06-17 22:30:31, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > What does this dissassemble to on your kernel? Care to post addr2line?
> 
[...]
> The __oom_reap_task_mm+0xa1/0x160 is __oom_reap_task_mm at mm/oom_kill.c:472
> which is "struct vm_area_struct *vma;" line in __oom_reap_task_mm().
> The __oom_reap_task_mm+0xb1/0x160 is __oom_reap_task_mm at mm/oom_kill.c:519
> which is "if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))" line.
> The <49> 8b 46 50 is "vma->vm_flags" in can_madv_dontneed_vma(vma) from 
> __oom_reap_task_mm().

OK, I see what is going on here. I could have noticed earlier. Sorry my
fault. We are simply accessing a stale mm->mmap. exit_mmap() does
remove_vma which frees all the vmas but it doesn't reset mm->mmap to
NULL. Trivial to fix.

diff --git a/mm/mmap.c b/mm/mmap.c
index ca58f8a2a217..253808e716dc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2979,6 +2979,7 @@ void exit_mmap(struct mm_struct *mm)
nr_accounted += vma_pages(vma);
vma = remove_vma(vma);
}
+   mm->mmap = NULL;
vm_unacct_memory(nr_accounted);
up_write(>mmap_sem);
 }

> Is it safe for the OOM reaper to call 
> tlb_gather_mmu()/unmap_page_range()/tlb_finish_mmu() sequence
> after the OOM victim already completed 
> tlb_gather_mmu()/unmap_vmas()/free_pgtables()/tlb_finish_mmu()/
> remove_vma() sequence from exit_mmap() from __mmput() from mmput() from 
> exit_mm() from do_exit() ?

It is safe to race until unmap_vmas because that only needs mmap_sem for
read mode (e.g. madvise MADV_DONTNEED) and all the later operations have
to be linearized because we cannot tear down page tables while the oom
reaper is doing pte walk. After we drop mmap_sem for write in the
exit_mmap then there are no vmas and so there is nothing to do in the
reaper.

I will give the patch more testing next week. This one was busy as hell
(i was travelling and then the stack gap thingy...).
-- 
Michal Hocko
SUSE Labs


Re: Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-23 Thread Michal Hocko
On Sat 17-06-17 22:30:31, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > What does this dissassemble to on your kernel? Care to post addr2line?
> 
[...]
> The __oom_reap_task_mm+0xa1/0x160 is __oom_reap_task_mm at mm/oom_kill.c:472
> which is "struct vm_area_struct *vma;" line in __oom_reap_task_mm().
> The __oom_reap_task_mm+0xb1/0x160 is __oom_reap_task_mm at mm/oom_kill.c:519
> which is "if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))" line.
> The <49> 8b 46 50 is "vma->vm_flags" in can_madv_dontneed_vma(vma) from 
> __oom_reap_task_mm().

OK, I see what is going on here. I could have noticed earlier. Sorry my
fault. We are simply accessing a stale mm->mmap. exit_mmap() does
remove_vma which frees all the vmas but it doesn't reset mm->mmap to
NULL. Trivial to fix.

diff --git a/mm/mmap.c b/mm/mmap.c
index ca58f8a2a217..253808e716dc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2979,6 +2979,7 @@ void exit_mmap(struct mm_struct *mm)
nr_accounted += vma_pages(vma);
vma = remove_vma(vma);
}
+   mm->mmap = NULL;
vm_unacct_memory(nr_accounted);
up_write(>mmap_sem);
 }

> Is it safe for the OOM reaper to call 
> tlb_gather_mmu()/unmap_page_range()/tlb_finish_mmu() sequence
> after the OOM victim already completed 
> tlb_gather_mmu()/unmap_vmas()/free_pgtables()/tlb_finish_mmu()/
> remove_vma() sequence from exit_mmap() from __mmput() from mmput() from 
> exit_mm() from do_exit() ?

It is safe to race until unmap_vmas because that only needs mmap_sem for
read mode (e.g. madvise MADV_DONTNEED) and all the later operations have
to be linearized because we cannot tear down page tables while the oom
reaper is doing pte walk. After we drop mmap_sem for write in the
exit_mmap then there are no vmas and so there is nothing to do in the
reaper.

I will give the patch more testing next week. This one was busy as hell
(i was travelling and then the stack gap thingy...).
-- 
Michal Hocko
SUSE Labs


Re: Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-17 Thread Tetsuo Handa
Michal Hocko wrote:
> On Fri 16-06-17 23:26:20, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Fri 16-06-17 19:27:19, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > > > On Fri 16-06-17 09:54:34, Tetsuo Handa wrote:
> > > > > [...]
> > > > > > And the patch you proposed is broken.
> > > > > 
> > > > > Thanks for your testing!
> > > > >  
> > > > > > --
> > > > > > [  161.846202] Out of memory: Kill process 6331 (a.out) score 999 
> > > > > > or sacrifice child
> > > > > > [  161.850327] Killed process 6331 (a.out) total-vm:4172kB, 
> > > > > > anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> > > > > > [  161.858503] [ cut here ]
> > > > > > [  161.861512] kernel BUG at mm/memory.c:1381!
> > > > > 
> > > > > BUG_ON(addr >= end) suggests our vma has trimmed. I guess I see what 
> > > > > is
> > > > > going on here.
> > > > > __oom_reap_task_mmexit_mmap
> > > > > free_pgtables
> > > > > up_write(mm->mmap_sem)
> > > > >   down_read_trylock(>mmap_sem)
> > > > > remove_vma
> > > > > unmap_page_range
> > > > > 
> > > > > So we need to extend the mmap_sem coverage. See the updated diff (not
> > > > > the full proper patch yet).
> > > > 
> > > > That diff is still wrong. We need to prevent __oom_reap_task_mm() from 
> > > > calling
> > > > unmap_page_range() when __mmput() already called exit_mm(), by 
> > > > setting/checking
> > > > MMF_OOM_SKIP like shown below.
> > > 
> > > Care to explain why?
> > 
> > I don't know. Your updated diff is causing below oops.
> > 
> > --
> > [   90.621890] Out of memory: Kill process 2671 (a.out) score 999 or 
> > sacrifice child
> > [   90.624636] Killed process 2671 (a.out) total-vm:4172kB, anon-rss:84kB, 
> > file-rss:0kB, shmem-rss:0kB
> > [   90.861308] general protection fault:  [#1] PREEMPT SMP 
> > DEBUG_PAGEALLOC
> > [   90.863695] Modules linked in: coretemp pcspkr sg vmw_vmci shpchp 
> > i2c_piix4 sd_mod ata_generic pata_acpi serio_raw vmwgfx drm_kms_helper 
> > syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mptspi scsi_transport_spi 
> > mptscsih ahci mptbase libahci drm e1000 ata_piix i2c_core libata ipv6
> > [   90.870672] CPU: 2 PID: 47 Comm: oom_reaper Not tainted 4.12.0-rc5+ #128
> > [   90.872929] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
> > Desktop Reference Platform, BIOS 6.00 07/02/2015
> > [   90.875995] task: 88007b6cd2c0 task.stack: 88007b6d
> > [   90.878290] RIP: 0010:__oom_reap_task_mm+0xa1/0x160
> 
> What does this dissassemble to on your kernel? Care to post addr2line?

--
[  114.427451] Out of memory: Kill process 2876 (a.out) score 999 or sacrifice 
child
[  114.430208] Killed process 2876 (a.out) total-vm:4172kB, anon-rss:84kB, 
file-rss:0kB, shmem-rss:0kB
[  114.436753] general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
[  114.439129] Modules linked in: pcspkr coretemp sg vmw_vmci i2c_piix4 shpchp 
sd_mod ata_generic pata_acpi serio_raw vmwgfx drm_kms_helper syscopyarea 
sysfillrect sysimgblt fb_sys_fops ttm ahci e1000 libahci mptspi 
scsi_transport_spi drm mptscsih mptbase i2c_core ata_piix libata ipv6
[  114.446220] CPU: 0 PID: 47 Comm: oom_reaper Not tainted 4.12.0-rc5+ #133
[  114.448705] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
Desktop Reference Platform, BIOS 6.00 07/02/2015
[  114.451695] task: 88007b6cd2c0 task.stack: 88007b6d
[  114.453703] RIP: 0010:__oom_reap_task_mm+0xa1/0x160
[  114.455422] RSP: :88007b6d3df0 EFLAGS: 00010202
[  114.457527] RAX: 6b6b6b6b6b6b6b6b RBX: 8800670eaa40 RCX: 
[  114.460002] RDX: 88007b6d3e18 RSI: 8800670eaa40 RDI: 88007b6d3df0
[  114.462206] RBP: 88007b6d3e98 R08: 88007b6cdb08 R09: 88007b6cdad0
[  114.464390] R10:  R11: 83f54a84 R12: 8800670eab00
[  114.466659] R13: 880067211bc0 R14: 6b6b6b6b6b6b6b6b R15: 8800670eaa40
[  114.469126] FS:  () GS:88007c20() 
knlGS:
[  114.471496] CS:  0010 DS:  ES:  CR0: 80050033
[  114.473540] CR2: 7f55d759d050 CR3: 79ff4000 CR4: 001406f0
[  114.475773] Call Trace:
[  114.477078]  oom_reaper+0xa2/0x1b0 /* oom_reap_task at mm/oom_kill.c:542 
(inlined by) oom_reaper at mm/oom_kill.c:580 */
[  114.478569]  ? wake_up_bit+0x30/0x30
[  114.480058]  kthread+0x10d/0x140
[  114.481656]  ? __oom_reap_task_mm+0x160/0x160
[  114.483308]  ? kthread_create_on_node+0x60/0x60
[  114.485075]  ret_from_fork+0x27/0x40
[  114.486620] Code: c3 e8 54 82 f1 ff f0 80 8b 7a 04 00 00 40 48 8d bd 58 ff 
ff ff 48 83 c9 ff 31 d2 48 89 de e8 57 12 03 00 4c 8b 33 4d 85 f6 74 3b <49> 8b 
46 50 a9 00 24 40 00 75 27 49 83 be 90 00 00 00 00 74 04 
[  114.491819] RIP: __oom_reap_task_mm+0xa1/0x160 RSP: 88007b6d3df0
[  114.494520] 

Re: Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-17 Thread Tetsuo Handa
Michal Hocko wrote:
> On Fri 16-06-17 23:26:20, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Fri 16-06-17 19:27:19, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > > > On Fri 16-06-17 09:54:34, Tetsuo Handa wrote:
> > > > > [...]
> > > > > > And the patch you proposed is broken.
> > > > > 
> > > > > Thanks for your testing!
> > > > >  
> > > > > > --
> > > > > > [  161.846202] Out of memory: Kill process 6331 (a.out) score 999 
> > > > > > or sacrifice child
> > > > > > [  161.850327] Killed process 6331 (a.out) total-vm:4172kB, 
> > > > > > anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> > > > > > [  161.858503] [ cut here ]
> > > > > > [  161.861512] kernel BUG at mm/memory.c:1381!
> > > > > 
> > > > > BUG_ON(addr >= end) suggests our vma has trimmed. I guess I see what 
> > > > > is
> > > > > going on here.
> > > > > __oom_reap_task_mmexit_mmap
> > > > > free_pgtables
> > > > > up_write(mm->mmap_sem)
> > > > >   down_read_trylock(>mmap_sem)
> > > > > remove_vma
> > > > > unmap_page_range
> > > > > 
> > > > > So we need to extend the mmap_sem coverage. See the updated diff (not
> > > > > the full proper patch yet).
> > > > 
> > > > That diff is still wrong. We need to prevent __oom_reap_task_mm() from 
> > > > calling
> > > > unmap_page_range() when __mmput() already called exit_mm(), by 
> > > > setting/checking
> > > > MMF_OOM_SKIP like shown below.
> > > 
> > > Care to explain why?
> > 
> > I don't know. Your updated diff is causing below oops.
> > 
> > --
> > [   90.621890] Out of memory: Kill process 2671 (a.out) score 999 or 
> > sacrifice child
> > [   90.624636] Killed process 2671 (a.out) total-vm:4172kB, anon-rss:84kB, 
> > file-rss:0kB, shmem-rss:0kB
> > [   90.861308] general protection fault:  [#1] PREEMPT SMP 
> > DEBUG_PAGEALLOC
> > [   90.863695] Modules linked in: coretemp pcspkr sg vmw_vmci shpchp 
> > i2c_piix4 sd_mod ata_generic pata_acpi serio_raw vmwgfx drm_kms_helper 
> > syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mptspi scsi_transport_spi 
> > mptscsih ahci mptbase libahci drm e1000 ata_piix i2c_core libata ipv6
> > [   90.870672] CPU: 2 PID: 47 Comm: oom_reaper Not tainted 4.12.0-rc5+ #128
> > [   90.872929] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
> > Desktop Reference Platform, BIOS 6.00 07/02/2015
> > [   90.875995] task: 88007b6cd2c0 task.stack: 88007b6d
> > [   90.878290] RIP: 0010:__oom_reap_task_mm+0xa1/0x160
> 
> What does this dissassemble to on your kernel? Care to post addr2line?

--
[  114.427451] Out of memory: Kill process 2876 (a.out) score 999 or sacrifice 
child
[  114.430208] Killed process 2876 (a.out) total-vm:4172kB, anon-rss:84kB, 
file-rss:0kB, shmem-rss:0kB
[  114.436753] general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
[  114.439129] Modules linked in: pcspkr coretemp sg vmw_vmci i2c_piix4 shpchp 
sd_mod ata_generic pata_acpi serio_raw vmwgfx drm_kms_helper syscopyarea 
sysfillrect sysimgblt fb_sys_fops ttm ahci e1000 libahci mptspi 
scsi_transport_spi drm mptscsih mptbase i2c_core ata_piix libata ipv6
[  114.446220] CPU: 0 PID: 47 Comm: oom_reaper Not tainted 4.12.0-rc5+ #133
[  114.448705] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
Desktop Reference Platform, BIOS 6.00 07/02/2015
[  114.451695] task: 88007b6cd2c0 task.stack: 88007b6d
[  114.453703] RIP: 0010:__oom_reap_task_mm+0xa1/0x160
[  114.455422] RSP: :88007b6d3df0 EFLAGS: 00010202
[  114.457527] RAX: 6b6b6b6b6b6b6b6b RBX: 8800670eaa40 RCX: 
[  114.460002] RDX: 88007b6d3e18 RSI: 8800670eaa40 RDI: 88007b6d3df0
[  114.462206] RBP: 88007b6d3e98 R08: 88007b6cdb08 R09: 88007b6cdad0
[  114.464390] R10:  R11: 83f54a84 R12: 8800670eab00
[  114.466659] R13: 880067211bc0 R14: 6b6b6b6b6b6b6b6b R15: 8800670eaa40
[  114.469126] FS:  () GS:88007c20() 
knlGS:
[  114.471496] CS:  0010 DS:  ES:  CR0: 80050033
[  114.473540] CR2: 7f55d759d050 CR3: 79ff4000 CR4: 001406f0
[  114.475773] Call Trace:
[  114.477078]  oom_reaper+0xa2/0x1b0 /* oom_reap_task at mm/oom_kill.c:542 
(inlined by) oom_reaper at mm/oom_kill.c:580 */
[  114.478569]  ? wake_up_bit+0x30/0x30
[  114.480058]  kthread+0x10d/0x140
[  114.481656]  ? __oom_reap_task_mm+0x160/0x160
[  114.483308]  ? kthread_create_on_node+0x60/0x60
[  114.485075]  ret_from_fork+0x27/0x40
[  114.486620] Code: c3 e8 54 82 f1 ff f0 80 8b 7a 04 00 00 40 48 8d bd 58 ff 
ff ff 48 83 c9 ff 31 d2 48 89 de e8 57 12 03 00 4c 8b 33 4d 85 f6 74 3b <49> 8b 
46 50 a9 00 24 40 00 75 27 49 83 be 90 00 00 00 00 74 04 
[  114.491819] RIP: __oom_reap_task_mm+0xa1/0x160 RSP: 88007b6d3df0
[  114.494520] 

Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-16 Thread Michal Hocko
On Fri 16-06-17 21:22:20, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > OK, could you play with the patch/idea suggested in
> > http://lkml.kernel.org/r/20170615122031.gl1...@dhcp22.suse.cz?
> 
> I think we don't need to worry about mmap_sem dependency inside __mmput().
> Since the OOM killer checks for !MMF_OOM_SKIP mm rather than TIF_MEMDIE 
> thread,
> we can keep the OOM killer disabled until we set MMF_OOM_SKIP to the victim's 
> mm.
> That is, elevating mm_users throughout the reaping procedure does not cause
> premature victim selection, even after TIF_MEMDIE is cleared from the victim's
> thread. Then, we don't need to use down_write()/up_write() for non OOM 
> victim's mm
> (nearly 100% of exit_mmap() calls), and can force partial reaping of OOM 
> victim's mm
> (nearly 0% of exit_mmap() calls) before __mmput() starts doing exit_aio() etc.
> Patch is shown below. Only compile tested.

Yes, that would be another approach.
 
>  include/linux/sched/coredump.h |  1 +
>  mm/oom_kill.c  | 80 
> --
>  2 files changed, 40 insertions(+), 41 deletions(-)
> 
> diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
> index 98ae0d0..6b6237b 100644
> --- a/include/linux/sched/coredump.h
> +++ b/include/linux/sched/coredump.h
> @@ -62,6 +62,7 @@ static inline int get_dumpable(struct mm_struct *mm)
>   * on NFS restore
>   */
>  //#define MMF_EXE_FILE_CHANGED   18  /* see prctl_set_mm_exe_file() 
> */
> +#define MMF_OOM_REAPING  18  /* mm is supposed to be reaped 
> */

A new flag is not really needed. We can increase it for _each_ reapable
oom victim.

> @@ -658,6 +643,13 @@ static void mark_oom_victim(struct task_struct *tsk)
>   if (!cmpxchg(>signal->oom_mm, NULL, mm))
>   mmgrab(tsk->signal->oom_mm);
>  
> +#ifdef CONFIG_MMU
> + if (!test_bit(MMF_OOM_REAPING, >flags)) {
> + set_bit(MMF_OOM_REAPING, >flags);
> + mmget(mm);
> + }
> +#endif

This would really need a big fat warning explaining why we do not need
mmget_not_zero. We rely on exit_mm doing both mmput and tsk->mm = NULL
under the task_lock and mark_oom_victim is called under this lock as
well and task_will_free_mem resp. find_lock_task_mm makes sure we do not
even consider tasks wihout mm.

I agree that a solution which is fully contained inside the oom proper
would be preferable to touching __mmput path.
-- 
Michal Hocko
SUSE Labs


Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-16 Thread Michal Hocko
On Fri 16-06-17 21:22:20, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > OK, could you play with the patch/idea suggested in
> > http://lkml.kernel.org/r/20170615122031.gl1...@dhcp22.suse.cz?
> 
> I think we don't need to worry about mmap_sem dependency inside __mmput().
> Since the OOM killer checks for !MMF_OOM_SKIP mm rather than TIF_MEMDIE 
> thread,
> we can keep the OOM killer disabled until we set MMF_OOM_SKIP to the victim's 
> mm.
> That is, elevating mm_users throughout the reaping procedure does not cause
> premature victim selection, even after TIF_MEMDIE is cleared from the victim's
> thread. Then, we don't need to use down_write()/up_write() for non OOM 
> victim's mm
> (nearly 100% of exit_mmap() calls), and can force partial reaping of OOM 
> victim's mm
> (nearly 0% of exit_mmap() calls) before __mmput() starts doing exit_aio() etc.
> Patch is shown below. Only compile tested.

Yes, that would be another approach.
 
>  include/linux/sched/coredump.h |  1 +
>  mm/oom_kill.c  | 80 
> --
>  2 files changed, 40 insertions(+), 41 deletions(-)
> 
> diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
> index 98ae0d0..6b6237b 100644
> --- a/include/linux/sched/coredump.h
> +++ b/include/linux/sched/coredump.h
> @@ -62,6 +62,7 @@ static inline int get_dumpable(struct mm_struct *mm)
>   * on NFS restore
>   */
>  //#define MMF_EXE_FILE_CHANGED   18  /* see prctl_set_mm_exe_file() 
> */
> +#define MMF_OOM_REAPING  18  /* mm is supposed to be reaped 
> */

A new flag is not really needed. We can increase it for _each_ reapable
oom victim.

> @@ -658,6 +643,13 @@ static void mark_oom_victim(struct task_struct *tsk)
>   if (!cmpxchg(>signal->oom_mm, NULL, mm))
>   mmgrab(tsk->signal->oom_mm);
>  
> +#ifdef CONFIG_MMU
> + if (!test_bit(MMF_OOM_REAPING, >flags)) {
> + set_bit(MMF_OOM_REAPING, >flags);
> + mmget(mm);
> + }
> +#endif

This would really need a big fat warning explaining why we do not need
mmget_not_zero. We rely on exit_mm doing both mmput and tsk->mm = NULL
under the task_lock and mark_oom_victim is called under this lock as
well and task_will_free_mem resp. find_lock_task_mm makes sure we do not
even consider tasks wihout mm.

I agree that a solution which is fully contained inside the oom proper
would be preferable to touching __mmput path.
-- 
Michal Hocko
SUSE Labs


Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-16 Thread Tetsuo Handa
Michal Hocko wrote:
> OK, could you play with the patch/idea suggested in
> http://lkml.kernel.org/r/20170615122031.gl1...@dhcp22.suse.cz?

I think we don't need to worry about mmap_sem dependency inside __mmput().
Since the OOM killer checks for !MMF_OOM_SKIP mm rather than TIF_MEMDIE thread,
we can keep the OOM killer disabled until we set MMF_OOM_SKIP to the victim's 
mm.
That is, elevating mm_users throughout the reaping procedure does not cause
premature victim selection, even after TIF_MEMDIE is cleared from the victim's
thread. Then, we don't need to use down_write()/up_write() for non OOM victim's 
mm
(nearly 100% of exit_mmap() calls), and can force partial reaping of OOM 
victim's mm
(nearly 0% of exit_mmap() calls) before __mmput() starts doing exit_aio() etc.
Patch is shown below. Only compile tested.

 include/linux/sched/coredump.h |  1 +
 mm/oom_kill.c  | 80 --
 2 files changed, 40 insertions(+), 41 deletions(-)

diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
index 98ae0d0..6b6237b 100644
--- a/include/linux/sched/coredump.h
+++ b/include/linux/sched/coredump.h
@@ -62,6 +62,7 @@ static inline int get_dumpable(struct mm_struct *mm)
  * on NFS restore
  */
 //#define MMF_EXE_FILE_CHANGED 18  /* see prctl_set_mm_exe_file() */
+#define MMF_OOM_REAPING18  /* mm is supposed to be reaped 
*/
 
 #define MMF_HAS_UPROBES19  /* has uprobes */
 #define MMF_RECALC_UPROBES 20  /* MMF_HAS_UPROBES can be wrong */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0e2c925..bdcf658 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -470,38 +470,9 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
 {
struct mmu_gather tlb;
struct vm_area_struct *vma;
-   bool ret = true;
-
-   /*
-* We have to make sure to not race with the victim exit path
-* and cause premature new oom victim selection:
-* __oom_reap_task_mm   exit_mm
-*   mmget_not_zero
-*mmput
-*  atomic_dec_and_test
-*exit_oom_victim
-*  [...]
-*  out_of_memory
-*select_bad_process
-*  # no TIF_MEMDIE task selects new 
victim
-*  unmap_page_range # frees some memory
-*/
-   mutex_lock(_lock);
 
-   if (!down_read_trylock(>mmap_sem)) {
-   ret = false;
-   goto unlock_oom;
-   }
-
-   /*
-* increase mm_users only after we know we will reap something so
-* that the mmput_async is called only when we have reaped something
-* and delayed __mmput doesn't matter that much
-*/
-   if (!mmget_not_zero(mm)) {
-   up_read(>mmap_sem);
-   goto unlock_oom;
-   }
+   if (!down_read_trylock(>mmap_sem))
+   return false;
 
/*
 * Tell all users of get_user/copy_from_user etc... that the content
@@ -537,16 +508,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
K(get_mm_counter(mm, MM_FILEPAGES)),
K(get_mm_counter(mm, MM_SHMEMPAGES)));
up_read(>mmap_sem);
-
-   /*
-* Drop our reference but make sure the mmput slow path is called from a
-* different context because we shouldn't risk we get stuck there and
-* put the oom_reaper out of the way.
-*/
-   mmput_async(mm);
-unlock_oom:
-   mutex_unlock(_lock);
-   return ret;
+   return true;
 }
 
 #define MAX_OOM_REAP_RETRIES 10
@@ -573,9 +535,32 @@ static void oom_reap_task(struct task_struct *tsk)
/*
 * Hide this mm from OOM killer because it has been either reaped or
 * somebody can't call up_write(mmap_sem).
+*
+* Serialize setting of MMF_OOM_SKIP using oom_lock in order to
+* avoid race with select_bad_process() which causes premature
+* new oom victim selection.
+*
+* The OOM reaper:   An allocating task:
+* Failed get_page_from_freelist().
+* Enters into out_of_memory().
+*   Reaped memory enough to make get_page_from_freelist() succeed.
+*   Sets MMF_OOM_SKIP to mm.
+*   Enters into select_bad_process().
+* # MMF_OOM_SKIP mm selects new victim.
 */
+   mutex_lock(_lock);
set_bit(MMF_OOM_SKIP, >flags);
+   mutex_unlock(_lock);
 
+   /*
+* Drop our reference but make sure the mmput slow path is called from a
+* different context because we shouldn't risk we 

Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-16 Thread Tetsuo Handa
Michal Hocko wrote:
> OK, could you play with the patch/idea suggested in
> http://lkml.kernel.org/r/20170615122031.gl1...@dhcp22.suse.cz?

I think we don't need to worry about mmap_sem dependency inside __mmput().
Since the OOM killer checks for !MMF_OOM_SKIP mm rather than TIF_MEMDIE thread,
we can keep the OOM killer disabled until we set MMF_OOM_SKIP to the victim's 
mm.
That is, elevating mm_users throughout the reaping procedure does not cause
premature victim selection, even after TIF_MEMDIE is cleared from the victim's
thread. Then, we don't need to use down_write()/up_write() for non OOM victim's 
mm
(nearly 100% of exit_mmap() calls), and can force partial reaping of OOM 
victim's mm
(nearly 0% of exit_mmap() calls) before __mmput() starts doing exit_aio() etc.
Patch is shown below. Only compile tested.

 include/linux/sched/coredump.h |  1 +
 mm/oom_kill.c  | 80 --
 2 files changed, 40 insertions(+), 41 deletions(-)

diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
index 98ae0d0..6b6237b 100644
--- a/include/linux/sched/coredump.h
+++ b/include/linux/sched/coredump.h
@@ -62,6 +62,7 @@ static inline int get_dumpable(struct mm_struct *mm)
  * on NFS restore
  */
 //#define MMF_EXE_FILE_CHANGED 18  /* see prctl_set_mm_exe_file() */
+#define MMF_OOM_REAPING18  /* mm is supposed to be reaped 
*/
 
 #define MMF_HAS_UPROBES19  /* has uprobes */
 #define MMF_RECALC_UPROBES 20  /* MMF_HAS_UPROBES can be wrong */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0e2c925..bdcf658 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -470,38 +470,9 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
 {
struct mmu_gather tlb;
struct vm_area_struct *vma;
-   bool ret = true;
-
-   /*
-* We have to make sure to not race with the victim exit path
-* and cause premature new oom victim selection:
-* __oom_reap_task_mm   exit_mm
-*   mmget_not_zero
-*mmput
-*  atomic_dec_and_test
-*exit_oom_victim
-*  [...]
-*  out_of_memory
-*select_bad_process
-*  # no TIF_MEMDIE task selects new 
victim
-*  unmap_page_range # frees some memory
-*/
-   mutex_lock(_lock);
 
-   if (!down_read_trylock(>mmap_sem)) {
-   ret = false;
-   goto unlock_oom;
-   }
-
-   /*
-* increase mm_users only after we know we will reap something so
-* that the mmput_async is called only when we have reaped something
-* and delayed __mmput doesn't matter that much
-*/
-   if (!mmget_not_zero(mm)) {
-   up_read(>mmap_sem);
-   goto unlock_oom;
-   }
+   if (!down_read_trylock(>mmap_sem))
+   return false;
 
/*
 * Tell all users of get_user/copy_from_user etc... that the content
@@ -537,16 +508,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
K(get_mm_counter(mm, MM_FILEPAGES)),
K(get_mm_counter(mm, MM_SHMEMPAGES)));
up_read(>mmap_sem);
-
-   /*
-* Drop our reference but make sure the mmput slow path is called from a
-* different context because we shouldn't risk we get stuck there and
-* put the oom_reaper out of the way.
-*/
-   mmput_async(mm);
-unlock_oom:
-   mutex_unlock(_lock);
-   return ret;
+   return true;
 }
 
 #define MAX_OOM_REAP_RETRIES 10
@@ -573,9 +535,32 @@ static void oom_reap_task(struct task_struct *tsk)
/*
 * Hide this mm from OOM killer because it has been either reaped or
 * somebody can't call up_write(mmap_sem).
+*
+* Serialize setting of MMF_OOM_SKIP using oom_lock in order to
+* avoid race with select_bad_process() which causes premature
+* new oom victim selection.
+*
+* The OOM reaper:   An allocating task:
+* Failed get_page_from_freelist().
+* Enters into out_of_memory().
+*   Reaped memory enough to make get_page_from_freelist() succeed.
+*   Sets MMF_OOM_SKIP to mm.
+*   Enters into select_bad_process().
+* # MMF_OOM_SKIP mm selects new victim.
 */
+   mutex_lock(_lock);
set_bit(MMF_OOM_SKIP, >flags);
+   mutex_unlock(_lock);
 
+   /*
+* Drop our reference but make sure the mmput slow path is called from a
+* different context because we shouldn't risk we 

Re: Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-16 Thread Michal Hocko
On Fri 16-06-17 19:27:19, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 16-06-17 09:54:34, Tetsuo Handa wrote:
> > [...]
> > > And the patch you proposed is broken.
> > 
> > Thanks for your testing!
> >  
> > > --
> > > [  161.846202] Out of memory: Kill process 6331 (a.out) score 999 or 
> > > sacrifice child
> > > [  161.850327] Killed process 6331 (a.out) total-vm:4172kB, 
> > > anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> > > [  161.858503] [ cut here ]
> > > [  161.861512] kernel BUG at mm/memory.c:1381!
> > 
> > BUG_ON(addr >= end) suggests our vma has trimmed. I guess I see what is
> > going on here.
> > __oom_reap_task_mm  exit_mmap
> >   free_pgtables
> >   up_write(mm->mmap_sem)
> >   down_read_trylock(>mmap_sem)
> >   remove_vma
> > unmap_page_range
> > 
> > So we need to extend the mmap_sem coverage. See the updated diff (not
> > the full proper patch yet).
> 
> That diff is still wrong. We need to prevent __oom_reap_task_mm() from calling
> unmap_page_range() when __mmput() already called exit_mm(), by 
> setting/checking
> MMF_OOM_SKIP like shown below.

Care to explain why?
[...]
 
> Since the OOM reaper does not reap hugepages, khugepaged_exit() part could be
> safe.

I think you are mixing hugetlb and THP pages here. khugepaged_exit is
about later and we do unmap those.

> But ksm_exit() part might interfere.

How?

> If it is guaranteed to be safe,
> what will go wrong if we move uprobe_clear_state()/exit_aio()/ksm_exit() etc.
> to just before mmdrop() (i.e. after setting MMF_OOM_SKIP) ?

I do not see why those matter and why they should be any special. Unless
I miss anything we really do only care about page table tear down and
the address space modification. They do none of that.

-- 
Michal Hocko
SUSE Labs


Re: Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-16 Thread Michal Hocko
On Fri 16-06-17 19:27:19, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 16-06-17 09:54:34, Tetsuo Handa wrote:
> > [...]
> > > And the patch you proposed is broken.
> > 
> > Thanks for your testing!
> >  
> > > --
> > > [  161.846202] Out of memory: Kill process 6331 (a.out) score 999 or 
> > > sacrifice child
> > > [  161.850327] Killed process 6331 (a.out) total-vm:4172kB, 
> > > anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> > > [  161.858503] [ cut here ]
> > > [  161.861512] kernel BUG at mm/memory.c:1381!
> > 
> > BUG_ON(addr >= end) suggests our vma has trimmed. I guess I see what is
> > going on here.
> > __oom_reap_task_mm  exit_mmap
> >   free_pgtables
> >   up_write(mm->mmap_sem)
> >   down_read_trylock(>mmap_sem)
> >   remove_vma
> > unmap_page_range
> > 
> > So we need to extend the mmap_sem coverage. See the updated diff (not
> > the full proper patch yet).
> 
> That diff is still wrong. We need to prevent __oom_reap_task_mm() from calling
> unmap_page_range() when __mmput() already called exit_mm(), by 
> setting/checking
> MMF_OOM_SKIP like shown below.

Care to explain why?
[...]
 
> Since the OOM reaper does not reap hugepages, khugepaged_exit() part could be
> safe.

I think you are mixing hugetlb and THP pages here. khugepaged_exit is
about later and we do unmap those.

> But ksm_exit() part might interfere.

How?

> If it is guaranteed to be safe,
> what will go wrong if we move uprobe_clear_state()/exit_aio()/ksm_exit() etc.
> to just before mmdrop() (i.e. after setting MMF_OOM_SKIP) ?

I do not see why those matter and why they should be any special. Unless
I miss anything we really do only care about page table tear down and
the address space modification. They do none of that.

-- 
Michal Hocko
SUSE Labs


Re: Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-16 Thread Tetsuo Handa
Michal Hocko wrote:
> On Fri 16-06-17 09:54:34, Tetsuo Handa wrote:
> [...]
> > And the patch you proposed is broken.
> 
> Thanks for your testing!
>  
> > --
> > [  161.846202] Out of memory: Kill process 6331 (a.out) score 999 or 
> > sacrifice child
> > [  161.850327] Killed process 6331 (a.out) total-vm:4172kB, anon-rss:84kB, 
> > file-rss:0kB, shmem-rss:0kB
> > [  161.858503] [ cut here ]
> > [  161.861512] kernel BUG at mm/memory.c:1381!
> 
> BUG_ON(addr >= end) suggests our vma has trimmed. I guess I see what is
> going on here.
> __oom_reap_task_mmexit_mmap
> free_pgtables
> up_write(mm->mmap_sem)
>   down_read_trylock(>mmap_sem)
> remove_vma
> unmap_page_range
> 
> So we need to extend the mmap_sem coverage. See the updated diff (not
> the full proper patch yet).

That diff is still wrong. We need to prevent __oom_reap_task_mm() from calling
unmap_page_range() when __mmput() already called exit_mm(), by setting/checking
MMF_OOM_SKIP like shown below.

diff --git a/kernel/fork.c b/kernel/fork.c
index e53770d..5ef715c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -902,6 +902,11 @@ static inline void __mmput(struct mm_struct *mm)
exit_aio(mm);
ksm_exit(mm);
khugepaged_exit(mm); /* must run before exit_mmap */
+   /*
+* oom reaper might race with exit_mmap so make sure we won't free
+* page tables under its feet
+*/
+   down_write(>mmap_sem);
exit_mmap(mm);
mm_put_huge_zero_page(mm);
set_mm_exe_file(mm, NULL);
@@ -913,6 +918,7 @@ static inline void __mmput(struct mm_struct *mm)
if (mm->binfmt)
module_put(mm->binfmt->module);
set_bit(MMF_OOM_SKIP, >flags);
+   up_write(>mmap_sem);
mmdrop(mm);
 }
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 04c9143..98cca19 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -493,12 +493,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
goto unlock_oom;
}
 
-   /*
-* increase mm_users only after we know we will reap something so
-* that the mmput_async is called only when we have reaped something
-* and delayed __mmput doesn't matter that much
-*/
-   if (!mmget_not_zero(mm)) {
+   if (test_bit(MMF_OOM_SKIP, >flags)) {
up_read(>mmap_sem);
goto unlock_oom;
}
@@ -537,13 +532,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
K(get_mm_counter(mm, MM_FILEPAGES)),
K(get_mm_counter(mm, MM_SHMEMPAGES)));
up_read(>mmap_sem);
-
-   /*
-* Drop our reference but make sure the mmput slow path is called from a
-* different context because we shouldn't risk we get stuck there and
-* put the oom_reaper out of the way.
-*/
-   mmput_async(mm);
 unlock_oom:
mutex_unlock(_lock);
return ret;



> 
> > Please carefully consider the reason why there is VM_BUG_ON() in __mmput(),
> > and clarify in your patch that what are possible side effects of racing
> > uprobe_clear_state()/exit_aio()/ksm_exit()/exit_mmap() etc. with
> > __oom_reap_task_mm()
> 
> Yes that definitely needs to be checked. We basically rely on the racing
> part of the __mmput to not modify the address space. oom_reaper doesn't
> touch any vma state except it unmaps pages which can run in parallel.
> exit_aio->kill_ioctx seemingly does vm_munmap but it a) uses the
> mmap_sem for write and b) it doesn't actually unmap because exit_aio
> does ctx->mmap_size = 0. {ksm,khugepaged}_exit just do some houskeeping
> which is not modifying the address space. I hope I will find some more
> time to work on this next week. Additional test would be highly
> appreciated of course.

Since the OOM reaper does not reap hugepages, khugepaged_exit() part could be
safe. But ksm_exit() part might interfere. If it is guaranteed to be safe,
what will go wrong if we move uprobe_clear_state()/exit_aio()/ksm_exit() etc.
to just before mmdrop() (i.e. after setting MMF_OOM_SKIP) ?


Re: Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-16 Thread Tetsuo Handa
Michal Hocko wrote:
> On Fri 16-06-17 09:54:34, Tetsuo Handa wrote:
> [...]
> > And the patch you proposed is broken.
> 
> Thanks for your testing!
>  
> > --
> > [  161.846202] Out of memory: Kill process 6331 (a.out) score 999 or 
> > sacrifice child
> > [  161.850327] Killed process 6331 (a.out) total-vm:4172kB, anon-rss:84kB, 
> > file-rss:0kB, shmem-rss:0kB
> > [  161.858503] [ cut here ]
> > [  161.861512] kernel BUG at mm/memory.c:1381!
> 
> BUG_ON(addr >= end) suggests our vma has trimmed. I guess I see what is
> going on here.
> __oom_reap_task_mmexit_mmap
> free_pgtables
> up_write(mm->mmap_sem)
>   down_read_trylock(>mmap_sem)
> remove_vma
> unmap_page_range
> 
> So we need to extend the mmap_sem coverage. See the updated diff (not
> the full proper patch yet).

That diff is still wrong. We need to prevent __oom_reap_task_mm() from calling
unmap_page_range() when __mmput() already called exit_mm(), by setting/checking
MMF_OOM_SKIP like shown below.

diff --git a/kernel/fork.c b/kernel/fork.c
index e53770d..5ef715c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -902,6 +902,11 @@ static inline void __mmput(struct mm_struct *mm)
exit_aio(mm);
ksm_exit(mm);
khugepaged_exit(mm); /* must run before exit_mmap */
+   /*
+* oom reaper might race with exit_mmap so make sure we won't free
+* page tables under its feet
+*/
+   down_write(>mmap_sem);
exit_mmap(mm);
mm_put_huge_zero_page(mm);
set_mm_exe_file(mm, NULL);
@@ -913,6 +918,7 @@ static inline void __mmput(struct mm_struct *mm)
if (mm->binfmt)
module_put(mm->binfmt->module);
set_bit(MMF_OOM_SKIP, >flags);
+   up_write(>mmap_sem);
mmdrop(mm);
 }
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 04c9143..98cca19 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -493,12 +493,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
goto unlock_oom;
}
 
-   /*
-* increase mm_users only after we know we will reap something so
-* that the mmput_async is called only when we have reaped something
-* and delayed __mmput doesn't matter that much
-*/
-   if (!mmget_not_zero(mm)) {
+   if (test_bit(MMF_OOM_SKIP, >flags)) {
up_read(>mmap_sem);
goto unlock_oom;
}
@@ -537,13 +532,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
K(get_mm_counter(mm, MM_FILEPAGES)),
K(get_mm_counter(mm, MM_SHMEMPAGES)));
up_read(>mmap_sem);
-
-   /*
-* Drop our reference but make sure the mmput slow path is called from a
-* different context because we shouldn't risk we get stuck there and
-* put the oom_reaper out of the way.
-*/
-   mmput_async(mm);
 unlock_oom:
mutex_unlock(_lock);
return ret;



> 
> > Please carefully consider the reason why there is VM_BUG_ON() in __mmput(),
> > and clarify in your patch that what are possible side effects of racing
> > uprobe_clear_state()/exit_aio()/ksm_exit()/exit_mmap() etc. with
> > __oom_reap_task_mm()
> 
> Yes that definitely needs to be checked. We basically rely on the racing
> part of the __mmput to not modify the address space. oom_reaper doesn't
> touch any vma state except it unmaps pages which can run in parallel.
> exit_aio->kill_ioctx seemingly does vm_munmap but it a) uses the
> mmap_sem for write and b) it doesn't actually unmap because exit_aio
> does ctx->mmap_size = 0. {ksm,khugepaged}_exit just do some houskeeping
> which is not modifying the address space. I hope I will find some more
> time to work on this next week. Additional test would be highly
> appreciated of course.

Since the OOM reaper does not reap hugepages, khugepaged_exit() part could be
safe. But ksm_exit() part might interfere. If it is guaranteed to be safe,
what will go wrong if we move uprobe_clear_state()/exit_aio()/ksm_exit() etc.
to just before mmdrop() (i.e. after setting MMF_OOM_SKIP) ?


Re: Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-16 Thread Michal Hocko
On Fri 16-06-17 09:54:34, Tetsuo Handa wrote:
[...]
> And the patch you proposed is broken.

Thanks for your testing!
 
> --
> [  161.846202] Out of memory: Kill process 6331 (a.out) score 999 or 
> sacrifice child
> [  161.850327] Killed process 6331 (a.out) total-vm:4172kB, anon-rss:84kB, 
> file-rss:0kB, shmem-rss:0kB
> [  161.858503] [ cut here ]
> [  161.861512] kernel BUG at mm/memory.c:1381!

BUG_ON(addr >= end) suggests our vma has trimmed. I guess I see what is
going on here.
__oom_reap_task_mm  exit_mmap
  free_pgtables
  up_write(mm->mmap_sem)
  down_read_trylock(>mmap_sem)
  remove_vma
unmap_page_range

So we need to extend the mmap_sem coverage. See the updated diff (not
the full proper patch yet).

> Please carefully consider the reason why there is VM_BUG_ON() in __mmput(),
> and clarify in your patch that what are possible side effects of racing
> uprobe_clear_state()/exit_aio()/ksm_exit()/exit_mmap() etc. with
> __oom_reap_task_mm()

Yes that definitely needs to be checked. We basically rely on the racing
part of the __mmput to not modify the address space. oom_reaper doesn't
touch any vma state except it unmaps pages which can run in parallel.
exit_aio->kill_ioctx seemingly does vm_munmap but it a) uses the
mmap_sem for write and b) it doesn't actually unmap because exit_aio
does ctx->mmap_size = 0. {ksm,khugepaged}_exit just do some houskeeping
which is not modifying the address space. I hope I will find some more
time to work on this next week. Additional test would be highly
appreciated of course.

---
diff --git a/mm/mmap.c b/mm/mmap.c
index 3bd5ecd20d4d..ca58f8a2a217 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2962,6 +2962,11 @@ void exit_mmap(struct mm_struct *mm)
/* Use -1 here to ensure all VMAs in the mm are unmapped */
unmap_vmas(, vma, 0, -1);
 
+   /*
+* oom reaper might race with exit_mmap so make sure we won't free
+* page tables or unmap VMAs under its feet
+*/
+   down_write(>mmap_sem);
free_pgtables(, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
tlb_finish_mmu(, 0, -1);
 
@@ -2975,6 +2980,7 @@ void exit_mmap(struct mm_struct *mm)
vma = remove_vma(vma);
}
vm_unacct_memory(nr_accounted);
+   up_write(>mmap_sem);
 }
 
 /* Insert vm structure into process list sorted by address
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0e2c925e7826..3df464f0f48b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -494,16 +494,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
}
 
/*
-* increase mm_users only after we know we will reap something so
-* that the mmput_async is called only when we have reaped something
-* and delayed __mmput doesn't matter that much
-*/
-   if (!mmget_not_zero(mm)) {
-   up_read(>mmap_sem);
-   goto unlock_oom;
-   }
-
-   /*
 * Tell all users of get_user/copy_from_user etc... that the content
 * is no longer stable. No barriers really needed because unmapping
 * should imply barriers already and the reader would hit a page fault
@@ -537,13 +527,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
K(get_mm_counter(mm, MM_FILEPAGES)),
K(get_mm_counter(mm, MM_SHMEMPAGES)));
up_read(>mmap_sem);
-
-   /*
-* Drop our reference but make sure the mmput slow path is called from a
-* different context because we shouldn't risk we get stuck there and
-* put the oom_reaper out of the way.
-*/
-   mmput_async(mm);
 unlock_oom:
mutex_unlock(_lock);
return ret;
-- 
Michal Hocko
SUSE Labs


Re: Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-16 Thread Michal Hocko
On Fri 16-06-17 09:54:34, Tetsuo Handa wrote:
[...]
> And the patch you proposed is broken.

Thanks for your testing!
 
> --
> [  161.846202] Out of memory: Kill process 6331 (a.out) score 999 or 
> sacrifice child
> [  161.850327] Killed process 6331 (a.out) total-vm:4172kB, anon-rss:84kB, 
> file-rss:0kB, shmem-rss:0kB
> [  161.858503] [ cut here ]
> [  161.861512] kernel BUG at mm/memory.c:1381!

BUG_ON(addr >= end) suggests our vma has trimmed. I guess I see what is
going on here.
__oom_reap_task_mm  exit_mmap
  free_pgtables
  up_write(mm->mmap_sem)
  down_read_trylock(>mmap_sem)
  remove_vma
unmap_page_range

So we need to extend the mmap_sem coverage. See the updated diff (not
the full proper patch yet).

> Please carefully consider the reason why there is VM_BUG_ON() in __mmput(),
> and clarify in your patch that what are possible side effects of racing
> uprobe_clear_state()/exit_aio()/ksm_exit()/exit_mmap() etc. with
> __oom_reap_task_mm()

Yes that definitely needs to be checked. We basically rely on the racing
part of the __mmput to not modify the address space. oom_reaper doesn't
touch any vma state except it unmaps pages which can run in parallel.
exit_aio->kill_ioctx seemingly does vm_munmap but it a) uses the
mmap_sem for write and b) it doesn't actually unmap because exit_aio
does ctx->mmap_size = 0. {ksm,khugepaged}_exit just do some houskeeping
which is not modifying the address space. I hope I will find some more
time to work on this next week. Additional test would be highly
appreciated of course.

---
diff --git a/mm/mmap.c b/mm/mmap.c
index 3bd5ecd20d4d..ca58f8a2a217 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2962,6 +2962,11 @@ void exit_mmap(struct mm_struct *mm)
/* Use -1 here to ensure all VMAs in the mm are unmapped */
unmap_vmas(, vma, 0, -1);
 
+   /*
+* oom reaper might race with exit_mmap so make sure we won't free
+* page tables or unmap VMAs under its feet
+*/
+   down_write(>mmap_sem);
free_pgtables(, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
tlb_finish_mmu(, 0, -1);
 
@@ -2975,6 +2980,7 @@ void exit_mmap(struct mm_struct *mm)
vma = remove_vma(vma);
}
vm_unacct_memory(nr_accounted);
+   up_write(>mmap_sem);
 }
 
 /* Insert vm structure into process list sorted by address
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0e2c925e7826..3df464f0f48b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -494,16 +494,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
}
 
/*
-* increase mm_users only after we know we will reap something so
-* that the mmput_async is called only when we have reaped something
-* and delayed __mmput doesn't matter that much
-*/
-   if (!mmget_not_zero(mm)) {
-   up_read(>mmap_sem);
-   goto unlock_oom;
-   }
-
-   /*
 * Tell all users of get_user/copy_from_user etc... that the content
 * is no longer stable. No barriers really needed because unmapping
 * should imply barriers already and the reader would hit a page fault
@@ -537,13 +527,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
K(get_mm_counter(mm, MM_FILEPAGES)),
K(get_mm_counter(mm, MM_SHMEMPAGES)));
up_read(>mmap_sem);
-
-   /*
-* Drop our reference but make sure the mmput slow path is called from a
-* different context because we shouldn't risk we get stuck there and
-* put the oom_reaper out of the way.
-*/
-   mmput_async(mm);
 unlock_oom:
mutex_unlock(_lock);
return ret;
-- 
Michal Hocko
SUSE Labs


Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-16 Thread Michal Hocko
On Thu 15-06-17 15:42:23, David Rientjes wrote:
> On Fri, 16 Jun 2017, Michal Hocko wrote:
> 
> > I am sorry but I have really hard to make the oom reaper a reliable way
> > to stop all the potential oom lockups go away. I do not want to
> > reintroduce another potential lockup now.
> 
> Please show where this "potential lockup" ever existed in a bug report or 
> a testcase?

I am not aware of any specific bug report. But the main point of the
reaper is to close all _possible_ lockups due to oom victim being stuck
somewhere. exit_aio waits for all kiocbs. Can we guarantee that none
of them will depend on an allocation (directly or via a lock chain) to
proceed? Likewise ksm_exit/khugepaged_exit depend on mmap_sem for write
to proceed. Are we _guaranteed_ nobody can hold mmap_sem for read at
that time and depend on an allocation? Can we guarantee that __mmput
path will work without any depency on allocation in future?

> I have never seen __mmput() block when trying to free the 
> memory it maps.
> 
> > I also do not see why any
> > solution should be rushed into. I have proposed a way to go and unless
> > it is clear that this is not a way forward then I simply do not agree
> > with any partial workarounds or shortcuts.
> 
> This is not a shortcut, it is a bug fix.  4.12 kills 1-4 processes 
> unnecessarily as a result of setting MMF_OOM_SKIP incorrectly before the 
> mm's memory can be freed.  If you have not seen this issue before, which 
> is why you asked if I ever observed it in practice, then you have not 
> stress tested oom reaping.  It is very observable and reproducible.  

I am not questioning that it works for your particular test. I just
argue that it reduces the robustness of the oom reaper because it allows
oom victim to leave the reaper without MMF_OOM_SKIP set and that is the
core concept to guarantee a forward progress. So we should think about
something more appropriate.

> I do 
> not agree that adding additional and obscure locking into __mmput() is the 
> solution to what is plainly and obviously fixed with this simple patch.

Well, __mmput path already depends on the mmap_sem for write. So this is
not a new concept. I am not saying using mmap_sem is the only way. I
will think about that more.
 
> 4.12 needs to stop killing 2-5 processes on every oom condition instead of 
> 1.

Believe me, I am not dismissing the issue nor the fact it _has_ to be
fixed. I just disagree we should make the oom reaper less robust.

-- 
Michal Hocko
SUSE Labs


Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-16 Thread Michal Hocko
On Thu 15-06-17 15:42:23, David Rientjes wrote:
> On Fri, 16 Jun 2017, Michal Hocko wrote:
> 
> > I am sorry but I have really hard to make the oom reaper a reliable way
> > to stop all the potential oom lockups go away. I do not want to
> > reintroduce another potential lockup now.
> 
> Please show where this "potential lockup" ever existed in a bug report or 
> a testcase?

I am not aware of any specific bug report. But the main point of the
reaper is to close all _possible_ lockups due to oom victim being stuck
somewhere. exit_aio waits for all kiocbs. Can we guarantee that none
of them will depend on an allocation (directly or via a lock chain) to
proceed? Likewise ksm_exit/khugepaged_exit depend on mmap_sem for write
to proceed. Are we _guaranteed_ nobody can hold mmap_sem for read at
that time and depend on an allocation? Can we guarantee that __mmput
path will work without any depency on allocation in future?

> I have never seen __mmput() block when trying to free the 
> memory it maps.
> 
> > I also do not see why any
> > solution should be rushed into. I have proposed a way to go and unless
> > it is clear that this is not a way forward then I simply do not agree
> > with any partial workarounds or shortcuts.
> 
> This is not a shortcut, it is a bug fix.  4.12 kills 1-4 processes 
> unnecessarily as a result of setting MMF_OOM_SKIP incorrectly before the 
> mm's memory can be freed.  If you have not seen this issue before, which 
> is why you asked if I ever observed it in practice, then you have not 
> stress tested oom reaping.  It is very observable and reproducible.  

I am not questioning that it works for your particular test. I just
argue that it reduces the robustness of the oom reaper because it allows
oom victim to leave the reaper without MMF_OOM_SKIP set and that is the
core concept to guarantee a forward progress. So we should think about
something more appropriate.

> I do 
> not agree that adding additional and obscure locking into __mmput() is the 
> solution to what is plainly and obviously fixed with this simple patch.

Well, __mmput path already depends on the mmap_sem for write. So this is
not a new concept. I am not saying using mmap_sem is the only way. I
will think about that more.
 
> 4.12 needs to stop killing 2-5 processes on every oom condition instead of 
> 1.

Believe me, I am not dismissing the issue nor the fact it _has_ to be
fixed. I just disagree we should make the oom reaper less robust.

-- 
Michal Hocko
SUSE Labs


Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-15 Thread Tetsuo Handa
Tetsuo Handa wrote:
> and clarify in your patch that there is no possibility
> of waiting for direct/indirect memory allocation inside free_pgtables(),
> in addition to fixing the bug above.

Oops, this part was wrong, for __oom_reap_task_mm() will give up after
waiting for one second because down_read_trylock(>mmap_sem) continues
failing due to down_write(>mmap_sem) by exit_mmap().
# This is after all moving the location of "give up by timeout", isn't it? ;-)

Thus, clarify in your patch that there is no possibility of waiting for
direct/indirect memory allocation outside down_write()/up_write() (e.g.
i_mmap_lock_write() inside unmap_vmas(, vma, 0, -1) just before
down_write()).


Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-15 Thread Tetsuo Handa
Tetsuo Handa wrote:
> and clarify in your patch that there is no possibility
> of waiting for direct/indirect memory allocation inside free_pgtables(),
> in addition to fixing the bug above.

Oops, this part was wrong, for __oom_reap_task_mm() will give up after
waiting for one second because down_read_trylock(>mmap_sem) continues
failing due to down_write(>mmap_sem) by exit_mmap().
# This is after all moving the location of "give up by timeout", isn't it? ;-)

Thus, clarify in your patch that there is no possibility of waiting for
direct/indirect memory allocation outside down_write()/up_write() (e.g.
i_mmap_lock_write() inside unmap_vmas(, vma, 0, -1) just before
down_write()).


Re: Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-15 Thread Tetsuo Handa
Michal Hocko wrote:
> On Thu 15-06-17 15:03:17, David Rientjes wrote:
> > On Thu, 15 Jun 2017, Michal Hocko wrote:
> > 
> > > > Yes, quite a bit in testing.
> > > > 
> > > > One oom kill shows the system to be oom:
> > > > 
> > > > [22999.488705] Node 0 Normal free:90484kB min:90500kB ...
> > > > [22999.488711] Node 1 Normal free:91536kB min:91948kB ...
> > > > 
> > > > followed up by one or more unnecessary oom kills showing the oom killer 
> > > > racing with memory freeing of the victim:
> > > > 
> > > > [22999.510329] Node 0 Normal free:229588kB min:90500kB ...
> > > > [22999.510334] Node 1 Normal free:600036kB min:91948kB ...
> > > > 
> > > > The patch is absolutely required for us to prevent continuous oom 
> > > > killing 
> > > > of processes after a single process has been oom killed and its memory 
> > > > is 
> > > > in the process of being freed.
> > > 
> > > OK, could you play with the patch/idea suggested in
> > > http://lkml.kernel.org/r/20170615122031.gl1...@dhcp22.suse.cz?
> > > 
> > 
> > I cannot, I am trying to unblock a stable kernel release to my production 
> > that is obviously fixed with this patch and cannot experiment with 
> > uncompiled and untested patches that introduce otherwise unnecessary 
> > locking into the __mmput() path and is based on speculation rather than 
> > hard data that __mmput() for some reason stalls for the oom victim's mm.  
> > I was hoping that this fix could make it in time for 4.12 since 4.12 kills 
> > 1-4 processes unnecessarily for each oom condition and then can review any 
> > tested solution you may propose at a later time.
> 
> I am sorry but I have really hard to make the oom reaper a reliable way
> to stop all the potential oom lockups go away. I do not want to
> reintroduce another potential lockup now. I also do not see why any
> solution should be rushed into. I have proposed a way to go and unless
> it is clear that this is not a way forward then I simply do not agree
> with any partial workarounds or shortcuts.

And the patch you proposed is broken.

--
[  161.846202] Out of memory: Kill process 6331 (a.out) score 999 or sacrifice 
child
[  161.850327] Killed process 6331 (a.out) total-vm:4172kB, anon-rss:84kB, 
file-rss:0kB, shmem-rss:0kB
[  161.858503] [ cut here ]
[  161.861512] kernel BUG at mm/memory.c:1381!
[  161.864154] invalid opcode:  [#1] SMP
[  161.866599] Modules linked in: nf_conntrack_netbios_ns 
nf_conntrack_broadcast ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack 
ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat 
nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security 
ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat 
nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter 
ebtables ip6table_filter ip6_tables coretemp crct10dif_pclmul crc32_pclmul 
ghash_clmulni_intel vmw_balloon pcspkr ppdev shpchp parport_pc i2c_piix4 
parport vmw_vmci xfs libcrc32c vmwgfx crc32c_intel drm_kms_helper serio_raw ttm 
drm e1000 mptspi scsi_transport_spi mptscsih mptbase ata_generic pata_acpi 
floppy
[  161.896811] CPU: 1 PID: 43 Comm: oom_reaper Not tainted 4.12.0-rc5+ #221
[  161.900458] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
Desktop Reference Platform, BIOS 6.00 07/31/2013
[  161.905588] task: 937bb1c13200 task.stack: a13cc0b94000
[  161.908876] RIP: 0010:unmap_page_range+0xa19/0xa60
[  161.911739] RSP: :a13cc0b97d08 EFLAGS: 00010282
[  161.914767] RAX:  RBX: 937ba9e89300 RCX: 00401000
[  161.918543] RDX: 937baf707440 RSI: 937baf707680 RDI: a13cc0b97df0
[  161.922314] RBP: a13cc0b97de0 R08:  R09: 
[  161.926059] R10:  R11: 1f1e8b15 R12: 937ba9e893c0
[  161.929789] R13: 937ba4198000 R14: 937baf707440 R15: 937ba9e89300
[  161.933509] FS:  () GS:937bb380() 
knlGS:
[  161.937615] CS:  0010 DS:  ES:  CR0: 80050033
[  161.940774] CR2: 561fb93c1b00 CR3: 9ee11000 CR4: 001406e0
[  161.944477] Call Trace:
[  161.946333]  ? __mutex_lock+0x574/0x950
[  161.948678]  ? __mutex_lock+0xce/0x950
[  161.950996]  ? __oom_reap_task_mm+0x49/0x170
[  161.953485]  __oom_reap_task_mm+0xd8/0x170
[  161.955893]  oom_reaper+0xac/0x1c0
[  161.957992]  ? remove_wait_queue+0x60/0x60
[  161.960688]  kthread+0x117/0x150
[  161.962719]  ? trace_event_raw_event_oom_score_adj_update+0xe0/0xe0
[  161.965920]  ? kthread_create_on_node+0x70/0x70
[  161.968417]  ret_from_fork+0x2a/0x40
[  161.970530] Code: 13 fb ff ff e9 25 fc ff ff 48 83 e8 01 e9 77 fc ff ff 48 
83 e8 01 e9 62 fe ff ff e8 22 0a e6 ff 48 8b 7d 98 e8 09 ba ff ff 0f 0b <0f> 0b 
48 83 e9 01 e9 a1 fb ff ff e8 03 a5 06 00 48 83 e9 01 e9 
[  161.979386] RIP: unmap_page_range+0xa19/0xa60 RSP: a13cc0b97d08
[  161.982611] ---[ end trace 

Re: Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-15 Thread Tetsuo Handa
Michal Hocko wrote:
> On Thu 15-06-17 15:03:17, David Rientjes wrote:
> > On Thu, 15 Jun 2017, Michal Hocko wrote:
> > 
> > > > Yes, quite a bit in testing.
> > > > 
> > > > One oom kill shows the system to be oom:
> > > > 
> > > > [22999.488705] Node 0 Normal free:90484kB min:90500kB ...
> > > > [22999.488711] Node 1 Normal free:91536kB min:91948kB ...
> > > > 
> > > > followed up by one or more unnecessary oom kills showing the oom killer 
> > > > racing with memory freeing of the victim:
> > > > 
> > > > [22999.510329] Node 0 Normal free:229588kB min:90500kB ...
> > > > [22999.510334] Node 1 Normal free:600036kB min:91948kB ...
> > > > 
> > > > The patch is absolutely required for us to prevent continuous oom 
> > > > killing 
> > > > of processes after a single process has been oom killed and its memory 
> > > > is 
> > > > in the process of being freed.
> > > 
> > > OK, could you play with the patch/idea suggested in
> > > http://lkml.kernel.org/r/20170615122031.gl1...@dhcp22.suse.cz?
> > > 
> > 
> > I cannot, I am trying to unblock a stable kernel release to my production 
> > that is obviously fixed with this patch and cannot experiment with 
> > uncompiled and untested patches that introduce otherwise unnecessary 
> > locking into the __mmput() path and is based on speculation rather than 
> > hard data that __mmput() for some reason stalls for the oom victim's mm.  
> > I was hoping that this fix could make it in time for 4.12 since 4.12 kills 
> > 1-4 processes unnecessarily for each oom condition and then can review any 
> > tested solution you may propose at a later time.
> 
> I am sorry but I have really hard to make the oom reaper a reliable way
> to stop all the potential oom lockups go away. I do not want to
> reintroduce another potential lockup now. I also do not see why any
> solution should be rushed into. I have proposed a way to go and unless
> it is clear that this is not a way forward then I simply do not agree
> with any partial workarounds or shortcuts.

And the patch you proposed is broken.

--
[  161.846202] Out of memory: Kill process 6331 (a.out) score 999 or sacrifice 
child
[  161.850327] Killed process 6331 (a.out) total-vm:4172kB, anon-rss:84kB, 
file-rss:0kB, shmem-rss:0kB
[  161.858503] [ cut here ]
[  161.861512] kernel BUG at mm/memory.c:1381!
[  161.864154] invalid opcode:  [#1] SMP
[  161.866599] Modules linked in: nf_conntrack_netbios_ns 
nf_conntrack_broadcast ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack 
ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat 
nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security 
ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat 
nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter 
ebtables ip6table_filter ip6_tables coretemp crct10dif_pclmul crc32_pclmul 
ghash_clmulni_intel vmw_balloon pcspkr ppdev shpchp parport_pc i2c_piix4 
parport vmw_vmci xfs libcrc32c vmwgfx crc32c_intel drm_kms_helper serio_raw ttm 
drm e1000 mptspi scsi_transport_spi mptscsih mptbase ata_generic pata_acpi 
floppy
[  161.896811] CPU: 1 PID: 43 Comm: oom_reaper Not tainted 4.12.0-rc5+ #221
[  161.900458] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
Desktop Reference Platform, BIOS 6.00 07/31/2013
[  161.905588] task: 937bb1c13200 task.stack: a13cc0b94000
[  161.908876] RIP: 0010:unmap_page_range+0xa19/0xa60
[  161.911739] RSP: :a13cc0b97d08 EFLAGS: 00010282
[  161.914767] RAX:  RBX: 937ba9e89300 RCX: 00401000
[  161.918543] RDX: 937baf707440 RSI: 937baf707680 RDI: a13cc0b97df0
[  161.922314] RBP: a13cc0b97de0 R08:  R09: 
[  161.926059] R10:  R11: 1f1e8b15 R12: 937ba9e893c0
[  161.929789] R13: 937ba4198000 R14: 937baf707440 R15: 937ba9e89300
[  161.933509] FS:  () GS:937bb380() 
knlGS:
[  161.937615] CS:  0010 DS:  ES:  CR0: 80050033
[  161.940774] CR2: 561fb93c1b00 CR3: 9ee11000 CR4: 001406e0
[  161.944477] Call Trace:
[  161.946333]  ? __mutex_lock+0x574/0x950
[  161.948678]  ? __mutex_lock+0xce/0x950
[  161.950996]  ? __oom_reap_task_mm+0x49/0x170
[  161.953485]  __oom_reap_task_mm+0xd8/0x170
[  161.955893]  oom_reaper+0xac/0x1c0
[  161.957992]  ? remove_wait_queue+0x60/0x60
[  161.960688]  kthread+0x117/0x150
[  161.962719]  ? trace_event_raw_event_oom_score_adj_update+0xe0/0xe0
[  161.965920]  ? kthread_create_on_node+0x70/0x70
[  161.968417]  ret_from_fork+0x2a/0x40
[  161.970530] Code: 13 fb ff ff e9 25 fc ff ff 48 83 e8 01 e9 77 fc ff ff 48 
83 e8 01 e9 62 fe ff ff e8 22 0a e6 ff 48 8b 7d 98 e8 09 ba ff ff 0f 0b <0f> 0b 
48 83 e9 01 e9 a1 fb ff ff e8 03 a5 06 00 48 83 e9 01 e9 
[  161.979386] RIP: unmap_page_range+0xa19/0xa60 RSP: a13cc0b97d08
[  161.982611] ---[ end trace 

Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-15 Thread David Rientjes
On Fri, 16 Jun 2017, Michal Hocko wrote:

> I am sorry but I have really hard to make the oom reaper a reliable way
> to stop all the potential oom lockups go away. I do not want to
> reintroduce another potential lockup now.

Please show where this "potential lockup" ever existed in a bug report or 
a testcase?  I have never seen __mmput() block when trying to free the 
memory it maps.

> I also do not see why any
> solution should be rushed into. I have proposed a way to go and unless
> it is clear that this is not a way forward then I simply do not agree
> with any partial workarounds or shortcuts.

This is not a shortcut, it is a bug fix.  4.12 kills 1-4 processes 
unnecessarily as a result of setting MMF_OOM_SKIP incorrectly before the 
mm's memory can be freed.  If you have not seen this issue before, which 
is why you asked if I ever observed it in practice, then you have not 
stress tested oom reaping.  It is very observable and reproducible.  I do 
not agree that adding additional and obscure locking into __mmput() is the 
solution to what is plainly and obviously fixed with this simple patch.

4.12 needs to stop killing 2-5 processes on every oom condition instead of 
1.


Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-15 Thread David Rientjes
On Fri, 16 Jun 2017, Michal Hocko wrote:

> I am sorry but I have really hard to make the oom reaper a reliable way
> to stop all the potential oom lockups go away. I do not want to
> reintroduce another potential lockup now.

Please show where this "potential lockup" ever existed in a bug report or 
a testcase?  I have never seen __mmput() block when trying to free the 
memory it maps.

> I also do not see why any
> solution should be rushed into. I have proposed a way to go and unless
> it is clear that this is not a way forward then I simply do not agree
> with any partial workarounds or shortcuts.

This is not a shortcut, it is a bug fix.  4.12 kills 1-4 processes 
unnecessarily as a result of setting MMF_OOM_SKIP incorrectly before the 
mm's memory can be freed.  If you have not seen this issue before, which 
is why you asked if I ever observed it in practice, then you have not 
stress tested oom reaping.  It is very observable and reproducible.  I do 
not agree that adding additional and obscure locking into __mmput() is the 
solution to what is plainly and obviously fixed with this simple patch.

4.12 needs to stop killing 2-5 processes on every oom condition instead of 
1.


Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-15 Thread Michal Hocko
On Thu 15-06-17 15:03:17, David Rientjes wrote:
> On Thu, 15 Jun 2017, Michal Hocko wrote:
> 
> > > Yes, quite a bit in testing.
> > > 
> > > One oom kill shows the system to be oom:
> > > 
> > > [22999.488705] Node 0 Normal free:90484kB min:90500kB ...
> > > [22999.488711] Node 1 Normal free:91536kB min:91948kB ...
> > > 
> > > followed up by one or more unnecessary oom kills showing the oom killer 
> > > racing with memory freeing of the victim:
> > > 
> > > [22999.510329] Node 0 Normal free:229588kB min:90500kB ...
> > > [22999.510334] Node 1 Normal free:600036kB min:91948kB ...
> > > 
> > > The patch is absolutely required for us to prevent continuous oom killing 
> > > of processes after a single process has been oom killed and its memory is 
> > > in the process of being freed.
> > 
> > OK, could you play with the patch/idea suggested in
> > http://lkml.kernel.org/r/20170615122031.gl1...@dhcp22.suse.cz?
> > 
> 
> I cannot, I am trying to unblock a stable kernel release to my production 
> that is obviously fixed with this patch and cannot experiment with 
> uncompiled and untested patches that introduce otherwise unnecessary 
> locking into the __mmput() path and is based on speculation rather than 
> hard data that __mmput() for some reason stalls for the oom victim's mm.  
> I was hoping that this fix could make it in time for 4.12 since 4.12 kills 
> 1-4 processes unnecessarily for each oom condition and then can review any 
> tested solution you may propose at a later time.

I am sorry but I have really hard to make the oom reaper a reliable way
to stop all the potential oom lockups go away. I do not want to
reintroduce another potential lockup now. I also do not see why any
solution should be rushed into. I have proposed a way to go and unless
it is clear that this is not a way forward then I simply do not agree
with any partial workarounds or shortcuts.
-- 
Michal Hocko
SUSE Labs


Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-15 Thread Michal Hocko
On Thu 15-06-17 15:03:17, David Rientjes wrote:
> On Thu, 15 Jun 2017, Michal Hocko wrote:
> 
> > > Yes, quite a bit in testing.
> > > 
> > > One oom kill shows the system to be oom:
> > > 
> > > [22999.488705] Node 0 Normal free:90484kB min:90500kB ...
> > > [22999.488711] Node 1 Normal free:91536kB min:91948kB ...
> > > 
> > > followed up by one or more unnecessary oom kills showing the oom killer 
> > > racing with memory freeing of the victim:
> > > 
> > > [22999.510329] Node 0 Normal free:229588kB min:90500kB ...
> > > [22999.510334] Node 1 Normal free:600036kB min:91948kB ...
> > > 
> > > The patch is absolutely required for us to prevent continuous oom killing 
> > > of processes after a single process has been oom killed and its memory is 
> > > in the process of being freed.
> > 
> > OK, could you play with the patch/idea suggested in
> > http://lkml.kernel.org/r/20170615122031.gl1...@dhcp22.suse.cz?
> > 
> 
> I cannot, I am trying to unblock a stable kernel release to my production 
> that is obviously fixed with this patch and cannot experiment with 
> uncompiled and untested patches that introduce otherwise unnecessary 
> locking into the __mmput() path and is based on speculation rather than 
> hard data that __mmput() for some reason stalls for the oom victim's mm.  
> I was hoping that this fix could make it in time for 4.12 since 4.12 kills 
> 1-4 processes unnecessarily for each oom condition and then can review any 
> tested solution you may propose at a later time.

I am sorry but I have really hard to make the oom reaper a reliable way
to stop all the potential oom lockups go away. I do not want to
reintroduce another potential lockup now. I also do not see why any
solution should be rushed into. I have proposed a way to go and unless
it is clear that this is not a way forward then I simply do not agree
with any partial workarounds or shortcuts.
-- 
Michal Hocko
SUSE Labs


Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-15 Thread David Rientjes
On Thu, 15 Jun 2017, Michal Hocko wrote:

> > Yes, quite a bit in testing.
> > 
> > One oom kill shows the system to be oom:
> > 
> > [22999.488705] Node 0 Normal free:90484kB min:90500kB ...
> > [22999.488711] Node 1 Normal free:91536kB min:91948kB ...
> > 
> > followed up by one or more unnecessary oom kills showing the oom killer 
> > racing with memory freeing of the victim:
> > 
> > [22999.510329] Node 0 Normal free:229588kB min:90500kB ...
> > [22999.510334] Node 1 Normal free:600036kB min:91948kB ...
> > 
> > The patch is absolutely required for us to prevent continuous oom killing 
> > of processes after a single process has been oom killed and its memory is 
> > in the process of being freed.
> 
> OK, could you play with the patch/idea suggested in
> http://lkml.kernel.org/r/20170615122031.gl1...@dhcp22.suse.cz?
> 

I cannot, I am trying to unblock a stable kernel release to my production 
that is obviously fixed with this patch and cannot experiment with 
uncompiled and untested patches that introduce otherwise unnecessary 
locking into the __mmput() path and is based on speculation rather than 
hard data that __mmput() for some reason stalls for the oom victim's mm.  
I was hoping that this fix could make it in time for 4.12 since 4.12 kills 
1-4 processes unnecessarily for each oom condition and then can review any 
tested solution you may propose at a later time.


Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-15 Thread David Rientjes
On Thu, 15 Jun 2017, Michal Hocko wrote:

> > Yes, quite a bit in testing.
> > 
> > One oom kill shows the system to be oom:
> > 
> > [22999.488705] Node 0 Normal free:90484kB min:90500kB ...
> > [22999.488711] Node 1 Normal free:91536kB min:91948kB ...
> > 
> > followed up by one or more unnecessary oom kills showing the oom killer 
> > racing with memory freeing of the victim:
> > 
> > [22999.510329] Node 0 Normal free:229588kB min:90500kB ...
> > [22999.510334] Node 1 Normal free:600036kB min:91948kB ...
> > 
> > The patch is absolutely required for us to prevent continuous oom killing 
> > of processes after a single process has been oom killed and its memory is 
> > in the process of being freed.
> 
> OK, could you play with the patch/idea suggested in
> http://lkml.kernel.org/r/20170615122031.gl1...@dhcp22.suse.cz?
> 

I cannot, I am trying to unblock a stable kernel release to my production 
that is obviously fixed with this patch and cannot experiment with 
uncompiled and untested patches that introduce otherwise unnecessary 
locking into the __mmput() path and is based on speculation rather than 
hard data that __mmput() for some reason stalls for the oom victim's mm.  
I was hoping that this fix could make it in time for 4.12 since 4.12 kills 
1-4 processes unnecessarily for each oom condition and then can review any 
tested solution you may propose at a later time.


Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-15 Thread Tetsuo Handa
Michal Hocko wrote:
> On Thu 15-06-17 22:01:33, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Thu 15-06-17 14:03:35, Michal Hocko wrote:
> > > > On Thu 15-06-17 20:32:39, Tetsuo Handa wrote:
> > > > > @@ -556,25 +553,21 @@ static void oom_reap_task(struct task_struct 
> > > > > *tsk)
> > > > >   struct mm_struct *mm = tsk->signal->oom_mm;
> > > > >  
> > > > >   /* Retry the down_read_trylock(mmap_sem) a few times */
> > > > > - while (attempts++ < MAX_OOM_REAP_RETRIES && 
> > > > > !__oom_reap_task_mm(tsk, mm))
> > > > > + while (__oom_reap_task_mm(tsk, mm), !test_bit(MMF_OOM_SKIP, 
> > > > > >flags)
> > > > > +&& attempts++ < MAX_OOM_REAP_RETRIES)
> > > > >   schedule_timeout_idle(HZ/10);
> > > > >  
> > > > > - if (attempts <= MAX_OOM_REAP_RETRIES)
> > > > > - goto done;
> > > > > -
> > > > > -
> > > > > - pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> > > > > - task_pid_nr(tsk), tsk->comm);
> > > > > - debug_show_all_locks();
> > > > > -
> > > > > -done:
> > > > > - tsk->oom_reaper_list = NULL;
> > > > > -
> > > > >   /*
> > > > >* Hide this mm from OOM killer because it has been either 
> > > > > reaped or
> > > > >* somebody can't call up_write(mmap_sem).
> > > > >*/
> > > > > - set_bit(MMF_OOM_SKIP, >flags);
> > > > > + if (!test_and_set_bit(MMF_OOM_SKIP, >flags)) {
> > > > > + pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> > > > > + task_pid_nr(tsk), tsk->comm);
> > > > > + debug_show_all_locks();
> > > > > + }
> > > > > +
> > > > 
> > > > How does this _solve_ anything? Why would you even retry when you
> > > > _know_ that the reference count dropped to zero. It will never
> > > > increment. So the above is basically just schedule_timeout_idle(HZ/10) *
> > > > MAX_OOM_REAP_RETRIES before we set MMF_OOM_SKIP.
> > 
> > If the OOM reaper knows that mm->users == 0, it gives __mmput() some time
> > to "complete exit_mmap() etc. and set MMF_OOM_SKIP". If __mmput() released
> > some memory, subsequent OOM killer invocation is automatically avoided.
> > If __mmput() did not release some memory, let the OOM killer invoke again.
> > 
> > > 
> > > Just to make myself more clear. The above assumes that the victim hasn't
> > > passed exit_mmap and MMF_OOM_SKIP in __mmput. Which is the case we want to
> > > address here.
> > 
> > David is trying to avoid setting MMF_OOM_SKIP when the OOM reaper found that
> > mm->users == 0. But we must not wait forever because __mmput() might fail to
> > release some memory immediately. If __mmput() did not release some memory 
> > within
> > schedule_timeout_idle(HZ/10) * MAX_OOM_REAP_RETRIES sleep, let the OOM 
> > killer
> > invoke again. So, this is the case we want to address here, isn't it?
> 
> And we are back with a timeout based approach... Sigh. Just imagine that
> you have a really large process which will take some time to tear down.
> While it frees memory that might be in a different oom domain. Now you
> pretend to keep retrying and eventually give up to allow a new oom
> victim from that oom domain.

We are already using timeout based approach at down_read_trylock(>mmap_sem)
in __oom_reap_task_mm(). It is possible that down_read_trylock(>mmap_sem)
succeeds if the OOM reaper waited for one more second, for the thread which is
holding mmap_sem for write could just be failing to get TIF_MEMDIE due to
oom_lock contention among unrelated threads, but we allow the OOM reaper to
give up after one second.

Even if the victim is a really large process which will take some time to tear
down """inside __mmput()""", subsequent OOM killer invocation will be 
_automatically
avoided_ if __mmput() released _some_ memory. Thus, giving up __mmput() after 
one
second as well as giving up down_read_trylock(>mmap_sem) after one second is
reasonable.

> 
> If we want to handle oom victims with mm_users == 0 then let's do it
> properly, please.


Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-15 Thread Tetsuo Handa
Michal Hocko wrote:
> On Thu 15-06-17 22:01:33, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Thu 15-06-17 14:03:35, Michal Hocko wrote:
> > > > On Thu 15-06-17 20:32:39, Tetsuo Handa wrote:
> > > > > @@ -556,25 +553,21 @@ static void oom_reap_task(struct task_struct 
> > > > > *tsk)
> > > > >   struct mm_struct *mm = tsk->signal->oom_mm;
> > > > >  
> > > > >   /* Retry the down_read_trylock(mmap_sem) a few times */
> > > > > - while (attempts++ < MAX_OOM_REAP_RETRIES && 
> > > > > !__oom_reap_task_mm(tsk, mm))
> > > > > + while (__oom_reap_task_mm(tsk, mm), !test_bit(MMF_OOM_SKIP, 
> > > > > >flags)
> > > > > +&& attempts++ < MAX_OOM_REAP_RETRIES)
> > > > >   schedule_timeout_idle(HZ/10);
> > > > >  
> > > > > - if (attempts <= MAX_OOM_REAP_RETRIES)
> > > > > - goto done;
> > > > > -
> > > > > -
> > > > > - pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> > > > > - task_pid_nr(tsk), tsk->comm);
> > > > > - debug_show_all_locks();
> > > > > -
> > > > > -done:
> > > > > - tsk->oom_reaper_list = NULL;
> > > > > -
> > > > >   /*
> > > > >* Hide this mm from OOM killer because it has been either 
> > > > > reaped or
> > > > >* somebody can't call up_write(mmap_sem).
> > > > >*/
> > > > > - set_bit(MMF_OOM_SKIP, >flags);
> > > > > + if (!test_and_set_bit(MMF_OOM_SKIP, >flags)) {
> > > > > + pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> > > > > + task_pid_nr(tsk), tsk->comm);
> > > > > + debug_show_all_locks();
> > > > > + }
> > > > > +
> > > > 
> > > > How does this _solve_ anything? Why would you even retry when you
> > > > _know_ that the reference count dropped to zero. It will never
> > > > increment. So the above is basically just schedule_timeout_idle(HZ/10) *
> > > > MAX_OOM_REAP_RETRIES before we set MMF_OOM_SKIP.
> > 
> > If the OOM reaper knows that mm->users == 0, it gives __mmput() some time
> > to "complete exit_mmap() etc. and set MMF_OOM_SKIP". If __mmput() released
> > some memory, subsequent OOM killer invocation is automatically avoided.
> > If __mmput() did not release some memory, let the OOM killer invoke again.
> > 
> > > 
> > > Just to make myself more clear. The above assumes that the victim hasn't
> > > passed exit_mmap and MMF_OOM_SKIP in __mmput. Which is the case we want to
> > > address here.
> > 
> > David is trying to avoid setting MMF_OOM_SKIP when the OOM reaper found that
> > mm->users == 0. But we must not wait forever because __mmput() might fail to
> > release some memory immediately. If __mmput() did not release some memory 
> > within
> > schedule_timeout_idle(HZ/10) * MAX_OOM_REAP_RETRIES sleep, let the OOM 
> > killer
> > invoke again. So, this is the case we want to address here, isn't it?
> 
> And we are back with a timeout based approach... Sigh. Just imagine that
> you have a really large process which will take some time to tear down.
> While it frees memory that might be in a different oom domain. Now you
> pretend to keep retrying and eventually give up to allow a new oom
> victim from that oom domain.

We are already using timeout based approach at down_read_trylock(>mmap_sem)
in __oom_reap_task_mm(). It is possible that down_read_trylock(>mmap_sem)
succeeds if the OOM reaper waited for one more second, for the thread which is
holding mmap_sem for write could just be failing to get TIF_MEMDIE due to
oom_lock contention among unrelated threads, but we allow the OOM reaper to
give up after one second.

Even if the victim is a really large process which will take some time to tear
down """inside __mmput()""", subsequent OOM killer invocation will be 
_automatically
avoided_ if __mmput() released _some_ memory. Thus, giving up __mmput() after 
one
second as well as giving up down_read_trylock(>mmap_sem) after one second is
reasonable.

> 
> If we want to handle oom victims with mm_users == 0 then let's do it
> properly, please.


Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-15 Thread Michal Hocko
On Thu 15-06-17 14:26:26, David Rientjes wrote:
> On Thu, 15 Jun 2017, Michal Hocko wrote:
> 
> > > If mm->mm_users is not incremented because it is already zero by the oom
> > > reaper, meaning the final refcount has been dropped, do not set
> > > MMF_OOM_SKIP prematurely.
> > > 
> > > __mmput() may not have had a chance to do exit_mmap() yet, so memory from
> > > a previous oom victim is still mapped.
> > 
> > true and do we have a _guarantee_ it will do it? E.g. can somebody block
> > exit_aio from completing? Or can somebody hold mmap_sem and thus block
> > ksm_exit resp. khugepaged_exit from completing? The reason why I was
> > conservative and set such a mm as MMF_OOM_SKIP was because I couldn't
> > give a definitive answer to those questions. And we really _want_ to
> > have a guarantee of a forward progress here. Killing an additional
> > proecess is a price to pay and if that doesn't trigger normall it sounds
> > like a reasonable compromise to me.
> > 
> 
> I have not seen any issues where __mmput() stalls and exit_mmap() fails to 
> free its mapped memory once mm->mm_users has dropped to 0.
> 
> > > __mput() naturally requires no
> > > references on mm->mm_users to do exit_mmap().
> > > 
> > > Without this, several processes can be oom killed unnecessarily and the
> > > oom log can show an abundance of memory available if exit_mmap() is in
> > > progress at the time the process is skipped.
> > 
> > Have you seen this happening in the real life?
> > 
> 
> Yes, quite a bit in testing.
> 
> One oom kill shows the system to be oom:
> 
> [22999.488705] Node 0 Normal free:90484kB min:90500kB ...
> [22999.488711] Node 1 Normal free:91536kB min:91948kB ...
> 
> followed up by one or more unnecessary oom kills showing the oom killer 
> racing with memory freeing of the victim:
> 
> [22999.510329] Node 0 Normal free:229588kB min:90500kB ...
> [22999.510334] Node 1 Normal free:600036kB min:91948kB ...
> 
> The patch is absolutely required for us to prevent continuous oom killing 
> of processes after a single process has been oom killed and its memory is 
> in the process of being freed.

OK, could you play with the patch/idea suggested in
http://lkml.kernel.org/r/20170615122031.gl1...@dhcp22.suse.cz?

-- 
Michal Hocko
SUSE Labs


Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-15 Thread Michal Hocko
On Thu 15-06-17 14:26:26, David Rientjes wrote:
> On Thu, 15 Jun 2017, Michal Hocko wrote:
> 
> > > If mm->mm_users is not incremented because it is already zero by the oom
> > > reaper, meaning the final refcount has been dropped, do not set
> > > MMF_OOM_SKIP prematurely.
> > > 
> > > __mmput() may not have had a chance to do exit_mmap() yet, so memory from
> > > a previous oom victim is still mapped.
> > 
> > true and do we have a _guarantee_ it will do it? E.g. can somebody block
> > exit_aio from completing? Or can somebody hold mmap_sem and thus block
> > ksm_exit resp. khugepaged_exit from completing? The reason why I was
> > conservative and set such a mm as MMF_OOM_SKIP was because I couldn't
> > give a definitive answer to those questions. And we really _want_ to
> > have a guarantee of a forward progress here. Killing an additional
> > proecess is a price to pay and if that doesn't trigger normall it sounds
> > like a reasonable compromise to me.
> > 
> 
> I have not seen any issues where __mmput() stalls and exit_mmap() fails to 
> free its mapped memory once mm->mm_users has dropped to 0.
> 
> > > __mput() naturally requires no
> > > references on mm->mm_users to do exit_mmap().
> > > 
> > > Without this, several processes can be oom killed unnecessarily and the
> > > oom log can show an abundance of memory available if exit_mmap() is in
> > > progress at the time the process is skipped.
> > 
> > Have you seen this happening in the real life?
> > 
> 
> Yes, quite a bit in testing.
> 
> One oom kill shows the system to be oom:
> 
> [22999.488705] Node 0 Normal free:90484kB min:90500kB ...
> [22999.488711] Node 1 Normal free:91536kB min:91948kB ...
> 
> followed up by one or more unnecessary oom kills showing the oom killer 
> racing with memory freeing of the victim:
> 
> [22999.510329] Node 0 Normal free:229588kB min:90500kB ...
> [22999.510334] Node 1 Normal free:600036kB min:91948kB ...
> 
> The patch is absolutely required for us to prevent continuous oom killing 
> of processes after a single process has been oom killed and its memory is 
> in the process of being freed.

OK, could you play with the patch/idea suggested in
http://lkml.kernel.org/r/20170615122031.gl1...@dhcp22.suse.cz?

-- 
Michal Hocko
SUSE Labs


Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-15 Thread David Rientjes
On Thu, 15 Jun 2017, Tetsuo Handa wrote:

> David is trying to avoid setting MMF_OOM_SKIP when the OOM reaper found that
> mm->users == 0.

Yes, because MMF_OOM_SKIP enables the oom killer to select another process 
to kill and will do so without the original victim's mm being able to 
undergo exit_mmap().  So now we kill two or more processes when one would 
have sufficied; I have seen up to four processes killed unnecessarily 
without this patch.

> But we must not wait forever because __mmput() might fail to
> release some memory immediately. If __mmput() did not release some memory 
> within
> schedule_timeout_idle(HZ/10) * MAX_OOM_REAP_RETRIES sleep, let the OOM killer
> invoke again. So, this is the case we want to address here, isn't it?
> 

It is obviously a function of the number of threads that share the mm with 
the oom victim to determine how long would be a sensible amount of time to 
wait for __mmput() to even get a chance to be called, along with 
potentially allowing a non-zero number of those threads to allocate from 
memory reserves to allow them to eventually drop mm->mmap_sem to make 
forward progress.

I have not witnessed any thread stalling in __mmput() that prevents the 
mm's memory to be freed.  I have witnessed several processes oom killed 
unnecessarily for a single oom condition where before MMF_OOM_SKIP was 
introduced, a single oom kill would have sufficed.


Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-15 Thread David Rientjes
On Thu, 15 Jun 2017, Tetsuo Handa wrote:

> David is trying to avoid setting MMF_OOM_SKIP when the OOM reaper found that
> mm->users == 0.

Yes, because MMF_OOM_SKIP enables the oom killer to select another process 
to kill and will do so without the original victim's mm being able to 
undergo exit_mmap().  So now we kill two or more processes when one would 
have sufficied; I have seen up to four processes killed unnecessarily 
without this patch.

> But we must not wait forever because __mmput() might fail to
> release some memory immediately. If __mmput() did not release some memory 
> within
> schedule_timeout_idle(HZ/10) * MAX_OOM_REAP_RETRIES sleep, let the OOM killer
> invoke again. So, this is the case we want to address here, isn't it?
> 

It is obviously a function of the number of threads that share the mm with 
the oom victim to determine how long would be a sensible amount of time to 
wait for __mmput() to even get a chance to be called, along with 
potentially allowing a non-zero number of those threads to allocate from 
memory reserves to allow them to eventually drop mm->mmap_sem to make 
forward progress.

I have not witnessed any thread stalling in __mmput() that prevents the 
mm's memory to be freed.  I have witnessed several processes oom killed 
unnecessarily for a single oom condition where before MMF_OOM_SKIP was 
introduced, a single oom kill would have sufficed.


Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-15 Thread David Rientjes
On Thu, 15 Jun 2017, Michal Hocko wrote:

> > If mm->mm_users is not incremented because it is already zero by the oom
> > reaper, meaning the final refcount has been dropped, do not set
> > MMF_OOM_SKIP prematurely.
> > 
> > __mmput() may not have had a chance to do exit_mmap() yet, so memory from
> > a previous oom victim is still mapped.
> 
> true and do we have a _guarantee_ it will do it? E.g. can somebody block
> exit_aio from completing? Or can somebody hold mmap_sem and thus block
> ksm_exit resp. khugepaged_exit from completing? The reason why I was
> conservative and set such a mm as MMF_OOM_SKIP was because I couldn't
> give a definitive answer to those questions. And we really _want_ to
> have a guarantee of a forward progress here. Killing an additional
> proecess is a price to pay and if that doesn't trigger normall it sounds
> like a reasonable compromise to me.
> 

I have not seen any issues where __mmput() stalls and exit_mmap() fails to 
free its mapped memory once mm->mm_users has dropped to 0.

> > __mput() naturally requires no
> > references on mm->mm_users to do exit_mmap().
> > 
> > Without this, several processes can be oom killed unnecessarily and the
> > oom log can show an abundance of memory available if exit_mmap() is in
> > progress at the time the process is skipped.
> 
> Have you seen this happening in the real life?
> 

Yes, quite a bit in testing.

One oom kill shows the system to be oom:

[22999.488705] Node 0 Normal free:90484kB min:90500kB ...
[22999.488711] Node 1 Normal free:91536kB min:91948kB ...

followed up by one or more unnecessary oom kills showing the oom killer 
racing with memory freeing of the victim:

[22999.510329] Node 0 Normal free:229588kB min:90500kB ...
[22999.510334] Node 1 Normal free:600036kB min:91948kB ...

The patch is absolutely required for us to prevent continuous oom killing 
of processes after a single process has been oom killed and its memory is 
in the process of being freed.


Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-15 Thread David Rientjes
On Thu, 15 Jun 2017, Michal Hocko wrote:

> > If mm->mm_users is not incremented because it is already zero by the oom
> > reaper, meaning the final refcount has been dropped, do not set
> > MMF_OOM_SKIP prematurely.
> > 
> > __mmput() may not have had a chance to do exit_mmap() yet, so memory from
> > a previous oom victim is still mapped.
> 
> true and do we have a _guarantee_ it will do it? E.g. can somebody block
> exit_aio from completing? Or can somebody hold mmap_sem and thus block
> ksm_exit resp. khugepaged_exit from completing? The reason why I was
> conservative and set such a mm as MMF_OOM_SKIP was because I couldn't
> give a definitive answer to those questions. And we really _want_ to
> have a guarantee of a forward progress here. Killing an additional
> proecess is a price to pay and if that doesn't trigger normall it sounds
> like a reasonable compromise to me.
> 

I have not seen any issues where __mmput() stalls and exit_mmap() fails to 
free its mapped memory once mm->mm_users has dropped to 0.

> > __mput() naturally requires no
> > references on mm->mm_users to do exit_mmap().
> > 
> > Without this, several processes can be oom killed unnecessarily and the
> > oom log can show an abundance of memory available if exit_mmap() is in
> > progress at the time the process is skipped.
> 
> Have you seen this happening in the real life?
> 

Yes, quite a bit in testing.

One oom kill shows the system to be oom:

[22999.488705] Node 0 Normal free:90484kB min:90500kB ...
[22999.488711] Node 1 Normal free:91536kB min:91948kB ...

followed up by one or more unnecessary oom kills showing the oom killer 
racing with memory freeing of the victim:

[22999.510329] Node 0 Normal free:229588kB min:90500kB ...
[22999.510334] Node 1 Normal free:600036kB min:91948kB ...

The patch is absolutely required for us to prevent continuous oom killing 
of processes after a single process has been oom killed and its memory is 
in the process of being freed.


Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-15 Thread Michal Hocko
On Thu 15-06-17 22:01:33, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 15-06-17 14:03:35, Michal Hocko wrote:
> > > On Thu 15-06-17 20:32:39, Tetsuo Handa wrote:
> > > > @@ -556,25 +553,21 @@ static void oom_reap_task(struct task_struct *tsk)
> > > > struct mm_struct *mm = tsk->signal->oom_mm;
> > > >  
> > > > /* Retry the down_read_trylock(mmap_sem) a few times */
> > > > -   while (attempts++ < MAX_OOM_REAP_RETRIES && 
> > > > !__oom_reap_task_mm(tsk, mm))
> > > > +   while (__oom_reap_task_mm(tsk, mm), !test_bit(MMF_OOM_SKIP, 
> > > > >flags)
> > > > +  && attempts++ < MAX_OOM_REAP_RETRIES)
> > > > schedule_timeout_idle(HZ/10);
> > > >  
> > > > -   if (attempts <= MAX_OOM_REAP_RETRIES)
> > > > -   goto done;
> > > > -
> > > > -
> > > > -   pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> > > > -   task_pid_nr(tsk), tsk->comm);
> > > > -   debug_show_all_locks();
> > > > -
> > > > -done:
> > > > -   tsk->oom_reaper_list = NULL;
> > > > -
> > > > /*
> > > >  * Hide this mm from OOM killer because it has been either 
> > > > reaped or
> > > >  * somebody can't call up_write(mmap_sem).
> > > >  */
> > > > -   set_bit(MMF_OOM_SKIP, >flags);
> > > > +   if (!test_and_set_bit(MMF_OOM_SKIP, >flags)) {
> > > > +   pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> > > > +   task_pid_nr(tsk), tsk->comm);
> > > > +   debug_show_all_locks();
> > > > +   }
> > > > +
> > > 
> > > How does this _solve_ anything? Why would you even retry when you
> > > _know_ that the reference count dropped to zero. It will never
> > > increment. So the above is basically just schedule_timeout_idle(HZ/10) *
> > > MAX_OOM_REAP_RETRIES before we set MMF_OOM_SKIP.
> 
> If the OOM reaper knows that mm->users == 0, it gives __mmput() some time
> to "complete exit_mmap() etc. and set MMF_OOM_SKIP". If __mmput() released
> some memory, subsequent OOM killer invocation is automatically avoided.
> If __mmput() did not release some memory, let the OOM killer invoke again.
> 
> > 
> > Just to make myself more clear. The above assumes that the victim hasn't
> > passed exit_mmap and MMF_OOM_SKIP in __mmput. Which is the case we want to
> > address here.
> 
> David is trying to avoid setting MMF_OOM_SKIP when the OOM reaper found that
> mm->users == 0. But we must not wait forever because __mmput() might fail to
> release some memory immediately. If __mmput() did not release some memory 
> within
> schedule_timeout_idle(HZ/10) * MAX_OOM_REAP_RETRIES sleep, let the OOM killer
> invoke again. So, this is the case we want to address here, isn't it?

And we are back with a timeout based approach... Sigh. Just imagine that
you have a really large process which will take some time to tear down.
While it frees memory that might be in a different oom domain. Now you
pretend to keep retrying and eventually give up to allow a new oom
victim from that oom domain.

If we want to handle oom victims with mm_users == 0 then let's do it
properly, please.

-- 
Michal Hocko
SUSE Labs


Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-15 Thread Michal Hocko
On Thu 15-06-17 22:01:33, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 15-06-17 14:03:35, Michal Hocko wrote:
> > > On Thu 15-06-17 20:32:39, Tetsuo Handa wrote:
> > > > @@ -556,25 +553,21 @@ static void oom_reap_task(struct task_struct *tsk)
> > > > struct mm_struct *mm = tsk->signal->oom_mm;
> > > >  
> > > > /* Retry the down_read_trylock(mmap_sem) a few times */
> > > > -   while (attempts++ < MAX_OOM_REAP_RETRIES && 
> > > > !__oom_reap_task_mm(tsk, mm))
> > > > +   while (__oom_reap_task_mm(tsk, mm), !test_bit(MMF_OOM_SKIP, 
> > > > >flags)
> > > > +  && attempts++ < MAX_OOM_REAP_RETRIES)
> > > > schedule_timeout_idle(HZ/10);
> > > >  
> > > > -   if (attempts <= MAX_OOM_REAP_RETRIES)
> > > > -   goto done;
> > > > -
> > > > -
> > > > -   pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> > > > -   task_pid_nr(tsk), tsk->comm);
> > > > -   debug_show_all_locks();
> > > > -
> > > > -done:
> > > > -   tsk->oom_reaper_list = NULL;
> > > > -
> > > > /*
> > > >  * Hide this mm from OOM killer because it has been either 
> > > > reaped or
> > > >  * somebody can't call up_write(mmap_sem).
> > > >  */
> > > > -   set_bit(MMF_OOM_SKIP, >flags);
> > > > +   if (!test_and_set_bit(MMF_OOM_SKIP, >flags)) {
> > > > +   pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> > > > +   task_pid_nr(tsk), tsk->comm);
> > > > +   debug_show_all_locks();
> > > > +   }
> > > > +
> > > 
> > > How does this _solve_ anything? Why would you even retry when you
> > > _know_ that the reference count dropped to zero. It will never
> > > increment. So the above is basically just schedule_timeout_idle(HZ/10) *
> > > MAX_OOM_REAP_RETRIES before we set MMF_OOM_SKIP.
> 
> If the OOM reaper knows that mm->users == 0, it gives __mmput() some time
> to "complete exit_mmap() etc. and set MMF_OOM_SKIP". If __mmput() released
> some memory, subsequent OOM killer invocation is automatically avoided.
> If __mmput() did not release some memory, let the OOM killer invoke again.
> 
> > 
> > Just to make myself more clear. The above assumes that the victim hasn't
> > passed exit_mmap and MMF_OOM_SKIP in __mmput. Which is the case we want to
> > address here.
> 
> David is trying to avoid setting MMF_OOM_SKIP when the OOM reaper found that
> mm->users == 0. But we must not wait forever because __mmput() might fail to
> release some memory immediately. If __mmput() did not release some memory 
> within
> schedule_timeout_idle(HZ/10) * MAX_OOM_REAP_RETRIES sleep, let the OOM killer
> invoke again. So, this is the case we want to address here, isn't it?

And we are back with a timeout based approach... Sigh. Just imagine that
you have a really large process which will take some time to tear down.
While it frees memory that might be in a different oom domain. Now you
pretend to keep retrying and eventually give up to allow a new oom
victim from that oom domain.

If we want to handle oom victims with mm_users == 0 then let's do it
properly, please.

-- 
Michal Hocko
SUSE Labs


Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-15 Thread Tetsuo Handa
Michal Hocko wrote:
> On Thu 15-06-17 14:03:35, Michal Hocko wrote:
> > On Thu 15-06-17 20:32:39, Tetsuo Handa wrote:
> > > @@ -556,25 +553,21 @@ static void oom_reap_task(struct task_struct *tsk)
> > >   struct mm_struct *mm = tsk->signal->oom_mm;
> > >  
> > >   /* Retry the down_read_trylock(mmap_sem) a few times */
> > > - while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, 
> > > mm))
> > > + while (__oom_reap_task_mm(tsk, mm), !test_bit(MMF_OOM_SKIP, >flags)
> > > +&& attempts++ < MAX_OOM_REAP_RETRIES)
> > >   schedule_timeout_idle(HZ/10);
> > >  
> > > - if (attempts <= MAX_OOM_REAP_RETRIES)
> > > - goto done;
> > > -
> > > -
> > > - pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> > > - task_pid_nr(tsk), tsk->comm);
> > > - debug_show_all_locks();
> > > -
> > > -done:
> > > - tsk->oom_reaper_list = NULL;
> > > -
> > >   /*
> > >* Hide this mm from OOM killer because it has been either reaped or
> > >* somebody can't call up_write(mmap_sem).
> > >*/
> > > - set_bit(MMF_OOM_SKIP, >flags);
> > > + if (!test_and_set_bit(MMF_OOM_SKIP, >flags)) {
> > > + pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> > > + task_pid_nr(tsk), tsk->comm);
> > > + debug_show_all_locks();
> > > + }
> > > +
> > 
> > How does this _solve_ anything? Why would you even retry when you
> > _know_ that the reference count dropped to zero. It will never
> > increment. So the above is basically just schedule_timeout_idle(HZ/10) *
> > MAX_OOM_REAP_RETRIES before we set MMF_OOM_SKIP.

If the OOM reaper knows that mm->users == 0, it gives __mmput() some time
to "complete exit_mmap() etc. and set MMF_OOM_SKIP". If __mmput() released
some memory, subsequent OOM killer invocation is automatically avoided.
If __mmput() did not release some memory, let the OOM killer invoke again.

> 
> Just to make myself more clear. The above assumes that the victim hasn't
> passed exit_mmap and MMF_OOM_SKIP in __mmput. Which is the case we want to
> address here.

David is trying to avoid setting MMF_OOM_SKIP when the OOM reaper found that
mm->users == 0. But we must not wait forever because __mmput() might fail to
release some memory immediately. If __mmput() did not release some memory within
schedule_timeout_idle(HZ/10) * MAX_OOM_REAP_RETRIES sleep, let the OOM killer
invoke again. So, this is the case we want to address here, isn't it?


Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-15 Thread Tetsuo Handa
Michal Hocko wrote:
> On Thu 15-06-17 14:03:35, Michal Hocko wrote:
> > On Thu 15-06-17 20:32:39, Tetsuo Handa wrote:
> > > @@ -556,25 +553,21 @@ static void oom_reap_task(struct task_struct *tsk)
> > >   struct mm_struct *mm = tsk->signal->oom_mm;
> > >  
> > >   /* Retry the down_read_trylock(mmap_sem) a few times */
> > > - while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, 
> > > mm))
> > > + while (__oom_reap_task_mm(tsk, mm), !test_bit(MMF_OOM_SKIP, >flags)
> > > +&& attempts++ < MAX_OOM_REAP_RETRIES)
> > >   schedule_timeout_idle(HZ/10);
> > >  
> > > - if (attempts <= MAX_OOM_REAP_RETRIES)
> > > - goto done;
> > > -
> > > -
> > > - pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> > > - task_pid_nr(tsk), tsk->comm);
> > > - debug_show_all_locks();
> > > -
> > > -done:
> > > - tsk->oom_reaper_list = NULL;
> > > -
> > >   /*
> > >* Hide this mm from OOM killer because it has been either reaped or
> > >* somebody can't call up_write(mmap_sem).
> > >*/
> > > - set_bit(MMF_OOM_SKIP, >flags);
> > > + if (!test_and_set_bit(MMF_OOM_SKIP, >flags)) {
> > > + pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> > > + task_pid_nr(tsk), tsk->comm);
> > > + debug_show_all_locks();
> > > + }
> > > +
> > 
> > How does this _solve_ anything? Why would you even retry when you
> > _know_ that the reference count dropped to zero. It will never
> > increment. So the above is basically just schedule_timeout_idle(HZ/10) *
> > MAX_OOM_REAP_RETRIES before we set MMF_OOM_SKIP.

If the OOM reaper knows that mm->users == 0, it gives __mmput() some time
to "complete exit_mmap() etc. and set MMF_OOM_SKIP". If __mmput() released
some memory, subsequent OOM killer invocation is automatically avoided.
If __mmput() did not release some memory, let the OOM killer invoke again.

> 
> Just to make myself more clear. The above assumes that the victim hasn't
> passed exit_mmap and MMF_OOM_SKIP in __mmput. Which is the case we want to
> address here.

David is trying to avoid setting MMF_OOM_SKIP when the OOM reaper found that
mm->users == 0. But we must not wait forever because __mmput() might fail to
release some memory immediately. If __mmput() did not release some memory within
schedule_timeout_idle(HZ/10) * MAX_OOM_REAP_RETRIES sleep, let the OOM killer
invoke again. So, this is the case we want to address here, isn't it?


Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-15 Thread Michal Hocko
On Thu 15-06-17 13:01:19, Michal Hocko wrote:
> On Thu 15-06-17 19:53:24, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Wed 14-06-17 16:43:03, David Rientjes wrote:
> > > > If mm->mm_users is not incremented because it is already zero by the oom
> > > > reaper, meaning the final refcount has been dropped, do not set
> > > > MMF_OOM_SKIP prematurely.
> > > > 
> > > > __mmput() may not have had a chance to do exit_mmap() yet, so memory 
> > > > from
> > > > a previous oom victim is still mapped.
> > > 
> > > true and do we have a _guarantee_ it will do it? E.g. can somebody block
> > > exit_aio from completing? Or can somebody hold mmap_sem and thus block
> > > ksm_exit resp. khugepaged_exit from completing? The reason why I was
> > > conservative and set such a mm as MMF_OOM_SKIP was because I couldn't
> > > give a definitive answer to those questions. And we really _want_ to
> > > have a guarantee of a forward progress here. Killing an additional
> > > proecess is a price to pay and if that doesn't trigger normall it sounds
> > > like a reasonable compromise to me.
> > 
> > Right. If you want this patch, __oom_reap_task_mm() must not return true 
> > without
> > setting MMF_OOM_SKIP (in other words, return false if __oom_reap_task_mm()
> > does not set MMF_OOM_SKIP). The most important role of the OOM reaper is to
> > guarantee that the OOM killer is re-enabled within finite time, for 
> > __mmput()
> > cannot guarantee that MMF_OOM_SKIP is set within finite time.
> 
> An alternative would be to allow reaping and exit_mmap race. The unmap
> part should just work I guess. We just have to be careful to not race
> with free_pgtables and that shouldn't be too hard to implement (e.g.
> (ab)use mmap_sem for write there). I haven't thought that through
> completely though so I might miss something of course.

Just to illustrate what I've had in mind. This is completely untested
(not even compile tested) and it may be completely broken but let's try
to evaluate this approach.
---
diff --git a/mm/mmap.c b/mm/mmap.c
index 3bd5ecd20d4d..761ba1dc9ec6 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2962,7 +2962,13 @@ void exit_mmap(struct mm_struct *mm)
/* Use -1 here to ensure all VMAs in the mm are unmapped */
unmap_vmas(, vma, 0, -1);
 
+   /*
+* oom reaper might race with exit_mmap so make sure we won't free
+* page tables under its feet
+*/
+   down_write(>mmap_sem);
free_pgtables(, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
+   up_write(>mmap_sem);
tlb_finish_mmu(, 0, -1);
 
/*
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0e2c925e7826..3df464f0f48b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -494,16 +494,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
}
 
/*
-* increase mm_users only after we know we will reap something so
-* that the mmput_async is called only when we have reaped something
-* and delayed __mmput doesn't matter that much
-*/
-   if (!mmget_not_zero(mm)) {
-   up_read(>mmap_sem);
-   goto unlock_oom;
-   }
-
-   /*
 * Tell all users of get_user/copy_from_user etc... that the content
 * is no longer stable. No barriers really needed because unmapping
 * should imply barriers already and the reader would hit a page fault
@@ -537,13 +527,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
K(get_mm_counter(mm, MM_FILEPAGES)),
K(get_mm_counter(mm, MM_SHMEMPAGES)));
up_read(>mmap_sem);
-
-   /*
-* Drop our reference but make sure the mmput slow path is called from a
-* different context because we shouldn't risk we get stuck there and
-* put the oom_reaper out of the way.
-*/
-   mmput_async(mm);
 unlock_oom:
mutex_unlock(_lock);
return ret;
-- 
Michal Hocko
SUSE Labs


Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-15 Thread Michal Hocko
On Thu 15-06-17 13:01:19, Michal Hocko wrote:
> On Thu 15-06-17 19:53:24, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Wed 14-06-17 16:43:03, David Rientjes wrote:
> > > > If mm->mm_users is not incremented because it is already zero by the oom
> > > > reaper, meaning the final refcount has been dropped, do not set
> > > > MMF_OOM_SKIP prematurely.
> > > > 
> > > > __mmput() may not have had a chance to do exit_mmap() yet, so memory 
> > > > from
> > > > a previous oom victim is still mapped.
> > > 
> > > true and do we have a _guarantee_ it will do it? E.g. can somebody block
> > > exit_aio from completing? Or can somebody hold mmap_sem and thus block
> > > ksm_exit resp. khugepaged_exit from completing? The reason why I was
> > > conservative and set such a mm as MMF_OOM_SKIP was because I couldn't
> > > give a definitive answer to those questions. And we really _want_ to
> > > have a guarantee of a forward progress here. Killing an additional
> > > proecess is a price to pay and if that doesn't trigger normall it sounds
> > > like a reasonable compromise to me.
> > 
> > Right. If you want this patch, __oom_reap_task_mm() must not return true 
> > without
> > setting MMF_OOM_SKIP (in other words, return false if __oom_reap_task_mm()
> > does not set MMF_OOM_SKIP). The most important role of the OOM reaper is to
> > guarantee that the OOM killer is re-enabled within finite time, for 
> > __mmput()
> > cannot guarantee that MMF_OOM_SKIP is set within finite time.
> 
> An alternative would be to allow reaping and exit_mmap race. The unmap
> part should just work I guess. We just have to be careful to not race
> with free_pgtables and that shouldn't be too hard to implement (e.g.
> (ab)use mmap_sem for write there). I haven't thought that through
> completely though so I might miss something of course.

Just to illustrate what I've had in mind. This is completely untested
(not even compile tested) and it may be completely broken but let's try
to evaluate this approach.
---
diff --git a/mm/mmap.c b/mm/mmap.c
index 3bd5ecd20d4d..761ba1dc9ec6 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2962,7 +2962,13 @@ void exit_mmap(struct mm_struct *mm)
/* Use -1 here to ensure all VMAs in the mm are unmapped */
unmap_vmas(, vma, 0, -1);
 
+   /*
+* oom reaper might race with exit_mmap so make sure we won't free
+* page tables under its feet
+*/
+   down_write(>mmap_sem);
free_pgtables(, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
+   up_write(>mmap_sem);
tlb_finish_mmu(, 0, -1);
 
/*
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0e2c925e7826..3df464f0f48b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -494,16 +494,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
}
 
/*
-* increase mm_users only after we know we will reap something so
-* that the mmput_async is called only when we have reaped something
-* and delayed __mmput doesn't matter that much
-*/
-   if (!mmget_not_zero(mm)) {
-   up_read(>mmap_sem);
-   goto unlock_oom;
-   }
-
-   /*
 * Tell all users of get_user/copy_from_user etc... that the content
 * is no longer stable. No barriers really needed because unmapping
 * should imply barriers already and the reader would hit a page fault
@@ -537,13 +527,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
K(get_mm_counter(mm, MM_FILEPAGES)),
K(get_mm_counter(mm, MM_SHMEMPAGES)));
up_read(>mmap_sem);
-
-   /*
-* Drop our reference but make sure the mmput slow path is called from a
-* different context because we shouldn't risk we get stuck there and
-* put the oom_reaper out of the way.
-*/
-   mmput_async(mm);
 unlock_oom:
mutex_unlock(_lock);
return ret;
-- 
Michal Hocko
SUSE Labs


Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-15 Thread Michal Hocko
On Thu 15-06-17 14:03:35, Michal Hocko wrote:
> On Thu 15-06-17 20:32:39, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> [...]
> > > An alternative would be to allow reaping and exit_mmap race. The unmap
> > > part should just work I guess. We just have to be careful to not race
> > > with free_pgtables and that shouldn't be too hard to implement (e.g.
> > > (ab)use mmap_sem for write there). I haven't thought that through
> > > completely though so I might miss something of course.
> > 
> > I think below one is simpler.
> [...]
> > @@ -556,25 +553,21 @@ static void oom_reap_task(struct task_struct *tsk)
> > struct mm_struct *mm = tsk->signal->oom_mm;
> >  
> > /* Retry the down_read_trylock(mmap_sem) a few times */
> > -   while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, 
> > mm))
> > +   while (__oom_reap_task_mm(tsk, mm), !test_bit(MMF_OOM_SKIP, >flags)
> > +  && attempts++ < MAX_OOM_REAP_RETRIES)
> > schedule_timeout_idle(HZ/10);
> >  
> > -   if (attempts <= MAX_OOM_REAP_RETRIES)
> > -   goto done;
> > -
> > -
> > -   pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> > -   task_pid_nr(tsk), tsk->comm);
> > -   debug_show_all_locks();
> > -
> > -done:
> > -   tsk->oom_reaper_list = NULL;
> > -
> > /*
> >  * Hide this mm from OOM killer because it has been either reaped or
> >  * somebody can't call up_write(mmap_sem).
> >  */
> > -   set_bit(MMF_OOM_SKIP, >flags);
> > +   if (!test_and_set_bit(MMF_OOM_SKIP, >flags)) {
> > +   pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> > +   task_pid_nr(tsk), tsk->comm);
> > +   debug_show_all_locks();
> > +   }
> > +
> 
> How does this _solve_ anything? Why would you even retry when you
> _know_ that the reference count dropped to zero. It will never
> increment. So the above is basically just schedule_timeout_idle(HZ/10) *
> MAX_OOM_REAP_RETRIES before we set MMF_OOM_SKIP.

Just to make myself more clear. The above assumes that the victim hasn't
passed exit_mmap and MMF_OOM_SKIP in __mmput. Which is the case we want to
address here.
-- 
Michal Hocko
SUSE Labs


Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-15 Thread Michal Hocko
On Thu 15-06-17 14:03:35, Michal Hocko wrote:
> On Thu 15-06-17 20:32:39, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> [...]
> > > An alternative would be to allow reaping and exit_mmap race. The unmap
> > > part should just work I guess. We just have to be careful to not race
> > > with free_pgtables and that shouldn't be too hard to implement (e.g.
> > > (ab)use mmap_sem for write there). I haven't thought that through
> > > completely though so I might miss something of course.
> > 
> > I think below one is simpler.
> [...]
> > @@ -556,25 +553,21 @@ static void oom_reap_task(struct task_struct *tsk)
> > struct mm_struct *mm = tsk->signal->oom_mm;
> >  
> > /* Retry the down_read_trylock(mmap_sem) a few times */
> > -   while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, 
> > mm))
> > +   while (__oom_reap_task_mm(tsk, mm), !test_bit(MMF_OOM_SKIP, >flags)
> > +  && attempts++ < MAX_OOM_REAP_RETRIES)
> > schedule_timeout_idle(HZ/10);
> >  
> > -   if (attempts <= MAX_OOM_REAP_RETRIES)
> > -   goto done;
> > -
> > -
> > -   pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> > -   task_pid_nr(tsk), tsk->comm);
> > -   debug_show_all_locks();
> > -
> > -done:
> > -   tsk->oom_reaper_list = NULL;
> > -
> > /*
> >  * Hide this mm from OOM killer because it has been either reaped or
> >  * somebody can't call up_write(mmap_sem).
> >  */
> > -   set_bit(MMF_OOM_SKIP, >flags);
> > +   if (!test_and_set_bit(MMF_OOM_SKIP, >flags)) {
> > +   pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> > +   task_pid_nr(tsk), tsk->comm);
> > +   debug_show_all_locks();
> > +   }
> > +
> 
> How does this _solve_ anything? Why would you even retry when you
> _know_ that the reference count dropped to zero. It will never
> increment. So the above is basically just schedule_timeout_idle(HZ/10) *
> MAX_OOM_REAP_RETRIES before we set MMF_OOM_SKIP.

Just to make myself more clear. The above assumes that the victim hasn't
passed exit_mmap and MMF_OOM_SKIP in __mmput. Which is the case we want to
address here.
-- 
Michal Hocko
SUSE Labs


Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-15 Thread Michal Hocko
On Thu 15-06-17 20:32:39, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > An alternative would be to allow reaping and exit_mmap race. The unmap
> > part should just work I guess. We just have to be careful to not race
> > with free_pgtables and that shouldn't be too hard to implement (e.g.
> > (ab)use mmap_sem for write there). I haven't thought that through
> > completely though so I might miss something of course.
> 
> I think below one is simpler.
[...]
> @@ -556,25 +553,21 @@ static void oom_reap_task(struct task_struct *tsk)
>   struct mm_struct *mm = tsk->signal->oom_mm;
>  
>   /* Retry the down_read_trylock(mmap_sem) a few times */
> - while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, 
> mm))
> + while (__oom_reap_task_mm(tsk, mm), !test_bit(MMF_OOM_SKIP, >flags)
> +&& attempts++ < MAX_OOM_REAP_RETRIES)
>   schedule_timeout_idle(HZ/10);
>  
> - if (attempts <= MAX_OOM_REAP_RETRIES)
> - goto done;
> -
> -
> - pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> - task_pid_nr(tsk), tsk->comm);
> - debug_show_all_locks();
> -
> -done:
> - tsk->oom_reaper_list = NULL;
> -
>   /*
>* Hide this mm from OOM killer because it has been either reaped or
>* somebody can't call up_write(mmap_sem).
>*/
> - set_bit(MMF_OOM_SKIP, >flags);
> + if (!test_and_set_bit(MMF_OOM_SKIP, >flags)) {
> + pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> + task_pid_nr(tsk), tsk->comm);
> + debug_show_all_locks();
> + }
> +

How does this _solve_ anything? Why would you even retry when you
_know_ that the reference count dropped to zero. It will never
increment. So the above is basically just schedule_timeout_idle(HZ/10) *
MAX_OOM_REAP_RETRIES before we set MMF_OOM_SKIP. This might be enough
for victim to finish the exit_mmap but it is more a hack^Wworkround
than anything else. You could very well do the sleep without any
obfuscation...
-- 
Michal Hocko
SUSE Labs


Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-15 Thread Michal Hocko
On Thu 15-06-17 20:32:39, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > An alternative would be to allow reaping and exit_mmap race. The unmap
> > part should just work I guess. We just have to be careful to not race
> > with free_pgtables and that shouldn't be too hard to implement (e.g.
> > (ab)use mmap_sem for write there). I haven't thought that through
> > completely though so I might miss something of course.
> 
> I think below one is simpler.
[...]
> @@ -556,25 +553,21 @@ static void oom_reap_task(struct task_struct *tsk)
>   struct mm_struct *mm = tsk->signal->oom_mm;
>  
>   /* Retry the down_read_trylock(mmap_sem) a few times */
> - while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, 
> mm))
> + while (__oom_reap_task_mm(tsk, mm), !test_bit(MMF_OOM_SKIP, >flags)
> +&& attempts++ < MAX_OOM_REAP_RETRIES)
>   schedule_timeout_idle(HZ/10);
>  
> - if (attempts <= MAX_OOM_REAP_RETRIES)
> - goto done;
> -
> -
> - pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> - task_pid_nr(tsk), tsk->comm);
> - debug_show_all_locks();
> -
> -done:
> - tsk->oom_reaper_list = NULL;
> -
>   /*
>* Hide this mm from OOM killer because it has been either reaped or
>* somebody can't call up_write(mmap_sem).
>*/
> - set_bit(MMF_OOM_SKIP, >flags);
> + if (!test_and_set_bit(MMF_OOM_SKIP, >flags)) {
> + pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> + task_pid_nr(tsk), tsk->comm);
> + debug_show_all_locks();
> + }
> +

How does this _solve_ anything? Why would you even retry when you
_know_ that the reference count dropped to zero. It will never
increment. So the above is basically just schedule_timeout_idle(HZ/10) *
MAX_OOM_REAP_RETRIES before we set MMF_OOM_SKIP. This might be enough
for victim to finish the exit_mmap but it is more a hack^Wworkround
than anything else. You could very well do the sleep without any
obfuscation...
-- 
Michal Hocko
SUSE Labs


Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-15 Thread Tetsuo Handa
Michal Hocko wrote:
> On Thu 15-06-17 19:53:24, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Wed 14-06-17 16:43:03, David Rientjes wrote:
> > > > If mm->mm_users is not incremented because it is already zero by the oom
> > > > reaper, meaning the final refcount has been dropped, do not set
> > > > MMF_OOM_SKIP prematurely.
> > > > 
> > > > __mmput() may not have had a chance to do exit_mmap() yet, so memory 
> > > > from
> > > > a previous oom victim is still mapped.
> > > 
> > > true and do we have a _guarantee_ it will do it? E.g. can somebody block
> > > exit_aio from completing? Or can somebody hold mmap_sem and thus block
> > > ksm_exit resp. khugepaged_exit from completing? The reason why I was
> > > conservative and set such a mm as MMF_OOM_SKIP was because I couldn't
> > > give a definitive answer to those questions. And we really _want_ to
> > > have a guarantee of a forward progress here. Killing an additional
> > > proecess is a price to pay and if that doesn't trigger normall it sounds
> > > like a reasonable compromise to me.
> > 
> > Right. If you want this patch, __oom_reap_task_mm() must not return true 
> > without
> > setting MMF_OOM_SKIP (in other words, return false if __oom_reap_task_mm()
> > does not set MMF_OOM_SKIP). The most important role of the OOM reaper is to
> > guarantee that the OOM killer is re-enabled within finite time, for 
> > __mmput()
> > cannot guarantee that MMF_OOM_SKIP is set within finite time.
> 
> An alternative would be to allow reaping and exit_mmap race. The unmap
> part should just work I guess. We just have to be careful to not race
> with free_pgtables and that shouldn't be too hard to implement (e.g.
> (ab)use mmap_sem for write there). I haven't thought that through
> completely though so I might miss something of course.

I think below one is simpler.

 mm/oom_kill.c | 31 ---
 1 file changed, 12 insertions(+), 19 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0e2c925..c63f514 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -466,11 +466,10 @@ bool process_shares_mm(struct task_struct *p, struct 
mm_struct *mm)
 static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
-static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
+static void __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 {
struct mmu_gather tlb;
struct vm_area_struct *vma;
-   bool ret = true;
 
/*
 * We have to make sure to not race with the victim exit path
@@ -488,10 +487,8 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
 */
mutex_lock(_lock);
 
-   if (!down_read_trylock(>mmap_sem)) {
-   ret = false;
+   if (!down_read_trylock(>mmap_sem))
goto unlock_oom;
-   }
 
/*
 * increase mm_users only after we know we will reap something so
@@ -531,6 +528,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
 NULL);
}
tlb_finish_mmu(, 0, -1);
+   set_bit(MMF_OOM_SKIP, >flags);
pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, 
file-rss:%lukB, shmem-rss:%lukB\n",
task_pid_nr(tsk), tsk->comm,
K(get_mm_counter(mm, MM_ANONPAGES)),
@@ -546,7 +544,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
mmput_async(mm);
 unlock_oom:
mutex_unlock(_lock);
-   return ret;
 }
 
 #define MAX_OOM_REAP_RETRIES 10
@@ -556,25 +553,21 @@ static void oom_reap_task(struct task_struct *tsk)
struct mm_struct *mm = tsk->signal->oom_mm;
 
/* Retry the down_read_trylock(mmap_sem) a few times */
-   while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, 
mm))
+   while (__oom_reap_task_mm(tsk, mm), !test_bit(MMF_OOM_SKIP, >flags)
+  && attempts++ < MAX_OOM_REAP_RETRIES)
schedule_timeout_idle(HZ/10);
 
-   if (attempts <= MAX_OOM_REAP_RETRIES)
-   goto done;
-
-
-   pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
-   task_pid_nr(tsk), tsk->comm);
-   debug_show_all_locks();
-
-done:
-   tsk->oom_reaper_list = NULL;
-
/*
 * Hide this mm from OOM killer because it has been either reaped or
 * somebody can't call up_write(mmap_sem).
 */
-   set_bit(MMF_OOM_SKIP, >flags);
+   if (!test_and_set_bit(MMF_OOM_SKIP, >flags)) {
+   pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
+   task_pid_nr(tsk), tsk->comm);
+   debug_show_all_locks();
+   }
+
+   tsk->oom_reaper_list = NULL;
 
/* Drop a reference taken by wake_oom_reaper */
put_task_struct(tsk);


Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-15 Thread Tetsuo Handa
Michal Hocko wrote:
> On Thu 15-06-17 19:53:24, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Wed 14-06-17 16:43:03, David Rientjes wrote:
> > > > If mm->mm_users is not incremented because it is already zero by the oom
> > > > reaper, meaning the final refcount has been dropped, do not set
> > > > MMF_OOM_SKIP prematurely.
> > > > 
> > > > __mmput() may not have had a chance to do exit_mmap() yet, so memory 
> > > > from
> > > > a previous oom victim is still mapped.
> > > 
> > > true and do we have a _guarantee_ it will do it? E.g. can somebody block
> > > exit_aio from completing? Or can somebody hold mmap_sem and thus block
> > > ksm_exit resp. khugepaged_exit from completing? The reason why I was
> > > conservative and set such a mm as MMF_OOM_SKIP was because I couldn't
> > > give a definitive answer to those questions. And we really _want_ to
> > > have a guarantee of a forward progress here. Killing an additional
> > > proecess is a price to pay and if that doesn't trigger normall it sounds
> > > like a reasonable compromise to me.
> > 
> > Right. If you want this patch, __oom_reap_task_mm() must not return true 
> > without
> > setting MMF_OOM_SKIP (in other words, return false if __oom_reap_task_mm()
> > does not set MMF_OOM_SKIP). The most important role of the OOM reaper is to
> > guarantee that the OOM killer is re-enabled within finite time, for 
> > __mmput()
> > cannot guarantee that MMF_OOM_SKIP is set within finite time.
> 
> An alternative would be to allow reaping and exit_mmap race. The unmap
> part should just work I guess. We just have to be careful to not race
> with free_pgtables and that shouldn't be too hard to implement (e.g.
> (ab)use mmap_sem for write there). I haven't thought that through
> completely though so I might miss something of course.

I think below one is simpler.

 mm/oom_kill.c | 31 ---
 1 file changed, 12 insertions(+), 19 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0e2c925..c63f514 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -466,11 +466,10 @@ bool process_shares_mm(struct task_struct *p, struct 
mm_struct *mm)
 static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
-static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
+static void __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 {
struct mmu_gather tlb;
struct vm_area_struct *vma;
-   bool ret = true;
 
/*
 * We have to make sure to not race with the victim exit path
@@ -488,10 +487,8 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
 */
mutex_lock(_lock);
 
-   if (!down_read_trylock(>mmap_sem)) {
-   ret = false;
+   if (!down_read_trylock(>mmap_sem))
goto unlock_oom;
-   }
 
/*
 * increase mm_users only after we know we will reap something so
@@ -531,6 +528,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
 NULL);
}
tlb_finish_mmu(, 0, -1);
+   set_bit(MMF_OOM_SKIP, >flags);
pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, 
file-rss:%lukB, shmem-rss:%lukB\n",
task_pid_nr(tsk), tsk->comm,
K(get_mm_counter(mm, MM_ANONPAGES)),
@@ -546,7 +544,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
mmput_async(mm);
 unlock_oom:
mutex_unlock(_lock);
-   return ret;
 }
 
 #define MAX_OOM_REAP_RETRIES 10
@@ -556,25 +553,21 @@ static void oom_reap_task(struct task_struct *tsk)
struct mm_struct *mm = tsk->signal->oom_mm;
 
/* Retry the down_read_trylock(mmap_sem) a few times */
-   while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, 
mm))
+   while (__oom_reap_task_mm(tsk, mm), !test_bit(MMF_OOM_SKIP, >flags)
+  && attempts++ < MAX_OOM_REAP_RETRIES)
schedule_timeout_idle(HZ/10);
 
-   if (attempts <= MAX_OOM_REAP_RETRIES)
-   goto done;
-
-
-   pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
-   task_pid_nr(tsk), tsk->comm);
-   debug_show_all_locks();
-
-done:
-   tsk->oom_reaper_list = NULL;
-
/*
 * Hide this mm from OOM killer because it has been either reaped or
 * somebody can't call up_write(mmap_sem).
 */
-   set_bit(MMF_OOM_SKIP, >flags);
+   if (!test_and_set_bit(MMF_OOM_SKIP, >flags)) {
+   pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
+   task_pid_nr(tsk), tsk->comm);
+   debug_show_all_locks();
+   }
+
+   tsk->oom_reaper_list = NULL;
 
/* Drop a reference taken by wake_oom_reaper */
put_task_struct(tsk);


Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-15 Thread Michal Hocko
On Thu 15-06-17 19:53:24, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 14-06-17 16:43:03, David Rientjes wrote:
> > > If mm->mm_users is not incremented because it is already zero by the oom
> > > reaper, meaning the final refcount has been dropped, do not set
> > > MMF_OOM_SKIP prematurely.
> > > 
> > > __mmput() may not have had a chance to do exit_mmap() yet, so memory from
> > > a previous oom victim is still mapped.
> > 
> > true and do we have a _guarantee_ it will do it? E.g. can somebody block
> > exit_aio from completing? Or can somebody hold mmap_sem and thus block
> > ksm_exit resp. khugepaged_exit from completing? The reason why I was
> > conservative and set such a mm as MMF_OOM_SKIP was because I couldn't
> > give a definitive answer to those questions. And we really _want_ to
> > have a guarantee of a forward progress here. Killing an additional
> > proecess is a price to pay and if that doesn't trigger normall it sounds
> > like a reasonable compromise to me.
> 
> Right. If you want this patch, __oom_reap_task_mm() must not return true 
> without
> setting MMF_OOM_SKIP (in other words, return false if __oom_reap_task_mm()
> does not set MMF_OOM_SKIP). The most important role of the OOM reaper is to
> guarantee that the OOM killer is re-enabled within finite time, for __mmput()
> cannot guarantee that MMF_OOM_SKIP is set within finite time.

An alternative would be to allow reaping and exit_mmap race. The unmap
part should just work I guess. We just have to be careful to not race
with free_pgtables and that shouldn't be too hard to implement (e.g.
(ab)use mmap_sem for write there). I haven't thought that through
completely though so I might miss something of course.
-- 
Michal Hocko
SUSE Labs


Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-15 Thread Michal Hocko
On Thu 15-06-17 19:53:24, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 14-06-17 16:43:03, David Rientjes wrote:
> > > If mm->mm_users is not incremented because it is already zero by the oom
> > > reaper, meaning the final refcount has been dropped, do not set
> > > MMF_OOM_SKIP prematurely.
> > > 
> > > __mmput() may not have had a chance to do exit_mmap() yet, so memory from
> > > a previous oom victim is still mapped.
> > 
> > true and do we have a _guarantee_ it will do it? E.g. can somebody block
> > exit_aio from completing? Or can somebody hold mmap_sem and thus block
> > ksm_exit resp. khugepaged_exit from completing? The reason why I was
> > conservative and set such a mm as MMF_OOM_SKIP was because I couldn't
> > give a definitive answer to those questions. And we really _want_ to
> > have a guarantee of a forward progress here. Killing an additional
> > proecess is a price to pay and if that doesn't trigger normall it sounds
> > like a reasonable compromise to me.
> 
> Right. If you want this patch, __oom_reap_task_mm() must not return true 
> without
> setting MMF_OOM_SKIP (in other words, return false if __oom_reap_task_mm()
> does not set MMF_OOM_SKIP). The most important role of the OOM reaper is to
> guarantee that the OOM killer is re-enabled within finite time, for __mmput()
> cannot guarantee that MMF_OOM_SKIP is set within finite time.

An alternative would be to allow reaping and exit_mmap race. The unmap
part should just work I guess. We just have to be careful to not race
with free_pgtables and that shouldn't be too hard to implement (e.g.
(ab)use mmap_sem for write there). I haven't thought that through
completely though so I might miss something of course.
-- 
Michal Hocko
SUSE Labs


Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-15 Thread Tetsuo Handa
Michal Hocko wrote:
> On Wed 14-06-17 16:43:03, David Rientjes wrote:
> > If mm->mm_users is not incremented because it is already zero by the oom
> > reaper, meaning the final refcount has been dropped, do not set
> > MMF_OOM_SKIP prematurely.
> > 
> > __mmput() may not have had a chance to do exit_mmap() yet, so memory from
> > a previous oom victim is still mapped.
> 
> true and do we have a _guarantee_ it will do it? E.g. can somebody block
> exit_aio from completing? Or can somebody hold mmap_sem and thus block
> ksm_exit resp. khugepaged_exit from completing? The reason why I was
> conservative and set such a mm as MMF_OOM_SKIP was because I couldn't
> give a definitive answer to those questions. And we really _want_ to
> have a guarantee of a forward progress here. Killing an additional
> proecess is a price to pay and if that doesn't trigger normall it sounds
> like a reasonable compromise to me.

Right. If you want this patch, __oom_reap_task_mm() must not return true without
setting MMF_OOM_SKIP (in other words, return false if __oom_reap_task_mm()
does not set MMF_OOM_SKIP). The most important role of the OOM reaper is to
guarantee that the OOM killer is re-enabled within finite time, for __mmput()
cannot guarantee that MMF_OOM_SKIP is set within finite time.

> 
> > __mput() naturally requires no
> > references on mm->mm_users to do exit_mmap().
> > 
> > Without this, several processes can be oom killed unnecessarily and the
> > oom log can show an abundance of memory available if exit_mmap() is in
> > progress at the time the process is skipped.
> 
> Have you seen this happening in the real life?
> 
> > Signed-off-by: David Rientjes 


Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-15 Thread Tetsuo Handa
Michal Hocko wrote:
> On Wed 14-06-17 16:43:03, David Rientjes wrote:
> > If mm->mm_users is not incremented because it is already zero by the oom
> > reaper, meaning the final refcount has been dropped, do not set
> > MMF_OOM_SKIP prematurely.
> > 
> > __mmput() may not have had a chance to do exit_mmap() yet, so memory from
> > a previous oom victim is still mapped.
> 
> true and do we have a _guarantee_ it will do it? E.g. can somebody block
> exit_aio from completing? Or can somebody hold mmap_sem and thus block
> ksm_exit resp. khugepaged_exit from completing? The reason why I was
> conservative and set such a mm as MMF_OOM_SKIP was because I couldn't
> give a definitive answer to those questions. And we really _want_ to
> have a guarantee of a forward progress here. Killing an additional
> proecess is a price to pay and if that doesn't trigger normall it sounds
> like a reasonable compromise to me.

Right. If you want this patch, __oom_reap_task_mm() must not return true without
setting MMF_OOM_SKIP (in other words, return false if __oom_reap_task_mm()
does not set MMF_OOM_SKIP). The most important role of the OOM reaper is to
guarantee that the OOM killer is re-enabled within finite time, for __mmput()
cannot guarantee that MMF_OOM_SKIP is set within finite time.

> 
> > __mput() naturally requires no
> > references on mm->mm_users to do exit_mmap().
> > 
> > Without this, several processes can be oom killed unnecessarily and the
> > oom log can show an abundance of memory available if exit_mmap() is in
> > progress at the time the process is skipped.
> 
> Have you seen this happening in the real life?
> 
> > Signed-off-by: David Rientjes 


Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-15 Thread Michal Hocko
On Wed 14-06-17 16:43:03, David Rientjes wrote:
> If mm->mm_users is not incremented because it is already zero by the oom
> reaper, meaning the final refcount has been dropped, do not set
> MMF_OOM_SKIP prematurely.
> 
> __mmput() may not have had a chance to do exit_mmap() yet, so memory from
> a previous oom victim is still mapped.

true and do we have a _guarantee_ it will do it? E.g. can somebody block
exit_aio from completing? Or can somebody hold mmap_sem and thus block
ksm_exit resp. khugepaged_exit from completing? The reason why I was
conservative and set such a mm as MMF_OOM_SKIP was because I couldn't
give a definitive answer to those questions. And we really _want_ to
have a guarantee of a forward progress here. Killing an additional
proecess is a price to pay and if that doesn't trigger normall it sounds
like a reasonable compromise to me.

> __mput() naturally requires no
> references on mm->mm_users to do exit_mmap().
> 
> Without this, several processes can be oom killed unnecessarily and the
> oom log can show an abundance of memory available if exit_mmap() is in
> progress at the time the process is skipped.

Have you seen this happening in the real life?

> Signed-off-by: David Rientjes 
> ---
>  mm/oom_kill.c | 13 ++---
>  1 file changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -531,6 +531,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
> struct mm_struct *mm)
>NULL);
>   }
>   tlb_finish_mmu(, 0, -1);
> + set_bit(MMF_OOM_SKIP, >flags);
>   pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, 
> file-rss:%lukB, shmem-rss:%lukB\n",
>   task_pid_nr(tsk), tsk->comm,
>   K(get_mm_counter(mm, MM_ANONPAGES)),
> @@ -562,7 +563,11 @@ static void oom_reap_task(struct task_struct *tsk)
>   if (attempts <= MAX_OOM_REAP_RETRIES)
>   goto done;
>  
> -
> + /*
> +  * Hide this mm from OOM killer because it cannot be reaped since
> +  * mm->mmap_sem cannot be acquired.
> +  */
> + set_bit(MMF_OOM_SKIP, >flags);
>   pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
>   task_pid_nr(tsk), tsk->comm);
>   debug_show_all_locks();
> @@ -570,12 +575,6 @@ static void oom_reap_task(struct task_struct *tsk)
>  done:
>   tsk->oom_reaper_list = NULL;
>  
> - /*
> -  * Hide this mm from OOM killer because it has been either reaped or
> -  * somebody can't call up_write(mmap_sem).
> -  */
> - set_bit(MMF_OOM_SKIP, >flags);
> -
>   /* Drop a reference taken by wake_oom_reaper */
>   put_task_struct(tsk);
>  }

-- 
Michal Hocko
SUSE Labs


Re: [patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-15 Thread Michal Hocko
On Wed 14-06-17 16:43:03, David Rientjes wrote:
> If mm->mm_users is not incremented because it is already zero by the oom
> reaper, meaning the final refcount has been dropped, do not set
> MMF_OOM_SKIP prematurely.
> 
> __mmput() may not have had a chance to do exit_mmap() yet, so memory from
> a previous oom victim is still mapped.

true and do we have a _guarantee_ it will do it? E.g. can somebody block
exit_aio from completing? Or can somebody hold mmap_sem and thus block
ksm_exit resp. khugepaged_exit from completing? The reason why I was
conservative and set such a mm as MMF_OOM_SKIP was because I couldn't
give a definitive answer to those questions. And we really _want_ to
have a guarantee of a forward progress here. Killing an additional
proecess is a price to pay and if that doesn't trigger normall it sounds
like a reasonable compromise to me.

> __mput() naturally requires no
> references on mm->mm_users to do exit_mmap().
> 
> Without this, several processes can be oom killed unnecessarily and the
> oom log can show an abundance of memory available if exit_mmap() is in
> progress at the time the process is skipped.

Have you seen this happening in the real life?

> Signed-off-by: David Rientjes 
> ---
>  mm/oom_kill.c | 13 ++---
>  1 file changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -531,6 +531,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
> struct mm_struct *mm)
>NULL);
>   }
>   tlb_finish_mmu(, 0, -1);
> + set_bit(MMF_OOM_SKIP, >flags);
>   pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, 
> file-rss:%lukB, shmem-rss:%lukB\n",
>   task_pid_nr(tsk), tsk->comm,
>   K(get_mm_counter(mm, MM_ANONPAGES)),
> @@ -562,7 +563,11 @@ static void oom_reap_task(struct task_struct *tsk)
>   if (attempts <= MAX_OOM_REAP_RETRIES)
>   goto done;
>  
> -
> + /*
> +  * Hide this mm from OOM killer because it cannot be reaped since
> +  * mm->mmap_sem cannot be acquired.
> +  */
> + set_bit(MMF_OOM_SKIP, >flags);
>   pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
>   task_pid_nr(tsk), tsk->comm);
>   debug_show_all_locks();
> @@ -570,12 +575,6 @@ static void oom_reap_task(struct task_struct *tsk)
>  done:
>   tsk->oom_reaper_list = NULL;
>  
> - /*
> -  * Hide this mm from OOM killer because it has been either reaped or
> -  * somebody can't call up_write(mmap_sem).
> -  */
> - set_bit(MMF_OOM_SKIP, >flags);
> -
>   /* Drop a reference taken by wake_oom_reaper */
>   put_task_struct(tsk);
>  }

-- 
Michal Hocko
SUSE Labs


[patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-14 Thread David Rientjes
If mm->mm_users is not incremented because it is already zero by the oom
reaper, meaning the final refcount has been dropped, do not set
MMF_OOM_SKIP prematurely.

__mmput() may not have had a chance to do exit_mmap() yet, so memory from
a previous oom victim is still mapped.  __mput() naturally requires no
references on mm->mm_users to do exit_mmap().

Without this, several processes can be oom killed unnecessarily and the
oom log can show an abundance of memory available if exit_mmap() is in
progress at the time the process is skipped.

Signed-off-by: David Rientjes 
---
 mm/oom_kill.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -531,6 +531,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
 NULL);
}
tlb_finish_mmu(, 0, -1);
+   set_bit(MMF_OOM_SKIP, >flags);
pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, 
file-rss:%lukB, shmem-rss:%lukB\n",
task_pid_nr(tsk), tsk->comm,
K(get_mm_counter(mm, MM_ANONPAGES)),
@@ -562,7 +563,11 @@ static void oom_reap_task(struct task_struct *tsk)
if (attempts <= MAX_OOM_REAP_RETRIES)
goto done;
 
-
+   /*
+* Hide this mm from OOM killer because it cannot be reaped since
+* mm->mmap_sem cannot be acquired.
+*/
+   set_bit(MMF_OOM_SKIP, >flags);
pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
task_pid_nr(tsk), tsk->comm);
debug_show_all_locks();
@@ -570,12 +575,6 @@ static void oom_reap_task(struct task_struct *tsk)
 done:
tsk->oom_reaper_list = NULL;
 
-   /*
-* Hide this mm from OOM killer because it has been either reaped or
-* somebody can't call up_write(mmap_sem).
-*/
-   set_bit(MMF_OOM_SKIP, >flags);
-
/* Drop a reference taken by wake_oom_reaper */
put_task_struct(tsk);
 }


[patch] mm, oom: prevent additional oom kills before memory is freed

2017-06-14 Thread David Rientjes
If mm->mm_users is not incremented because it is already zero by the oom
reaper, meaning the final refcount has been dropped, do not set
MMF_OOM_SKIP prematurely.

__mmput() may not have had a chance to do exit_mmap() yet, so memory from
a previous oom victim is still mapped.  __mput() naturally requires no
references on mm->mm_users to do exit_mmap().

Without this, several processes can be oom killed unnecessarily and the
oom log can show an abundance of memory available if exit_mmap() is in
progress at the time the process is skipped.

Signed-off-by: David Rientjes 
---
 mm/oom_kill.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -531,6 +531,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
 NULL);
}
tlb_finish_mmu(, 0, -1);
+   set_bit(MMF_OOM_SKIP, >flags);
pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, 
file-rss:%lukB, shmem-rss:%lukB\n",
task_pid_nr(tsk), tsk->comm,
K(get_mm_counter(mm, MM_ANONPAGES)),
@@ -562,7 +563,11 @@ static void oom_reap_task(struct task_struct *tsk)
if (attempts <= MAX_OOM_REAP_RETRIES)
goto done;
 
-
+   /*
+* Hide this mm from OOM killer because it cannot be reaped since
+* mm->mmap_sem cannot be acquired.
+*/
+   set_bit(MMF_OOM_SKIP, >flags);
pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
task_pid_nr(tsk), tsk->comm);
debug_show_all_locks();
@@ -570,12 +575,6 @@ static void oom_reap_task(struct task_struct *tsk)
 done:
tsk->oom_reaper_list = NULL;
 
-   /*
-* Hide this mm from OOM killer because it has been either reaped or
-* somebody can't call up_write(mmap_sem).
-*/
-   set_bit(MMF_OOM_SKIP, >flags);
-
/* Drop a reference taken by wake_oom_reaper */
put_task_struct(tsk);
 }