Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-09-06 Thread Tetsuo Handa
Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > I assert that we should fix af5679fbc669f31f.
> > 
> > If you can come up with reasonable patch which doesn't complicate the
> > code and it is a clear win for both this particular workload as well as
> > others then why not.
> 
> Why can't we do "at least MMF_OOM_SKIP should be set under the lock to
> prevent from races" ?
> 

Well, serializing setting of MMF_OOM_SKIP using oom_lock was not sufficient.
It seems that some more moment is needed for allowing last second allocation
attempt at __alloc_pages_may_oom() to succeed. We need to migrate to
"mm, oom: fix unnecessary killing of additional processes" thread.



[  702.895936] a.out invoked oom-killer: 
gfp_mask=0x6280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null), order=0, 
oom_score_adj=0
[  702.900717] a.out cpuset=/ mems_allowed=0
[  702.903210] CPU: 1 PID: 3359 Comm: a.out Tainted: GT 
4.19.0-rc2+ #692
[  702.906630] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
Desktop Reference Platform, BIOS 6.00 05/19/2017
[  702.910771] Call Trace:
[  702.912681]  dump_stack+0x85/0xcb
[  702.914918]  dump_header+0x69/0x2fe
[  702.917387]  ? _raw_spin_unlock_irqrestore+0x41/0x70
[  702.921267]  oom_kill_process+0x307/0x390
[  702.923809]  out_of_memory+0x2f3/0x5d0
[  702.926208]  __alloc_pages_slowpath+0xc01/0x1030
[  702.928817]  __alloc_pages_nodemask+0x333/0x390
[  702.931270]  alloc_pages_vma+0x77/0x1f0
[  702.933618]  __handle_mm_fault+0x81c/0xf40
[  702.935962]  handle_mm_fault+0x1b7/0x3c0
[  702.938215]  __do_page_fault+0x2a6/0x580
[  702.940481]  do_page_fault+0x32/0x270
[  702.942753]  ? page_fault+0x8/0x30
[  702.944860]  page_fault+0x1e/0x30
[  702.947138] RIP: 0033:0x4008d8
[  702.949722] Code: Bad RIP value.
[  702.952000] RSP: 002b:7ffc21fd99c0 EFLAGS: 00010206
[  702.954570] RAX: 7fb3457cc010 RBX: 0001 RCX: 
[  702.957631] RDX: b0b24000 RSI: 0002 RDI: 00020050
[  702.960599] RBP: 7fb3457cc010 R08: 00021000 R09: 00021000
[  702.963531] R10: 0022 R11: 1000 R12: 0006
[  702.966518] R13: 7ffc21fd9ab0 R14:  R15: 
[  702.971186] Mem-Info:
[  702.976959] active_anon:788641 inactive_anon:3457 isolated_anon:0
[  702.976959]  active_file:0 inactive_file:77 isolated_file:0
[  702.976959]  unevictable:0 dirty:0 writeback:0 unstable:0
[  702.976959]  slab_reclaimable:8152 slab_unreclaimable:24616
[  702.976959]  mapped:2806 shmem:3704 pagetables:4355 bounce:0
[  702.976959]  free:20831 free_pcp:136 free_cma:0
[  703.007374] Node 0 active_anon:3154564kB inactive_anon:13828kB 
active_file:304kB inactive_file:112kB unevictable:0kB isolated(anon):0kB 
isolated(file):0kB mapped:11028kB dirty:0kB writeback:0kB shmem:14816kB 
shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2846720kB writeback_tmp:0kB 
unstable:0kB all_unreclaimable? no
[  703.029994] Node 0 DMA free:13816kB min:308kB low:384kB high:460kB 
active_anon:1976kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB writepending:0kB present:15960kB managed:15876kB mlocked:0kB 
kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB 
free_cma:0kB
[  703.055331] lowmem_reserve[]: 0 2717 3378 3378
[  703.062881] Node 0 DMA32 free:58152kB min:54108kB low:67632kB high:81156kB 
active_anon:2718364kB inactive_anon:12kB active_file:424kB inactive_file:852kB 
unevictable:0kB writepending:0kB present:3129152kB managed:2782296kB 
mlocked:0kB kernel_stack:352kB pagetables:388kB bounce:0kB free_pcp:728kB 
local_pcp:0kB free_cma:0kB
[  703.091529] lowmem_reserve[]: 0 0 661 661
[  703.096552] Node 0 Normal free:12900kB min:13164kB low:16452kB high:19740kB 
active_anon:434248kB inactive_anon:13816kB active_file:0kB inactive_file:48kB 
unevictable:0kB writepending:0kB present:1048576kB managed:676908kB mlocked:0kB 
kernel_stack:6720kB pagetables:17028kB bounce:0kB free_pcp:0kB local_pcp:0kB 
free_cma:0kB
[  703.123046] lowmem_reserve[]: 0 0 0 0
[  703.127490] Node 0 DMA: 0*4kB 1*8kB (M) 1*16kB (U) 3*32kB (U) 4*64kB (UM) 
1*128kB (U) 0*256kB 0*512kB 1*1024kB (U) 0*2048kB 3*4096kB (M) = 13816kB
[  703.134763] Node 0 DMA32: 85*4kB (UM) 59*8kB (UM) 61*16kB (UM) 65*32kB (UME) 
47*64kB (UE) 44*128kB (UME) 41*256kB (UME) 27*512kB (UME) 20*1024kB (ME) 
0*2048kB 0*4096kB = 57308kB
[  703.144076] Node 0 Normal: 119*4kB (UM) 69*8kB (UM) 156*16kB (UME) 179*32kB 
(UME) 44*64kB (UE) 9*128kB (UE) 1*256kB (E) 0*512kB 0*1024kB 0*2048kB 0*4096kB 
= 13476kB
[  703.151746] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 
hugepages_size=2048kB
[  703.156537] 3744 total pagecache pages
[  703.159017] 0 pages in swap cache
[  703.163209] Swap cache stats: add 0, delete 0, find 0/0
[  703.167254] Free swap  = 0kB
[  703.170577] Total swap = 0kB
[  703.173758] 1048422 pages RAM
[  703.176843] 0 pages HighMem/MovableOnly
[  703.180380] 179652 pages reserved
[  703.183718] 0 

Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-09-06 Thread Tetsuo Handa
Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > I assert that we should fix af5679fbc669f31f.
> > 
> > If you can come up with reasonable patch which doesn't complicate the
> > code and it is a clear win for both this particular workload as well as
> > others then why not.
> 
> Why can't we do "at least MMF_OOM_SKIP should be set under the lock to
> prevent from races" ?
> 

Well, serializing setting of MMF_OOM_SKIP using oom_lock was not sufficient.
It seems that some more moment is needed for allowing last second allocation
attempt at __alloc_pages_may_oom() to succeed. We need to migrate to
"mm, oom: fix unnecessary killing of additional processes" thread.



[  702.895936] a.out invoked oom-killer: 
gfp_mask=0x6280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null), order=0, 
oom_score_adj=0
[  702.900717] a.out cpuset=/ mems_allowed=0
[  702.903210] CPU: 1 PID: 3359 Comm: a.out Tainted: GT 
4.19.0-rc2+ #692
[  702.906630] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
Desktop Reference Platform, BIOS 6.00 05/19/2017
[  702.910771] Call Trace:
[  702.912681]  dump_stack+0x85/0xcb
[  702.914918]  dump_header+0x69/0x2fe
[  702.917387]  ? _raw_spin_unlock_irqrestore+0x41/0x70
[  702.921267]  oom_kill_process+0x307/0x390
[  702.923809]  out_of_memory+0x2f3/0x5d0
[  702.926208]  __alloc_pages_slowpath+0xc01/0x1030
[  702.928817]  __alloc_pages_nodemask+0x333/0x390
[  702.931270]  alloc_pages_vma+0x77/0x1f0
[  702.933618]  __handle_mm_fault+0x81c/0xf40
[  702.935962]  handle_mm_fault+0x1b7/0x3c0
[  702.938215]  __do_page_fault+0x2a6/0x580
[  702.940481]  do_page_fault+0x32/0x270
[  702.942753]  ? page_fault+0x8/0x30
[  702.944860]  page_fault+0x1e/0x30
[  702.947138] RIP: 0033:0x4008d8
[  702.949722] Code: Bad RIP value.
[  702.952000] RSP: 002b:7ffc21fd99c0 EFLAGS: 00010206
[  702.954570] RAX: 7fb3457cc010 RBX: 0001 RCX: 
[  702.957631] RDX: b0b24000 RSI: 0002 RDI: 00020050
[  702.960599] RBP: 7fb3457cc010 R08: 00021000 R09: 00021000
[  702.963531] R10: 0022 R11: 1000 R12: 0006
[  702.966518] R13: 7ffc21fd9ab0 R14:  R15: 
[  702.971186] Mem-Info:
[  702.976959] active_anon:788641 inactive_anon:3457 isolated_anon:0
[  702.976959]  active_file:0 inactive_file:77 isolated_file:0
[  702.976959]  unevictable:0 dirty:0 writeback:0 unstable:0
[  702.976959]  slab_reclaimable:8152 slab_unreclaimable:24616
[  702.976959]  mapped:2806 shmem:3704 pagetables:4355 bounce:0
[  702.976959]  free:20831 free_pcp:136 free_cma:0
[  703.007374] Node 0 active_anon:3154564kB inactive_anon:13828kB 
active_file:304kB inactive_file:112kB unevictable:0kB isolated(anon):0kB 
isolated(file):0kB mapped:11028kB dirty:0kB writeback:0kB shmem:14816kB 
shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2846720kB writeback_tmp:0kB 
unstable:0kB all_unreclaimable? no
[  703.029994] Node 0 DMA free:13816kB min:308kB low:384kB high:460kB 
active_anon:1976kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB writepending:0kB present:15960kB managed:15876kB mlocked:0kB 
kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB 
free_cma:0kB
[  703.055331] lowmem_reserve[]: 0 2717 3378 3378
[  703.062881] Node 0 DMA32 free:58152kB min:54108kB low:67632kB high:81156kB 
active_anon:2718364kB inactive_anon:12kB active_file:424kB inactive_file:852kB 
unevictable:0kB writepending:0kB present:3129152kB managed:2782296kB 
mlocked:0kB kernel_stack:352kB pagetables:388kB bounce:0kB free_pcp:728kB 
local_pcp:0kB free_cma:0kB
[  703.091529] lowmem_reserve[]: 0 0 661 661
[  703.096552] Node 0 Normal free:12900kB min:13164kB low:16452kB high:19740kB 
active_anon:434248kB inactive_anon:13816kB active_file:0kB inactive_file:48kB 
unevictable:0kB writepending:0kB present:1048576kB managed:676908kB mlocked:0kB 
kernel_stack:6720kB pagetables:17028kB bounce:0kB free_pcp:0kB local_pcp:0kB 
free_cma:0kB
[  703.123046] lowmem_reserve[]: 0 0 0 0
[  703.127490] Node 0 DMA: 0*4kB 1*8kB (M) 1*16kB (U) 3*32kB (U) 4*64kB (UM) 
1*128kB (U) 0*256kB 0*512kB 1*1024kB (U) 0*2048kB 3*4096kB (M) = 13816kB
[  703.134763] Node 0 DMA32: 85*4kB (UM) 59*8kB (UM) 61*16kB (UM) 65*32kB (UME) 
47*64kB (UE) 44*128kB (UME) 41*256kB (UME) 27*512kB (UME) 20*1024kB (ME) 
0*2048kB 0*4096kB = 57308kB
[  703.144076] Node 0 Normal: 119*4kB (UM) 69*8kB (UM) 156*16kB (UME) 179*32kB 
(UME) 44*64kB (UE) 9*128kB (UE) 1*256kB (E) 0*512kB 0*1024kB 0*2048kB 0*4096kB 
= 13476kB
[  703.151746] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 
hugepages_size=2048kB
[  703.156537] 3744 total pagecache pages
[  703.159017] 0 pages in swap cache
[  703.163209] Swap cache stats: add 0, delete 0, find 0/0
[  703.167254] Free swap  = 0kB
[  703.170577] Total swap = 0kB
[  703.173758] 1048422 pages RAM
[  703.176843] 0 pages HighMem/MovableOnly
[  703.180380] 179652 pages reserved
[  703.183718] 0 

Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-09-06 Thread Tetsuo Handa
Michal Hocko wrote:
> > I assert that we should fix af5679fbc669f31f.
> 
> If you can come up with reasonable patch which doesn't complicate the
> code and it is a clear win for both this particular workload as well as
> others then why not.

Why can't we do "at least MMF_OOM_SKIP should be set under the lock to
prevent from races" ?

diff --git a/mm/mmap.c b/mm/mmap.c
index 5f2b2b1..e096bb8 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3065,7 +3065,9 @@ void exit_mmap(struct mm_struct *mm)
 */
(void)__oom_reap_task_mm(mm);
 
+   mutex_lock(_lock);
set_bit(MMF_OOM_SKIP, >flags);
+   mutex_unlock(_lock);
down_write(>mmap_sem);
up_write(>mmap_sem);
}
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa53..b2a94c1 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -606,7 +606,9 @@ static void oom_reap_task(struct task_struct *tsk)
 * Hide this mm from OOM killer because it has been either reaped or
 * somebody can't call up_write(mmap_sem).
 */
+   mutex_lock(_lock);
set_bit(MMF_OOM_SKIP, >flags);
+   mutex_unlock(_lock);
 
/* Drop a reference taken by wake_oom_reaper */
put_task_struct(tsk);


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-09-06 Thread Tetsuo Handa
Michal Hocko wrote:
> > I assert that we should fix af5679fbc669f31f.
> 
> If you can come up with reasonable patch which doesn't complicate the
> code and it is a clear win for both this particular workload as well as
> others then why not.

Why can't we do "at least MMF_OOM_SKIP should be set under the lock to
prevent from races" ?

diff --git a/mm/mmap.c b/mm/mmap.c
index 5f2b2b1..e096bb8 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3065,7 +3065,9 @@ void exit_mmap(struct mm_struct *mm)
 */
(void)__oom_reap_task_mm(mm);
 
+   mutex_lock(_lock);
set_bit(MMF_OOM_SKIP, >flags);
+   mutex_unlock(_lock);
down_write(>mmap_sem);
up_write(>mmap_sem);
}
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa53..b2a94c1 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -606,7 +606,9 @@ static void oom_reap_task(struct task_struct *tsk)
 * Hide this mm from OOM killer because it has been either reaped or
 * somebody can't call up_write(mmap_sem).
 */
+   mutex_lock(_lock);
set_bit(MMF_OOM_SKIP, >flags);
+   mutex_unlock(_lock);
 
/* Drop a reference taken by wake_oom_reaper */
put_task_struct(tsk);


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-09-05 Thread Michal Hocko
On Thu 06-09-18 10:00:00, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 05-09-18 22:53:33, Tetsuo Handa wrote:
> > > On 2018/09/05 22:40, Michal Hocko wrote:
> > > > Changelog said 
> > > > 
> > > > "Although this is possible in principle let's wait for it to actually
> > > > happen in real life before we make the locking more complex again."
> > > > 
> > > > So what is the real life workload that hits it? The log you have pasted
> > > > below doesn't tell much.
> > > 
> > > Nothing special. I just ran a multi-threaded memory eater on a 
> > > CONFIG_PREEMPT=y kernel.
> > 
> > I strongly suspec that your test doesn't really represent or simulate
> > any real and useful workload. Sure it triggers a rare race and we kill
> > another oom victim. Does this warrant to make the code more complex?
> > Well, I am not convinced, as I've said countless times.
> 
> Yes. Below is an example from a machine running Apache Web server/Tomcat AP 
> server/PostgreSQL DB server.
> An memory eater needlessly killed Tomcat due to this race.

What prevents you from modifying you mem eater in a way that Tomcat
resp. others from being the primary oom victim choice? In other words,
yeah it is not optimal to lose the race but if it is rare enough then
this is something to live with because it can be hardly considered a
new DoS vector AFAICS. Remember that this is always going to be racy
land and we are not going to plumb all possible races because this is
simply not viable. But I am pretty sure we have been through all this
many times already. Oh well...

> I assert that we should fix af5679fbc669f31f.

If you can come up with reasonable patch which doesn't complicate the
code and it is a clear win for both this particular workload as well as
others then why not.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-09-05 Thread Michal Hocko
On Thu 06-09-18 10:00:00, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 05-09-18 22:53:33, Tetsuo Handa wrote:
> > > On 2018/09/05 22:40, Michal Hocko wrote:
> > > > Changelog said 
> > > > 
> > > > "Although this is possible in principle let's wait for it to actually
> > > > happen in real life before we make the locking more complex again."
> > > > 
> > > > So what is the real life workload that hits it? The log you have pasted
> > > > below doesn't tell much.
> > > 
> > > Nothing special. I just ran a multi-threaded memory eater on a 
> > > CONFIG_PREEMPT=y kernel.
> > 
> > I strongly suspec that your test doesn't really represent or simulate
> > any real and useful workload. Sure it triggers a rare race and we kill
> > another oom victim. Does this warrant to make the code more complex?
> > Well, I am not convinced, as I've said countless times.
> 
> Yes. Below is an example from a machine running Apache Web server/Tomcat AP 
> server/PostgreSQL DB server.
> An memory eater needlessly killed Tomcat due to this race.

What prevents you from modifying you mem eater in a way that Tomcat
resp. others from being the primary oom victim choice? In other words,
yeah it is not optimal to lose the race but if it is rare enough then
this is something to live with because it can be hardly considered a
new DoS vector AFAICS. Remember that this is always going to be racy
land and we are not going to plumb all possible races because this is
simply not viable. But I am pretty sure we have been through all this
many times already. Oh well...

> I assert that we should fix af5679fbc669f31f.

If you can come up with reasonable patch which doesn't complicate the
code and it is a clear win for both this particular workload as well as
others then why not.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-09-05 Thread Tetsuo Handa
Michal Hocko wrote:
> On Wed 05-09-18 22:53:33, Tetsuo Handa wrote:
> > On 2018/09/05 22:40, Michal Hocko wrote:
> > > Changelog said 
> > > 
> > > "Although this is possible in principle let's wait for it to actually
> > > happen in real life before we make the locking more complex again."
> > > 
> > > So what is the real life workload that hits it? The log you have pasted
> > > below doesn't tell much.
> > 
> > Nothing special. I just ran a multi-threaded memory eater on a 
> > CONFIG_PREEMPT=y kernel.
> 
> I strongly suspec that your test doesn't really represent or simulate
> any real and useful workload. Sure it triggers a rare race and we kill
> another oom victim. Does this warrant to make the code more complex?
> Well, I am not convinced, as I've said countless times.

Yes. Below is an example from a machine running Apache Web server/Tomcat AP 
server/PostgreSQL DB server.
An memory eater needlessly killed Tomcat due to this race. I assert that we 
should fix af5679fbc669f31f.



Before:

systemd(1)-+-NetworkManager(693)-+-dhclient(791)
   | |-{NetworkManager}(698)
   | `-{NetworkManager}(702)
   |-abrtd(630)
   |-agetty(1007)
   |-atd(653)
   |-auditd(600)---{auditd}(601)
   |-avahi-daemon(625)---avahi-daemon(631)
   |-crond(657)
   |-dbus-daemon(638)
   |-firewalld(661)---{firewalld}(788)
   |-httpd(1169)-+-httpd(1170)
   | |-httpd(1171)
   | |-httpd(1172)
   | |-httpd(1173)
   | `-httpd(1174)
   |-irqbalance(628)
   |-java(1074)-+-{java}(1092)
   ||-{java}(1093)
   ||-{java}(1094)
   ||-{java}(1095)
   ||-{java}(1096)
   ||-{java}(1097)
   ||-{java}(1098)
   ||-{java}(1099)
   ||-{java}(1100)
   ||-{java}(1101)
   ||-{java}(1102)
   ||-{java}(1103)
   ||-{java}(1104)
   ||-{java}(1105)
   ||-{java}(1106)
   ||-{java}(1107)
   ||-{java}(1108)
   ||-{java}(1109)
   ||-{java}(1110)
   ||-{java}()
   ||-{java}(1114)
   ||-{java}(1115)
   ||-{java}(1116)
   ||-{java}(1117)
   ||-{java}(1118)
   ||-{java}(1119)
   ||-{java}(1120)
   ||-{java}(1121)
   ||-{java}(1122)
   ||-{java}(1123)
   ||-{java}(1124)
   ||-{java}(1125)
   ||-{java}(1126)
   ||-{java}(1127)
   ||-{java}(1128)
   ||-{java}(1129)
   ||-{java}(1130)
   ||-{java}(1131)
   ||-{java}(1132)
   ||-{java}(1133)
   ||-{java}(1134)
   ||-{java}(1135)
   ||-{java}(1136)
   ||-{java}(1137)
   |`-{java}(1138)
   |-ksmtuned(659)---sleep(1727)
   |-login(1006)---bash(1052)---pstree(1728)
   |-polkitd(624)-+-{polkitd}(633)
   |  |-{polkitd}(642)
   |  |-{polkitd}(643)
   |  |-{polkitd}(645)
   |  `-{polkitd}(650)
   |-postgres(1154)-+-postgres(1155)
   ||-postgres(1157)
   ||-postgres(1158)
   ||-postgres(1159)
   ||-postgres(1160)
   |`-postgres(1161)
   |-rsyslogd(986)-+-{rsyslogd}(997)
   |   `-{rsyslogd}(999)
   |-sendmail(1008)
   |-sendmail(1023)
   |-smbd(983)-+-cleanupd(1027)
   |   |-lpqd(1032)
   |   `-smbd-notifyd(1026)
   |-sshd(981)
   |-systemd-journal(529)
   |-systemd-logind(627)
   |-systemd-udevd(560)
   `-tuned(980)-+-{tuned}(1030)
|-{tuned}(1031)
|-{tuned}(1033)
`-{tuned}(1047)



After:

systemd(1)-+-NetworkManager(693)-+-dhclient(791)
   | |-{NetworkManager}(698)
   | `-{NetworkManager}(702)
   |-abrtd(630)
   |-agetty(1007)
   |-atd(653)
   |-auditd(600)---{auditd}(601)
   |-avahi-daemon(625)---avahi-daemon(631)
   |-crond(657)
   |-dbus-daemon(638)
   |-firewalld(661)---{firewalld}(788)
   

Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-09-05 Thread Tetsuo Handa
Michal Hocko wrote:
> On Wed 05-09-18 22:53:33, Tetsuo Handa wrote:
> > On 2018/09/05 22:40, Michal Hocko wrote:
> > > Changelog said 
> > > 
> > > "Although this is possible in principle let's wait for it to actually
> > > happen in real life before we make the locking more complex again."
> > > 
> > > So what is the real life workload that hits it? The log you have pasted
> > > below doesn't tell much.
> > 
> > Nothing special. I just ran a multi-threaded memory eater on a 
> > CONFIG_PREEMPT=y kernel.
> 
> I strongly suspec that your test doesn't really represent or simulate
> any real and useful workload. Sure it triggers a rare race and we kill
> another oom victim. Does this warrant to make the code more complex?
> Well, I am not convinced, as I've said countless times.

Yes. Below is an example from a machine running Apache Web server/Tomcat AP 
server/PostgreSQL DB server.
An memory eater needlessly killed Tomcat due to this race. I assert that we 
should fix af5679fbc669f31f.



Before:

systemd(1)-+-NetworkManager(693)-+-dhclient(791)
   | |-{NetworkManager}(698)
   | `-{NetworkManager}(702)
   |-abrtd(630)
   |-agetty(1007)
   |-atd(653)
   |-auditd(600)---{auditd}(601)
   |-avahi-daemon(625)---avahi-daemon(631)
   |-crond(657)
   |-dbus-daemon(638)
   |-firewalld(661)---{firewalld}(788)
   |-httpd(1169)-+-httpd(1170)
   | |-httpd(1171)
   | |-httpd(1172)
   | |-httpd(1173)
   | `-httpd(1174)
   |-irqbalance(628)
   |-java(1074)-+-{java}(1092)
   ||-{java}(1093)
   ||-{java}(1094)
   ||-{java}(1095)
   ||-{java}(1096)
   ||-{java}(1097)
   ||-{java}(1098)
   ||-{java}(1099)
   ||-{java}(1100)
   ||-{java}(1101)
   ||-{java}(1102)
   ||-{java}(1103)
   ||-{java}(1104)
   ||-{java}(1105)
   ||-{java}(1106)
   ||-{java}(1107)
   ||-{java}(1108)
   ||-{java}(1109)
   ||-{java}(1110)
   ||-{java}()
   ||-{java}(1114)
   ||-{java}(1115)
   ||-{java}(1116)
   ||-{java}(1117)
   ||-{java}(1118)
   ||-{java}(1119)
   ||-{java}(1120)
   ||-{java}(1121)
   ||-{java}(1122)
   ||-{java}(1123)
   ||-{java}(1124)
   ||-{java}(1125)
   ||-{java}(1126)
   ||-{java}(1127)
   ||-{java}(1128)
   ||-{java}(1129)
   ||-{java}(1130)
   ||-{java}(1131)
   ||-{java}(1132)
   ||-{java}(1133)
   ||-{java}(1134)
   ||-{java}(1135)
   ||-{java}(1136)
   ||-{java}(1137)
   |`-{java}(1138)
   |-ksmtuned(659)---sleep(1727)
   |-login(1006)---bash(1052)---pstree(1728)
   |-polkitd(624)-+-{polkitd}(633)
   |  |-{polkitd}(642)
   |  |-{polkitd}(643)
   |  |-{polkitd}(645)
   |  `-{polkitd}(650)
   |-postgres(1154)-+-postgres(1155)
   ||-postgres(1157)
   ||-postgres(1158)
   ||-postgres(1159)
   ||-postgres(1160)
   |`-postgres(1161)
   |-rsyslogd(986)-+-{rsyslogd}(997)
   |   `-{rsyslogd}(999)
   |-sendmail(1008)
   |-sendmail(1023)
   |-smbd(983)-+-cleanupd(1027)
   |   |-lpqd(1032)
   |   `-smbd-notifyd(1026)
   |-sshd(981)
   |-systemd-journal(529)
   |-systemd-logind(627)
   |-systemd-udevd(560)
   `-tuned(980)-+-{tuned}(1030)
|-{tuned}(1031)
|-{tuned}(1033)
`-{tuned}(1047)



After:

systemd(1)-+-NetworkManager(693)-+-dhclient(791)
   | |-{NetworkManager}(698)
   | `-{NetworkManager}(702)
   |-abrtd(630)
   |-agetty(1007)
   |-atd(653)
   |-auditd(600)---{auditd}(601)
   |-avahi-daemon(625)---avahi-daemon(631)
   |-crond(657)
   |-dbus-daemon(638)
   |-firewalld(661)---{firewalld}(788)
   

Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-09-05 Thread Michal Hocko
On Wed 05-09-18 22:53:33, Tetsuo Handa wrote:
> On 2018/09/05 22:40, Michal Hocko wrote:
> > Changelog said 
> > 
> > "Although this is possible in principle let's wait for it to actually
> > happen in real life before we make the locking more complex again."
> > 
> > So what is the real life workload that hits it? The log you have pasted
> > below doesn't tell much.
> 
> Nothing special. I just ran a multi-threaded memory eater on a 
> CONFIG_PREEMPT=y kernel.

I strongly suspec that your test doesn't really represent or simulate
any real and useful workload. Sure it triggers a rare race and we kill
another oom victim. Does this warrant to make the code more complex?
Well, I am not convinced, as I've said countless times.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-09-05 Thread Michal Hocko
On Wed 05-09-18 22:53:33, Tetsuo Handa wrote:
> On 2018/09/05 22:40, Michal Hocko wrote:
> > Changelog said 
> > 
> > "Although this is possible in principle let's wait for it to actually
> > happen in real life before we make the locking more complex again."
> > 
> > So what is the real life workload that hits it? The log you have pasted
> > below doesn't tell much.
> 
> Nothing special. I just ran a multi-threaded memory eater on a 
> CONFIG_PREEMPT=y kernel.

I strongly suspec that your test doesn't really represent or simulate
any real and useful workload. Sure it triggers a rare race and we kill
another oom victim. Does this warrant to make the code more complex?
Well, I am not convinced, as I've said countless times.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-09-05 Thread Tetsuo Handa
On 2018/09/05 22:40, Michal Hocko wrote:
> Changelog said 
> 
> "Although this is possible in principle let's wait for it to actually
> happen in real life before we make the locking more complex again."
> 
> So what is the real life workload that hits it? The log you have pasted
> below doesn't tell much.

Nothing special. I just ran a multi-threaded memory eater on a CONFIG_PREEMPT=y 
kernel.
The OOM reaper succeeded to reclaim all memory but the OOM killer still 
triggered
just because somebody was inside that race window.

#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static int memory_eater(void *unused)
{
const int fd = open("/proc/self/exe", O_RDONLY);
static cpu_set_t set = { { 1 } };
sched_setaffinity(0, sizeof(set), );
srand(getpid());
while (1) {
int size = rand() % 1048576;
void *ptr = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
munmap(ptr, size);
}
return 0;
}

static int child(void *unused)
{
static char *stack[256] = { };
char buf[32] = { };
int i;
sleep(1);
snprintf(buf, sizeof(buf), "tgid=%u", getpid());
prctl(PR_SET_NAME, (unsigned long) buf, 0, 0, 0);
for (i = 0; i < 256; i++)
stack[i] = malloc(4096 * 2);
for (i = 1; i < 128; i++)
if (clone(memory_eater, stack[i] + 8192, CLONE_THREAD | 
CLONE_SIGHAND | CLONE_VM, NULL) == -1)
_exit(1);
for (; i < 254; i++)
if (clone(memory_eater, stack[i] + 8192, CLONE_VM, NULL) == -1)
_exit(1);
return 0;
}

int main(int argc, char *argv[])
{
char *stack = malloc(1048576);
char *buf = NULL;
unsigned long size;
unsigned long i;
if (clone(child, stack + 1048576, CLONE_VM, NULL) == -1)
_exit(1);
for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
char *cp = realloc(buf, size);
if (!cp) {
size >>= 1;
break;
}
buf = cp;
}
/* Will cause OOM due to overcommit */
for (i = 0; i < size; i += 4096)
buf[i] = 0;
while (1)
pause();
return 0;
}

> 
>> [  278.147280] Out of memory: Kill process 9943 (a.out) score 919 or 
>> sacrifice child
>> [  278.148927] Killed process 9943 (a.out) total-vm:4267252kB, 
>> anon-rss:3430056kB, file-rss:0kB, shmem-rss:0kB
>> [  278.151586] vmtoolsd invoked oom-killer: 
>> gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, 
>> oom_score_adj=0
> [...]
>> [  278.331527] Out of memory: Kill process 8790 (firewalld) score 5 or 
>> sacrifice child
>> [  278.333267] Killed process 8790 (firewalld) total-vm:358012kB, 
>> anon-rss:21928kB, file-rss:0kB, shmem-rss:0kB
>> [  278.336430] oom_reaper: reaped process 8790 (firewalld), now 
>> anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> 

Another example from a different machine.

[  765.523676] a.out invoked oom-killer: 
gfp_mask=0x6280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null), order=0, 
oom_score_adj=0
[  765.528172] a.out cpuset=/ mems_allowed=0
[  765.530603] CPU: 5 PID: 4540 Comm: a.out Tainted: GT 
4.19.0-rc2+ #689
[  765.534307] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
Desktop Reference Platform, BIOS 6.00 05/19/2017
[  765.538975] Call Trace:
[  765.540920]  dump_stack+0x85/0xcb
[  765.543088]  dump_header+0x69/0x2fe
[  765.545375]  ? _raw_spin_unlock_irqrestore+0x41/0x70
[  765.547983]  oom_kill_process+0x307/0x390
[  765.550394]  out_of_memory+0x2f5/0x5b0
[  765.552818]  __alloc_pages_slowpath+0xc01/0x1030
[  765.555442]  __alloc_pages_nodemask+0x333/0x390
[  765.557966]  alloc_pages_vma+0x77/0x1f0
[  765.560292]  __handle_mm_fault+0x81c/0xf40
[  765.562736]  handle_mm_fault+0x1b7/0x3c0
[  765.565038]  __do_page_fault+0x2a6/0x580
[  765.567420]  do_page_fault+0x32/0x270
[  765.569670]  ? page_fault+0x8/0x30
[  765.571833]  page_fault+0x1e/0x30
[  765.573934] RIP: 0033:0x4008d8
[  765.575924] Code: Bad RIP value.
[  765.577842] RSP: 002b:7ffec6f7d420 EFLAGS: 00010206
[  765.580221] RAX: 7f6a201f4010 RBX: 0001 RCX: 
[  765.583253] RDX: bde23000 RSI: 0002 RDI: 00020050
[  765.586207] RBP: 7f6a201f4010 R08: 00021000 R09: 00021000
[  765.589047] R10: 0022 R11: 1000 R12: 0006
[  765.591996] R13: 7ffec6f7d510 R14:  R15: 
[  765.595732] Mem-Info:
[  765.597580] active_anon:794622 inactive_anon:2126 isolated_anon:0
[  765.597580]  active_file:7 inactive_file:11 isolated_file:0
[  765.597580]  unevictable:0 dirty:0 writeback:0 unstable:0
[  765.597580]  

Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-09-05 Thread Tetsuo Handa
On 2018/09/05 22:40, Michal Hocko wrote:
> Changelog said 
> 
> "Although this is possible in principle let's wait for it to actually
> happen in real life before we make the locking more complex again."
> 
> So what is the real life workload that hits it? The log you have pasted
> below doesn't tell much.

Nothing special. I just ran a multi-threaded memory eater on a CONFIG_PREEMPT=y 
kernel.
The OOM reaper succeeded to reclaim all memory but the OOM killer still 
triggered
just because somebody was inside that race window.

#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static int memory_eater(void *unused)
{
const int fd = open("/proc/self/exe", O_RDONLY);
static cpu_set_t set = { { 1 } };
sched_setaffinity(0, sizeof(set), );
srand(getpid());
while (1) {
int size = rand() % 1048576;
void *ptr = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
munmap(ptr, size);
}
return 0;
}

static int child(void *unused)
{
static char *stack[256] = { };
char buf[32] = { };
int i;
sleep(1);
snprintf(buf, sizeof(buf), "tgid=%u", getpid());
prctl(PR_SET_NAME, (unsigned long) buf, 0, 0, 0);
for (i = 0; i < 256; i++)
stack[i] = malloc(4096 * 2);
for (i = 1; i < 128; i++)
if (clone(memory_eater, stack[i] + 8192, CLONE_THREAD | 
CLONE_SIGHAND | CLONE_VM, NULL) == -1)
_exit(1);
for (; i < 254; i++)
if (clone(memory_eater, stack[i] + 8192, CLONE_VM, NULL) == -1)
_exit(1);
return 0;
}

int main(int argc, char *argv[])
{
char *stack = malloc(1048576);
char *buf = NULL;
unsigned long size;
unsigned long i;
if (clone(child, stack + 1048576, CLONE_VM, NULL) == -1)
_exit(1);
for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
char *cp = realloc(buf, size);
if (!cp) {
size >>= 1;
break;
}
buf = cp;
}
/* Will cause OOM due to overcommit */
for (i = 0; i < size; i += 4096)
buf[i] = 0;
while (1)
pause();
return 0;
}

> 
>> [  278.147280] Out of memory: Kill process 9943 (a.out) score 919 or 
>> sacrifice child
>> [  278.148927] Killed process 9943 (a.out) total-vm:4267252kB, 
>> anon-rss:3430056kB, file-rss:0kB, shmem-rss:0kB
>> [  278.151586] vmtoolsd invoked oom-killer: 
>> gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, 
>> oom_score_adj=0
> [...]
>> [  278.331527] Out of memory: Kill process 8790 (firewalld) score 5 or 
>> sacrifice child
>> [  278.333267] Killed process 8790 (firewalld) total-vm:358012kB, 
>> anon-rss:21928kB, file-rss:0kB, shmem-rss:0kB
>> [  278.336430] oom_reaper: reaped process 8790 (firewalld), now 
>> anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> 

Another example from a different machine.

[  765.523676] a.out invoked oom-killer: 
gfp_mask=0x6280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null), order=0, 
oom_score_adj=0
[  765.528172] a.out cpuset=/ mems_allowed=0
[  765.530603] CPU: 5 PID: 4540 Comm: a.out Tainted: GT 
4.19.0-rc2+ #689
[  765.534307] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
Desktop Reference Platform, BIOS 6.00 05/19/2017
[  765.538975] Call Trace:
[  765.540920]  dump_stack+0x85/0xcb
[  765.543088]  dump_header+0x69/0x2fe
[  765.545375]  ? _raw_spin_unlock_irqrestore+0x41/0x70
[  765.547983]  oom_kill_process+0x307/0x390
[  765.550394]  out_of_memory+0x2f5/0x5b0
[  765.552818]  __alloc_pages_slowpath+0xc01/0x1030
[  765.555442]  __alloc_pages_nodemask+0x333/0x390
[  765.557966]  alloc_pages_vma+0x77/0x1f0
[  765.560292]  __handle_mm_fault+0x81c/0xf40
[  765.562736]  handle_mm_fault+0x1b7/0x3c0
[  765.565038]  __do_page_fault+0x2a6/0x580
[  765.567420]  do_page_fault+0x32/0x270
[  765.569670]  ? page_fault+0x8/0x30
[  765.571833]  page_fault+0x1e/0x30
[  765.573934] RIP: 0033:0x4008d8
[  765.575924] Code: Bad RIP value.
[  765.577842] RSP: 002b:7ffec6f7d420 EFLAGS: 00010206
[  765.580221] RAX: 7f6a201f4010 RBX: 0001 RCX: 
[  765.583253] RDX: bde23000 RSI: 0002 RDI: 00020050
[  765.586207] RBP: 7f6a201f4010 R08: 00021000 R09: 00021000
[  765.589047] R10: 0022 R11: 1000 R12: 0006
[  765.591996] R13: 7ffec6f7d510 R14:  R15: 
[  765.595732] Mem-Info:
[  765.597580] active_anon:794622 inactive_anon:2126 isolated_anon:0
[  765.597580]  active_file:7 inactive_file:11 isolated_file:0
[  765.597580]  unevictable:0 dirty:0 writeback:0 unstable:0
[  765.597580]  

Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-09-05 Thread Michal Hocko
On Wed 05-09-18 22:20:58, Tetsuo Handa wrote:
> On 2018/08/24 9:31, Tetsuo Handa wrote:
> > For now, I don't think we need to add af5679fbc669f31f to the list for
> > CVE-2016-10723, for af5679fbc669f31f might cause premature next OOM victim
> > selection (especially with CONFIG_PREEMPT=y kernels) due to
> > 
> >__alloc_pages_may_oom():   oom_reap_task():
> > 
> >  mutex_trylock(_lock) succeeds.
> >  get_page_from_freelist() fails.
> >  Preempted to other process.
> > oom_reap_task_mm() succeeds.
> > Sets MMF_OOM_SKIP.
> >  Returned from preemption.
> >  Finds that MMF_OOM_SKIP was already set.
> >  Selects next OOM victim and kills it.
> >  mutex_unlock(_lock) is called.
> > 
> > race window like described as
> > 
> > Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the 
> > lock
> > to prevent from races when the page allocator didn't manage to get the
> > freed (reaped) memory in __alloc_pages_may_oom but it sees the flag 
> > later
> > on and move on to another victim.  Although this is possible in 
> > principle
> > let's wait for it to actually happen in real life before we make the
> > locking more complex again.
> > 
> > in that commit.
> > 
> 
> Yes, that race window is real. We can needlessly select next OOM victim.
> I think that af5679fbc669f31f was too optimistic.

Changelog said 

"Although this is possible in principle let's wait for it to actually
happen in real life before we make the locking more complex again."

So what is the real life workload that hits it? The log you have pasted
below doesn't tell much.

> [  278.147280] Out of memory: Kill process 9943 (a.out) score 919 or 
> sacrifice child
> [  278.148927] Killed process 9943 (a.out) total-vm:4267252kB, 
> anon-rss:3430056kB, file-rss:0kB, shmem-rss:0kB
> [  278.151586] vmtoolsd invoked oom-killer: 
> gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, 
> oom_score_adj=0
[...]
> [  278.331527] Out of memory: Kill process 8790 (firewalld) score 5 or 
> sacrifice child
> [  278.333267] Killed process 8790 (firewalld) total-vm:358012kB, 
> anon-rss:21928kB, file-rss:0kB, shmem-rss:0kB
> [  278.336430] oom_reaper: reaped process 8790 (firewalld), now anon-rss:0kB, 
> file-rss:0kB, shmem-rss:0kB

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-09-05 Thread Michal Hocko
On Wed 05-09-18 22:20:58, Tetsuo Handa wrote:
> On 2018/08/24 9:31, Tetsuo Handa wrote:
> > For now, I don't think we need to add af5679fbc669f31f to the list for
> > CVE-2016-10723, for af5679fbc669f31f might cause premature next OOM victim
> > selection (especially with CONFIG_PREEMPT=y kernels) due to
> > 
> >__alloc_pages_may_oom():   oom_reap_task():
> > 
> >  mutex_trylock(_lock) succeeds.
> >  get_page_from_freelist() fails.
> >  Preempted to other process.
> > oom_reap_task_mm() succeeds.
> > Sets MMF_OOM_SKIP.
> >  Returned from preemption.
> >  Finds that MMF_OOM_SKIP was already set.
> >  Selects next OOM victim and kills it.
> >  mutex_unlock(_lock) is called.
> > 
> > race window like described as
> > 
> > Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the 
> > lock
> > to prevent from races when the page allocator didn't manage to get the
> > freed (reaped) memory in __alloc_pages_may_oom but it sees the flag 
> > later
> > on and move on to another victim.  Although this is possible in 
> > principle
> > let's wait for it to actually happen in real life before we make the
> > locking more complex again.
> > 
> > in that commit.
> > 
> 
> Yes, that race window is real. We can needlessly select next OOM victim.
> I think that af5679fbc669f31f was too optimistic.

Changelog said 

"Although this is possible in principle let's wait for it to actually
happen in real life before we make the locking more complex again."

So what is the real life workload that hits it? The log you have pasted
below doesn't tell much.

> [  278.147280] Out of memory: Kill process 9943 (a.out) score 919 or 
> sacrifice child
> [  278.148927] Killed process 9943 (a.out) total-vm:4267252kB, 
> anon-rss:3430056kB, file-rss:0kB, shmem-rss:0kB
> [  278.151586] vmtoolsd invoked oom-killer: 
> gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, 
> oom_score_adj=0
[...]
> [  278.331527] Out of memory: Kill process 8790 (firewalld) score 5 or 
> sacrifice child
> [  278.333267] Killed process 8790 (firewalld) total-vm:358012kB, 
> anon-rss:21928kB, file-rss:0kB, shmem-rss:0kB
> [  278.336430] oom_reaper: reaped process 8790 (firewalld), now anon-rss:0kB, 
> file-rss:0kB, shmem-rss:0kB

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-09-05 Thread Tetsuo Handa
On 2018/08/24 9:31, Tetsuo Handa wrote:
> For now, I don't think we need to add af5679fbc669f31f to the list for
> CVE-2016-10723, for af5679fbc669f31f might cause premature next OOM victim
> selection (especially with CONFIG_PREEMPT=y kernels) due to
> 
>__alloc_pages_may_oom():   oom_reap_task():
> 
>  mutex_trylock(_lock) succeeds.
>  get_page_from_freelist() fails.
>  Preempted to other process.
> oom_reap_task_mm() succeeds.
> Sets MMF_OOM_SKIP.
>  Returned from preemption.
>  Finds that MMF_OOM_SKIP was already set.
>  Selects next OOM victim and kills it.
>  mutex_unlock(_lock) is called.
> 
> race window like described as
> 
> Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock
> to prevent from races when the page allocator didn't manage to get the
> freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later
> on and move on to another victim.  Although this is possible in principle
> let's wait for it to actually happen in real life before we make the
> locking more complex again.
> 
> in that commit.
> 

Yes, that race window is real. We can needlessly select next OOM victim.
I think that af5679fbc669f31f was too optimistic.

[  278.147280] Out of memory: Kill process 9943 (a.out) score 919 or sacrifice 
child
[  278.148927] Killed process 9943 (a.out) total-vm:4267252kB, 
anon-rss:3430056kB, file-rss:0kB, shmem-rss:0kB
[  278.151586] vmtoolsd invoked oom-killer: 
gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, 
oom_score_adj=0
[  278.156642] vmtoolsd cpuset=/ mems_allowed=0
[  278.158884] CPU: 2 PID: 8916 Comm: vmtoolsd Kdump: loaded Not tainted 
4.19.0-rc2+ #465
[  278.162252] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
Desktop Reference Platform, BIOS 6.00 05/19/2017
[  278.165499] Call Trace:
[  278.166693]  dump_stack+0x99/0xdc
[  278.167922]  dump_header+0x70/0x2fa
[  278.169414] oom_reaper: reaped process 9943 (a.out), now anon-rss:0kB, 
file-rss:0kB, shmem-rss:0kB
[  278.169629]  ? _raw_spin_unlock_irqrestore+0x6a/0x8c
[  278.169633]  oom_kill_process+0x2ee/0x380
[  278.169635]  out_of_memory+0x136/0x540
[  278.169636]  ? out_of_memory+0x1fc/0x540
[  278.169640]  __alloc_pages_slowpath+0x986/0xce4
[  278.169641]  ? get_page_from_freelist+0x16b/0x1600
[  278.169646]  __alloc_pages_nodemask+0x398/0x3d0
[  278.180594]  alloc_pages_current+0x65/0xb0
[  278.182173]  __page_cache_alloc+0x154/0x190
[  278.184200]  ? pagecache_get_page+0x27/0x250
[  278.185410]  filemap_fault+0x4df/0x880
[  278.186282]  ? filemap_fault+0x31b/0x880
[  278.187395]  ? xfs_ilock+0x1bd/0x220
[  278.188264]  ? __xfs_filemap_fault+0x76/0x270
[  278.189268]  ? down_read_nested+0x48/0x80
[  278.190229]  ? xfs_ilock+0x1bd/0x220
[  278.191061]  __xfs_filemap_fault+0x89/0x270
[  278.192059]  xfs_filemap_fault+0x27/0x30
[  278.192967]  __do_fault+0x1f/0x70
[  278.193777]  __handle_mm_fault+0xfbd/0x1470
[  278.194743]  handle_mm_fault+0x1f2/0x400
[  278.195679]  ? handle_mm_fault+0x47/0x400
[  278.196618]  __do_page_fault+0x217/0x4b0
[  278.197504]  do_page_fault+0x3c/0x21e
[  278.198303]  ? page_fault+0x8/0x30
[  278.199092]  page_fault+0x1e/0x30
[  278.199821] RIP: 0033:0x7f2322ebbfb0
[  278.200605] Code: Bad RIP value.
[  278.201370] RSP: 002b:7ffda96e7648 EFLAGS: 00010246
[  278.202518] RAX:  RBX: 7f23220f26b0 RCX: 0010
[  278.204280] RDX:  RSI:  RDI: 7f2321ecfb5b
[  278.205838] RBP: 02504b70 R08: 7f2321ecfb60 R09: 0250bd20
[  278.207426] R10: 383836312d646c69 R11:  R12: 7ffda96e76b0
[  278.208982] R13: 7f2322ea8540 R14: 0250ba90 R15: 7f2323173920
[  278.210840] Mem-Info:
[  278.211462] active_anon:18629 inactive_anon:2390 isolated_anon:0
[  278.211462]  active_file:19 inactive_file:1565 isolated_file:0
[  278.211462]  unevictable:0 dirty:0 writeback:0 unstable:0
[  278.211462]  slab_reclaimable:5820 slab_unreclaimable:8964
[  278.211462]  mapped:2128 shmem:2493 pagetables:1826 bounce:0
[  278.211462]  free:878043 free_pcp:909 free_cma:0
[  278.218830] Node 0 active_anon:74516kB inactive_anon:9560kB active_file:76kB 
inactive_file:6260kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
mapped:8512kB dirty:0kB writeback:0kB shmem:9972kB shmem_thp: 0kB 
shmem_pmdmapped: 0kB anon_thp: 43008kB writeback_tmp:0kB unstable:0kB 
all_unreclaimable? yes
[  278.224997] Node 0 DMA free:15888kB min:288kB low:360kB high:432kB 
active_anon:32kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB 
kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB 
free_cma:0kB
[  278.230602] lowmem_reserve[]: 0 2663 3610 3610
[  278.231887] Node 0 DMA32 free:2746332kB min:49636kB low:62044kB 

Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-09-05 Thread Tetsuo Handa
On 2018/08/24 9:31, Tetsuo Handa wrote:
> For now, I don't think we need to add af5679fbc669f31f to the list for
> CVE-2016-10723, for af5679fbc669f31f might cause premature next OOM victim
> selection (especially with CONFIG_PREEMPT=y kernels) due to
> 
>__alloc_pages_may_oom():   oom_reap_task():
> 
>  mutex_trylock(_lock) succeeds.
>  get_page_from_freelist() fails.
>  Preempted to other process.
> oom_reap_task_mm() succeeds.
> Sets MMF_OOM_SKIP.
>  Returned from preemption.
>  Finds that MMF_OOM_SKIP was already set.
>  Selects next OOM victim and kills it.
>  mutex_unlock(_lock) is called.
> 
> race window like described as
> 
> Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock
> to prevent from races when the page allocator didn't manage to get the
> freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later
> on and move on to another victim.  Although this is possible in principle
> let's wait for it to actually happen in real life before we make the
> locking more complex again.
> 
> in that commit.
> 

Yes, that race window is real. We can needlessly select next OOM victim.
I think that af5679fbc669f31f was too optimistic.

[  278.147280] Out of memory: Kill process 9943 (a.out) score 919 or sacrifice 
child
[  278.148927] Killed process 9943 (a.out) total-vm:4267252kB, 
anon-rss:3430056kB, file-rss:0kB, shmem-rss:0kB
[  278.151586] vmtoolsd invoked oom-killer: 
gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, 
oom_score_adj=0
[  278.156642] vmtoolsd cpuset=/ mems_allowed=0
[  278.158884] CPU: 2 PID: 8916 Comm: vmtoolsd Kdump: loaded Not tainted 
4.19.0-rc2+ #465
[  278.162252] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
Desktop Reference Platform, BIOS 6.00 05/19/2017
[  278.165499] Call Trace:
[  278.166693]  dump_stack+0x99/0xdc
[  278.167922]  dump_header+0x70/0x2fa
[  278.169414] oom_reaper: reaped process 9943 (a.out), now anon-rss:0kB, 
file-rss:0kB, shmem-rss:0kB
[  278.169629]  ? _raw_spin_unlock_irqrestore+0x6a/0x8c
[  278.169633]  oom_kill_process+0x2ee/0x380
[  278.169635]  out_of_memory+0x136/0x540
[  278.169636]  ? out_of_memory+0x1fc/0x540
[  278.169640]  __alloc_pages_slowpath+0x986/0xce4
[  278.169641]  ? get_page_from_freelist+0x16b/0x1600
[  278.169646]  __alloc_pages_nodemask+0x398/0x3d0
[  278.180594]  alloc_pages_current+0x65/0xb0
[  278.182173]  __page_cache_alloc+0x154/0x190
[  278.184200]  ? pagecache_get_page+0x27/0x250
[  278.185410]  filemap_fault+0x4df/0x880
[  278.186282]  ? filemap_fault+0x31b/0x880
[  278.187395]  ? xfs_ilock+0x1bd/0x220
[  278.188264]  ? __xfs_filemap_fault+0x76/0x270
[  278.189268]  ? down_read_nested+0x48/0x80
[  278.190229]  ? xfs_ilock+0x1bd/0x220
[  278.191061]  __xfs_filemap_fault+0x89/0x270
[  278.192059]  xfs_filemap_fault+0x27/0x30
[  278.192967]  __do_fault+0x1f/0x70
[  278.193777]  __handle_mm_fault+0xfbd/0x1470
[  278.194743]  handle_mm_fault+0x1f2/0x400
[  278.195679]  ? handle_mm_fault+0x47/0x400
[  278.196618]  __do_page_fault+0x217/0x4b0
[  278.197504]  do_page_fault+0x3c/0x21e
[  278.198303]  ? page_fault+0x8/0x30
[  278.199092]  page_fault+0x1e/0x30
[  278.199821] RIP: 0033:0x7f2322ebbfb0
[  278.200605] Code: Bad RIP value.
[  278.201370] RSP: 002b:7ffda96e7648 EFLAGS: 00010246
[  278.202518] RAX:  RBX: 7f23220f26b0 RCX: 0010
[  278.204280] RDX:  RSI:  RDI: 7f2321ecfb5b
[  278.205838] RBP: 02504b70 R08: 7f2321ecfb60 R09: 0250bd20
[  278.207426] R10: 383836312d646c69 R11:  R12: 7ffda96e76b0
[  278.208982] R13: 7f2322ea8540 R14: 0250ba90 R15: 7f2323173920
[  278.210840] Mem-Info:
[  278.211462] active_anon:18629 inactive_anon:2390 isolated_anon:0
[  278.211462]  active_file:19 inactive_file:1565 isolated_file:0
[  278.211462]  unevictable:0 dirty:0 writeback:0 unstable:0
[  278.211462]  slab_reclaimable:5820 slab_unreclaimable:8964
[  278.211462]  mapped:2128 shmem:2493 pagetables:1826 bounce:0
[  278.211462]  free:878043 free_pcp:909 free_cma:0
[  278.218830] Node 0 active_anon:74516kB inactive_anon:9560kB active_file:76kB 
inactive_file:6260kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
mapped:8512kB dirty:0kB writeback:0kB shmem:9972kB shmem_thp: 0kB 
shmem_pmdmapped: 0kB anon_thp: 43008kB writeback_tmp:0kB unstable:0kB 
all_unreclaimable? yes
[  278.224997] Node 0 DMA free:15888kB min:288kB low:360kB high:432kB 
active_anon:32kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB 
kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB 
free_cma:0kB
[  278.230602] lowmem_reserve[]: 0 2663 3610 3610
[  278.231887] Node 0 DMA32 free:2746332kB min:49636kB low:62044kB 

Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-08-23 Thread Tetsuo Handa
David Rientjes wrote:
> On Fri, 24 Aug 2018, Tetsuo Handa wrote:
> 
> > > For those of us who are tracking CVE-2016-10723 which has peristently 
> > > been 
> > > labeled as "disputed" and with no clear indication of what patches 
> > > address 
> > > it, I am assuming that commit 9bfe5ded054b ("mm, oom: remove sleep from 
> > > under oom_lock") and this patch are the intended mitigations?
> > > 
> > > A list of SHA1s for merged fixed and links to proposed patches to address 
> > > this issue would be appreciated.
> > > 
> > 
> > Commit 9bfe5ded054b ("mm, oom: remove sleep from under oom_lock") is a
> > mitigation for CVE-2016-10723.
> > 
> > "[PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at
> > should_reclaim_retry()." is independent from CVE-2016-10723.
> > 
> 
> Thanks, Tetsuo.  Should commit af5679fbc669 ("mm, oom: remove oom_lock 
> from oom_reaper") also be added to the list for CVE-2016-10723?
> 

Commit af5679fbc669f31f ("mm, oom: remove oom_lock from oom_reaper")
negated one of the two rationales for commit 9bfe5ded054b8e28 ("mm, oom:
remove sleep from under oom_lock"). If we didn't apply af5679fbc669f31f,
we could make sure that CPU resource is given to the owner of oom_lock
by replacing mutex_trylock() in __alloc_pages_may_oom() with mutex_lock().
But now that af5679fbc669f31f was already applied, we don't know how to
give CPU resource to the OOM reaper / exit_mmap(). We might arrive at
direct OOM reaping but we haven't reached there...

For now, I don't think we need to add af5679fbc669f31f to the list for
CVE-2016-10723, for af5679fbc669f31f might cause premature next OOM victim
selection (especially with CONFIG_PREEMPT=y kernels) due to

   __alloc_pages_may_oom():   oom_reap_task():

 mutex_trylock(_lock) succeeds.
 get_page_from_freelist() fails.
 Preempted to other process.
oom_reap_task_mm() succeeds.
Sets MMF_OOM_SKIP.
 Returned from preemption.
 Finds that MMF_OOM_SKIP was already set.
 Selects next OOM victim and kills it.
 mutex_unlock(_lock) is called.

race window like described as

Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock
to prevent from races when the page allocator didn't manage to get the
freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later
on and move on to another victim.  Although this is possible in principle
let's wait for it to actually happen in real life before we make the
locking more complex again.

in that commit.


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-08-23 Thread Tetsuo Handa
David Rientjes wrote:
> On Fri, 24 Aug 2018, Tetsuo Handa wrote:
> 
> > > For those of us who are tracking CVE-2016-10723 which has peristently 
> > > been 
> > > labeled as "disputed" and with no clear indication of what patches 
> > > address 
> > > it, I am assuming that commit 9bfe5ded054b ("mm, oom: remove sleep from 
> > > under oom_lock") and this patch are the intended mitigations?
> > > 
> > > A list of SHA1s for merged fixed and links to proposed patches to address 
> > > this issue would be appreciated.
> > > 
> > 
> > Commit 9bfe5ded054b ("mm, oom: remove sleep from under oom_lock") is a
> > mitigation for CVE-2016-10723.
> > 
> > "[PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at
> > should_reclaim_retry()." is independent from CVE-2016-10723.
> > 
> 
> Thanks, Tetsuo.  Should commit af5679fbc669 ("mm, oom: remove oom_lock 
> from oom_reaper") also be added to the list for CVE-2016-10723?
> 

Commit af5679fbc669f31f ("mm, oom: remove oom_lock from oom_reaper")
negated one of the two rationales for commit 9bfe5ded054b8e28 ("mm, oom:
remove sleep from under oom_lock"). If we didn't apply af5679fbc669f31f,
we could make sure that CPU resource is given to the owner of oom_lock
by replacing mutex_trylock() in __alloc_pages_may_oom() with mutex_lock().
But now that af5679fbc669f31f was already applied, we don't know how to
give CPU resource to the OOM reaper / exit_mmap(). We might arrive at
direct OOM reaping but we haven't reached there...

For now, I don't think we need to add af5679fbc669f31f to the list for
CVE-2016-10723, for af5679fbc669f31f might cause premature next OOM victim
selection (especially with CONFIG_PREEMPT=y kernels) due to

   __alloc_pages_may_oom():   oom_reap_task():

 mutex_trylock(_lock) succeeds.
 get_page_from_freelist() fails.
 Preempted to other process.
oom_reap_task_mm() succeeds.
Sets MMF_OOM_SKIP.
 Returned from preemption.
 Finds that MMF_OOM_SKIP was already set.
 Selects next OOM victim and kills it.
 mutex_unlock(_lock) is called.

race window like described as

Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock
to prevent from races when the page allocator didn't manage to get the
freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later
on and move on to another victim.  Although this is possible in principle
let's wait for it to actually happen in real life before we make the
locking more complex again.

in that commit.


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-08-23 Thread David Rientjes
On Fri, 24 Aug 2018, Tetsuo Handa wrote:

> > For those of us who are tracking CVE-2016-10723 which has peristently been 
> > labeled as "disputed" and with no clear indication of what patches address 
> > it, I am assuming that commit 9bfe5ded054b ("mm, oom: remove sleep from 
> > under oom_lock") and this patch are the intended mitigations?
> > 
> > A list of SHA1s for merged fixed and links to proposed patches to address 
> > this issue would be appreciated.
> > 
> 
> Commit 9bfe5ded054b ("mm, oom: remove sleep from under oom_lock") is a
> mitigation for CVE-2016-10723.
> 
> "[PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at
> should_reclaim_retry()." is independent from CVE-2016-10723.
> 

Thanks, Tetsuo.  Should commit af5679fbc669 ("mm, oom: remove oom_lock 
from oom_reaper") also be added to the list for CVE-2016-10723?


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-08-23 Thread David Rientjes
On Fri, 24 Aug 2018, Tetsuo Handa wrote:

> > For those of us who are tracking CVE-2016-10723 which has peristently been 
> > labeled as "disputed" and with no clear indication of what patches address 
> > it, I am assuming that commit 9bfe5ded054b ("mm, oom: remove sleep from 
> > under oom_lock") and this patch are the intended mitigations?
> > 
> > A list of SHA1s for merged fixed and links to proposed patches to address 
> > this issue would be appreciated.
> > 
> 
> Commit 9bfe5ded054b ("mm, oom: remove sleep from under oom_lock") is a
> mitigation for CVE-2016-10723.
> 
> "[PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at
> should_reclaim_retry()." is independent from CVE-2016-10723.
> 

Thanks, Tetsuo.  Should commit af5679fbc669 ("mm, oom: remove oom_lock 
from oom_reaper") also be added to the list for CVE-2016-10723?


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-08-23 Thread Tetsuo Handa
On 2018/08/24 5:06, David Rientjes wrote:
> For those of us who are tracking CVE-2016-10723 which has peristently been 
> labeled as "disputed" and with no clear indication of what patches address 
> it, I am assuming that commit 9bfe5ded054b ("mm, oom: remove sleep from 
> under oom_lock") and this patch are the intended mitigations?
> 
> A list of SHA1s for merged fixed and links to proposed patches to address 
> this issue would be appreciated.
> 

Commit 9bfe5ded054b ("mm, oom: remove sleep from under oom_lock") is a
mitigation for CVE-2016-10723.

"[PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at
should_reclaim_retry()." is independent from CVE-2016-10723.

We haven't made sure that the OOM reaper / exit_mmap() will get enough CPU
resources. For example, under a cluster of concurrently allocating realtime
scheduling priority threads, the OOM reaper takes about 1800 milliseconds
whereas direct OOM reaping takes only a few milliseconds.

Regards.


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-08-23 Thread Tetsuo Handa
On 2018/08/24 5:06, David Rientjes wrote:
> For those of us who are tracking CVE-2016-10723 which has peristently been 
> labeled as "disputed" and with no clear indication of what patches address 
> it, I am assuming that commit 9bfe5ded054b ("mm, oom: remove sleep from 
> under oom_lock") and this patch are the intended mitigations?
> 
> A list of SHA1s for merged fixed and links to proposed patches to address 
> this issue would be appreciated.
> 

Commit 9bfe5ded054b ("mm, oom: remove sleep from under oom_lock") is a
mitigation for CVE-2016-10723.

"[PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at
should_reclaim_retry()." is independent from CVE-2016-10723.

We haven't made sure that the OOM reaper / exit_mmap() will get enough CPU
resources. For example, under a cluster of concurrently allocating realtime
scheduling priority threads, the OOM reaper takes about 1800 milliseconds
whereas direct OOM reaping takes only a few milliseconds.

Regards.


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-08-23 Thread David Rientjes
On Wed, 22 Aug 2018, Tetsuo Handa wrote:

> On 2018/08/03 15:16, Michal Hocko wrote:
> > On Fri 03-08-18 07:05:54, Tetsuo Handa wrote:
> >> On 2018/07/31 14:09, Michal Hocko wrote:
> >>> On Tue 31-07-18 06:01:48, Tetsuo Handa wrote:
>  On 2018/07/31 4:10, Michal Hocko wrote:
> > Since should_reclaim_retry() should be a natural reschedule point,
> > let's do the short sleep for PF_WQ_WORKER threads unconditionally in
> > order to guarantee that other pending work items are started. This will
> > workaround this problem and it is less fragile than hunting down when
> > the sleep is missed. E.g. we used to have a sleeping point in the oom
> > path but this has been removed recently because it caused other issues.
> > Having a single sleeping point is more robust.
> 
>  linux.git has not removed the sleeping point in the OOM path yet. Since 
>  removing the
>  sleeping point in the OOM path can mitigate CVE-2016-10723, please do so 
>  immediately.
> >>>
> >>> is this an {Acked,Reviewed,Tested}-by?
> >>>
> >>> I will send the patch to Andrew if the patch is ok. 
> >>>
>  (And that change will conflict with Roman's cgroup aware OOM killer 
>  patchset. But it
>  should be easy to rebase.)
> >>>
> >>> That is still a WIP so I would lose sleep over it.
> >>>
> >>
> >> Now that Roman's cgroup aware OOM killer patchset will be dropped from 
> >> linux-next.git ,
> >> linux-next.git will get the sleeping point removed. Please send this patch 
> >> to linux-next.git .
> > 
> > I still haven't heard any explicit confirmation that the patch works for
> > your workload. Should I beg for it? Or you simply do not want to have
> > your stamp on the patch? If yes, I can live with that but this playing
> > hide and catch is not really a lot of fun.
> > 
> 
> I noticed that the patch has not been sent to linux-next.git yet.
> Please send to linux-next.git without my stamp on the patch.
> 

For those of us who are tracking CVE-2016-10723 which has peristently been 
labeled as "disputed" and with no clear indication of what patches address 
it, I am assuming that commit 9bfe5ded054b ("mm, oom: remove sleep from 
under oom_lock") and this patch are the intended mitigations?

A list of SHA1s for merged fixed and links to proposed patches to address 
this issue would be appreciated.


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-08-23 Thread David Rientjes
On Wed, 22 Aug 2018, Tetsuo Handa wrote:

> On 2018/08/03 15:16, Michal Hocko wrote:
> > On Fri 03-08-18 07:05:54, Tetsuo Handa wrote:
> >> On 2018/07/31 14:09, Michal Hocko wrote:
> >>> On Tue 31-07-18 06:01:48, Tetsuo Handa wrote:
>  On 2018/07/31 4:10, Michal Hocko wrote:
> > Since should_reclaim_retry() should be a natural reschedule point,
> > let's do the short sleep for PF_WQ_WORKER threads unconditionally in
> > order to guarantee that other pending work items are started. This will
> > workaround this problem and it is less fragile than hunting down when
> > the sleep is missed. E.g. we used to have a sleeping point in the oom
> > path but this has been removed recently because it caused other issues.
> > Having a single sleeping point is more robust.
> 
>  linux.git has not removed the sleeping point in the OOM path yet. Since 
>  removing the
>  sleeping point in the OOM path can mitigate CVE-2016-10723, please do so 
>  immediately.
> >>>
> >>> is this an {Acked,Reviewed,Tested}-by?
> >>>
> >>> I will send the patch to Andrew if the patch is ok. 
> >>>
>  (And that change will conflict with Roman's cgroup aware OOM killer 
>  patchset. But it
>  should be easy to rebase.)
> >>>
> >>> That is still a WIP so I would lose sleep over it.
> >>>
> >>
> >> Now that Roman's cgroup aware OOM killer patchset will be dropped from 
> >> linux-next.git ,
> >> linux-next.git will get the sleeping point removed. Please send this patch 
> >> to linux-next.git .
> > 
> > I still haven't heard any explicit confirmation that the patch works for
> > your workload. Should I beg for it? Or you simply do not want to have
> > your stamp on the patch? If yes, I can live with that but this playing
> > hide and catch is not really a lot of fun.
> > 
> 
> I noticed that the patch has not been sent to linux-next.git yet.
> Please send to linux-next.git without my stamp on the patch.
> 

For those of us who are tracking CVE-2016-10723 which has peristently been 
labeled as "disputed" and with no clear indication of what patches address 
it, I am assuming that commit 9bfe5ded054b ("mm, oom: remove sleep from 
under oom_lock") and this patch are the intended mitigations?

A list of SHA1s for merged fixed and links to proposed patches to address 
this issue would be appreciated.


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-08-22 Thread Michal Hocko
On Wed 22-08-18 06:07:40, Tetsuo Handa wrote:
> On 2018/08/03 15:16, Michal Hocko wrote:
[...]
> >> Now that Roman's cgroup aware OOM killer patchset will be dropped from 
> >> linux-next.git ,
> >> linux-next.git will get the sleeping point removed. Please send this patch 
> >> to linux-next.git .
> > 
> > I still haven't heard any explicit confirmation that the patch works for
> > your workload. Should I beg for it? Or you simply do not want to have
> > your stamp on the patch? If yes, I can live with that but this playing
> > hide and catch is not really a lot of fun.
> > 
> 
> I noticed that the patch has not been sent to linux-next.git yet.
> Please send to linux-next.git without my stamp on the patch.

I plan to do so after merge window closes.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-08-22 Thread Michal Hocko
On Wed 22-08-18 06:07:40, Tetsuo Handa wrote:
> On 2018/08/03 15:16, Michal Hocko wrote:
[...]
> >> Now that Roman's cgroup aware OOM killer patchset will be dropped from 
> >> linux-next.git ,
> >> linux-next.git will get the sleeping point removed. Please send this patch 
> >> to linux-next.git .
> > 
> > I still haven't heard any explicit confirmation that the patch works for
> > your workload. Should I beg for it? Or you simply do not want to have
> > your stamp on the patch? If yes, I can live with that but this playing
> > hide and catch is not really a lot of fun.
> > 
> 
> I noticed that the patch has not been sent to linux-next.git yet.
> Please send to linux-next.git without my stamp on the patch.

I plan to do so after merge window closes.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-08-21 Thread Tetsuo Handa
On 2018/08/03 15:16, Michal Hocko wrote:
> On Fri 03-08-18 07:05:54, Tetsuo Handa wrote:
>> On 2018/07/31 14:09, Michal Hocko wrote:
>>> On Tue 31-07-18 06:01:48, Tetsuo Handa wrote:
 On 2018/07/31 4:10, Michal Hocko wrote:
> Since should_reclaim_retry() should be a natural reschedule point,
> let's do the short sleep for PF_WQ_WORKER threads unconditionally in
> order to guarantee that other pending work items are started. This will
> workaround this problem and it is less fragile than hunting down when
> the sleep is missed. E.g. we used to have a sleeping point in the oom
> path but this has been removed recently because it caused other issues.
> Having a single sleeping point is more robust.

 linux.git has not removed the sleeping point in the OOM path yet. Since 
 removing the
 sleeping point in the OOM path can mitigate CVE-2016-10723, please do so 
 immediately.
>>>
>>> is this an {Acked,Reviewed,Tested}-by?
>>>
>>> I will send the patch to Andrew if the patch is ok. 
>>>
 (And that change will conflict with Roman's cgroup aware OOM killer 
 patchset. But it
 should be easy to rebase.)
>>>
>>> That is still a WIP so I would lose sleep over it.
>>>
>>
>> Now that Roman's cgroup aware OOM killer patchset will be dropped from 
>> linux-next.git ,
>> linux-next.git will get the sleeping point removed. Please send this patch 
>> to linux-next.git .
> 
> I still haven't heard any explicit confirmation that the patch works for
> your workload. Should I beg for it? Or you simply do not want to have
> your stamp on the patch? If yes, I can live with that but this playing
> hide and catch is not really a lot of fun.
> 

I noticed that the patch has not been sent to linux-next.git yet.
Please send to linux-next.git without my stamp on the patch.

[   44.863590] Out of memory: Kill process 1071 (a.out) score 865 or sacrifice 
child
[   44.867666] Killed process 1817 (a.out) total-vm:5244kB, anon-rss:1040kB, 
file-rss:0kB, shmem-rss:0kB
[   44.872176] oom_reaper: reaped process 1817 (a.out), now anon-rss:0kB, 
file-rss:0kB, shmem-rss:0kB
[   91.698761] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 
stuck for 48s!
[   91.702313] Showing busy workqueues and worker pools:
[   91.705011] workqueue events: flags=0x0
[   91.707482]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=8/256
[   91.710524] pending: vmpressure_work_fn, vmw_fb_dirty_flush [vmwgfx], 
e1000_watchdog [e1000], vmstat_shepherd, free_work, mmdrop_async_fn, 
mmdrop_async_fn, check_corruption
[   91.717439] workqueue events_freezable: flags=0x4
[   91.720161]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[   91.723304] pending: vmballoon_work [vmw_balloon]
[   91.726167] workqueue events_power_efficient: flags=0x80
[   91.729139]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=4/256
[   91.732253] pending: fb_flashcursor, gc_worker [nf_conntrack], 
neigh_periodic_work, neigh_periodic_work
[   91.736471] workqueue events_freezable_power_: flags=0x84
[   91.739546]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[   91.742696] in-flight: 2097:disk_events_workfn
[   91.745517] workqueue mm_percpu_wq: flags=0x8
[   91.748069]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[   91.751182] pending: drain_local_pages_wq BAR(1830), vmstat_update
[   91.754661] workqueue mpt_poll_0: flags=0x8
[   91.757161]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[   91.759958] pending: mpt_fault_reset_work [mptbase]
[   91.762696] workqueue xfs-data/sda1: flags=0xc
[   91.765353]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=3/256
[   91.768248] pending: xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs]
[   91.771589] workqueue xfs-cil/sda1: flags=0xc
[   91.774009]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[   91.776800] pending: xlog_cil_push_work [xfs] BAR(703)
[   91.779464] workqueue xfs-reclaim/sda1: flags=0xc
[   91.782017]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[   91.784599] pending: xfs_reclaim_worker [xfs]
[   91.786930] workqueue xfs-sync/sda1: flags=0x4
[   91.789289]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[   91.792075] pending: xfs_log_worker [xfs]
[   91.794213] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=48s workers=4 idle: 
52 13 5
[  121.906640] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 
stuck for 78s!
[  121.909572] Showing busy workqueues and worker pools:
[  121.911703] workqueue events: flags=0x0
[  121.913531]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=8/256
[  121.915873] pending: vmpressure_work_fn, vmw_fb_dirty_flush [vmwgfx], 
e1000_watchdog [e1000], vmstat_shepherd, free_work, mmdrop_async_fn, 
mmdrop_async_fn, check_corruption
[  121.921962] workqueue events_freezable: flags=0x4
[  121.924336]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  121.926941] pending: vmballoon_work [vmw_balloon]
[  121.929226] workqueue 

Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-08-21 Thread Tetsuo Handa
On 2018/08/03 15:16, Michal Hocko wrote:
> On Fri 03-08-18 07:05:54, Tetsuo Handa wrote:
>> On 2018/07/31 14:09, Michal Hocko wrote:
>>> On Tue 31-07-18 06:01:48, Tetsuo Handa wrote:
 On 2018/07/31 4:10, Michal Hocko wrote:
> Since should_reclaim_retry() should be a natural reschedule point,
> let's do the short sleep for PF_WQ_WORKER threads unconditionally in
> order to guarantee that other pending work items are started. This will
> workaround this problem and it is less fragile than hunting down when
> the sleep is missed. E.g. we used to have a sleeping point in the oom
> path but this has been removed recently because it caused other issues.
> Having a single sleeping point is more robust.

 linux.git has not removed the sleeping point in the OOM path yet. Since 
 removing the
 sleeping point in the OOM path can mitigate CVE-2016-10723, please do so 
 immediately.
>>>
>>> is this an {Acked,Reviewed,Tested}-by?
>>>
>>> I will send the patch to Andrew if the patch is ok. 
>>>
 (And that change will conflict with Roman's cgroup aware OOM killer 
 patchset. But it
 should be easy to rebase.)
>>>
>>> That is still a WIP so I would lose sleep over it.
>>>
>>
>> Now that Roman's cgroup aware OOM killer patchset will be dropped from 
>> linux-next.git ,
>> linux-next.git will get the sleeping point removed. Please send this patch 
>> to linux-next.git .
> 
> I still haven't heard any explicit confirmation that the patch works for
> your workload. Should I beg for it? Or you simply do not want to have
> your stamp on the patch? If yes, I can live with that but this playing
> hide and catch is not really a lot of fun.
> 

I noticed that the patch has not been sent to linux-next.git yet.
Please send to linux-next.git without my stamp on the patch.

[   44.863590] Out of memory: Kill process 1071 (a.out) score 865 or sacrifice 
child
[   44.867666] Killed process 1817 (a.out) total-vm:5244kB, anon-rss:1040kB, 
file-rss:0kB, shmem-rss:0kB
[   44.872176] oom_reaper: reaped process 1817 (a.out), now anon-rss:0kB, 
file-rss:0kB, shmem-rss:0kB
[   91.698761] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 
stuck for 48s!
[   91.702313] Showing busy workqueues and worker pools:
[   91.705011] workqueue events: flags=0x0
[   91.707482]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=8/256
[   91.710524] pending: vmpressure_work_fn, vmw_fb_dirty_flush [vmwgfx], 
e1000_watchdog [e1000], vmstat_shepherd, free_work, mmdrop_async_fn, 
mmdrop_async_fn, check_corruption
[   91.717439] workqueue events_freezable: flags=0x4
[   91.720161]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[   91.723304] pending: vmballoon_work [vmw_balloon]
[   91.726167] workqueue events_power_efficient: flags=0x80
[   91.729139]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=4/256
[   91.732253] pending: fb_flashcursor, gc_worker [nf_conntrack], 
neigh_periodic_work, neigh_periodic_work
[   91.736471] workqueue events_freezable_power_: flags=0x84
[   91.739546]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[   91.742696] in-flight: 2097:disk_events_workfn
[   91.745517] workqueue mm_percpu_wq: flags=0x8
[   91.748069]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[   91.751182] pending: drain_local_pages_wq BAR(1830), vmstat_update
[   91.754661] workqueue mpt_poll_0: flags=0x8
[   91.757161]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[   91.759958] pending: mpt_fault_reset_work [mptbase]
[   91.762696] workqueue xfs-data/sda1: flags=0xc
[   91.765353]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=3/256
[   91.768248] pending: xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs]
[   91.771589] workqueue xfs-cil/sda1: flags=0xc
[   91.774009]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[   91.776800] pending: xlog_cil_push_work [xfs] BAR(703)
[   91.779464] workqueue xfs-reclaim/sda1: flags=0xc
[   91.782017]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[   91.784599] pending: xfs_reclaim_worker [xfs]
[   91.786930] workqueue xfs-sync/sda1: flags=0x4
[   91.789289]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[   91.792075] pending: xfs_log_worker [xfs]
[   91.794213] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=48s workers=4 idle: 
52 13 5
[  121.906640] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 
stuck for 78s!
[  121.909572] Showing busy workqueues and worker pools:
[  121.911703] workqueue events: flags=0x0
[  121.913531]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=8/256
[  121.915873] pending: vmpressure_work_fn, vmw_fb_dirty_flush [vmwgfx], 
e1000_watchdog [e1000], vmstat_shepherd, free_work, mmdrop_async_fn, 
mmdrop_async_fn, check_corruption
[  121.921962] workqueue events_freezable: flags=0x4
[  121.924336]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  121.926941] pending: vmballoon_work [vmw_balloon]
[  121.929226] workqueue 

Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-08-03 Thread Michal Hocko
On Fri 03-08-18 07:05:54, Tetsuo Handa wrote:
> On 2018/07/31 14:09, Michal Hocko wrote:
> > On Tue 31-07-18 06:01:48, Tetsuo Handa wrote:
> >> On 2018/07/31 4:10, Michal Hocko wrote:
> >>> Since should_reclaim_retry() should be a natural reschedule point,
> >>> let's do the short sleep for PF_WQ_WORKER threads unconditionally in
> >>> order to guarantee that other pending work items are started. This will
> >>> workaround this problem and it is less fragile than hunting down when
> >>> the sleep is missed. E.g. we used to have a sleeping point in the oom
> >>> path but this has been removed recently because it caused other issues.
> >>> Having a single sleeping point is more robust.
> >>
> >> linux.git has not removed the sleeping point in the OOM path yet. Since 
> >> removing the
> >> sleeping point in the OOM path can mitigate CVE-2016-10723, please do so 
> >> immediately.
> > 
> > is this an {Acked,Reviewed,Tested}-by?
> > 
> > I will send the patch to Andrew if the patch is ok. 
> > 
> >> (And that change will conflict with Roman's cgroup aware OOM killer 
> >> patchset. But it
> >> should be easy to rebase.)
> > 
> > That is still a WIP so I would lose sleep over it.
> > 
> 
> Now that Roman's cgroup aware OOM killer patchset will be dropped from 
> linux-next.git ,
> linux-next.git will get the sleeping point removed. Please send this patch to 
> linux-next.git .

I still haven't heard any explicit confirmation that the patch works for
your workload. Should I beg for it? Or you simply do not want to have
your stamp on the patch? If yes, I can live with that but this playing
hide and catch is not really a lot of fun.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-08-03 Thread Michal Hocko
On Fri 03-08-18 07:05:54, Tetsuo Handa wrote:
> On 2018/07/31 14:09, Michal Hocko wrote:
> > On Tue 31-07-18 06:01:48, Tetsuo Handa wrote:
> >> On 2018/07/31 4:10, Michal Hocko wrote:
> >>> Since should_reclaim_retry() should be a natural reschedule point,
> >>> let's do the short sleep for PF_WQ_WORKER threads unconditionally in
> >>> order to guarantee that other pending work items are started. This will
> >>> workaround this problem and it is less fragile than hunting down when
> >>> the sleep is missed. E.g. we used to have a sleeping point in the oom
> >>> path but this has been removed recently because it caused other issues.
> >>> Having a single sleeping point is more robust.
> >>
> >> linux.git has not removed the sleeping point in the OOM path yet. Since 
> >> removing the
> >> sleeping point in the OOM path can mitigate CVE-2016-10723, please do so 
> >> immediately.
> > 
> > is this an {Acked,Reviewed,Tested}-by?
> > 
> > I will send the patch to Andrew if the patch is ok. 
> > 
> >> (And that change will conflict with Roman's cgroup aware OOM killer 
> >> patchset. But it
> >> should be easy to rebase.)
> > 
> > That is still a WIP so I would lose sleep over it.
> > 
> 
> Now that Roman's cgroup aware OOM killer patchset will be dropped from 
> linux-next.git ,
> linux-next.git will get the sleeping point removed. Please send this patch to 
> linux-next.git .

I still haven't heard any explicit confirmation that the patch works for
your workload. Should I beg for it? Or you simply do not want to have
your stamp on the patch? If yes, I can live with that but this playing
hide and catch is not really a lot of fun.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-08-02 Thread Tetsuo Handa
On 2018/07/31 14:09, Michal Hocko wrote:
> On Tue 31-07-18 06:01:48, Tetsuo Handa wrote:
>> On 2018/07/31 4:10, Michal Hocko wrote:
>>> Since should_reclaim_retry() should be a natural reschedule point,
>>> let's do the short sleep for PF_WQ_WORKER threads unconditionally in
>>> order to guarantee that other pending work items are started. This will
>>> workaround this problem and it is less fragile than hunting down when
>>> the sleep is missed. E.g. we used to have a sleeping point in the oom
>>> path but this has been removed recently because it caused other issues.
>>> Having a single sleeping point is more robust.
>>
>> linux.git has not removed the sleeping point in the OOM path yet. Since 
>> removing the
>> sleeping point in the OOM path can mitigate CVE-2016-10723, please do so 
>> immediately.
> 
> is this an {Acked,Reviewed,Tested}-by?
> 
> I will send the patch to Andrew if the patch is ok. 
> 
>> (And that change will conflict with Roman's cgroup aware OOM killer 
>> patchset. But it
>> should be easy to rebase.)
> 
> That is still a WIP so I would lose sleep over it.
> 

Now that Roman's cgroup aware OOM killer patchset will be dropped from 
linux-next.git ,
linux-next.git will get the sleeping point removed. Please send this patch to 
linux-next.git .


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-08-02 Thread Tetsuo Handa
On 2018/07/31 14:09, Michal Hocko wrote:
> On Tue 31-07-18 06:01:48, Tetsuo Handa wrote:
>> On 2018/07/31 4:10, Michal Hocko wrote:
>>> Since should_reclaim_retry() should be a natural reschedule point,
>>> let's do the short sleep for PF_WQ_WORKER threads unconditionally in
>>> order to guarantee that other pending work items are started. This will
>>> workaround this problem and it is less fragile than hunting down when
>>> the sleep is missed. E.g. we used to have a sleeping point in the oom
>>> path but this has been removed recently because it caused other issues.
>>> Having a single sleeping point is more robust.
>>
>> linux.git has not removed the sleeping point in the OOM path yet. Since 
>> removing the
>> sleeping point in the OOM path can mitigate CVE-2016-10723, please do so 
>> immediately.
> 
> is this an {Acked,Reviewed,Tested}-by?
> 
> I will send the patch to Andrew if the patch is ok. 
> 
>> (And that change will conflict with Roman's cgroup aware OOM killer 
>> patchset. But it
>> should be easy to rebase.)
> 
> That is still a WIP so I would lose sleep over it.
> 

Now that Roman's cgroup aware OOM killer patchset will be dropped from 
linux-next.git ,
linux-next.git will get the sleeping point removed. Please send this patch to 
linux-next.git .


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-31 Thread Michal Hocko
On Tue 31-07-18 20:30:08, Tetsuo Handa wrote:
> On 2018/07/31 20:15, Michal Hocko wrote:
> >>> I will send the patch to Andrew if the patch is ok. 
> >>
> >> Andrew, can we send the "we used to have a sleeping point in the oom path 
> >> but this has
> >> been removed recently" patch to linux.git ?
> > 
> > This can really wait for the next merge window IMHO.
> > 
> 
> "mm, oom: cgroup-aware OOM killer" in linux-next.git is reviving that 
> sleeping point.
> Current "mm, oom: cgroup-aware OOM killer" will not be sent to linux.git in 
> the next
> merge window? I'm confused...

This has nothing to do with cgroup-aware OOM killer.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-31 Thread Michal Hocko
On Tue 31-07-18 20:30:08, Tetsuo Handa wrote:
> On 2018/07/31 20:15, Michal Hocko wrote:
> >>> I will send the patch to Andrew if the patch is ok. 
> >>
> >> Andrew, can we send the "we used to have a sleeping point in the oom path 
> >> but this has
> >> been removed recently" patch to linux.git ?
> > 
> > This can really wait for the next merge window IMHO.
> > 
> 
> "mm, oom: cgroup-aware OOM killer" in linux-next.git is reviving that 
> sleeping point.
> Current "mm, oom: cgroup-aware OOM killer" will not be sent to linux.git in 
> the next
> merge window? I'm confused...

This has nothing to do with cgroup-aware OOM killer.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-31 Thread Tetsuo Handa
On 2018/07/31 20:15, Michal Hocko wrote:
>>> I will send the patch to Andrew if the patch is ok. 
>>
>> Andrew, can we send the "we used to have a sleeping point in the oom path 
>> but this has
>> been removed recently" patch to linux.git ?
> 
> This can really wait for the next merge window IMHO.
> 

"mm, oom: cgroup-aware OOM killer" in linux-next.git is reviving that sleeping 
point.
Current "mm, oom: cgroup-aware OOM killer" will not be sent to linux.git in the 
next
merge window? I'm confused...


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-31 Thread Tetsuo Handa
On 2018/07/31 20:15, Michal Hocko wrote:
>>> I will send the patch to Andrew if the patch is ok. 
>>
>> Andrew, can we send the "we used to have a sleeping point in the oom path 
>> but this has
>> been removed recently" patch to linux.git ?
> 
> This can really wait for the next merge window IMHO.
> 

"mm, oom: cgroup-aware OOM killer" in linux-next.git is reviving that sleeping 
point.
Current "mm, oom: cgroup-aware OOM killer" will not be sent to linux.git in the 
next
merge window? I'm confused...


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-31 Thread Michal Hocko
On Tue 31-07-18 19:47:45, Tetsuo Handa wrote:
> On 2018/07/31 14:09, Michal Hocko wrote:
> > On Tue 31-07-18 06:01:48, Tetsuo Handa wrote:
> >> On 2018/07/31 4:10, Michal Hocko wrote:
> >>> Since should_reclaim_retry() should be a natural reschedule point,
> >>> let's do the short sleep for PF_WQ_WORKER threads unconditionally in
> >>> order to guarantee that other pending work items are started. This will
> >>> workaround this problem and it is less fragile than hunting down when
> >>> the sleep is missed. E.g. we used to have a sleeping point in the oom
> >>> path but this has been removed recently because it caused other issues.
> >>> Having a single sleeping point is more robust.
> >>
> >> linux.git has not removed the sleeping point in the OOM path yet. Since 
> >> removing the
> >> sleeping point in the OOM path can mitigate CVE-2016-10723, please do so 
> >> immediately.
> > 
> > is this an {Acked,Reviewed,Tested}-by?
> 
> I'm saying that "we used to have a sleeping point in the oom path but this 
> has been
> removed recently" is not true. You need to send that patch to linux.git first 
> if you
> want to refer that patch in this patch.

That patch is already sitting in mmotm tree and this one will go on top.
I do not really see any reason to rush it to Linus tree. A dubious CVE
doesn't really raise the priority if you ask me.

On the other hand, having a confirmation, either of the above tags would
help to raise the credibility of the change.

> > I will send the patch to Andrew if the patch is ok. 
> 
> Andrew, can we send the "we used to have a sleeping point in the oom path but 
> this has
> been removed recently" patch to linux.git ?

This can really wait for the next merge window IMHO.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-31 Thread Michal Hocko
On Tue 31-07-18 19:47:45, Tetsuo Handa wrote:
> On 2018/07/31 14:09, Michal Hocko wrote:
> > On Tue 31-07-18 06:01:48, Tetsuo Handa wrote:
> >> On 2018/07/31 4:10, Michal Hocko wrote:
> >>> Since should_reclaim_retry() should be a natural reschedule point,
> >>> let's do the short sleep for PF_WQ_WORKER threads unconditionally in
> >>> order to guarantee that other pending work items are started. This will
> >>> workaround this problem and it is less fragile than hunting down when
> >>> the sleep is missed. E.g. we used to have a sleeping point in the oom
> >>> path but this has been removed recently because it caused other issues.
> >>> Having a single sleeping point is more robust.
> >>
> >> linux.git has not removed the sleeping point in the OOM path yet. Since 
> >> removing the
> >> sleeping point in the OOM path can mitigate CVE-2016-10723, please do so 
> >> immediately.
> > 
> > is this an {Acked,Reviewed,Tested}-by?
> 
> I'm saying that "we used to have a sleeping point in the oom path but this 
> has been
> removed recently" is not true. You need to send that patch to linux.git first 
> if you
> want to refer that patch in this patch.

That patch is already sitting in mmotm tree and this one will go on top.
I do not really see any reason to rush it to Linus tree. A dubious CVE
doesn't really raise the priority if you ask me.

On the other hand, having a confirmation, either of the above tags would
help to raise the credibility of the change.

> > I will send the patch to Andrew if the patch is ok. 
> 
> Andrew, can we send the "we used to have a sleeping point in the oom path but 
> this has
> been removed recently" patch to linux.git ?

This can really wait for the next merge window IMHO.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-31 Thread Tetsuo Handa
On 2018/07/31 14:09, Michal Hocko wrote:
> On Tue 31-07-18 06:01:48, Tetsuo Handa wrote:
>> On 2018/07/31 4:10, Michal Hocko wrote:
>>> Since should_reclaim_retry() should be a natural reschedule point,
>>> let's do the short sleep for PF_WQ_WORKER threads unconditionally in
>>> order to guarantee that other pending work items are started. This will
>>> workaround this problem and it is less fragile than hunting down when
>>> the sleep is missed. E.g. we used to have a sleeping point in the oom
>>> path but this has been removed recently because it caused other issues.
>>> Having a single sleeping point is more robust.
>>
>> linux.git has not removed the sleeping point in the OOM path yet. Since 
>> removing the
>> sleeping point in the OOM path can mitigate CVE-2016-10723, please do so 
>> immediately.
> 
> is this an {Acked,Reviewed,Tested}-by?

I'm saying that "we used to have a sleeping point in the oom path but this has 
been
removed recently" is not true. You need to send that patch to linux.git first 
if you
want to refer that patch in this patch.

> 
> I will send the patch to Andrew if the patch is ok. 

Andrew, can we send the "we used to have a sleeping point in the oom path but 
this has
been removed recently" patch to linux.git ?

> 
>> (And that change will conflict with Roman's cgroup aware OOM killer 
>> patchset. But it
>> should be easy to rebase.)
> 
> That is still a WIP so I would lose sleep over it.
> 


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-31 Thread Tetsuo Handa
On 2018/07/31 14:09, Michal Hocko wrote:
> On Tue 31-07-18 06:01:48, Tetsuo Handa wrote:
>> On 2018/07/31 4:10, Michal Hocko wrote:
>>> Since should_reclaim_retry() should be a natural reschedule point,
>>> let's do the short sleep for PF_WQ_WORKER threads unconditionally in
>>> order to guarantee that other pending work items are started. This will
>>> workaround this problem and it is less fragile than hunting down when
>>> the sleep is missed. E.g. we used to have a sleeping point in the oom
>>> path but this has been removed recently because it caused other issues.
>>> Having a single sleeping point is more robust.
>>
>> linux.git has not removed the sleeping point in the OOM path yet. Since 
>> removing the
>> sleeping point in the OOM path can mitigate CVE-2016-10723, please do so 
>> immediately.
> 
> is this an {Acked,Reviewed,Tested}-by?

I'm saying that "we used to have a sleeping point in the oom path but this has 
been
removed recently" is not true. You need to send that patch to linux.git first 
if you
want to refer that patch in this patch.

> 
> I will send the patch to Andrew if the patch is ok. 

Andrew, can we send the "we used to have a sleeping point in the oom path but 
this has
been removed recently" patch to linux.git ?

> 
>> (And that change will conflict with Roman's cgroup aware OOM killer 
>> patchset. But it
>> should be easy to rebase.)
> 
> That is still a WIP so I would lose sleep over it.
> 


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-30 Thread Michal Hocko
On Tue 31-07-18 06:01:48, Tetsuo Handa wrote:
> On 2018/07/31 4:10, Michal Hocko wrote:
> > Since should_reclaim_retry() should be a natural reschedule point,
> > let's do the short sleep for PF_WQ_WORKER threads unconditionally in
> > order to guarantee that other pending work items are started. This will
> > workaround this problem and it is less fragile than hunting down when
> > the sleep is missed. E.g. we used to have a sleeping point in the oom
> > path but this has been removed recently because it caused other issues.
> > Having a single sleeping point is more robust.
> 
> linux.git has not removed the sleeping point in the OOM path yet. Since 
> removing the
> sleeping point in the OOM path can mitigate CVE-2016-10723, please do so 
> immediately.

is this an {Acked,Reviewed,Tested}-by?

I will send the patch to Andrew if the patch is ok. 

> (And that change will conflict with Roman's cgroup aware OOM killer patchset. 
> But it
> should be easy to rebase.)

That is still a WIP so I would lose sleep over it.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-30 Thread Michal Hocko
On Tue 31-07-18 06:01:48, Tetsuo Handa wrote:
> On 2018/07/31 4:10, Michal Hocko wrote:
> > Since should_reclaim_retry() should be a natural reschedule point,
> > let's do the short sleep for PF_WQ_WORKER threads unconditionally in
> > order to guarantee that other pending work items are started. This will
> > workaround this problem and it is less fragile than hunting down when
> > the sleep is missed. E.g. we used to have a sleeping point in the oom
> > path but this has been removed recently because it caused other issues.
> > Having a single sleeping point is more robust.
> 
> linux.git has not removed the sleeping point in the OOM path yet. Since 
> removing the
> sleeping point in the OOM path can mitigate CVE-2016-10723, please do so 
> immediately.

is this an {Acked,Reviewed,Tested}-by?

I will send the patch to Andrew if the patch is ok. 

> (And that change will conflict with Roman's cgroup aware OOM killer patchset. 
> But it
> should be easy to rebase.)

That is still a WIP so I would lose sleep over it.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-30 Thread Tetsuo Handa
On 2018/07/31 4:10, Michal Hocko wrote:
> Since should_reclaim_retry() should be a natural reschedule point,
> let's do the short sleep for PF_WQ_WORKER threads unconditionally in
> order to guarantee that other pending work items are started. This will
> workaround this problem and it is less fragile than hunting down when
> the sleep is missed. E.g. we used to have a sleeping point in the oom
> path but this has been removed recently because it caused other issues.
> Having a single sleeping point is more robust.

linux.git has not removed the sleeping point in the OOM path yet. Since 
removing the
sleeping point in the OOM path can mitigate CVE-2016-10723, please do so 
immediately.
(And that change will conflict with Roman's cgroup aware OOM killer patchset. 
But it
should be easy to rebase.)


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-30 Thread Tetsuo Handa
On 2018/07/31 4:10, Michal Hocko wrote:
> Since should_reclaim_retry() should be a natural reschedule point,
> let's do the short sleep for PF_WQ_WORKER threads unconditionally in
> order to guarantee that other pending work items are started. This will
> workaround this problem and it is less fragile than hunting down when
> the sleep is missed. E.g. we used to have a sleeping point in the oom
> path but this has been removed recently because it caused other issues.
> Having a single sleeping point is more robust.

linux.git has not removed the sleeping point in the OOM path yet. Since 
removing the
sleeping point in the OOM path can mitigate CVE-2016-10723, please do so 
immediately.
(And that change will conflict with Roman's cgroup aware OOM killer patchset. 
But it
should be easy to rebase.)


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-30 Thread Tejun Heo
Hello, Michal.

On Mon, Jul 30, 2018 at 08:51:10PM +0200, Michal Hocko wrote:
> > Yeah, workqueue can choke on things like that and kthread indefinitely
> > busy looping doesn't do anybody any good.
> 
> Yeah, I do agree. But this is much easier said than done ;) Sure
> we have that hack that does sleep rather than cond_resched in the
> page allocator. We can and will "fix" it to be unconditional in the
> should_reclaim_retry [1] but this whole thing is really subtle. It just
> take one misbehaving worker and something which is really important to
> run will get stuck.

Oh yeah, I'm not saying the current behavior is ideal or anything, but
since the behavior has been put in many years ago, it only became a
problem only a couple times and all cases were rather easy and obvious
fixes on the wq user side.  It shouldn't be difficult to add a timer
mechanism on top.  We might be able to simply extend the hang
detection mechanism to kick off all pending rescuers after detecting a
wq stall.  I'm wary about making it a part of normal operation
(ie. silent timeout).  per-cpu kworkers really shouldn't busy loop for
an extended period of time.

Thanks.

-- 
tejun


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-30 Thread Tejun Heo
Hello, Michal.

On Mon, Jul 30, 2018 at 08:51:10PM +0200, Michal Hocko wrote:
> > Yeah, workqueue can choke on things like that and kthread indefinitely
> > busy looping doesn't do anybody any good.
> 
> Yeah, I do agree. But this is much easier said than done ;) Sure
> we have that hack that does sleep rather than cond_resched in the
> page allocator. We can and will "fix" it to be unconditional in the
> should_reclaim_retry [1] but this whole thing is really subtle. It just
> take one misbehaving worker and something which is really important to
> run will get stuck.

Oh yeah, I'm not saying the current behavior is ideal or anything, but
since the behavior has been put in many years ago, it only became a
problem only a couple times and all cases were rather easy and obvious
fixes on the wq user side.  It shouldn't be difficult to add a timer
mechanism on top.  We might be able to simply extend the hang
detection mechanism to kick off all pending rescuers after detecting a
wq stall.  I'm wary about making it a part of normal operation
(ie. silent timeout).  per-cpu kworkers really shouldn't busy loop for
an extended period of time.

Thanks.

-- 
tejun


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-30 Thread Michal Hocko
This change has been posted several times with some concerns about the
changelog. Originally I thought it is more of a "nice to have" thing
rather than a bug fix, later Tetsuo has taken over it but the changelog
was not really comprehensible so I reworded it. Let's see if this is
better.


>From 9bbea6516bb99615aff5ba5699865aa2d48333cc Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Thu, 26 Jul 2018 14:40:03 +0900
Subject: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at
 should_reclaim_retry().

Tetsuo Handa has reported that it is possible to bypass the short sleep
for PF_WQ_WORKER threads which was introduced by commit 373ccbe5927034b5
("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make
any progress") and lock up the system if OOM.

The primary reason is that WQ_MEM_RECLAIM WQs are not guaranteed to
run even when they have a rescuer available. Those workers might be
essential for reclaim to make a forward progress, however. If we are
too unlucky all the allocations requests can get stuck waiting for a
WQ_MEM_RECLAIM work item and the system is essentially stuck in an OOM
condition without much hope to move on. Tetsuo has seen the reclaim
stuck on drain_local_pages_wq or xlog_cil_push_work (xfs). There might
be others.

Since should_reclaim_retry() should be a natural reschedule point,
let's do the short sleep for PF_WQ_WORKER threads unconditionally in
order to guarantee that other pending work items are started. This will
workaround this problem and it is less fragile than hunting down when
the sleep is missed. E.g. we used to have a sleeping point in the oom
path but this has been removed recently because it caused other issues.
Having a single sleeping point is more robust.

Reported-and-debugged-by: Tetsuo Handa 
Signed-off-by: Michal Hocko 
Cc: Roman Gushchin 
Cc: Johannes Weiner 
Cc: Vladimir Davydov 
Cc: David Rientjes 
Cc: Tejun Heo 
---
 mm/page_alloc.c | 34 ++
 1 file changed, 18 insertions(+), 16 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1521100f1e63..f56cc0958d09 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3922,6 +3922,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 {
struct zone *zone;
struct zoneref *z;
+   bool ret = false;
 
/*
 * Costly allocations might have made a progress but this doesn't mean
@@ -3985,25 +3986,26 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
}
}
 
-   /*
-* Memory allocation/reclaim might be called from a WQ
-* context and the current implementation of the WQ
-* concurrency control doesn't recognize that
-* a particular WQ is congested if the worker thread is
-* looping without ever sleeping. Therefore we have to
-* do a short sleep here rather than calling
-* cond_resched().
-*/
-   if (current->flags & PF_WQ_WORKER)
-   schedule_timeout_uninterruptible(1);
-   else
-   cond_resched();
-
-   return true;
+   ret = true;
+   goto out;
}
}
 
-   return false;
+out:
+   /*
+* Memory allocation/reclaim might be called from a WQ
+* context and the current implementation of the WQ
+* concurrency control doesn't recognize that
+* a particular WQ is congested if the worker thread is
+* looping without ever sleeping. Therefore we have to
+* do a short sleep here rather than calling
+* cond_resched().
+*/
+   if (current->flags & PF_WQ_WORKER)
+   schedule_timeout_uninterruptible(1);
+   else
+   cond_resched();
+   return ret;
 }
 
 static inline bool
-- 
2.18.0

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-30 Thread Michal Hocko
This change has been posted several times with some concerns about the
changelog. Originally I thought it is more of a "nice to have" thing
rather than a bug fix, later Tetsuo has taken over it but the changelog
was not really comprehensible so I reworded it. Let's see if this is
better.


>From 9bbea6516bb99615aff5ba5699865aa2d48333cc Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Thu, 26 Jul 2018 14:40:03 +0900
Subject: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at
 should_reclaim_retry().

Tetsuo Handa has reported that it is possible to bypass the short sleep
for PF_WQ_WORKER threads which was introduced by commit 373ccbe5927034b5
("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make
any progress") and lock up the system if OOM.

The primary reason is that WQ_MEM_RECLAIM WQs are not guaranteed to
run even when they have a rescuer available. Those workers might be
essential for reclaim to make a forward progress, however. If we are
too unlucky all the allocations requests can get stuck waiting for a
WQ_MEM_RECLAIM work item and the system is essentially stuck in an OOM
condition without much hope to move on. Tetsuo has seen the reclaim
stuck on drain_local_pages_wq or xlog_cil_push_work (xfs). There might
be others.

Since should_reclaim_retry() should be a natural reschedule point,
let's do the short sleep for PF_WQ_WORKER threads unconditionally in
order to guarantee that other pending work items are started. This will
workaround this problem and it is less fragile than hunting down when
the sleep is missed. E.g. we used to have a sleeping point in the oom
path but this has been removed recently because it caused other issues.
Having a single sleeping point is more robust.

Reported-and-debugged-by: Tetsuo Handa 
Signed-off-by: Michal Hocko 
Cc: Roman Gushchin 
Cc: Johannes Weiner 
Cc: Vladimir Davydov 
Cc: David Rientjes 
Cc: Tejun Heo 
---
 mm/page_alloc.c | 34 ++
 1 file changed, 18 insertions(+), 16 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1521100f1e63..f56cc0958d09 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3922,6 +3922,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 {
struct zone *zone;
struct zoneref *z;
+   bool ret = false;
 
/*
 * Costly allocations might have made a progress but this doesn't mean
@@ -3985,25 +3986,26 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
}
}
 
-   /*
-* Memory allocation/reclaim might be called from a WQ
-* context and the current implementation of the WQ
-* concurrency control doesn't recognize that
-* a particular WQ is congested if the worker thread is
-* looping without ever sleeping. Therefore we have to
-* do a short sleep here rather than calling
-* cond_resched().
-*/
-   if (current->flags & PF_WQ_WORKER)
-   schedule_timeout_uninterruptible(1);
-   else
-   cond_resched();
-
-   return true;
+   ret = true;
+   goto out;
}
}
 
-   return false;
+out:
+   /*
+* Memory allocation/reclaim might be called from a WQ
+* context and the current implementation of the WQ
+* concurrency control doesn't recognize that
+* a particular WQ is congested if the worker thread is
+* looping without ever sleeping. Therefore we have to
+* do a short sleep here rather than calling
+* cond_resched().
+*/
+   if (current->flags & PF_WQ_WORKER)
+   schedule_timeout_uninterruptible(1);
+   else
+   cond_resched();
+   return ret;
 }
 
 static inline bool
-- 
2.18.0

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-30 Thread Michal Hocko
On Mon 30-07-18 08:44:24, Tejun Heo wrote:
> Hello,
> 
> On Tue, Jul 31, 2018 at 12:25:04AM +0900, Tetsuo Handa wrote:
> > WQ_MEM_RECLAIM guarantees that "struct task_struct" is preallocated. But
> > WQ_MEM_RECLAIM does not guarantee that the pending work is started as soon
> > as an item was queued. Same rule applies to both WQ_MEM_RECLAIM workqueues 
> > and !WQ_MEM_RECLAIM workqueues regarding when to start a pending work (i.e.
> > when schedule_timeout_*() is called).
> > 
> > Is this correct?
> 
> WQ_MEM_RECLAIM guarantees that there's always gonna exist at least one
> kworker running the workqueue.  But all per-cpu kworkers are subject
> to concurrency limiting execution - ie. if there are any per-cpu
> actively running on a cpu, no futher kworkers will be scheduled.

Well, in the ideal world we would _use_ that pre-allocated kworker if
there are no other available because they are doing something that takes
a long time to accomplish. Page allocator can spend a lot of time if we
are struggling to death to get some memory.

> > >  We can add timeout mechanism to workqueue so that it
> > > kicks off other kworkers if one of them is in running state for too
> > > long, but idk, if there's an indefinite busy loop condition in kernel
> > > threads, we really should get rid of them and hung task watchdog is
> > > pretty effective at finding these cases (at least with preemption
> > > disabled).
> > 
> > Currently the page allocator has a path which can loop forever with
> > only cond_resched().
> 
> Yeah, workqueue can choke on things like that and kthread indefinitely
> busy looping doesn't do anybody any good.

Yeah, I do agree. But this is much easier said than done ;) Sure
we have that hack that does sleep rather than cond_resched in the
page allocator. We can and will "fix" it to be unconditional in the
should_reclaim_retry [1] but this whole thing is really subtle. It just
take one misbehaving worker and something which is really important to
run will get stuck.

That being said I will post the patch with updated changelog recording
this.

[1] 
http://lkml.kernel.org/r/ca3da8b8-1bb5-c302-b190-fa6cebab5...@i-love.sakura.ne.jp
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-30 Thread Michal Hocko
On Mon 30-07-18 08:44:24, Tejun Heo wrote:
> Hello,
> 
> On Tue, Jul 31, 2018 at 12:25:04AM +0900, Tetsuo Handa wrote:
> > WQ_MEM_RECLAIM guarantees that "struct task_struct" is preallocated. But
> > WQ_MEM_RECLAIM does not guarantee that the pending work is started as soon
> > as an item was queued. Same rule applies to both WQ_MEM_RECLAIM workqueues 
> > and !WQ_MEM_RECLAIM workqueues regarding when to start a pending work (i.e.
> > when schedule_timeout_*() is called).
> > 
> > Is this correct?
> 
> WQ_MEM_RECLAIM guarantees that there's always gonna exist at least one
> kworker running the workqueue.  But all per-cpu kworkers are subject
> to concurrency limiting execution - ie. if there are any per-cpu
> actively running on a cpu, no futher kworkers will be scheduled.

Well, in the ideal world we would _use_ that pre-allocated kworker if
there are no other available because they are doing something that takes
a long time to accomplish. Page allocator can spend a lot of time if we
are struggling to death to get some memory.

> > >  We can add timeout mechanism to workqueue so that it
> > > kicks off other kworkers if one of them is in running state for too
> > > long, but idk, if there's an indefinite busy loop condition in kernel
> > > threads, we really should get rid of them and hung task watchdog is
> > > pretty effective at finding these cases (at least with preemption
> > > disabled).
> > 
> > Currently the page allocator has a path which can loop forever with
> > only cond_resched().
> 
> Yeah, workqueue can choke on things like that and kthread indefinitely
> busy looping doesn't do anybody any good.

Yeah, I do agree. But this is much easier said than done ;) Sure
we have that hack that does sleep rather than cond_resched in the
page allocator. We can and will "fix" it to be unconditional in the
should_reclaim_retry [1] but this whole thing is really subtle. It just
take one misbehaving worker and something which is really important to
run will get stuck.

That being said I will post the patch with updated changelog recording
this.

[1] 
http://lkml.kernel.org/r/ca3da8b8-1bb5-c302-b190-fa6cebab5...@i-love.sakura.ne.jp
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-30 Thread Tejun Heo
Hello,

On Tue, Jul 31, 2018 at 12:25:04AM +0900, Tetsuo Handa wrote:
> WQ_MEM_RECLAIM guarantees that "struct task_struct" is preallocated. But
> WQ_MEM_RECLAIM does not guarantee that the pending work is started as soon
> as an item was queued. Same rule applies to both WQ_MEM_RECLAIM workqueues 
> and !WQ_MEM_RECLAIM workqueues regarding when to start a pending work (i.e.
> when schedule_timeout_*() is called).
> 
> Is this correct?

WQ_MEM_RECLAIM guarantees that there's always gonna exist at least one
kworker running the workqueue.  But all per-cpu kworkers are subject
to concurrency limiting execution - ie. if there are any per-cpu
actively running on a cpu, no futher kworkers will be scheduled.

> >  We can add timeout mechanism to workqueue so that it
> > kicks off other kworkers if one of them is in running state for too
> > long, but idk, if there's an indefinite busy loop condition in kernel
> > threads, we really should get rid of them and hung task watchdog is
> > pretty effective at finding these cases (at least with preemption
> > disabled).
> 
> Currently the page allocator has a path which can loop forever with
> only cond_resched().

Yeah, workqueue can choke on things like that and kthread indefinitely
busy looping doesn't do anybody any good.

Thanks.

-- 
tejun


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-30 Thread Tejun Heo
Hello,

On Tue, Jul 31, 2018 at 12:25:04AM +0900, Tetsuo Handa wrote:
> WQ_MEM_RECLAIM guarantees that "struct task_struct" is preallocated. But
> WQ_MEM_RECLAIM does not guarantee that the pending work is started as soon
> as an item was queued. Same rule applies to both WQ_MEM_RECLAIM workqueues 
> and !WQ_MEM_RECLAIM workqueues regarding when to start a pending work (i.e.
> when schedule_timeout_*() is called).
> 
> Is this correct?

WQ_MEM_RECLAIM guarantees that there's always gonna exist at least one
kworker running the workqueue.  But all per-cpu kworkers are subject
to concurrency limiting execution - ie. if there are any per-cpu
actively running on a cpu, no futher kworkers will be scheduled.

> >  We can add timeout mechanism to workqueue so that it
> > kicks off other kworkers if one of them is in running state for too
> > long, but idk, if there's an indefinite busy loop condition in kernel
> > threads, we really should get rid of them and hung task watchdog is
> > pretty effective at finding these cases (at least with preemption
> > disabled).
> 
> Currently the page allocator has a path which can loop forever with
> only cond_resched().

Yeah, workqueue can choke on things like that and kthread indefinitely
busy looping doesn't do anybody any good.

Thanks.

-- 
tejun


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-30 Thread Tetsuo Handa
On 2018/07/30 23:54, Tejun Heo wrote:
> Hello,
> 
> On Mon, Jul 30, 2018 at 04:46:47PM +0200, Michal Hocko wrote:
>> On Mon 30-07-18 23:34:23, Tetsuo Handa wrote:
>>> On 2018/07/30 18:32, Michal Hocko wrote:
>> [...]
 This one is waiting for draining and we are in mm_percpu_wq WQ context
 which has its rescuer so no other activity can block us for ever. So
 this certainly shouldn't deadlock. It can be dead slow but well, this is
 what you will get when your shoot your system to death.
>>>
>>> We need schedule_timeout_*() to allow such WQ_MEM_RECLAIM workqueues to 
>>> wake up. (Tejun,
>>> is my understanding correct?) Lack of schedule_timeout_*() does block 
>>> WQ_MEM_RECLAIM
>>> workqueues forever.
>>
>> Hmm. This doesn't match my understanding of what WQ_MEM_RECLAIM actually
>> guarantees. If you are right then the whole thing sounds quite fragile
>> to me TBH.
> 
> Workqueue doesn't think the cpu is stalled as long as one of the
> per-cpu kworkers is running.  The assumption is that kernel threads
> are not supposed to be busy-looping indefinitely (and they really
> shouldn't).

WQ_MEM_RECLAIM guarantees that "struct task_struct" is preallocated. But
WQ_MEM_RECLAIM does not guarantee that the pending work is started as soon
as an item was queued. Same rule applies to both WQ_MEM_RECLAIM workqueues 
and !WQ_MEM_RECLAIM workqueues regarding when to start a pending work (i.e.
when schedule_timeout_*() is called).

Is this correct?

>  We can add timeout mechanism to workqueue so that it
> kicks off other kworkers if one of them is in running state for too
> long, but idk, if there's an indefinite busy loop condition in kernel
> threads, we really should get rid of them and hung task watchdog is
> pretty effective at finding these cases (at least with preemption
> disabled).

Currently the page allocator has a path which can loop forever with
only cond_resched().

> 
> Thanks.
> 



Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-30 Thread Tetsuo Handa
On 2018/07/30 23:54, Tejun Heo wrote:
> Hello,
> 
> On Mon, Jul 30, 2018 at 04:46:47PM +0200, Michal Hocko wrote:
>> On Mon 30-07-18 23:34:23, Tetsuo Handa wrote:
>>> On 2018/07/30 18:32, Michal Hocko wrote:
>> [...]
 This one is waiting for draining and we are in mm_percpu_wq WQ context
 which has its rescuer so no other activity can block us for ever. So
 this certainly shouldn't deadlock. It can be dead slow but well, this is
 what you will get when your shoot your system to death.
>>>
>>> We need schedule_timeout_*() to allow such WQ_MEM_RECLAIM workqueues to 
>>> wake up. (Tejun,
>>> is my understanding correct?) Lack of schedule_timeout_*() does block 
>>> WQ_MEM_RECLAIM
>>> workqueues forever.
>>
>> Hmm. This doesn't match my understanding of what WQ_MEM_RECLAIM actually
>> guarantees. If you are right then the whole thing sounds quite fragile
>> to me TBH.
> 
> Workqueue doesn't think the cpu is stalled as long as one of the
> per-cpu kworkers is running.  The assumption is that kernel threads
> are not supposed to be busy-looping indefinitely (and they really
> shouldn't).

WQ_MEM_RECLAIM guarantees that "struct task_struct" is preallocated. But
WQ_MEM_RECLAIM does not guarantee that the pending work is started as soon
as an item was queued. Same rule applies to both WQ_MEM_RECLAIM workqueues 
and !WQ_MEM_RECLAIM workqueues regarding when to start a pending work (i.e.
when schedule_timeout_*() is called).

Is this correct?

>  We can add timeout mechanism to workqueue so that it
> kicks off other kworkers if one of them is in running state for too
> long, but idk, if there's an indefinite busy loop condition in kernel
> threads, we really should get rid of them and hung task watchdog is
> pretty effective at finding these cases (at least with preemption
> disabled).

Currently the page allocator has a path which can loop forever with
only cond_resched().

> 
> Thanks.
> 



Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-30 Thread Tejun Heo
Hello,

On Mon, Jul 30, 2018 at 04:46:47PM +0200, Michal Hocko wrote:
> On Mon 30-07-18 23:34:23, Tetsuo Handa wrote:
> > On 2018/07/30 18:32, Michal Hocko wrote:
> [...]
> > > This one is waiting for draining and we are in mm_percpu_wq WQ context
> > > which has its rescuer so no other activity can block us for ever. So
> > > this certainly shouldn't deadlock. It can be dead slow but well, this is
> > > what you will get when your shoot your system to death.
> > 
> > We need schedule_timeout_*() to allow such WQ_MEM_RECLAIM workqueues to 
> > wake up. (Tejun,
> > is my understanding correct?) Lack of schedule_timeout_*() does block 
> > WQ_MEM_RECLAIM
> > workqueues forever.
> 
> Hmm. This doesn't match my understanding of what WQ_MEM_RECLAIM actually
> guarantees. If you are right then the whole thing sounds quite fragile
> to me TBH.

Workqueue doesn't think the cpu is stalled as long as one of the
per-cpu kworkers is running.  The assumption is that kernel threads
are not supposed to be busy-looping indefinitely (and they really
shouldn't).  We can add timeout mechanism to workqueue so that it
kicks off other kworkers if one of them is in running state for too
long, but idk, if there's an indefinite busy loop condition in kernel
threads, we really should get rid of them and hung task watchdog is
pretty effective at finding these cases (at least with preemption
disabled).

Thanks.

-- 
tejun


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-30 Thread Tejun Heo
Hello,

On Mon, Jul 30, 2018 at 04:46:47PM +0200, Michal Hocko wrote:
> On Mon 30-07-18 23:34:23, Tetsuo Handa wrote:
> > On 2018/07/30 18:32, Michal Hocko wrote:
> [...]
> > > This one is waiting for draining and we are in mm_percpu_wq WQ context
> > > which has its rescuer so no other activity can block us for ever. So
> > > this certainly shouldn't deadlock. It can be dead slow but well, this is
> > > what you will get when your shoot your system to death.
> > 
> > We need schedule_timeout_*() to allow such WQ_MEM_RECLAIM workqueues to 
> > wake up. (Tejun,
> > is my understanding correct?) Lack of schedule_timeout_*() does block 
> > WQ_MEM_RECLAIM
> > workqueues forever.
> 
> Hmm. This doesn't match my understanding of what WQ_MEM_RECLAIM actually
> guarantees. If you are right then the whole thing sounds quite fragile
> to me TBH.

Workqueue doesn't think the cpu is stalled as long as one of the
per-cpu kworkers is running.  The assumption is that kernel threads
are not supposed to be busy-looping indefinitely (and they really
shouldn't).  We can add timeout mechanism to workqueue so that it
kicks off other kworkers if one of them is in running state for too
long, but idk, if there's an indefinite busy loop condition in kernel
threads, we really should get rid of them and hung task watchdog is
pretty effective at finding these cases (at least with preemption
disabled).

Thanks.

-- 
tejun


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-30 Thread Michal Hocko
On Mon 30-07-18 23:34:23, Tetsuo Handa wrote:
> On 2018/07/30 18:32, Michal Hocko wrote:
[...]
> > This one is waiting for draining and we are in mm_percpu_wq WQ context
> > which has its rescuer so no other activity can block us for ever. So
> > this certainly shouldn't deadlock. It can be dead slow but well, this is
> > what you will get when your shoot your system to death.
> 
> We need schedule_timeout_*() to allow such WQ_MEM_RECLAIM workqueues to wake 
> up. (Tejun,
> is my understanding correct?) Lack of schedule_timeout_*() does block 
> WQ_MEM_RECLAIM
> workqueues forever.

Hmm. This doesn't match my understanding of what WQ_MEM_RECLAIM actually
guarantees. If you are right then the whole thing sounds quite fragile
to me TBH.

Anyway we would at least have an explanation for what you are seeing.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-30 Thread Michal Hocko
On Mon 30-07-18 23:34:23, Tetsuo Handa wrote:
> On 2018/07/30 18:32, Michal Hocko wrote:
[...]
> > This one is waiting for draining and we are in mm_percpu_wq WQ context
> > which has its rescuer so no other activity can block us for ever. So
> > this certainly shouldn't deadlock. It can be dead slow but well, this is
> > what you will get when your shoot your system to death.
> 
> We need schedule_timeout_*() to allow such WQ_MEM_RECLAIM workqueues to wake 
> up. (Tejun,
> is my understanding correct?) Lack of schedule_timeout_*() does block 
> WQ_MEM_RECLAIM
> workqueues forever.

Hmm. This doesn't match my understanding of what WQ_MEM_RECLAIM actually
guarantees. If you are right then the whole thing sounds quite fragile
to me TBH.

Anyway we would at least have an explanation for what you are seeing.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-30 Thread Tetsuo Handa
On 2018/07/30 18:32, Michal Hocko wrote:
 Since the patch shown below was suggested by Michal Hocko at
 https://marc.info/?l=linux-mm=152723708623015 , it is from Michal Hocko.

 >From cd8095242de13ace61eefca0c3d6f2a5a7b40032 Mon Sep 17 00:00:00 2001
 From: Michal Hocko 
 Date: Thu, 26 Jul 2018 14:40:03 +0900
 Subject: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at 
 should_reclaim_retry().

 Tetsuo Handa has reported that it is possible to bypass the short sleep
 for PF_WQ_WORKER threads which was introduced by commit 373ccbe5927034b5
 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make
 any progress") and moved by commit ede37713737834d9 ("mm: throttle on IO
 only when there are too many dirty and writeback pages") and lock up the
 system if OOM.

 This is because we are implicitly counting on falling back to
 schedule_timeout_uninterruptible() in __alloc_pages_may_oom() when
 schedule_timeout_uninterruptible() in should_reclaim_retry() was not
 called due to __zone_watermark_ok() == false.
>>>
>>> How do we rely on that?
>>
>> Unless disk_events_workfn() (PID=5) sleeps at 
>> schedule_timeout_uninterruptible()
>> in __alloc_pages_may_oom(), drain_local_pages_wq() which a thread doing 
>> __GFP_FS
>> allocation (PID=498) is waiting for cannot be started.
> 
> Now you are losing me again. What is going on about
> drain_local_pages_wq? I can see that all __GFP_FS allocations are
> waiting for a completion which depends on the kworker context to run but
> I fail to see why drain_local_pages_wq matters. The memory on the pcp
> lists is not accounted to NR_FREE_PAGES IIRC but that should be a
> relatively small amount of memory so this cannot be a fundamental
> problem here.

If you look at "busy workqueues and worker pools", you will find that not only
vmstat_update and drain_local_pages_wq (which is WQ_MEM_RECLAIM) but also
other works such as xlog_cil_push_work and xfs_reclaim_worker (which is also
WQ_MEM_RECLAIM) remain "pending:" state.

[  324.960731] Showing busy workqueues and worker pools:
[  324.962577] workqueue events: flags=0x0
[  324.964137]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=13/256
[  324.966231] pending: vmw_fb_dirty_flush [vmwgfx], vmstat_shepherd, 
vmpressure_work_fn, free_work, mmdrop_async_fn, mmdrop_async_fn, 
mmdrop_async_fn, mmdrop_async_fn, e1000_watchdog [e1000], mmdrop_async_fn, 
mmdrop_async_fn, check_corruption, console_callback
[  324.973425] workqueue events_freezable: flags=0x4
[  324.975247]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  324.977393] pending: vmballoon_work [vmw_balloon]
[  324.979310] workqueue events_power_efficient: flags=0x80
[  324.981298]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=5/256
[  324.983543] pending: gc_worker [nf_conntrack], fb_flashcursor, 
neigh_periodic_work, neigh_periodic_work, check_lifetime
[  324.987240] workqueue events_freezable_power_: flags=0x84
[  324.989292]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  324.991482] in-flight: 5:disk_events_workfn
[  324.993371] workqueue mm_percpu_wq: flags=0x8
[  324.995167]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[  324.997363] pending: vmstat_update, drain_local_pages_wq BAR(498)
[  324.77] workqueue ipv6_addrconf: flags=0x40008
[  325.001899]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
[  325.004092] pending: addrconf_verify_work
[  325.005911] workqueue mpt_poll_0: flags=0x8
[  325.007686]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  325.009914] pending: mpt_fault_reset_work [mptbase]
[  325.012044] workqueue xfs-cil/sda1: flags=0xc
[  325.013897]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  325.016190] pending: xlog_cil_push_work [xfs] BAR(2344)
[  325.018354] workqueue xfs-reclaim/sda1: flags=0xc
[  325.020293]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  325.022549] pending: xfs_reclaim_worker [xfs]
[  325.024540] workqueue xfs-sync/sda1: flags=0x4
[  325.026425]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  325.028691] pending: xfs_log_worker [xfs]
[  325.030546] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=189s workers=4 idle: 
977 65 13

[  444.206334] Showing busy workqueues and worker pools:
[  444.208472] workqueue events: flags=0x0
[  444.210193]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=15/256
[  444.212389] pending: vmw_fb_dirty_flush [vmwgfx], vmstat_shepherd, 
vmpressure_work_fn, free_work, mmdrop_async_fn, mmdrop_async_fn, 
mmdrop_async_fn, mmdrop_async_fn, e1000_watchdog [e1000], mmdrop_async_fn, 
mmdrop_async_fn, check_corruption, console_callback, sysrq_reinject_alt_sysrq, 
moom_callback
[  444.220547] workqueue events_freezable: flags=0x4
[  444.222562]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  444.224852] pending: vmballoon_work [vmw_balloon]
[  444.227022] workqueue events_power_efficient: 

Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-30 Thread Tetsuo Handa
On 2018/07/30 18:32, Michal Hocko wrote:
 Since the patch shown below was suggested by Michal Hocko at
 https://marc.info/?l=linux-mm=152723708623015 , it is from Michal Hocko.

 >From cd8095242de13ace61eefca0c3d6f2a5a7b40032 Mon Sep 17 00:00:00 2001
 From: Michal Hocko 
 Date: Thu, 26 Jul 2018 14:40:03 +0900
 Subject: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at 
 should_reclaim_retry().

 Tetsuo Handa has reported that it is possible to bypass the short sleep
 for PF_WQ_WORKER threads which was introduced by commit 373ccbe5927034b5
 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make
 any progress") and moved by commit ede37713737834d9 ("mm: throttle on IO
 only when there are too many dirty and writeback pages") and lock up the
 system if OOM.

 This is because we are implicitly counting on falling back to
 schedule_timeout_uninterruptible() in __alloc_pages_may_oom() when
 schedule_timeout_uninterruptible() in should_reclaim_retry() was not
 called due to __zone_watermark_ok() == false.
>>>
>>> How do we rely on that?
>>
>> Unless disk_events_workfn() (PID=5) sleeps at 
>> schedule_timeout_uninterruptible()
>> in __alloc_pages_may_oom(), drain_local_pages_wq() which a thread doing 
>> __GFP_FS
>> allocation (PID=498) is waiting for cannot be started.
> 
> Now you are losing me again. What is going on about
> drain_local_pages_wq? I can see that all __GFP_FS allocations are
> waiting for a completion which depends on the kworker context to run but
> I fail to see why drain_local_pages_wq matters. The memory on the pcp
> lists is not accounted to NR_FREE_PAGES IIRC but that should be a
> relatively small amount of memory so this cannot be a fundamental
> problem here.

If you look at "busy workqueues and worker pools", you will find that not only
vmstat_update and drain_local_pages_wq (which is WQ_MEM_RECLAIM) but also
other works such as xlog_cil_push_work and xfs_reclaim_worker (which is also
WQ_MEM_RECLAIM) remain "pending:" state.

[  324.960731] Showing busy workqueues and worker pools:
[  324.962577] workqueue events: flags=0x0
[  324.964137]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=13/256
[  324.966231] pending: vmw_fb_dirty_flush [vmwgfx], vmstat_shepherd, 
vmpressure_work_fn, free_work, mmdrop_async_fn, mmdrop_async_fn, 
mmdrop_async_fn, mmdrop_async_fn, e1000_watchdog [e1000], mmdrop_async_fn, 
mmdrop_async_fn, check_corruption, console_callback
[  324.973425] workqueue events_freezable: flags=0x4
[  324.975247]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  324.977393] pending: vmballoon_work [vmw_balloon]
[  324.979310] workqueue events_power_efficient: flags=0x80
[  324.981298]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=5/256
[  324.983543] pending: gc_worker [nf_conntrack], fb_flashcursor, 
neigh_periodic_work, neigh_periodic_work, check_lifetime
[  324.987240] workqueue events_freezable_power_: flags=0x84
[  324.989292]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  324.991482] in-flight: 5:disk_events_workfn
[  324.993371] workqueue mm_percpu_wq: flags=0x8
[  324.995167]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[  324.997363] pending: vmstat_update, drain_local_pages_wq BAR(498)
[  324.77] workqueue ipv6_addrconf: flags=0x40008
[  325.001899]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
[  325.004092] pending: addrconf_verify_work
[  325.005911] workqueue mpt_poll_0: flags=0x8
[  325.007686]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  325.009914] pending: mpt_fault_reset_work [mptbase]
[  325.012044] workqueue xfs-cil/sda1: flags=0xc
[  325.013897]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  325.016190] pending: xlog_cil_push_work [xfs] BAR(2344)
[  325.018354] workqueue xfs-reclaim/sda1: flags=0xc
[  325.020293]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  325.022549] pending: xfs_reclaim_worker [xfs]
[  325.024540] workqueue xfs-sync/sda1: flags=0x4
[  325.026425]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  325.028691] pending: xfs_log_worker [xfs]
[  325.030546] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=189s workers=4 idle: 
977 65 13

[  444.206334] Showing busy workqueues and worker pools:
[  444.208472] workqueue events: flags=0x0
[  444.210193]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=15/256
[  444.212389] pending: vmw_fb_dirty_flush [vmwgfx], vmstat_shepherd, 
vmpressure_work_fn, free_work, mmdrop_async_fn, mmdrop_async_fn, 
mmdrop_async_fn, mmdrop_async_fn, e1000_watchdog [e1000], mmdrop_async_fn, 
mmdrop_async_fn, check_corruption, console_callback, sysrq_reinject_alt_sysrq, 
moom_callback
[  444.220547] workqueue events_freezable: flags=0x4
[  444.222562]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  444.224852] pending: vmballoon_work [vmw_balloon]
[  444.227022] workqueue events_power_efficient: 

Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-30 Thread Michal Hocko
On Sat 28-07-18 00:47:40, Tetsuo Handa wrote:
> On 2018/07/26 20:39, Michal Hocko wrote:
> > On Thu 26-07-18 20:06:24, Tetsuo Handa wrote:
> >> Before applying "an OOM lockup mitigation patch", I want to apply this
> >> "another OOM lockup avoidance" patch.
> > 
> > I do not really see why. All these are borderline interesting as the
> > system is basically dead by the time you reach this state.
> 
> I don't like your "borderline interesting". We still don't have a watchdog
> which tells something went wrong. Thus, "borderline interesting" has to be
> examined and fixed.

No question about that. Bugs should be fixed. But this doesn't look like
something we should panic about and rush a half baked or not fully
understood fixes.
 
> >> Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20180726.txt.xz
> >> (which was captured with
> >>
> >>   --- a/mm/oom_kill.c
> >>   +++ b/mm/oom_kill.c
> >>   @@ -1071,6 +1071,12 @@ bool out_of_memory(struct oom_control *oc)
> >>{
> >>unsigned long freed = 0;
> >>bool delay = false; /* if set, delay next allocation attempt */
> >>   +static unsigned long last_warned;
> >>   +if (!last_warned || time_after(jiffies, last_warned + 10 * HZ)) 
> >> {
> >>   +pr_warn("%s(%d) gfp_mask=%#x(%pGg), order=%d\n", 
> >> current->comm,
> >>   +current->pid, oc->gfp_mask, >gfp_mask, 
> >> oc->order);
> >>   +last_warned = jiffies;
> >>   +}
> >>
> >>oc->constraint = CONSTRAINT_NONE;
> >>if (oom_killer_disabled)
> >>
> >> in order to demonstrate that the GFP_NOIO allocation from 
> >> disk_events_workfn() is
> >> calling out_of_memory() rather than by error failing to give up direct 
> >> reclaim).
> >>
> >> [  258.619119] kworker/0:0(5) gfp_mask=0x60(GFP_NOIO), order=0
> >> [  268.622732] kworker/0:0(5) gfp_mask=0x60(GFP_NOIO), order=0
> >> [  278.635344] kworker/0:0(5) gfp_mask=0x60(GFP_NOIO), order=0
> >> [  288.639360] kworker/0:0(5) gfp_mask=0x60(GFP_NOIO), order=0
> >> [  298.642715] kworker/0:0(5) gfp_mask=0x60(GFP_NOIO), order=0
> > 
> > Hmm, so there is no other memory allocation to trigger the oom or they
> > all just back off on the oom_lock trylock? In other words what is
> > preventing from the oom killer invocation?
> 
> All __GFP_FS allocations got stuck at direct reclaim or workqueue.

OK, I see. This is important information which was missing in the
previous examination.

[...]

> >> Since the patch shown below was suggested by Michal Hocko at
> >> https://marc.info/?l=linux-mm=152723708623015 , it is from Michal Hocko.
> >>
> >> >From cd8095242de13ace61eefca0c3d6f2a5a7b40032 Mon Sep 17 00:00:00 2001
> >> From: Michal Hocko 
> >> Date: Thu, 26 Jul 2018 14:40:03 +0900
> >> Subject: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at 
> >> should_reclaim_retry().
> >>
> >> Tetsuo Handa has reported that it is possible to bypass the short sleep
> >> for PF_WQ_WORKER threads which was introduced by commit 373ccbe5927034b5
> >> ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make
> >> any progress") and moved by commit ede37713737834d9 ("mm: throttle on IO
> >> only when there are too many dirty and writeback pages") and lock up the
> >> system if OOM.
> >>
> >> This is because we are implicitly counting on falling back to
> >> schedule_timeout_uninterruptible() in __alloc_pages_may_oom() when
> >> schedule_timeout_uninterruptible() in should_reclaim_retry() was not
> >> called due to __zone_watermark_ok() == false.
> > 
> > How do we rely on that?
> 
> Unless disk_events_workfn() (PID=5) sleeps at 
> schedule_timeout_uninterruptible()
> in __alloc_pages_may_oom(), drain_local_pages_wq() which a thread doing 
> __GFP_FS
> allocation (PID=498) is waiting for cannot be started.

Now you are losing me again. What is going on about
drain_local_pages_wq? I can see that all __GFP_FS allocations are
waiting for a completion which depends on the kworker context to run but
I fail to see why drain_local_pages_wq matters. The memory on the pcp
lists is not accounted to NR_FREE_PAGES IIRC but that should be a
relatively small amount of memory so this cannot be a fundamental
problem here.

> >> However, schedule_timeout_uninterruptible() in __alloc_pages_may_oom() is
> >> not called if all allocating threads but a PF_WQ_WORKER thread got stuck at
> >> __GFP_FS direct reclaim, for mutex_trylock(_lock) by that PF_WQ_WORKER
> >> thread succeeds and out_of_memory() remains no-op unless that PF_WQ_WORKER
> >> thread is doing __GFP_FS allocation.
> > 
> > I have really hard time to parse and understand this.
> 
> You can write as you like.

I can as soon as I understand what is going on.

> >> Tetsuo is observing that GFP_NOIO
> >> allocation request from disk_events_workfn() is preventing other pending
> >> works from starting.
> > 
> > What about any other allocation from !PF_WQ_WORKER context? Why those 

Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-30 Thread Michal Hocko
On Sat 28-07-18 00:47:40, Tetsuo Handa wrote:
> On 2018/07/26 20:39, Michal Hocko wrote:
> > On Thu 26-07-18 20:06:24, Tetsuo Handa wrote:
> >> Before applying "an OOM lockup mitigation patch", I want to apply this
> >> "another OOM lockup avoidance" patch.
> > 
> > I do not really see why. All these are borderline interesting as the
> > system is basically dead by the time you reach this state.
> 
> I don't like your "borderline interesting". We still don't have a watchdog
> which tells something went wrong. Thus, "borderline interesting" has to be
> examined and fixed.

No question about that. Bugs should be fixed. But this doesn't look like
something we should panic about and rush a half baked or not fully
understood fixes.
 
> >> Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20180726.txt.xz
> >> (which was captured with
> >>
> >>   --- a/mm/oom_kill.c
> >>   +++ b/mm/oom_kill.c
> >>   @@ -1071,6 +1071,12 @@ bool out_of_memory(struct oom_control *oc)
> >>{
> >>unsigned long freed = 0;
> >>bool delay = false; /* if set, delay next allocation attempt */
> >>   +static unsigned long last_warned;
> >>   +if (!last_warned || time_after(jiffies, last_warned + 10 * HZ)) 
> >> {
> >>   +pr_warn("%s(%d) gfp_mask=%#x(%pGg), order=%d\n", 
> >> current->comm,
> >>   +current->pid, oc->gfp_mask, >gfp_mask, 
> >> oc->order);
> >>   +last_warned = jiffies;
> >>   +}
> >>
> >>oc->constraint = CONSTRAINT_NONE;
> >>if (oom_killer_disabled)
> >>
> >> in order to demonstrate that the GFP_NOIO allocation from 
> >> disk_events_workfn() is
> >> calling out_of_memory() rather than by error failing to give up direct 
> >> reclaim).
> >>
> >> [  258.619119] kworker/0:0(5) gfp_mask=0x60(GFP_NOIO), order=0
> >> [  268.622732] kworker/0:0(5) gfp_mask=0x60(GFP_NOIO), order=0
> >> [  278.635344] kworker/0:0(5) gfp_mask=0x60(GFP_NOIO), order=0
> >> [  288.639360] kworker/0:0(5) gfp_mask=0x60(GFP_NOIO), order=0
> >> [  298.642715] kworker/0:0(5) gfp_mask=0x60(GFP_NOIO), order=0
> > 
> > Hmm, so there is no other memory allocation to trigger the oom or they
> > all just back off on the oom_lock trylock? In other words what is
> > preventing from the oom killer invocation?
> 
> All __GFP_FS allocations got stuck at direct reclaim or workqueue.

OK, I see. This is important information which was missing in the
previous examination.

[...]

> >> Since the patch shown below was suggested by Michal Hocko at
> >> https://marc.info/?l=linux-mm=152723708623015 , it is from Michal Hocko.
> >>
> >> >From cd8095242de13ace61eefca0c3d6f2a5a7b40032 Mon Sep 17 00:00:00 2001
> >> From: Michal Hocko 
> >> Date: Thu, 26 Jul 2018 14:40:03 +0900
> >> Subject: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at 
> >> should_reclaim_retry().
> >>
> >> Tetsuo Handa has reported that it is possible to bypass the short sleep
> >> for PF_WQ_WORKER threads which was introduced by commit 373ccbe5927034b5
> >> ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make
> >> any progress") and moved by commit ede37713737834d9 ("mm: throttle on IO
> >> only when there are too many dirty and writeback pages") and lock up the
> >> system if OOM.
> >>
> >> This is because we are implicitly counting on falling back to
> >> schedule_timeout_uninterruptible() in __alloc_pages_may_oom() when
> >> schedule_timeout_uninterruptible() in should_reclaim_retry() was not
> >> called due to __zone_watermark_ok() == false.
> > 
> > How do we rely on that?
> 
> Unless disk_events_workfn() (PID=5) sleeps at 
> schedule_timeout_uninterruptible()
> in __alloc_pages_may_oom(), drain_local_pages_wq() which a thread doing 
> __GFP_FS
> allocation (PID=498) is waiting for cannot be started.

Now you are losing me again. What is going on about
drain_local_pages_wq? I can see that all __GFP_FS allocations are
waiting for a completion which depends on the kworker context to run but
I fail to see why drain_local_pages_wq matters. The memory on the pcp
lists is not accounted to NR_FREE_PAGES IIRC but that should be a
relatively small amount of memory so this cannot be a fundamental
problem here.

> >> However, schedule_timeout_uninterruptible() in __alloc_pages_may_oom() is
> >> not called if all allocating threads but a PF_WQ_WORKER thread got stuck at
> >> __GFP_FS direct reclaim, for mutex_trylock(_lock) by that PF_WQ_WORKER
> >> thread succeeds and out_of_memory() remains no-op unless that PF_WQ_WORKER
> >> thread is doing __GFP_FS allocation.
> > 
> > I have really hard time to parse and understand this.
> 
> You can write as you like.

I can as soon as I understand what is going on.

> >> Tetsuo is observing that GFP_NOIO
> >> allocation request from disk_events_workfn() is preventing other pending
> >> works from starting.
> > 
> > What about any other allocation from !PF_WQ_WORKER context? Why those 

Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-27 Thread Tetsuo Handa
On 2018/07/26 20:39, Michal Hocko wrote:
> On Thu 26-07-18 20:06:24, Tetsuo Handa wrote:
>> Before applying "an OOM lockup mitigation patch", I want to apply this
>> "another OOM lockup avoidance" patch.
> 
> I do not really see why. All these are borderline interesting as the
> system is basically dead by the time you reach this state.

I don't like your "borderline interesting". We still don't have a watchdog
which tells something went wrong. Thus, "borderline interesting" has to be
examined and fixed.

> 
>> Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20180726.txt.xz
>> (which was captured with
>>
>>   --- a/mm/oom_kill.c
>>   +++ b/mm/oom_kill.c
>>   @@ -1071,6 +1071,12 @@ bool out_of_memory(struct oom_control *oc)
>>{
>>  unsigned long freed = 0;
>>  bool delay = false; /* if set, delay next allocation attempt */
>>   +  static unsigned long last_warned;
>>   +  if (!last_warned || time_after(jiffies, last_warned + 10 * HZ)) {
>>   +  pr_warn("%s(%d) gfp_mask=%#x(%pGg), order=%d\n", current->comm,
>>   +  current->pid, oc->gfp_mask, >gfp_mask, oc->order);
>>   +  last_warned = jiffies;
>>   +  }
>>
>>  oc->constraint = CONSTRAINT_NONE;
>>  if (oom_killer_disabled)
>>
>> in order to demonstrate that the GFP_NOIO allocation from 
>> disk_events_workfn() is
>> calling out_of_memory() rather than by error failing to give up direct 
>> reclaim).
>>
>> [  258.619119] kworker/0:0(5) gfp_mask=0x60(GFP_NOIO), order=0
>> [  268.622732] kworker/0:0(5) gfp_mask=0x60(GFP_NOIO), order=0
>> [  278.635344] kworker/0:0(5) gfp_mask=0x60(GFP_NOIO), order=0
>> [  288.639360] kworker/0:0(5) gfp_mask=0x60(GFP_NOIO), order=0
>> [  298.642715] kworker/0:0(5) gfp_mask=0x60(GFP_NOIO), order=0
> 
> Hmm, so there is no other memory allocation to trigger the oom or they
> all just back off on the oom_lock trylock? In other words what is
> preventing from the oom killer invocation?

All __GFP_FS allocations got stuck at direct reclaim or workqueue.

[  310.265293] systemd D12240 1  0 0x
[  310.268118] Call Trace:
[  310.269894]  ? __schedule+0x245/0x7f0
[  310.271901]  ? xfs_reclaim_inodes_ag+0x3b8/0x470 [xfs]
[  310.274187]  schedule+0x23/0x80
[  310.275965]  schedule_preempt_disabled+0x5/0x10
[  310.278107]  __mutex_lock+0x3f5/0x9b0
[  310.280070]  ? xfs_reclaim_inodes_ag+0x3b8/0x470 [xfs]
[  310.282451]  xfs_reclaim_inodes_ag+0x3b8/0x470 [xfs]
[  310.284730]  ? lock_acquire+0x51/0x70
[  310.286678]  ? xfs_ail_push_all+0x13/0x60 [xfs]
[  310.288844]  xfs_reclaim_inodes_nr+0x2c/0x40 [xfs]
[  310.291092]  super_cache_scan+0x14b/0x1a0
[  310.293131]  do_shrink_slab+0xfc/0x190
[  310.295082]  shrink_slab+0x240/0x2c0
[  310.297028]  shrink_node+0xe3/0x460
[  310.298899]  do_try_to_free_pages+0xcb/0x380
[  310.300945]  try_to_free_pages+0xbb/0xf0
[  310.302874]  __alloc_pages_slowpath+0x3c1/0xc50
[  310.304931]  ? lock_acquire+0x51/0x70
[  310.306753]  ? set_page_refcounted+0x40/0x40
[  310.308686]  __alloc_pages_nodemask+0x2a6/0x2c0
[  310.310663]  filemap_fault+0x437/0x8e0
[  310.312446]  ? lock_acquire+0x51/0x70
[  310.314150]  ? xfs_ilock+0x86/0x190 [xfs]
[  310.315915]  __xfs_filemap_fault.constprop.21+0x37/0xc0 [xfs]
[  310.318089]  __do_fault+0x13/0x126
[  310.319696]  __handle_mm_fault+0xc8d/0x11c0
[  310.321493]  handle_mm_fault+0x17a/0x390
[  310.323156]  __do_page_fault+0x24c/0x4d0
[  310.324782]  do_page_fault+0x2a/0x70
[  310.326289]  ? page_fault+0x8/0x30
[  310.327745]  page_fault+0x1e/0x30

[  313.714537] systemd-journal D12600   498  1 0x
[  313.716538] Call Trace:
[  313.717683]  ? __schedule+0x245/0x7f0
[  313.719221]  schedule+0x23/0x80
[  313.720586]  schedule_timeout+0x21f/0x400
[  313.722163]  wait_for_completion+0xb2/0x130
[  313.723750]  ? wake_up_q+0x70/0x70
[  313.725134]  flush_work+0x1db/0x2b0
[  313.726535]  ? flush_workqueue_prep_pwqs+0x1b0/0x1b0
[  313.728336]  ? page_alloc_cpu_dead+0x30/0x30
[  313.729936]  drain_all_pages+0x174/0x1e0
[  313.731446]  __alloc_pages_slowpath+0x428/0xc50
[  313.733120]  __alloc_pages_nodemask+0x2a6/0x2c0
[  313.734826]  filemap_fault+0x437/0x8e0
[  313.736296]  ? lock_acquire+0x51/0x70
[  313.737769]  ? xfs_ilock+0x86/0x190 [xfs]
[  313.739309]  __xfs_filemap_fault.constprop.21+0x37/0xc0 [xfs]
[  313.741291]  __do_fault+0x13/0x126
[  313.742667]  __handle_mm_fault+0xc8d/0x11c0
[  313.744245]  handle_mm_fault+0x17a/0x390
[  313.745755]  __do_page_fault+0x24c/0x4d0
[  313.747290]  do_page_fault+0x2a/0x70
[  313.748728]  ? page_fault+0x8/0x30
[  313.750148]  page_fault+0x1e/0x30

[  324.987240] workqueue events_freezable_power_: flags=0x84
[  324.989292]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  324.991482] in-flight: 5:disk_events_workfn
[  324.993371] workqueue mm_percpu_wq: flags=0x8
[  324.995167]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[  324.997363] pending: vmstat_update, drain_local_pages_wq BAR(498)

[  

Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-27 Thread Tetsuo Handa
On 2018/07/26 20:39, Michal Hocko wrote:
> On Thu 26-07-18 20:06:24, Tetsuo Handa wrote:
>> Before applying "an OOM lockup mitigation patch", I want to apply this
>> "another OOM lockup avoidance" patch.
> 
> I do not really see why. All these are borderline interesting as the
> system is basically dead by the time you reach this state.

I don't like your "borderline interesting". We still don't have a watchdog
which tells something went wrong. Thus, "borderline interesting" has to be
examined and fixed.

> 
>> Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20180726.txt.xz
>> (which was captured with
>>
>>   --- a/mm/oom_kill.c
>>   +++ b/mm/oom_kill.c
>>   @@ -1071,6 +1071,12 @@ bool out_of_memory(struct oom_control *oc)
>>{
>>  unsigned long freed = 0;
>>  bool delay = false; /* if set, delay next allocation attempt */
>>   +  static unsigned long last_warned;
>>   +  if (!last_warned || time_after(jiffies, last_warned + 10 * HZ)) {
>>   +  pr_warn("%s(%d) gfp_mask=%#x(%pGg), order=%d\n", current->comm,
>>   +  current->pid, oc->gfp_mask, >gfp_mask, oc->order);
>>   +  last_warned = jiffies;
>>   +  }
>>
>>  oc->constraint = CONSTRAINT_NONE;
>>  if (oom_killer_disabled)
>>
>> in order to demonstrate that the GFP_NOIO allocation from 
>> disk_events_workfn() is
>> calling out_of_memory() rather than by error failing to give up direct 
>> reclaim).
>>
>> [  258.619119] kworker/0:0(5) gfp_mask=0x60(GFP_NOIO), order=0
>> [  268.622732] kworker/0:0(5) gfp_mask=0x60(GFP_NOIO), order=0
>> [  278.635344] kworker/0:0(5) gfp_mask=0x60(GFP_NOIO), order=0
>> [  288.639360] kworker/0:0(5) gfp_mask=0x60(GFP_NOIO), order=0
>> [  298.642715] kworker/0:0(5) gfp_mask=0x60(GFP_NOIO), order=0
> 
> Hmm, so there is no other memory allocation to trigger the oom or they
> all just back off on the oom_lock trylock? In other words what is
> preventing from the oom killer invocation?

All __GFP_FS allocations got stuck at direct reclaim or workqueue.

[  310.265293] systemd D12240 1  0 0x
[  310.268118] Call Trace:
[  310.269894]  ? __schedule+0x245/0x7f0
[  310.271901]  ? xfs_reclaim_inodes_ag+0x3b8/0x470 [xfs]
[  310.274187]  schedule+0x23/0x80
[  310.275965]  schedule_preempt_disabled+0x5/0x10
[  310.278107]  __mutex_lock+0x3f5/0x9b0
[  310.280070]  ? xfs_reclaim_inodes_ag+0x3b8/0x470 [xfs]
[  310.282451]  xfs_reclaim_inodes_ag+0x3b8/0x470 [xfs]
[  310.284730]  ? lock_acquire+0x51/0x70
[  310.286678]  ? xfs_ail_push_all+0x13/0x60 [xfs]
[  310.288844]  xfs_reclaim_inodes_nr+0x2c/0x40 [xfs]
[  310.291092]  super_cache_scan+0x14b/0x1a0
[  310.293131]  do_shrink_slab+0xfc/0x190
[  310.295082]  shrink_slab+0x240/0x2c0
[  310.297028]  shrink_node+0xe3/0x460
[  310.298899]  do_try_to_free_pages+0xcb/0x380
[  310.300945]  try_to_free_pages+0xbb/0xf0
[  310.302874]  __alloc_pages_slowpath+0x3c1/0xc50
[  310.304931]  ? lock_acquire+0x51/0x70
[  310.306753]  ? set_page_refcounted+0x40/0x40
[  310.308686]  __alloc_pages_nodemask+0x2a6/0x2c0
[  310.310663]  filemap_fault+0x437/0x8e0
[  310.312446]  ? lock_acquire+0x51/0x70
[  310.314150]  ? xfs_ilock+0x86/0x190 [xfs]
[  310.315915]  __xfs_filemap_fault.constprop.21+0x37/0xc0 [xfs]
[  310.318089]  __do_fault+0x13/0x126
[  310.319696]  __handle_mm_fault+0xc8d/0x11c0
[  310.321493]  handle_mm_fault+0x17a/0x390
[  310.323156]  __do_page_fault+0x24c/0x4d0
[  310.324782]  do_page_fault+0x2a/0x70
[  310.326289]  ? page_fault+0x8/0x30
[  310.327745]  page_fault+0x1e/0x30

[  313.714537] systemd-journal D12600   498  1 0x
[  313.716538] Call Trace:
[  313.717683]  ? __schedule+0x245/0x7f0
[  313.719221]  schedule+0x23/0x80
[  313.720586]  schedule_timeout+0x21f/0x400
[  313.722163]  wait_for_completion+0xb2/0x130
[  313.723750]  ? wake_up_q+0x70/0x70
[  313.725134]  flush_work+0x1db/0x2b0
[  313.726535]  ? flush_workqueue_prep_pwqs+0x1b0/0x1b0
[  313.728336]  ? page_alloc_cpu_dead+0x30/0x30
[  313.729936]  drain_all_pages+0x174/0x1e0
[  313.731446]  __alloc_pages_slowpath+0x428/0xc50
[  313.733120]  __alloc_pages_nodemask+0x2a6/0x2c0
[  313.734826]  filemap_fault+0x437/0x8e0
[  313.736296]  ? lock_acquire+0x51/0x70
[  313.737769]  ? xfs_ilock+0x86/0x190 [xfs]
[  313.739309]  __xfs_filemap_fault.constprop.21+0x37/0xc0 [xfs]
[  313.741291]  __do_fault+0x13/0x126
[  313.742667]  __handle_mm_fault+0xc8d/0x11c0
[  313.744245]  handle_mm_fault+0x17a/0x390
[  313.745755]  __do_page_fault+0x24c/0x4d0
[  313.747290]  do_page_fault+0x2a/0x70
[  313.748728]  ? page_fault+0x8/0x30
[  313.750148]  page_fault+0x1e/0x30

[  324.987240] workqueue events_freezable_power_: flags=0x84
[  324.989292]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  324.991482] in-flight: 5:disk_events_workfn
[  324.993371] workqueue mm_percpu_wq: flags=0x8
[  324.995167]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[  324.997363] pending: vmstat_update, drain_local_pages_wq BAR(498)

[  

Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-26 Thread Michal Hocko
On Thu 26-07-18 20:06:24, Tetsuo Handa wrote:
> Before applying "an OOM lockup mitigation patch", I want to apply this
> "another OOM lockup avoidance" patch.

I do not really see why. All these are borderline interesting as the
system is basically dead by the time you reach this state.

> Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20180726.txt.xz
> (which was captured with
> 
>   --- a/mm/oom_kill.c
>   +++ b/mm/oom_kill.c
>   @@ -1071,6 +1071,12 @@ bool out_of_memory(struct oom_control *oc)
>{
>   unsigned long freed = 0;
>   bool delay = false; /* if set, delay next allocation attempt */
>   +   static unsigned long last_warned;
>   +   if (!last_warned || time_after(jiffies, last_warned + 10 * HZ)) {
>   +   pr_warn("%s(%d) gfp_mask=%#x(%pGg), order=%d\n", current->comm,
>   +   current->pid, oc->gfp_mask, >gfp_mask, oc->order);
>   +   last_warned = jiffies;
>   +   }
>
>   oc->constraint = CONSTRAINT_NONE;
>   if (oom_killer_disabled)
> 
> in order to demonstrate that the GFP_NOIO allocation from 
> disk_events_workfn() is
> calling out_of_memory() rather than by error failing to give up direct 
> reclaim).
> 
> [  258.619119] kworker/0:0(5) gfp_mask=0x60(GFP_NOIO), order=0
> [  268.622732] kworker/0:0(5) gfp_mask=0x60(GFP_NOIO), order=0
> [  278.635344] kworker/0:0(5) gfp_mask=0x60(GFP_NOIO), order=0
> [  288.639360] kworker/0:0(5) gfp_mask=0x60(GFP_NOIO), order=0
> [  298.642715] kworker/0:0(5) gfp_mask=0x60(GFP_NOIO), order=0

Hmm, so there is no other memory allocation to trigger the oom or they
all just back off on the oom_lock trylock? In other words what is
preventing from the oom killer invocation?
 
[...]

> Since the patch shown below was suggested by Michal Hocko at
> https://marc.info/?l=linux-mm=152723708623015 , it is from Michal Hocko.
> 
> >From cd8095242de13ace61eefca0c3d6f2a5a7b40032 Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Thu, 26 Jul 2018 14:40:03 +0900
> Subject: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at 
> should_reclaim_retry().
> 
> Tetsuo Handa has reported that it is possible to bypass the short sleep
> for PF_WQ_WORKER threads which was introduced by commit 373ccbe5927034b5
> ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make
> any progress") and moved by commit ede37713737834d9 ("mm: throttle on IO
> only when there are too many dirty and writeback pages") and lock up the
> system if OOM.
> 
> This is because we are implicitly counting on falling back to
> schedule_timeout_uninterruptible() in __alloc_pages_may_oom() when
> schedule_timeout_uninterruptible() in should_reclaim_retry() was not
> called due to __zone_watermark_ok() == false.

How do we rely on that?

> However, schedule_timeout_uninterruptible() in __alloc_pages_may_oom() is
> not called if all allocating threads but a PF_WQ_WORKER thread got stuck at
> __GFP_FS direct reclaim, for mutex_trylock(_lock) by that PF_WQ_WORKER
> thread succeeds and out_of_memory() remains no-op unless that PF_WQ_WORKER
> thread is doing __GFP_FS allocation.

I have really hard time to parse and understand this.

> Tetsuo is observing that GFP_NOIO
> allocation request from disk_events_workfn() is preventing other pending
> works from starting.

What about any other allocation from !PF_WQ_WORKER context? Why those do
not jump in?

> Since should_reclaim_retry() should be a natural reschedule point,
> let's do the short sleep for PF_WQ_WORKER threads unconditionally
> in order to guarantee that other pending works are started.

OK, this is finally makes some sense. But it doesn't explain why it
handles the live lock.

> Reported-by: Tetsuo Handa 
> Signed-off-by: Michal Hocko 

Your s-o-b is missing again. I have already told you that previously
when you were posting the patch.

I do not mind this change per se but I am not happy about _your_ changelog.
It doesn't explain the underlying problem IMHO. Having a natural and
unconditional scheduling point in should_reclaim_retry is a reasonable
thing. But how the hack it relates to the livelock you are seeing. So
namely the changelog should explain
1) why nobody is able to make forward progress during direct reclaim
2) why nobody is able to trigger oom killer as the last resort

> Cc: Roman Gushchin 
> Cc: Johannes Weiner 
> Cc: Vladimir Davydov 
> Cc: David Rientjes 
> Cc: Tejun Heo 
> ---
>  mm/page_alloc.c | 34 ++
>  1 file changed, 18 insertions(+), 16 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a790ef4..0c2c0a2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3922,6 +3922,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  {
>   struct zone *zone;
>   struct zoneref *z;
> + bool ret = false;
>  
>   /*
>* Costly allocations might have made a progress but this doesn't mean
> @@ -3985,25 +3986,26 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  

Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

2018-07-26 Thread Michal Hocko
On Thu 26-07-18 20:06:24, Tetsuo Handa wrote:
> Before applying "an OOM lockup mitigation patch", I want to apply this
> "another OOM lockup avoidance" patch.

I do not really see why. All these are borderline interesting as the
system is basically dead by the time you reach this state.

> Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20180726.txt.xz
> (which was captured with
> 
>   --- a/mm/oom_kill.c
>   +++ b/mm/oom_kill.c
>   @@ -1071,6 +1071,12 @@ bool out_of_memory(struct oom_control *oc)
>{
>   unsigned long freed = 0;
>   bool delay = false; /* if set, delay next allocation attempt */
>   +   static unsigned long last_warned;
>   +   if (!last_warned || time_after(jiffies, last_warned + 10 * HZ)) {
>   +   pr_warn("%s(%d) gfp_mask=%#x(%pGg), order=%d\n", current->comm,
>   +   current->pid, oc->gfp_mask, >gfp_mask, oc->order);
>   +   last_warned = jiffies;
>   +   }
>
>   oc->constraint = CONSTRAINT_NONE;
>   if (oom_killer_disabled)
> 
> in order to demonstrate that the GFP_NOIO allocation from 
> disk_events_workfn() is
> calling out_of_memory() rather than by error failing to give up direct 
> reclaim).
> 
> [  258.619119] kworker/0:0(5) gfp_mask=0x60(GFP_NOIO), order=0
> [  268.622732] kworker/0:0(5) gfp_mask=0x60(GFP_NOIO), order=0
> [  278.635344] kworker/0:0(5) gfp_mask=0x60(GFP_NOIO), order=0
> [  288.639360] kworker/0:0(5) gfp_mask=0x60(GFP_NOIO), order=0
> [  298.642715] kworker/0:0(5) gfp_mask=0x60(GFP_NOIO), order=0

Hmm, so there is no other memory allocation to trigger the oom or they
all just back off on the oom_lock trylock? In other words what is
preventing from the oom killer invocation?
 
[...]

> Since the patch shown below was suggested by Michal Hocko at
> https://marc.info/?l=linux-mm=152723708623015 , it is from Michal Hocko.
> 
> >From cd8095242de13ace61eefca0c3d6f2a5a7b40032 Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Thu, 26 Jul 2018 14:40:03 +0900
> Subject: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at 
> should_reclaim_retry().
> 
> Tetsuo Handa has reported that it is possible to bypass the short sleep
> for PF_WQ_WORKER threads which was introduced by commit 373ccbe5927034b5
> ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make
> any progress") and moved by commit ede37713737834d9 ("mm: throttle on IO
> only when there are too many dirty and writeback pages") and lock up the
> system if OOM.
> 
> This is because we are implicitly counting on falling back to
> schedule_timeout_uninterruptible() in __alloc_pages_may_oom() when
> schedule_timeout_uninterruptible() in should_reclaim_retry() was not
> called due to __zone_watermark_ok() == false.

How do we rely on that?

> However, schedule_timeout_uninterruptible() in __alloc_pages_may_oom() is
> not called if all allocating threads but a PF_WQ_WORKER thread got stuck at
> __GFP_FS direct reclaim, for mutex_trylock(_lock) by that PF_WQ_WORKER
> thread succeeds and out_of_memory() remains no-op unless that PF_WQ_WORKER
> thread is doing __GFP_FS allocation.

I have really hard time to parse and understand this.

> Tetsuo is observing that GFP_NOIO
> allocation request from disk_events_workfn() is preventing other pending
> works from starting.

What about any other allocation from !PF_WQ_WORKER context? Why those do
not jump in?

> Since should_reclaim_retry() should be a natural reschedule point,
> let's do the short sleep for PF_WQ_WORKER threads unconditionally
> in order to guarantee that other pending works are started.

OK, this is finally makes some sense. But it doesn't explain why it
handles the live lock.

> Reported-by: Tetsuo Handa 
> Signed-off-by: Michal Hocko 

Your s-o-b is missing again. I have already told you that previously
when you were posting the patch.

I do not mind this change per se but I am not happy about _your_ changelog.
It doesn't explain the underlying problem IMHO. Having a natural and
unconditional scheduling point in should_reclaim_retry is a reasonable
thing. But how the hack it relates to the livelock you are seeing. So
namely the changelog should explain
1) why nobody is able to make forward progress during direct reclaim
2) why nobody is able to trigger oom killer as the last resort

> Cc: Roman Gushchin 
> Cc: Johannes Weiner 
> Cc: Vladimir Davydov 
> Cc: David Rientjes 
> Cc: Tejun Heo 
> ---
>  mm/page_alloc.c | 34 ++
>  1 file changed, 18 insertions(+), 16 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a790ef4..0c2c0a2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3922,6 +3922,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  {
>   struct zone *zone;
>   struct zoneref *z;
> + bool ret = false;
>  
>   /*
>* Costly allocations might have made a progress but this doesn't mean
> @@ -3985,25 +3986,26 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>