Re: Re: [PATCH] oom_reaper: close race without using oom_lock

2017-08-10 Thread Tetsuo Handa
Michal Hocko wrote:
> On Thu 10-08-17 21:10:30, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Tue 08-08-17 11:14:50, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > > > On Sat 05-08-17 10:02:55, Tetsuo Handa wrote:
> > > > > > Michal Hocko wrote:
> > > > > > > On Wed 26-07-17 20:33:21, Tetsuo Handa wrote:
> > > > > > > > My question is, how can users know it if somebody was 
> > > > > > > > OOM-killed needlessly
> > > > > > > > by allowing MMF_OOM_SKIP to race.
> > > > > > > 
> > > > > > > Is it really important to know that the race is due to 
> > > > > > > MMF_OOM_SKIP?
> > > > > > 
> > > > > > Yes, it is really important. Needlessly selecting even one OOM 
> > > > > > victim is
> > > > > > a pain which is difficult to explain to and persuade some of 
> > > > > > customers.
> > > > > 
> > > > > How is this any different from a race with a task exiting an releasing
> > > > > some memory after we have crossed the point of no return and will kill
> > > > > something?
> > > > 
> > > > I'm not complaining about an exiting task releasing some memory after 
> > > > we have
> > > > crossed the point of no return.
> > > > 
> > > > What I'm saying is that we can postpone "the point of no return" if we 
> > > > ignore
> > > > MMF_OOM_SKIP for once (both this "oom_reaper: close race without using 
> > > > oom_lock"
> > > > thread and "mm, oom: task_will_free_mem(current) should ignore 
> > > > MMF_OOM_SKIP for
> > > > once." thread). These are race conditions we can avoid without crystal 
> > > > ball.
> > > 
> > > If those races are really that common than we can handle them even
> > > without "try once more" tricks. Really this is just an ugly hack. If you
> > > really care then make sure that we always try to allocate from memory
> > > reserves before going down the oom path. In other words, try to find a
> > > robust solution rather than tweaks around a problem.
> > 
> > Since your "mm, oom: allow oom reaper to race with exit_mmap" patch removes
> > oom_lock serialization from the OOM reaper, possibility of calling 
> > out_of_memory()
> > due to successful mutex_trylock(_lock) would increase when the OOM 
> > reaper set
> > MMF_OOM_SKIP quickly.
> > 
> > What if task_is_oom_victim(current) became true and MMF_OOM_SKIP was set
> > on current->mm between after __gfp_pfmemalloc_flags() returned 0 and before
> > out_of_memory() is called (due to successful mutex_trylock(_lock)) ?
> > 
> > Excuse me? Are you suggesting to try memory reserves before
> > task_is_oom_victim(current) becomes true?
> 
> No what I've tried to say is that if this really is a real problem,
> which I am not sure about, then the proper way to handle that is to
> attempt to allocate from memory reserves for an oom victim. I would be
> even willing to take the oom_lock back into the oom reaper path if the
> former turnes out to be awkward to implement. But all this assumes this
> is a _real_ problem.

Aren't we back to square one? My question is, how can users know it if
somebody was OOM-killed needlessly by allowing MMF_OOM_SKIP to race.

You don't want to call get_page_from_freelist() from out_of_memory(), do you?
But without passing a flag "whether get_page_from_freelist() with memory 
reserves
was already attempted if current thread is an OOM victim" to 
task_will_free_mem()
in out_of_memory() and a flag "whether get_page_from_freelist() without memory
reserves was already attempted if current thread is not an OOM victim" to
test_bit(MMF_OOM_SKIP) in oom_evaluate_task(), we won't be able to know
if somebody was OOM-killed needlessly by allowing MMF_OOM_SKIP to race.

Will you accept passing such flags (something like incomplete patch shown 
below) ?

--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -35,6 +35,8 @@ struct oom_control {
 */
const int order;
 
+   const bool reserves_tried;
+
/* Used by oom implementation, do not set */
unsigned long totalpages;
struct task_struct *chosen;
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -303,8 +303,10 @@ static int oom_evaluate_task(struct task_struct *task, 
void *arg)
 * any memory is quite low.
 */
if (!is_sysrq_oom(oc) && tsk_is_oom_victim(task)) {
-   if (test_bit(MMF_OOM_SKIP, >signal->oom_mm->flags))
+   if (test_bit(MMF_OOM_SKIP, >signal->oom_mm->flags)) {
+   WARN_ON(!oc->reserves_tried); // can't represent 
correctly
goto next;
+   }
goto abort;
}
 
@@ -762,7 +764,7 @@ static inline bool __task_will_free_mem(struct task_struct 
*task)
  * Caller has to make sure that task->mm is stable (hold task_lock or
  * it operates on the current).
  */
-static bool task_will_free_mem(struct task_struct *task)
+static bool task_will_free_mem(struct task_struct *task, const bool 
reserves_tried)
 {
struct mm_struct *mm = task->mm;
struct task_struct *p;
@@ -783,8 +785,10 @@ static bool 

Re: Re: [PATCH] oom_reaper: close race without using oom_lock

2017-08-10 Thread Tetsuo Handa
Michal Hocko wrote:
> On Thu 10-08-17 21:10:30, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Tue 08-08-17 11:14:50, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > > > On Sat 05-08-17 10:02:55, Tetsuo Handa wrote:
> > > > > > Michal Hocko wrote:
> > > > > > > On Wed 26-07-17 20:33:21, Tetsuo Handa wrote:
> > > > > > > > My question is, how can users know it if somebody was 
> > > > > > > > OOM-killed needlessly
> > > > > > > > by allowing MMF_OOM_SKIP to race.
> > > > > > > 
> > > > > > > Is it really important to know that the race is due to 
> > > > > > > MMF_OOM_SKIP?
> > > > > > 
> > > > > > Yes, it is really important. Needlessly selecting even one OOM 
> > > > > > victim is
> > > > > > a pain which is difficult to explain to and persuade some of 
> > > > > > customers.
> > > > > 
> > > > > How is this any different from a race with a task exiting an releasing
> > > > > some memory after we have crossed the point of no return and will kill
> > > > > something?
> > > > 
> > > > I'm not complaining about an exiting task releasing some memory after 
> > > > we have
> > > > crossed the point of no return.
> > > > 
> > > > What I'm saying is that we can postpone "the point of no return" if we 
> > > > ignore
> > > > MMF_OOM_SKIP for once (both this "oom_reaper: close race without using 
> > > > oom_lock"
> > > > thread and "mm, oom: task_will_free_mem(current) should ignore 
> > > > MMF_OOM_SKIP for
> > > > once." thread). These are race conditions we can avoid without crystal 
> > > > ball.
> > > 
> > > If those races are really that common than we can handle them even
> > > without "try once more" tricks. Really this is just an ugly hack. If you
> > > really care then make sure that we always try to allocate from memory
> > > reserves before going down the oom path. In other words, try to find a
> > > robust solution rather than tweaks around a problem.
> > 
> > Since your "mm, oom: allow oom reaper to race with exit_mmap" patch removes
> > oom_lock serialization from the OOM reaper, possibility of calling 
> > out_of_memory()
> > due to successful mutex_trylock(_lock) would increase when the OOM 
> > reaper set
> > MMF_OOM_SKIP quickly.
> > 
> > What if task_is_oom_victim(current) became true and MMF_OOM_SKIP was set
> > on current->mm between after __gfp_pfmemalloc_flags() returned 0 and before
> > out_of_memory() is called (due to successful mutex_trylock(_lock)) ?
> > 
> > Excuse me? Are you suggesting to try memory reserves before
> > task_is_oom_victim(current) becomes true?
> 
> No what I've tried to say is that if this really is a real problem,
> which I am not sure about, then the proper way to handle that is to
> attempt to allocate from memory reserves for an oom victim. I would be
> even willing to take the oom_lock back into the oom reaper path if the
> former turnes out to be awkward to implement. But all this assumes this
> is a _real_ problem.

Aren't we back to square one? My question is, how can users know it if
somebody was OOM-killed needlessly by allowing MMF_OOM_SKIP to race.

You don't want to call get_page_from_freelist() from out_of_memory(), do you?
But without passing a flag "whether get_page_from_freelist() with memory 
reserves
was already attempted if current thread is an OOM victim" to 
task_will_free_mem()
in out_of_memory() and a flag "whether get_page_from_freelist() without memory
reserves was already attempted if current thread is not an OOM victim" to
test_bit(MMF_OOM_SKIP) in oom_evaluate_task(), we won't be able to know
if somebody was OOM-killed needlessly by allowing MMF_OOM_SKIP to race.

Will you accept passing such flags (something like incomplete patch shown 
below) ?

--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -35,6 +35,8 @@ struct oom_control {
 */
const int order;
 
+   const bool reserves_tried;
+
/* Used by oom implementation, do not set */
unsigned long totalpages;
struct task_struct *chosen;
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -303,8 +303,10 @@ static int oom_evaluate_task(struct task_struct *task, 
void *arg)
 * any memory is quite low.
 */
if (!is_sysrq_oom(oc) && tsk_is_oom_victim(task)) {
-   if (test_bit(MMF_OOM_SKIP, >signal->oom_mm->flags))
+   if (test_bit(MMF_OOM_SKIP, >signal->oom_mm->flags)) {
+   WARN_ON(!oc->reserves_tried); // can't represent 
correctly
goto next;
+   }
goto abort;
}
 
@@ -762,7 +764,7 @@ static inline bool __task_will_free_mem(struct task_struct 
*task)
  * Caller has to make sure that task->mm is stable (hold task_lock or
  * it operates on the current).
  */
-static bool task_will_free_mem(struct task_struct *task)
+static bool task_will_free_mem(struct task_struct *task, const bool 
reserves_tried)
 {
struct mm_struct *mm = task->mm;
struct task_struct *p;
@@ -783,8 +785,10 @@ static bool 

Re: Re: [PATCH] oom_reaper: close race without using oom_lock

2017-08-10 Thread Michal Hocko
On Thu 10-08-17 21:10:30, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Tue 08-08-17 11:14:50, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Sat 05-08-17 10:02:55, Tetsuo Handa wrote:
> > > > > Michal Hocko wrote:
> > > > > > On Wed 26-07-17 20:33:21, Tetsuo Handa wrote:
> > > > > > > My question is, how can users know it if somebody was OOM-killed 
> > > > > > > needlessly
> > > > > > > by allowing MMF_OOM_SKIP to race.
> > > > > > 
> > > > > > Is it really important to know that the race is due to MMF_OOM_SKIP?
> > > > > 
> > > > > Yes, it is really important. Needlessly selecting even one OOM victim 
> > > > > is
> > > > > a pain which is difficult to explain to and persuade some of 
> > > > > customers.
> > > > 
> > > > How is this any different from a race with a task exiting an releasing
> > > > some memory after we have crossed the point of no return and will kill
> > > > something?
> > > 
> > > I'm not complaining about an exiting task releasing some memory after we 
> > > have
> > > crossed the point of no return.
> > > 
> > > What I'm saying is that we can postpone "the point of no return" if we 
> > > ignore
> > > MMF_OOM_SKIP for once (both this "oom_reaper: close race without using 
> > > oom_lock"
> > > thread and "mm, oom: task_will_free_mem(current) should ignore 
> > > MMF_OOM_SKIP for
> > > once." thread). These are race conditions we can avoid without crystal 
> > > ball.
> > 
> > If those races are really that common than we can handle them even
> > without "try once more" tricks. Really this is just an ugly hack. If you
> > really care then make sure that we always try to allocate from memory
> > reserves before going down the oom path. In other words, try to find a
> > robust solution rather than tweaks around a problem.
> 
> Since your "mm, oom: allow oom reaper to race with exit_mmap" patch removes
> oom_lock serialization from the OOM reaper, possibility of calling 
> out_of_memory()
> due to successful mutex_trylock(_lock) would increase when the OOM reaper 
> set
> MMF_OOM_SKIP quickly.
> 
> What if task_is_oom_victim(current) became true and MMF_OOM_SKIP was set
> on current->mm between after __gfp_pfmemalloc_flags() returned 0 and before
> out_of_memory() is called (due to successful mutex_trylock(_lock)) ?
> 
> Excuse me? Are you suggesting to try memory reserves before
> task_is_oom_victim(current) becomes true?

No what I've tried to say is that if this really is a real problem,
which I am not sure about, then the proper way to handle that is to
attempt to allocate from memory reserves for an oom victim. I would be
even willing to take the oom_lock back into the oom reaper path if the
former turnes out to be awkward to implement. But all this assumes this
is a _real_ problem.

> > [...]
> > > > Yes that is possible. Once you are in the shrinker land then you have to
> > > > count with everything. And if you want to imply that
> > > > get_page_from_freelist inside __alloc_pages_may_oom may lockup while
> > > > holding the oom_lock then you might be right but I haven't checked that
> > > > too deeply. It might be very well possible that the node reclaim bails
> > > > out early when we are under OOM.
> > > 
> > > Yes, I worry that get_page_from_freelist() with oom_lock held might 
> > > lockup.
> > > 
> > > If we are about to invoke the OOM killer for the first time, it is likely 
> > > that
> > > __node_reclaim() finds nothing to reclaim and will bail out immediately. 
> > > But if
> > > we are about to invoke the OOM killer again, it is possible that small 
> > > amount of
> > > memory was reclaimed by the OOM killer/reaper, and all reclaimed memory 
> > > was assigned
> > > to things which __node_reclaim() will find and try to reclaim, and any 
> > > thread which
> > > took oom_lock will call __node_reclaim() and __node_reclaim() find 
> > > something
> > > reclaimable if __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory allocation 
> > > is involved.
> > > 
> > > We should consider such situation volatile (i.e. should not make 
> > > assumption that
> > > get_page_from_freelist() with oom_lock held shall bail out immediately) 
> > > if shrinkers
> > > which (directly or indirectly) involve __GFP_DIRECT_RECLAIM && 
> > > !__GFP_NORETRY memory
> > > allocation are permitted.
> > 
> > Well, I think you are so focused on details that you most probably miss
> > a large picture here. Just think about the purpose of the node reclaim.
> > It is there to _prefer_ local allocations than go to a distant NUMA
> > node. So rather than speculating about details maybe it makes sense to
> > consider whether it actually makes any sense to even try to node reclaim
> > when we are OOM. In other words why to do an additional reclaim when we
> > just found out that all reclaim attempts have failed...
> 
> Below is what I will propose if there is possibility of lockup.
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index be5bd60..718b2e7 100644
> --- a/mm/page_alloc.c
> +++ 

Re: Re: [PATCH] oom_reaper: close race without using oom_lock

2017-08-10 Thread Michal Hocko
On Thu 10-08-17 21:10:30, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Tue 08-08-17 11:14:50, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Sat 05-08-17 10:02:55, Tetsuo Handa wrote:
> > > > > Michal Hocko wrote:
> > > > > > On Wed 26-07-17 20:33:21, Tetsuo Handa wrote:
> > > > > > > My question is, how can users know it if somebody was OOM-killed 
> > > > > > > needlessly
> > > > > > > by allowing MMF_OOM_SKIP to race.
> > > > > > 
> > > > > > Is it really important to know that the race is due to MMF_OOM_SKIP?
> > > > > 
> > > > > Yes, it is really important. Needlessly selecting even one OOM victim 
> > > > > is
> > > > > a pain which is difficult to explain to and persuade some of 
> > > > > customers.
> > > > 
> > > > How is this any different from a race with a task exiting an releasing
> > > > some memory after we have crossed the point of no return and will kill
> > > > something?
> > > 
> > > I'm not complaining about an exiting task releasing some memory after we 
> > > have
> > > crossed the point of no return.
> > > 
> > > What I'm saying is that we can postpone "the point of no return" if we 
> > > ignore
> > > MMF_OOM_SKIP for once (both this "oom_reaper: close race without using 
> > > oom_lock"
> > > thread and "mm, oom: task_will_free_mem(current) should ignore 
> > > MMF_OOM_SKIP for
> > > once." thread). These are race conditions we can avoid without crystal 
> > > ball.
> > 
> > If those races are really that common than we can handle them even
> > without "try once more" tricks. Really this is just an ugly hack. If you
> > really care then make sure that we always try to allocate from memory
> > reserves before going down the oom path. In other words, try to find a
> > robust solution rather than tweaks around a problem.
> 
> Since your "mm, oom: allow oom reaper to race with exit_mmap" patch removes
> oom_lock serialization from the OOM reaper, possibility of calling 
> out_of_memory()
> due to successful mutex_trylock(_lock) would increase when the OOM reaper 
> set
> MMF_OOM_SKIP quickly.
> 
> What if task_is_oom_victim(current) became true and MMF_OOM_SKIP was set
> on current->mm between after __gfp_pfmemalloc_flags() returned 0 and before
> out_of_memory() is called (due to successful mutex_trylock(_lock)) ?
> 
> Excuse me? Are you suggesting to try memory reserves before
> task_is_oom_victim(current) becomes true?

No what I've tried to say is that if this really is a real problem,
which I am not sure about, then the proper way to handle that is to
attempt to allocate from memory reserves for an oom victim. I would be
even willing to take the oom_lock back into the oom reaper path if the
former turnes out to be awkward to implement. But all this assumes this
is a _real_ problem.

> > [...]
> > > > Yes that is possible. Once you are in the shrinker land then you have to
> > > > count with everything. And if you want to imply that
> > > > get_page_from_freelist inside __alloc_pages_may_oom may lockup while
> > > > holding the oom_lock then you might be right but I haven't checked that
> > > > too deeply. It might be very well possible that the node reclaim bails
> > > > out early when we are under OOM.
> > > 
> > > Yes, I worry that get_page_from_freelist() with oom_lock held might 
> > > lockup.
> > > 
> > > If we are about to invoke the OOM killer for the first time, it is likely 
> > > that
> > > __node_reclaim() finds nothing to reclaim and will bail out immediately. 
> > > But if
> > > we are about to invoke the OOM killer again, it is possible that small 
> > > amount of
> > > memory was reclaimed by the OOM killer/reaper, and all reclaimed memory 
> > > was assigned
> > > to things which __node_reclaim() will find and try to reclaim, and any 
> > > thread which
> > > took oom_lock will call __node_reclaim() and __node_reclaim() find 
> > > something
> > > reclaimable if __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory allocation 
> > > is involved.
> > > 
> > > We should consider such situation volatile (i.e. should not make 
> > > assumption that
> > > get_page_from_freelist() with oom_lock held shall bail out immediately) 
> > > if shrinkers
> > > which (directly or indirectly) involve __GFP_DIRECT_RECLAIM && 
> > > !__GFP_NORETRY memory
> > > allocation are permitted.
> > 
> > Well, I think you are so focused on details that you most probably miss
> > a large picture here. Just think about the purpose of the node reclaim.
> > It is there to _prefer_ local allocations than go to a distant NUMA
> > node. So rather than speculating about details maybe it makes sense to
> > consider whether it actually makes any sense to even try to node reclaim
> > when we are OOM. In other words why to do an additional reclaim when we
> > just found out that all reclaim attempts have failed...
> 
> Below is what I will propose if there is possibility of lockup.
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index be5bd60..718b2e7 100644
> --- a/mm/page_alloc.c
> +++ 

Re: Re: [PATCH] oom_reaper: close race without using oom_lock

2017-08-10 Thread Tetsuo Handa
Michal Hocko wrote:
> On Tue 08-08-17 11:14:50, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Sat 05-08-17 10:02:55, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > > > On Wed 26-07-17 20:33:21, Tetsuo Handa wrote:
> > > > > > My question is, how can users know it if somebody was OOM-killed 
> > > > > > needlessly
> > > > > > by allowing MMF_OOM_SKIP to race.
> > > > > 
> > > > > Is it really important to know that the race is due to MMF_OOM_SKIP?
> > > > 
> > > > Yes, it is really important. Needlessly selecting even one OOM victim is
> > > > a pain which is difficult to explain to and persuade some of customers.
> > > 
> > > How is this any different from a race with a task exiting an releasing
> > > some memory after we have crossed the point of no return and will kill
> > > something?
> > 
> > I'm not complaining about an exiting task releasing some memory after we 
> > have
> > crossed the point of no return.
> > 
> > What I'm saying is that we can postpone "the point of no return" if we 
> > ignore
> > MMF_OOM_SKIP for once (both this "oom_reaper: close race without using 
> > oom_lock"
> > thread and "mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP 
> > for
> > once." thread). These are race conditions we can avoid without crystal ball.
> 
> If those races are really that common than we can handle them even
> without "try once more" tricks. Really this is just an ugly hack. If you
> really care then make sure that we always try to allocate from memory
> reserves before going down the oom path. In other words, try to find a
> robust solution rather than tweaks around a problem.

Since your "mm, oom: allow oom reaper to race with exit_mmap" patch removes
oom_lock serialization from the OOM reaper, possibility of calling 
out_of_memory()
due to successful mutex_trylock(_lock) would increase when the OOM reaper 
set
MMF_OOM_SKIP quickly.

What if task_is_oom_victim(current) became true and MMF_OOM_SKIP was set
on current->mm between after __gfp_pfmemalloc_flags() returned 0 and before
out_of_memory() is called (due to successful mutex_trylock(_lock)) ?

Excuse me? Are you suggesting to try memory reserves before
task_is_oom_victim(current) becomes true?

> 
> [...]
> > > Yes that is possible. Once you are in the shrinker land then you have to
> > > count with everything. And if you want to imply that
> > > get_page_from_freelist inside __alloc_pages_may_oom may lockup while
> > > holding the oom_lock then you might be right but I haven't checked that
> > > too deeply. It might be very well possible that the node reclaim bails
> > > out early when we are under OOM.
> > 
> > Yes, I worry that get_page_from_freelist() with oom_lock held might lockup.
> > 
> > If we are about to invoke the OOM killer for the first time, it is likely 
> > that
> > __node_reclaim() finds nothing to reclaim and will bail out immediately. 
> > But if
> > we are about to invoke the OOM killer again, it is possible that small 
> > amount of
> > memory was reclaimed by the OOM killer/reaper, and all reclaimed memory was 
> > assigned
> > to things which __node_reclaim() will find and try to reclaim, and any 
> > thread which
> > took oom_lock will call __node_reclaim() and __node_reclaim() find something
> > reclaimable if __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory allocation is 
> > involved.
> > 
> > We should consider such situation volatile (i.e. should not make assumption 
> > that
> > get_page_from_freelist() with oom_lock held shall bail out immediately) if 
> > shrinkers
> > which (directly or indirectly) involve __GFP_DIRECT_RECLAIM && 
> > !__GFP_NORETRY memory
> > allocation are permitted.
> 
> Well, I think you are so focused on details that you most probably miss
> a large picture here. Just think about the purpose of the node reclaim.
> It is there to _prefer_ local allocations than go to a distant NUMA
> node. So rather than speculating about details maybe it makes sense to
> consider whether it actually makes any sense to even try to node reclaim
> when we are OOM. In other words why to do an additional reclaim when we
> just found out that all reclaim attempts have failed...

Below is what I will propose if there is possibility of lockup.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index be5bd60..718b2e7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3271,9 +3271,11 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, 
const char *fmt, ...)
/*
 * Go through the zonelist yet one more time, keep very high watermark
 * here, this is only to catch a parallel oom killing, we must fail if
-* we're still under heavy pressure.
+* we're still under heavy pressure. But make sure that this reclaim
+* attempt shall not involve __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
+* allocation which will never fail due to oom_lock already held.
 */
-   page = get_page_from_freelist(gfp_mask | __GFP_HARDWALL, order,
+   

Re: Re: [PATCH] oom_reaper: close race without using oom_lock

2017-08-10 Thread Tetsuo Handa
Michal Hocko wrote:
> On Tue 08-08-17 11:14:50, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Sat 05-08-17 10:02:55, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > > > On Wed 26-07-17 20:33:21, Tetsuo Handa wrote:
> > > > > > My question is, how can users know it if somebody was OOM-killed 
> > > > > > needlessly
> > > > > > by allowing MMF_OOM_SKIP to race.
> > > > > 
> > > > > Is it really important to know that the race is due to MMF_OOM_SKIP?
> > > > 
> > > > Yes, it is really important. Needlessly selecting even one OOM victim is
> > > > a pain which is difficult to explain to and persuade some of customers.
> > > 
> > > How is this any different from a race with a task exiting an releasing
> > > some memory after we have crossed the point of no return and will kill
> > > something?
> > 
> > I'm not complaining about an exiting task releasing some memory after we 
> > have
> > crossed the point of no return.
> > 
> > What I'm saying is that we can postpone "the point of no return" if we 
> > ignore
> > MMF_OOM_SKIP for once (both this "oom_reaper: close race without using 
> > oom_lock"
> > thread and "mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP 
> > for
> > once." thread). These are race conditions we can avoid without crystal ball.
> 
> If those races are really that common than we can handle them even
> without "try once more" tricks. Really this is just an ugly hack. If you
> really care then make sure that we always try to allocate from memory
> reserves before going down the oom path. In other words, try to find a
> robust solution rather than tweaks around a problem.

Since your "mm, oom: allow oom reaper to race with exit_mmap" patch removes
oom_lock serialization from the OOM reaper, possibility of calling 
out_of_memory()
due to successful mutex_trylock(_lock) would increase when the OOM reaper 
set
MMF_OOM_SKIP quickly.

What if task_is_oom_victim(current) became true and MMF_OOM_SKIP was set
on current->mm between after __gfp_pfmemalloc_flags() returned 0 and before
out_of_memory() is called (due to successful mutex_trylock(_lock)) ?

Excuse me? Are you suggesting to try memory reserves before
task_is_oom_victim(current) becomes true?

> 
> [...]
> > > Yes that is possible. Once you are in the shrinker land then you have to
> > > count with everything. And if you want to imply that
> > > get_page_from_freelist inside __alloc_pages_may_oom may lockup while
> > > holding the oom_lock then you might be right but I haven't checked that
> > > too deeply. It might be very well possible that the node reclaim bails
> > > out early when we are under OOM.
> > 
> > Yes, I worry that get_page_from_freelist() with oom_lock held might lockup.
> > 
> > If we are about to invoke the OOM killer for the first time, it is likely 
> > that
> > __node_reclaim() finds nothing to reclaim and will bail out immediately. 
> > But if
> > we are about to invoke the OOM killer again, it is possible that small 
> > amount of
> > memory was reclaimed by the OOM killer/reaper, and all reclaimed memory was 
> > assigned
> > to things which __node_reclaim() will find and try to reclaim, and any 
> > thread which
> > took oom_lock will call __node_reclaim() and __node_reclaim() find something
> > reclaimable if __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory allocation is 
> > involved.
> > 
> > We should consider such situation volatile (i.e. should not make assumption 
> > that
> > get_page_from_freelist() with oom_lock held shall bail out immediately) if 
> > shrinkers
> > which (directly or indirectly) involve __GFP_DIRECT_RECLAIM && 
> > !__GFP_NORETRY memory
> > allocation are permitted.
> 
> Well, I think you are so focused on details that you most probably miss
> a large picture here. Just think about the purpose of the node reclaim.
> It is there to _prefer_ local allocations than go to a distant NUMA
> node. So rather than speculating about details maybe it makes sense to
> consider whether it actually makes any sense to even try to node reclaim
> when we are OOM. In other words why to do an additional reclaim when we
> just found out that all reclaim attempts have failed...

Below is what I will propose if there is possibility of lockup.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index be5bd60..718b2e7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3271,9 +3271,11 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, 
const char *fmt, ...)
/*
 * Go through the zonelist yet one more time, keep very high watermark
 * here, this is only to catch a parallel oom killing, we must fail if
-* we're still under heavy pressure.
+* we're still under heavy pressure. But make sure that this reclaim
+* attempt shall not involve __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
+* allocation which will never fail due to oom_lock already held.
 */
-   page = get_page_from_freelist(gfp_mask | __GFP_HARDWALL, order,
+   

Re: Re: [PATCH] oom_reaper: close race without using oom_lock

2017-08-10 Thread Michal Hocko
On Tue 08-08-17 11:14:50, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Sat 05-08-17 10:02:55, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Wed 26-07-17 20:33:21, Tetsuo Handa wrote:
> > > > > Michal Hocko wrote:
> > > > > > On Sun 23-07-17 09:41:50, Tetsuo Handa wrote:
> > > > > > > So, how can we verify the above race a real problem?
> > > > > > 
> > > > > > Try to simulate a _real_ workload and see whether we kill more tasks
> > > > > > than necessary. 
> > > > > 
> > > > > Whether it is a _real_ workload or not cannot become an answer.
> > > > > 
> > > > > If somebody is trying to allocate hundreds/thousands of pages after 
> > > > > memory of
> > > > > an OOM victim was reaped, avoiding this race window makes no sense; 
> > > > > next OOM
> > > > > victim will be selected anyway. But if somebody is trying to allocate 
> > > > > only one
> > > > > page and then is planning to release a lot of memory, avoiding this 
> > > > > race window
> > > > > can save somebody from being OOM-killed needlessly. This race window 
> > > > > depends on
> > > > > what the threads are about to do, not whether the workload is natural 
> > > > > or
> > > > > artificial.
> > > > 
> > > > And with a desparate lack of crystal ball we cannot do much about that
> > > > really.
> > > > 
> > > > > My question is, how can users know it if somebody was OOM-killed 
> > > > > needlessly
> > > > > by allowing MMF_OOM_SKIP to race.
> > > > 
> > > > Is it really important to know that the race is due to MMF_OOM_SKIP?
> > > 
> > > Yes, it is really important. Needlessly selecting even one OOM victim is
> > > a pain which is difficult to explain to and persuade some of customers.
> > 
> > How is this any different from a race with a task exiting an releasing
> > some memory after we have crossed the point of no return and will kill
> > something?
> 
> I'm not complaining about an exiting task releasing some memory after we have
> crossed the point of no return.
> 
> What I'm saying is that we can postpone "the point of no return" if we ignore
> MMF_OOM_SKIP for once (both this "oom_reaper: close race without using 
> oom_lock"
> thread and "mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP 
> for
> once." thread). These are race conditions we can avoid without crystal ball.

If those races are really that common than we can handle them even
without "try once more" tricks. Really this is just an ugly hack. If you
really care then make sure that we always try to allocate from memory
reserves before going down the oom path. In other words, try to find a
robust solution rather than tweaks around a problem.

[...]
> > Yes that is possible. Once you are in the shrinker land then you have to
> > count with everything. And if you want to imply that
> > get_page_from_freelist inside __alloc_pages_may_oom may lockup while
> > holding the oom_lock then you might be right but I haven't checked that
> > too deeply. It might be very well possible that the node reclaim bails
> > out early when we are under OOM.
> 
> Yes, I worry that get_page_from_freelist() with oom_lock held might lockup.
> 
> If we are about to invoke the OOM killer for the first time, it is likely that
> __node_reclaim() finds nothing to reclaim and will bail out immediately. But 
> if
> we are about to invoke the OOM killer again, it is possible that small amount 
> of
> memory was reclaimed by the OOM killer/reaper, and all reclaimed memory was 
> assigned
> to things which __node_reclaim() will find and try to reclaim, and any thread 
> which
> took oom_lock will call __node_reclaim() and __node_reclaim() find something
> reclaimable if __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory allocation is 
> involved.
> 
> We should consider such situation volatile (i.e. should not make assumption 
> that
> get_page_from_freelist() with oom_lock held shall bail out immediately) if 
> shrinkers
> which (directly or indirectly) involve __GFP_DIRECT_RECLAIM && !__GFP_NORETRY 
> memory
> allocation are permitted.

Well, I think you are so focused on details that you most probably miss
a large picture here. Just think about the purpose of the node reclaim.
It is there to _prefer_ local allocations than go to a distant NUMA
node. So rather than speculating about details maybe it makes sense to
consider whether it actually makes any sense to even try to node reclaim
when we are OOM. In other words why to do an additional reclaim when we
just found out that all reclaim attempts have failed...

-- 
Michal Hocko
SUSE Labs


Re: Re: [PATCH] oom_reaper: close race without using oom_lock

2017-08-10 Thread Michal Hocko
On Tue 08-08-17 11:14:50, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Sat 05-08-17 10:02:55, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Wed 26-07-17 20:33:21, Tetsuo Handa wrote:
> > > > > Michal Hocko wrote:
> > > > > > On Sun 23-07-17 09:41:50, Tetsuo Handa wrote:
> > > > > > > So, how can we verify the above race a real problem?
> > > > > > 
> > > > > > Try to simulate a _real_ workload and see whether we kill more tasks
> > > > > > than necessary. 
> > > > > 
> > > > > Whether it is a _real_ workload or not cannot become an answer.
> > > > > 
> > > > > If somebody is trying to allocate hundreds/thousands of pages after 
> > > > > memory of
> > > > > an OOM victim was reaped, avoiding this race window makes no sense; 
> > > > > next OOM
> > > > > victim will be selected anyway. But if somebody is trying to allocate 
> > > > > only one
> > > > > page and then is planning to release a lot of memory, avoiding this 
> > > > > race window
> > > > > can save somebody from being OOM-killed needlessly. This race window 
> > > > > depends on
> > > > > what the threads are about to do, not whether the workload is natural 
> > > > > or
> > > > > artificial.
> > > > 
> > > > And with a desparate lack of crystal ball we cannot do much about that
> > > > really.
> > > > 
> > > > > My question is, how can users know it if somebody was OOM-killed 
> > > > > needlessly
> > > > > by allowing MMF_OOM_SKIP to race.
> > > > 
> > > > Is it really important to know that the race is due to MMF_OOM_SKIP?
> > > 
> > > Yes, it is really important. Needlessly selecting even one OOM victim is
> > > a pain which is difficult to explain to and persuade some of customers.
> > 
> > How is this any different from a race with a task exiting an releasing
> > some memory after we have crossed the point of no return and will kill
> > something?
> 
> I'm not complaining about an exiting task releasing some memory after we have
> crossed the point of no return.
> 
> What I'm saying is that we can postpone "the point of no return" if we ignore
> MMF_OOM_SKIP for once (both this "oom_reaper: close race without using 
> oom_lock"
> thread and "mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP 
> for
> once." thread). These are race conditions we can avoid without crystal ball.

If those races are really that common than we can handle them even
without "try once more" tricks. Really this is just an ugly hack. If you
really care then make sure that we always try to allocate from memory
reserves before going down the oom path. In other words, try to find a
robust solution rather than tweaks around a problem.

[...]
> > Yes that is possible. Once you are in the shrinker land then you have to
> > count with everything. And if you want to imply that
> > get_page_from_freelist inside __alloc_pages_may_oom may lockup while
> > holding the oom_lock then you might be right but I haven't checked that
> > too deeply. It might be very well possible that the node reclaim bails
> > out early when we are under OOM.
> 
> Yes, I worry that get_page_from_freelist() with oom_lock held might lockup.
> 
> If we are about to invoke the OOM killer for the first time, it is likely that
> __node_reclaim() finds nothing to reclaim and will bail out immediately. But 
> if
> we are about to invoke the OOM killer again, it is possible that small amount 
> of
> memory was reclaimed by the OOM killer/reaper, and all reclaimed memory was 
> assigned
> to things which __node_reclaim() will find and try to reclaim, and any thread 
> which
> took oom_lock will call __node_reclaim() and __node_reclaim() find something
> reclaimable if __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory allocation is 
> involved.
> 
> We should consider such situation volatile (i.e. should not make assumption 
> that
> get_page_from_freelist() with oom_lock held shall bail out immediately) if 
> shrinkers
> which (directly or indirectly) involve __GFP_DIRECT_RECLAIM && !__GFP_NORETRY 
> memory
> allocation are permitted.

Well, I think you are so focused on details that you most probably miss
a large picture here. Just think about the purpose of the node reclaim.
It is there to _prefer_ local allocations than go to a distant NUMA
node. So rather than speculating about details maybe it makes sense to
consider whether it actually makes any sense to even try to node reclaim
when we are OOM. In other words why to do an additional reclaim when we
just found out that all reclaim attempts have failed...

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] oom_reaper: close race without using oom_lock

2017-08-07 Thread Michal Hocko
On Sat 05-08-17 10:02:55, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 26-07-17 20:33:21, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Sun 23-07-17 09:41:50, Tetsuo Handa wrote:
> > > > > So, how can we verify the above race a real problem?
> > > > 
> > > > Try to simulate a _real_ workload and see whether we kill more tasks
> > > > than necessary. 
> > > 
> > > Whether it is a _real_ workload or not cannot become an answer.
> > > 
> > > If somebody is trying to allocate hundreds/thousands of pages after 
> > > memory of
> > > an OOM victim was reaped, avoiding this race window makes no sense; next 
> > > OOM
> > > victim will be selected anyway. But if somebody is trying to allocate 
> > > only one
> > > page and then is planning to release a lot of memory, avoiding this race 
> > > window
> > > can save somebody from being OOM-killed needlessly. This race window 
> > > depends on
> > > what the threads are about to do, not whether the workload is natural or
> > > artificial.
> > 
> > And with a desparate lack of crystal ball we cannot do much about that
> > really.
> > 
> > > My question is, how can users know it if somebody was OOM-killed 
> > > needlessly
> > > by allowing MMF_OOM_SKIP to race.
> > 
> > Is it really important to know that the race is due to MMF_OOM_SKIP?
> 
> Yes, it is really important. Needlessly selecting even one OOM victim is
> a pain which is difficult to explain to and persuade some of customers.

How is this any different from a race with a task exiting an releasing
some memory after we have crossed the point of no return and will kill
something?

> > Isn't it sufficient to see that we kill too many tasks and then debug it
> > further once something hits that?
> 
> It is not sufficient.
> 
> > 
> > [...]
> > > Is it guaranteed that __node_reclaim() never (even indirectly) waits for
> > > __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory allocation?
> > 
> > this is a direct reclaim which can go down to slab shrinkers with all
> > the usual fun...
> 
> Excuse me, but does that mean "Yes, it is" ?
> 
> As far as I checked, most shrinkers use non-scheduling operations other than
> cond_resched(). But some shrinkers use lock_page()/down_write() etc. I worry
> that such shrinkers might wait for __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
> memory allocation (i.e. "No, it isn't").

Yes that is possible. Once you are in the shrinker land then you have to
count with everything. And if you want to imply that
get_page_from_freelist inside __alloc_pages_may_oom may lockup while
holding the oom_lock then you might be right but I haven't checked that
too deeply. It might be very well possible that the node reclaim bails
out early when we are under OOM.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] oom_reaper: close race without using oom_lock

2017-08-07 Thread Michal Hocko
On Sat 05-08-17 10:02:55, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 26-07-17 20:33:21, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Sun 23-07-17 09:41:50, Tetsuo Handa wrote:
> > > > > So, how can we verify the above race a real problem?
> > > > 
> > > > Try to simulate a _real_ workload and see whether we kill more tasks
> > > > than necessary. 
> > > 
> > > Whether it is a _real_ workload or not cannot become an answer.
> > > 
> > > If somebody is trying to allocate hundreds/thousands of pages after 
> > > memory of
> > > an OOM victim was reaped, avoiding this race window makes no sense; next 
> > > OOM
> > > victim will be selected anyway. But if somebody is trying to allocate 
> > > only one
> > > page and then is planning to release a lot of memory, avoiding this race 
> > > window
> > > can save somebody from being OOM-killed needlessly. This race window 
> > > depends on
> > > what the threads are about to do, not whether the workload is natural or
> > > artificial.
> > 
> > And with a desparate lack of crystal ball we cannot do much about that
> > really.
> > 
> > > My question is, how can users know it if somebody was OOM-killed 
> > > needlessly
> > > by allowing MMF_OOM_SKIP to race.
> > 
> > Is it really important to know that the race is due to MMF_OOM_SKIP?
> 
> Yes, it is really important. Needlessly selecting even one OOM victim is
> a pain which is difficult to explain to and persuade some of customers.

How is this any different from a race with a task exiting an releasing
some memory after we have crossed the point of no return and will kill
something?

> > Isn't it sufficient to see that we kill too many tasks and then debug it
> > further once something hits that?
> 
> It is not sufficient.
> 
> > 
> > [...]
> > > Is it guaranteed that __node_reclaim() never (even indirectly) waits for
> > > __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory allocation?
> > 
> > this is a direct reclaim which can go down to slab shrinkers with all
> > the usual fun...
> 
> Excuse me, but does that mean "Yes, it is" ?
> 
> As far as I checked, most shrinkers use non-scheduling operations other than
> cond_resched(). But some shrinkers use lock_page()/down_write() etc. I worry
> that such shrinkers might wait for __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
> memory allocation (i.e. "No, it isn't").

Yes that is possible. Once you are in the shrinker land then you have to
count with everything. And if you want to imply that
get_page_from_freelist inside __alloc_pages_may_oom may lockup while
holding the oom_lock then you might be right but I haven't checked that
too deeply. It might be very well possible that the node reclaim bails
out early when we are under OOM.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] oom_reaper: close race without using oom_lock

2017-08-04 Thread Tetsuo Handa
Michal Hocko wrote:
> On Wed 26-07-17 20:33:21, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Sun 23-07-17 09:41:50, Tetsuo Handa wrote:
> > > > So, how can we verify the above race a real problem?
> > > 
> > > Try to simulate a _real_ workload and see whether we kill more tasks
> > > than necessary. 
> > 
> > Whether it is a _real_ workload or not cannot become an answer.
> > 
> > If somebody is trying to allocate hundreds/thousands of pages after memory 
> > of
> > an OOM victim was reaped, avoiding this race window makes no sense; next OOM
> > victim will be selected anyway. But if somebody is trying to allocate only 
> > one
> > page and then is planning to release a lot of memory, avoiding this race 
> > window
> > can save somebody from being OOM-killed needlessly. This race window 
> > depends on
> > what the threads are about to do, not whether the workload is natural or
> > artificial.
> 
> And with a desparate lack of crystal ball we cannot do much about that
> really.
> 
> > My question is, how can users know it if somebody was OOM-killed needlessly
> > by allowing MMF_OOM_SKIP to race.
> 
> Is it really important to know that the race is due to MMF_OOM_SKIP?

Yes, it is really important. Needlessly selecting even one OOM victim is
a pain which is difficult to explain to and persuade some of customers.

> Isn't it sufficient to see that we kill too many tasks and then debug it
> further once something hits that?

It is not sufficient.

> 
> [...]
> > Is it guaranteed that __node_reclaim() never (even indirectly) waits for
> > __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory allocation?
> 
> this is a direct reclaim which can go down to slab shrinkers with all
> the usual fun...

Excuse me, but does that mean "Yes, it is" ?

As far as I checked, most shrinkers use non-scheduling operations other than
cond_resched(). But some shrinkers use lock_page()/down_write() etc. I worry
that such shrinkers might wait for __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
memory allocation (i.e. "No, it isn't").


Re: [PATCH] oom_reaper: close race without using oom_lock

2017-08-04 Thread Tetsuo Handa
Michal Hocko wrote:
> On Wed 26-07-17 20:33:21, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Sun 23-07-17 09:41:50, Tetsuo Handa wrote:
> > > > So, how can we verify the above race a real problem?
> > > 
> > > Try to simulate a _real_ workload and see whether we kill more tasks
> > > than necessary. 
> > 
> > Whether it is a _real_ workload or not cannot become an answer.
> > 
> > If somebody is trying to allocate hundreds/thousands of pages after memory 
> > of
> > an OOM victim was reaped, avoiding this race window makes no sense; next OOM
> > victim will be selected anyway. But if somebody is trying to allocate only 
> > one
> > page and then is planning to release a lot of memory, avoiding this race 
> > window
> > can save somebody from being OOM-killed needlessly. This race window 
> > depends on
> > what the threads are about to do, not whether the workload is natural or
> > artificial.
> 
> And with a desparate lack of crystal ball we cannot do much about that
> really.
> 
> > My question is, how can users know it if somebody was OOM-killed needlessly
> > by allowing MMF_OOM_SKIP to race.
> 
> Is it really important to know that the race is due to MMF_OOM_SKIP?

Yes, it is really important. Needlessly selecting even one OOM victim is
a pain which is difficult to explain to and persuade some of customers.

> Isn't it sufficient to see that we kill too many tasks and then debug it
> further once something hits that?

It is not sufficient.

> 
> [...]
> > Is it guaranteed that __node_reclaim() never (even indirectly) waits for
> > __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory allocation?
> 
> this is a direct reclaim which can go down to slab shrinkers with all
> the usual fun...

Excuse me, but does that mean "Yes, it is" ?

As far as I checked, most shrinkers use non-scheduling operations other than
cond_resched(). But some shrinkers use lock_page()/down_write() etc. I worry
that such shrinkers might wait for __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
memory allocation (i.e. "No, it isn't").


Re: [PATCH] oom_reaper: close race without using oom_lock

2017-07-26 Thread Michal Hocko
On Wed 26-07-17 20:33:21, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Sun 23-07-17 09:41:50, Tetsuo Handa wrote:
> > > So, how can we verify the above race a real problem?
> > 
> > Try to simulate a _real_ workload and see whether we kill more tasks
> > than necessary. 
> 
> Whether it is a _real_ workload or not cannot become an answer.
> 
> If somebody is trying to allocate hundreds/thousands of pages after memory of
> an OOM victim was reaped, avoiding this race window makes no sense; next OOM
> victim will be selected anyway. But if somebody is trying to allocate only one
> page and then is planning to release a lot of memory, avoiding this race 
> window
> can save somebody from being OOM-killed needlessly. This race window depends 
> on
> what the threads are about to do, not whether the workload is natural or
> artificial.

And with a desparate lack of crystal ball we cannot do much about that
really.

> My question is, how can users know it if somebody was OOM-killed needlessly
> by allowing MMF_OOM_SKIP to race.

Is it really important to know that the race is due to MMF_OOM_SKIP?
Isn't it sufficient to see that we kill too many tasks and then debug it
further once something hits that?

[...]
> Is it guaranteed that __node_reclaim() never (even indirectly) waits for
> __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory allocation?

this is a direct reclaim which can go down to slab shrinkers with all
the usual fun...

> >  Such races are unfortunate but
> > unavoidable unless we synchronize oom kill with any memory freeing which
> > smells like a no-go to me. We can try a last allocation attempt right
> > before we go and kill something (which still wouldn't be race free) but
> > that might cause other issues - e.g. prolonged trashing without ever
> > killing something - but I haven't evaluated those to be honest.
> 
> Yes, postpone last get_page_from_freelist() attempt till oom_kill_process()
> will be what we would afford at best.

as I've said this would have to be evaluated very carefully and a strong
usecase would have to be shown.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] oom_reaper: close race without using oom_lock

2017-07-26 Thread Michal Hocko
On Wed 26-07-17 20:33:21, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Sun 23-07-17 09:41:50, Tetsuo Handa wrote:
> > > So, how can we verify the above race a real problem?
> > 
> > Try to simulate a _real_ workload and see whether we kill more tasks
> > than necessary. 
> 
> Whether it is a _real_ workload or not cannot become an answer.
> 
> If somebody is trying to allocate hundreds/thousands of pages after memory of
> an OOM victim was reaped, avoiding this race window makes no sense; next OOM
> victim will be selected anyway. But if somebody is trying to allocate only one
> page and then is planning to release a lot of memory, avoiding this race 
> window
> can save somebody from being OOM-killed needlessly. This race window depends 
> on
> what the threads are about to do, not whether the workload is natural or
> artificial.

And with a desparate lack of crystal ball we cannot do much about that
really.

> My question is, how can users know it if somebody was OOM-killed needlessly
> by allowing MMF_OOM_SKIP to race.

Is it really important to know that the race is due to MMF_OOM_SKIP?
Isn't it sufficient to see that we kill too many tasks and then debug it
further once something hits that?

[...]
> Is it guaranteed that __node_reclaim() never (even indirectly) waits for
> __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory allocation?

this is a direct reclaim which can go down to slab shrinkers with all
the usual fun...

> >  Such races are unfortunate but
> > unavoidable unless we synchronize oom kill with any memory freeing which
> > smells like a no-go to me. We can try a last allocation attempt right
> > before we go and kill something (which still wouldn't be race free) but
> > that might cause other issues - e.g. prolonged trashing without ever
> > killing something - but I haven't evaluated those to be honest.
> 
> Yes, postpone last get_page_from_freelist() attempt till oom_kill_process()
> will be what we would afford at best.

as I've said this would have to be evaluated very carefully and a strong
usecase would have to be shown.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] oom_reaper: close race without using oom_lock

2017-07-26 Thread Tetsuo Handa
Michal Hocko wrote:
> On Sun 23-07-17 09:41:50, Tetsuo Handa wrote:
> > So, how can we verify the above race a real problem?
> 
> Try to simulate a _real_ workload and see whether we kill more tasks
> than necessary. 

Whether it is a _real_ workload or not cannot become an answer.

If somebody is trying to allocate hundreds/thousands of pages after memory of
an OOM victim was reaped, avoiding this race window makes no sense; next OOM
victim will be selected anyway. But if somebody is trying to allocate only one
page and then is planning to release a lot of memory, avoiding this race window
can save somebody from being OOM-killed needlessly. This race window depends on
what the threads are about to do, not whether the workload is natural or
artificial.

My question is, how can users know it if somebody was OOM-killed needlessly
by allowing MMF_OOM_SKIP to race.

> Anyway, the change you are proposing is wrong for two reasons. First,
> you are in non-preemptible context in oom_evaluate_task so you cannot
> call into get_page_from_freelist (node_reclaim) and secondly it is a
> very specific hack while there is a whole category of possible races
> where someone frees memory (e.g. and exiting task which smells like what
> you see in your testing) while we are selecting an oom victim which
> can be quite an expensive operation.

Oh, I didn't know that get_page_from_freelist() might sleep.
I was assuming that get_page_from_freelist() never sleeps because it is
called from !can_direct_reclaim context. But looking into that function,
it is gfpflags_allow_blocking() from node_reclaim() from
get_page_from_freelist() that prevents !can_direct_reclaim context from
sleeping.

OK. I have to either mask __GFP_DIRECT_RECLAIM or postpone till
oom_kill_process(). Well, I came to worry about get_page_from_freelist()
at __alloc_pages_may_oom() which is called after oom_lock is taken.

Is it guaranteed that __node_reclaim() never (even indirectly) waits for
__GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory allocation? If it is not
guaranteed, calling __alloc_pages_may_oom(__GFP_DIRECT_RECLAIM) with oom_lock
taken can prevent __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory allocation from
completing (because did_some_progress will be forever set to 1 due to oom_lock
already taken). A possible location of OOM lockup unless it is guaranteed.

>  Such races are unfortunate but
> unavoidable unless we synchronize oom kill with any memory freeing which
> smells like a no-go to me. We can try a last allocation attempt right
> before we go and kill something (which still wouldn't be race free) but
> that might cause other issues - e.g. prolonged trashing without ever
> killing something - but I haven't evaluated those to be honest.

Yes, postpone last get_page_from_freelist() attempt till oom_kill_process()
will be what we would afford at best.


Re: [PATCH] oom_reaper: close race without using oom_lock

2017-07-26 Thread Tetsuo Handa
Michal Hocko wrote:
> On Sun 23-07-17 09:41:50, Tetsuo Handa wrote:
> > So, how can we verify the above race a real problem?
> 
> Try to simulate a _real_ workload and see whether we kill more tasks
> than necessary. 

Whether it is a _real_ workload or not cannot become an answer.

If somebody is trying to allocate hundreds/thousands of pages after memory of
an OOM victim was reaped, avoiding this race window makes no sense; next OOM
victim will be selected anyway. But if somebody is trying to allocate only one
page and then is planning to release a lot of memory, avoiding this race window
can save somebody from being OOM-killed needlessly. This race window depends on
what the threads are about to do, not whether the workload is natural or
artificial.

My question is, how can users know it if somebody was OOM-killed needlessly
by allowing MMF_OOM_SKIP to race.

> Anyway, the change you are proposing is wrong for two reasons. First,
> you are in non-preemptible context in oom_evaluate_task so you cannot
> call into get_page_from_freelist (node_reclaim) and secondly it is a
> very specific hack while there is a whole category of possible races
> where someone frees memory (e.g. and exiting task which smells like what
> you see in your testing) while we are selecting an oom victim which
> can be quite an expensive operation.

Oh, I didn't know that get_page_from_freelist() might sleep.
I was assuming that get_page_from_freelist() never sleeps because it is
called from !can_direct_reclaim context. But looking into that function,
it is gfpflags_allow_blocking() from node_reclaim() from
get_page_from_freelist() that prevents !can_direct_reclaim context from
sleeping.

OK. I have to either mask __GFP_DIRECT_RECLAIM or postpone till
oom_kill_process(). Well, I came to worry about get_page_from_freelist()
at __alloc_pages_may_oom() which is called after oom_lock is taken.

Is it guaranteed that __node_reclaim() never (even indirectly) waits for
__GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory allocation? If it is not
guaranteed, calling __alloc_pages_may_oom(__GFP_DIRECT_RECLAIM) with oom_lock
taken can prevent __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory allocation from
completing (because did_some_progress will be forever set to 1 due to oom_lock
already taken). A possible location of OOM lockup unless it is guaranteed.

>  Such races are unfortunate but
> unavoidable unless we synchronize oom kill with any memory freeing which
> smells like a no-go to me. We can try a last allocation attempt right
> before we go and kill something (which still wouldn't be race free) but
> that might cause other issues - e.g. prolonged trashing without ever
> killing something - but I haven't evaluated those to be honest.

Yes, postpone last get_page_from_freelist() attempt till oom_kill_process()
will be what we would afford at best.


Re: [PATCH] oom_reaper: close race without using oom_lock

2017-07-24 Thread Michal Hocko
On Sun 23-07-17 09:41:50, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Sat 22-07-17 00:18:48, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > OK, so let's say you have another task just about to jump into
> > > > out_of_memory and ... end up in the same situation.
> > > 
> > > Right.
> > > 
> > > > 
> > > > This race is just
> > > > unavoidable.
> > > 
> > > There is no perfect way (always timing dependent). But
> > 
> > I would rather not add a code which _pretends_ it solves something. If
> > we see the above race a real problem in out there then we should think
> > about how to fix it. I definitely do not want to add more hack into an
> > already complicated code base.
> 
> So, how can we verify the above race a real problem?

Try to simulate a _real_ workload and see whether we kill more tasks
than necessary. 

> I consider that
> it is impossible. The " free:%lukB" field by show_free_areas() is too
> random/inaccurate/racy/outdated for evaluating this race window.
> 
> Only actually calling alloc_page_from_freelist() immediately after
> MMF_OOM_SKIP test (like Patch1 shown below) can evaluate this race window,
> but I know that you won't allow me to add such code to the OOM killer layer.

Sigh. It is not about _me_ allowing you something or not. It is about
what makes sense and under which circumstances and usual cost benefit
evaluation. In other words, any patch has to be _justified_. I am really
tired of repeating this simple thing over and over again.

Anyway, the change you are proposing is wrong for two reasons. First,
you are in non-preemptible context in oom_evaluate_task so you cannot
call into get_page_from_freelist (node_reclaim) and secondly it is a
very specific hack while there is a whole category of possible races
where someone frees memory (e.g. and exiting task which smells like what
you see in your testing) while we are selecting an oom victim which
can be quite an expensive operation. Such races are unfortunate but
unavoidable unless we synchronize oom kill with any memory freeing which
smells like a no-go to me. We can try a last allocation attempt right
before we go and kill something (which still wouldn't be race free) but
that might cause other issues - e.g. prolonged trashing without ever
killing something - but I haven't evaluated those to be honest.

[...]

> The result shows that this race is highly timing dependent, but it
> at least shows that it is not rare case that get_page_from_freelist()
> can succeed after we checked that victim's mm already has MMF_OOM_SKIP.

It might be not rare for the extreme test case you are using. Do not
forget you spawn many tasks and them exiting might race with the oom
selection. I am really skeptical this reflects a real usecase.

> So, how can we check the above race a real problem? I consider that
> it is impossible.

And so I would be rather reluctant to add more hacks^Wheuristics...

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] oom_reaper: close race without using oom_lock

2017-07-24 Thread Michal Hocko
On Sun 23-07-17 09:41:50, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Sat 22-07-17 00:18:48, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > OK, so let's say you have another task just about to jump into
> > > > out_of_memory and ... end up in the same situation.
> > > 
> > > Right.
> > > 
> > > > 
> > > > This race is just
> > > > unavoidable.
> > > 
> > > There is no perfect way (always timing dependent). But
> > 
> > I would rather not add a code which _pretends_ it solves something. If
> > we see the above race a real problem in out there then we should think
> > about how to fix it. I definitely do not want to add more hack into an
> > already complicated code base.
> 
> So, how can we verify the above race a real problem?

Try to simulate a _real_ workload and see whether we kill more tasks
than necessary. 

> I consider that
> it is impossible. The " free:%lukB" field by show_free_areas() is too
> random/inaccurate/racy/outdated for evaluating this race window.
> 
> Only actually calling alloc_page_from_freelist() immediately after
> MMF_OOM_SKIP test (like Patch1 shown below) can evaluate this race window,
> but I know that you won't allow me to add such code to the OOM killer layer.

Sigh. It is not about _me_ allowing you something or not. It is about
what makes sense and under which circumstances and usual cost benefit
evaluation. In other words, any patch has to be _justified_. I am really
tired of repeating this simple thing over and over again.

Anyway, the change you are proposing is wrong for two reasons. First,
you are in non-preemptible context in oom_evaluate_task so you cannot
call into get_page_from_freelist (node_reclaim) and secondly it is a
very specific hack while there is a whole category of possible races
where someone frees memory (e.g. and exiting task which smells like what
you see in your testing) while we are selecting an oom victim which
can be quite an expensive operation. Such races are unfortunate but
unavoidable unless we synchronize oom kill with any memory freeing which
smells like a no-go to me. We can try a last allocation attempt right
before we go and kill something (which still wouldn't be race free) but
that might cause other issues - e.g. prolonged trashing without ever
killing something - but I haven't evaluated those to be honest.

[...]

> The result shows that this race is highly timing dependent, but it
> at least shows that it is not rare case that get_page_from_freelist()
> can succeed after we checked that victim's mm already has MMF_OOM_SKIP.

It might be not rare for the extreme test case you are using. Do not
forget you spawn many tasks and them exiting might race with the oom
selection. I am really skeptical this reflects a real usecase.

> So, how can we check the above race a real problem? I consider that
> it is impossible.

And so I would be rather reluctant to add more hacks^Wheuristics...

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] oom_reaper: close race without using oom_lock

2017-07-22 Thread Tetsuo Handa
Tetsuo Handa wrote:
> Log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170722.txt.xz .

Oops, I forgot to remove mmput_async() in Patch2. Below is updated result.
Though, situation (i.e. we can't tell without Patch1 whether we raced with
OOM_MMF_SKIP) is same.

Patch1:

 include/linux/oom.h |  4 
 mm/internal.h   |  4 
 mm/oom_kill.c   | 28 +++-
 mm/page_alloc.c | 10 +++---
 4 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 8a266e2..1b0bbb6 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -11,6 +11,7 @@
 struct notifier_block;
 struct mem_cgroup;
 struct task_struct;
+struct alloc_context;
 
 /*
  * Details of the page allocation that triggered the oom killer that are used 
to
@@ -39,6 +40,9 @@ struct oom_control {
unsigned long totalpages;
struct task_struct *chosen;
unsigned long chosen_points;
+
+   const struct alloc_context *alloc_context;
+   unsigned int alloc_flags;
 };
 
 extern struct mutex oom_lock;
diff --git a/mm/internal.h b/mm/internal.h
index 24d88f0..95a08b5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -522,4 +522,8 @@ static inline bool is_migrate_highatomic_page(struct page 
*page)
return get_pageblock_migratetype(page) == MIGRATE_HIGHATOMIC;
 }
 
+struct page *get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
+   int alloc_flags,
+   const struct alloc_context *ac);
+
 #endif /* __MM_INTERNAL_H */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9e8b4f0..fb7b2c8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -288,6 +288,9 @@ static enum oom_constraint constrained_alloc(struct 
oom_control *oc)
return CONSTRAINT_NONE;
 }
 
+static unsigned int mmf_oom_skip_raced;
+static unsigned int mmf_oom_skip_not_raced;
+
 static int oom_evaluate_task(struct task_struct *task, void *arg)
 {
struct oom_control *oc = arg;
@@ -303,8 +306,21 @@ static int oom_evaluate_task(struct task_struct *task, 
void *arg)
 * any memory is quite low.
 */
if (!is_sysrq_oom(oc) && tsk_is_oom_victim(task)) {
-   if (test_bit(MMF_OOM_SKIP, >signal->oom_mm->flags))
+   if (test_bit(MMF_OOM_SKIP, >signal->oom_mm->flags)) {
+   const struct alloc_context *ac = oc->alloc_context;
+
+   if (ac) {
+   struct page *page = get_page_from_freelist
+   (oc->gfp_mask, oc->order,
+oc->alloc_flags, ac);
+   if (page) {
+   __free_pages(page, oc->order);
+   mmf_oom_skip_raced++;
+   } else
+   mmf_oom_skip_not_raced++;
+   }
goto next;
+   }
goto abort;
}
 
@@ -1059,6 +1075,16 @@ bool out_of_memory(struct oom_control *oc)
 */
schedule_timeout_killable(1);
}
+   {
+   static unsigned long last;
+   unsigned long now = jiffies;
+
+   if (!last || time_after(now, last + 5 * HZ)) {
+   last = now;
+   pr_info("MMF_OOM_SKIP: raced=%u not_raced=%u\n",
+   mmf_oom_skip_raced, mmf_oom_skip_not_raced);
+   }
+   }
return !!oc->chosen;
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 80e4adb..4cf2861 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3054,7 +3054,7 @@ static bool zone_allows_reclaim(struct zone *local_zone, 
struct zone *zone)
  * get_page_from_freelist goes through the zonelist trying to allocate
  * a page.
  */
-static struct page *
+struct page *
 get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
const struct alloc_context *ac)
 {
@@ -3245,7 +3245,8 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, 
const char *fmt, ...)
 
 static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
-   const struct alloc_context *ac, unsigned long *did_some_progress)
+ unsigned int alloc_flags, const struct alloc_context *ac,
+ unsigned long *did_some_progress)
 {
struct oom_control oc = {
.zonelist = ac->zonelist,
@@ -3253,6 +3254,8 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, 
const char *fmt, ...)
.memcg = NULL,
.gfp_mask = gfp_mask,
.order = order,
+   .alloc_context = ac,
+   .alloc_flags = alloc_flags,
};
struct page *page;
 
@@ -3955,7 +3958,8 @@ bool 

Re: [PATCH] oom_reaper: close race without using oom_lock

2017-07-22 Thread Tetsuo Handa
Tetsuo Handa wrote:
> Log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170722.txt.xz .

Oops, I forgot to remove mmput_async() in Patch2. Below is updated result.
Though, situation (i.e. we can't tell without Patch1 whether we raced with
OOM_MMF_SKIP) is same.

Patch1:

 include/linux/oom.h |  4 
 mm/internal.h   |  4 
 mm/oom_kill.c   | 28 +++-
 mm/page_alloc.c | 10 +++---
 4 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 8a266e2..1b0bbb6 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -11,6 +11,7 @@
 struct notifier_block;
 struct mem_cgroup;
 struct task_struct;
+struct alloc_context;
 
 /*
  * Details of the page allocation that triggered the oom killer that are used 
to
@@ -39,6 +40,9 @@ struct oom_control {
unsigned long totalpages;
struct task_struct *chosen;
unsigned long chosen_points;
+
+   const struct alloc_context *alloc_context;
+   unsigned int alloc_flags;
 };
 
 extern struct mutex oom_lock;
diff --git a/mm/internal.h b/mm/internal.h
index 24d88f0..95a08b5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -522,4 +522,8 @@ static inline bool is_migrate_highatomic_page(struct page 
*page)
return get_pageblock_migratetype(page) == MIGRATE_HIGHATOMIC;
 }
 
+struct page *get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
+   int alloc_flags,
+   const struct alloc_context *ac);
+
 #endif /* __MM_INTERNAL_H */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9e8b4f0..fb7b2c8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -288,6 +288,9 @@ static enum oom_constraint constrained_alloc(struct 
oom_control *oc)
return CONSTRAINT_NONE;
 }
 
+static unsigned int mmf_oom_skip_raced;
+static unsigned int mmf_oom_skip_not_raced;
+
 static int oom_evaluate_task(struct task_struct *task, void *arg)
 {
struct oom_control *oc = arg;
@@ -303,8 +306,21 @@ static int oom_evaluate_task(struct task_struct *task, 
void *arg)
 * any memory is quite low.
 */
if (!is_sysrq_oom(oc) && tsk_is_oom_victim(task)) {
-   if (test_bit(MMF_OOM_SKIP, >signal->oom_mm->flags))
+   if (test_bit(MMF_OOM_SKIP, >signal->oom_mm->flags)) {
+   const struct alloc_context *ac = oc->alloc_context;
+
+   if (ac) {
+   struct page *page = get_page_from_freelist
+   (oc->gfp_mask, oc->order,
+oc->alloc_flags, ac);
+   if (page) {
+   __free_pages(page, oc->order);
+   mmf_oom_skip_raced++;
+   } else
+   mmf_oom_skip_not_raced++;
+   }
goto next;
+   }
goto abort;
}
 
@@ -1059,6 +1075,16 @@ bool out_of_memory(struct oom_control *oc)
 */
schedule_timeout_killable(1);
}
+   {
+   static unsigned long last;
+   unsigned long now = jiffies;
+
+   if (!last || time_after(now, last + 5 * HZ)) {
+   last = now;
+   pr_info("MMF_OOM_SKIP: raced=%u not_raced=%u\n",
+   mmf_oom_skip_raced, mmf_oom_skip_not_raced);
+   }
+   }
return !!oc->chosen;
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 80e4adb..4cf2861 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3054,7 +3054,7 @@ static bool zone_allows_reclaim(struct zone *local_zone, 
struct zone *zone)
  * get_page_from_freelist goes through the zonelist trying to allocate
  * a page.
  */
-static struct page *
+struct page *
 get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
const struct alloc_context *ac)
 {
@@ -3245,7 +3245,8 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, 
const char *fmt, ...)
 
 static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
-   const struct alloc_context *ac, unsigned long *did_some_progress)
+ unsigned int alloc_flags, const struct alloc_context *ac,
+ unsigned long *did_some_progress)
 {
struct oom_control oc = {
.zonelist = ac->zonelist,
@@ -3253,6 +3254,8 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, 
const char *fmt, ...)
.memcg = NULL,
.gfp_mask = gfp_mask,
.order = order,
+   .alloc_context = ac,
+   .alloc_flags = alloc_flags,
};
struct page *page;
 
@@ -3955,7 +3958,8 @@ bool 

Re: [PATCH] oom_reaper: close race without using oom_lock

2017-07-22 Thread Tetsuo Handa
Michal Hocko wrote:
> On Sat 22-07-17 00:18:48, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > OK, so let's say you have another task just about to jump into
> > > out_of_memory and ... end up in the same situation.
> > 
> > Right.
> > 
> > > 
> > > This race is just
> > > unavoidable.
> > 
> > There is no perfect way (always timing dependent). But
> 
> I would rather not add a code which _pretends_ it solves something. If
> we see the above race a real problem in out there then we should think
> about how to fix it. I definitely do not want to add more hack into an
> already complicated code base.

So, how can we verify the above race a real problem? I consider that
it is impossible. The " free:%lukB" field by show_free_areas() is too
random/inaccurate/racy/outdated for evaluating this race window.

Only actually calling alloc_page_from_freelist() immediately after
MMF_OOM_SKIP test (like Patch1 shown below) can evaluate this race window,
but I know that you won't allow me to add such code to the OOM killer layer.

Your "[RFC PATCH] mm, oom: allow oom reaper to race with exit_mmap" patch
is shown below as Patch2.

My "ignore MMF_OOM_SKIP once" patch is shown below as Patch3.

My "wait for oom_lock" patch is shown below as Patch4.

Patch1:

 include/linux/oom.h |  4 
 mm/internal.h   |  4 
 mm/oom_kill.c   | 28 +++-
 mm/page_alloc.c | 10 +++---
 4 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 8a266e2..1b0bbb6 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -11,6 +11,7 @@
 struct notifier_block;
 struct mem_cgroup;
 struct task_struct;
+struct alloc_context;
 
 /*
  * Details of the page allocation that triggered the oom killer that are used 
to
@@ -39,6 +40,9 @@ struct oom_control {
unsigned long totalpages;
struct task_struct *chosen;
unsigned long chosen_points;
+
+   const struct alloc_context *alloc_context;
+   unsigned int alloc_flags;
 };
 
 extern struct mutex oom_lock;
diff --git a/mm/internal.h b/mm/internal.h
index 24d88f0..95a08b5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -522,4 +522,8 @@ static inline bool is_migrate_highatomic_page(struct page 
*page)
return get_pageblock_migratetype(page) == MIGRATE_HIGHATOMIC;
 }
 
+struct page *get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
+   int alloc_flags,
+   const struct alloc_context *ac);
+
 #endif /* __MM_INTERNAL_H */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9e8b4f0..fb7b2c8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -288,6 +288,9 @@ static enum oom_constraint constrained_alloc(struct 
oom_control *oc)
return CONSTRAINT_NONE;
 }
 
+static unsigned int mmf_oom_skip_raced;
+static unsigned int mmf_oom_skip_not_raced;
+
 static int oom_evaluate_task(struct task_struct *task, void *arg)
 {
struct oom_control *oc = arg;
@@ -303,8 +306,21 @@ static int oom_evaluate_task(struct task_struct *task, 
void *arg)
 * any memory is quite low.
 */
if (!is_sysrq_oom(oc) && tsk_is_oom_victim(task)) {
-   if (test_bit(MMF_OOM_SKIP, >signal->oom_mm->flags))
+   if (test_bit(MMF_OOM_SKIP, >signal->oom_mm->flags)) {
+   const struct alloc_context *ac = oc->alloc_context;
+
+   if (ac) {
+   struct page *page = get_page_from_freelist
+   (oc->gfp_mask, oc->order,
+oc->alloc_flags, ac);
+   if (page) {
+   __free_pages(page, oc->order);
+   mmf_oom_skip_raced++;
+   } else
+   mmf_oom_skip_not_raced++;
+   }
goto next;
+   }
goto abort;
}
 
@@ -1059,6 +1075,16 @@ bool out_of_memory(struct oom_control *oc)
 */
schedule_timeout_killable(1);
}
+   {
+   static unsigned long last;
+   unsigned long now = jiffies;
+
+   if (!last || time_after(now, last + 5 * HZ)) {
+   last = now;
+   pr_info("MMF_OOM_SKIP: raced=%u not_raced=%u\n",
+   mmf_oom_skip_raced, mmf_oom_skip_not_raced);
+   }
+   }
return !!oc->chosen;
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 80e4adb..4cf2861 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3054,7 +3054,7 @@ static bool zone_allows_reclaim(struct zone *local_zone, 
struct zone *zone)
  * get_page_from_freelist goes through the zonelist trying to allocate
  

Re: [PATCH] oom_reaper: close race without using oom_lock

2017-07-22 Thread Tetsuo Handa
Michal Hocko wrote:
> On Sat 22-07-17 00:18:48, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > OK, so let's say you have another task just about to jump into
> > > out_of_memory and ... end up in the same situation.
> > 
> > Right.
> > 
> > > 
> > > This race is just
> > > unavoidable.
> > 
> > There is no perfect way (always timing dependent). But
> 
> I would rather not add a code which _pretends_ it solves something. If
> we see the above race a real problem in out there then we should think
> about how to fix it. I definitely do not want to add more hack into an
> already complicated code base.

So, how can we verify the above race a real problem? I consider that
it is impossible. The " free:%lukB" field by show_free_areas() is too
random/inaccurate/racy/outdated for evaluating this race window.

Only actually calling alloc_page_from_freelist() immediately after
MMF_OOM_SKIP test (like Patch1 shown below) can evaluate this race window,
but I know that you won't allow me to add such code to the OOM killer layer.

Your "[RFC PATCH] mm, oom: allow oom reaper to race with exit_mmap" patch
is shown below as Patch2.

My "ignore MMF_OOM_SKIP once" patch is shown below as Patch3.

My "wait for oom_lock" patch is shown below as Patch4.

Patch1:

 include/linux/oom.h |  4 
 mm/internal.h   |  4 
 mm/oom_kill.c   | 28 +++-
 mm/page_alloc.c | 10 +++---
 4 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 8a266e2..1b0bbb6 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -11,6 +11,7 @@
 struct notifier_block;
 struct mem_cgroup;
 struct task_struct;
+struct alloc_context;
 
 /*
  * Details of the page allocation that triggered the oom killer that are used 
to
@@ -39,6 +40,9 @@ struct oom_control {
unsigned long totalpages;
struct task_struct *chosen;
unsigned long chosen_points;
+
+   const struct alloc_context *alloc_context;
+   unsigned int alloc_flags;
 };
 
 extern struct mutex oom_lock;
diff --git a/mm/internal.h b/mm/internal.h
index 24d88f0..95a08b5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -522,4 +522,8 @@ static inline bool is_migrate_highatomic_page(struct page 
*page)
return get_pageblock_migratetype(page) == MIGRATE_HIGHATOMIC;
 }
 
+struct page *get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
+   int alloc_flags,
+   const struct alloc_context *ac);
+
 #endif /* __MM_INTERNAL_H */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9e8b4f0..fb7b2c8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -288,6 +288,9 @@ static enum oom_constraint constrained_alloc(struct 
oom_control *oc)
return CONSTRAINT_NONE;
 }
 
+static unsigned int mmf_oom_skip_raced;
+static unsigned int mmf_oom_skip_not_raced;
+
 static int oom_evaluate_task(struct task_struct *task, void *arg)
 {
struct oom_control *oc = arg;
@@ -303,8 +306,21 @@ static int oom_evaluate_task(struct task_struct *task, 
void *arg)
 * any memory is quite low.
 */
if (!is_sysrq_oom(oc) && tsk_is_oom_victim(task)) {
-   if (test_bit(MMF_OOM_SKIP, >signal->oom_mm->flags))
+   if (test_bit(MMF_OOM_SKIP, >signal->oom_mm->flags)) {
+   const struct alloc_context *ac = oc->alloc_context;
+
+   if (ac) {
+   struct page *page = get_page_from_freelist
+   (oc->gfp_mask, oc->order,
+oc->alloc_flags, ac);
+   if (page) {
+   __free_pages(page, oc->order);
+   mmf_oom_skip_raced++;
+   } else
+   mmf_oom_skip_not_raced++;
+   }
goto next;
+   }
goto abort;
}
 
@@ -1059,6 +1075,16 @@ bool out_of_memory(struct oom_control *oc)
 */
schedule_timeout_killable(1);
}
+   {
+   static unsigned long last;
+   unsigned long now = jiffies;
+
+   if (!last || time_after(now, last + 5 * HZ)) {
+   last = now;
+   pr_info("MMF_OOM_SKIP: raced=%u not_raced=%u\n",
+   mmf_oom_skip_raced, mmf_oom_skip_not_raced);
+   }
+   }
return !!oc->chosen;
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 80e4adb..4cf2861 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3054,7 +3054,7 @@ static bool zone_allows_reclaim(struct zone *local_zone, 
struct zone *zone)
  * get_page_from_freelist goes through the zonelist trying to allocate
  

Re: [PATCH] oom_reaper: close race without using oom_lock

2017-07-21 Thread Michal Hocko
On Sat 22-07-17 00:18:48, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > If we ignore MMF_OOM_SKIP once, we can avoid sequence above.
> > 
> > But we set MMF_OOM_SKIP _after_ the process lost its address space (well
> > after the patch which allows to race oom reaper with the exit_mmap).
> > 
> > > 
> > > Process-1  Process-2
> > > 
> > > Takes oom_lock.
> > > Fails get_page_from_freelist().
> > > Enters out_of_memory().
> > > Get SIGKILL.
> > > Get TIF_MEMDIE.
> > > Leaves out_of_memory().
> > > Releases oom_lock.
> > > Enters do_exit().
> > > Calls __mmput().
> > >Takes oom_lock.
> > >Fails get_page_from_freelist().
> > > Releases some memory.
> > > Sets MMF_OOM_SKIP.
> > >Enters out_of_memory().
> > >Ignores MMF_OOM_SKIP mm once.
> > >Leaves out_of_memory().
> > >Releases oom_lock.
> > >Succeeds get_page_from_freelist().
> > 
> > OK, so let's say you have another task just about to jump into
> > out_of_memory and ... end up in the same situation.
> 
> Right.
> 
> > 
> > This race is just
> > unavoidable.
> 
> There is no perfect way (always timing dependent). But

I would rather not add a code which _pretends_ it solves something. If
we see the above race a real problem in out there then we should think
about how to fix it. I definitely do not want to add more hack into an
already complicated code base.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] oom_reaper: close race without using oom_lock

2017-07-21 Thread Michal Hocko
On Sat 22-07-17 00:18:48, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > If we ignore MMF_OOM_SKIP once, we can avoid sequence above.
> > 
> > But we set MMF_OOM_SKIP _after_ the process lost its address space (well
> > after the patch which allows to race oom reaper with the exit_mmap).
> > 
> > > 
> > > Process-1  Process-2
> > > 
> > > Takes oom_lock.
> > > Fails get_page_from_freelist().
> > > Enters out_of_memory().
> > > Get SIGKILL.
> > > Get TIF_MEMDIE.
> > > Leaves out_of_memory().
> > > Releases oom_lock.
> > > Enters do_exit().
> > > Calls __mmput().
> > >Takes oom_lock.
> > >Fails get_page_from_freelist().
> > > Releases some memory.
> > > Sets MMF_OOM_SKIP.
> > >Enters out_of_memory().
> > >Ignores MMF_OOM_SKIP mm once.
> > >Leaves out_of_memory().
> > >Releases oom_lock.
> > >Succeeds get_page_from_freelist().
> > 
> > OK, so let's say you have another task just about to jump into
> > out_of_memory and ... end up in the same situation.
> 
> Right.
> 
> > 
> > This race is just
> > unavoidable.
> 
> There is no perfect way (always timing dependent). But

I would rather not add a code which _pretends_ it solves something. If
we see the above race a real problem in out there then we should think
about how to fix it. I definitely do not want to add more hack into an
already complicated code base.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] oom_reaper: close race without using oom_lock

2017-07-21 Thread Tetsuo Handa
Michal Hocko wrote:
> > If we ignore MMF_OOM_SKIP once, we can avoid sequence above.
> 
> But we set MMF_OOM_SKIP _after_ the process lost its address space (well
> after the patch which allows to race oom reaper with the exit_mmap).
> 
> > 
> > Process-1  Process-2
> > 
> > Takes oom_lock.
> > Fails get_page_from_freelist().
> > Enters out_of_memory().
> > Get SIGKILL.
> > Get TIF_MEMDIE.
> > Leaves out_of_memory().
> > Releases oom_lock.
> > Enters do_exit().
> > Calls __mmput().
> >Takes oom_lock.
> >Fails get_page_from_freelist().
> > Releases some memory.
> > Sets MMF_OOM_SKIP.
> >Enters out_of_memory().
> >Ignores MMF_OOM_SKIP mm once.
> >Leaves out_of_memory().
> >Releases oom_lock.
> >Succeeds get_page_from_freelist().
> 
> OK, so let's say you have another task just about to jump into
> out_of_memory and ... end up in the same situation.

Right.

> 
> This race is just
> unavoidable.

There is no perfect way (always timing dependent). But

> 
> > Strictly speaking, this patch is independent with OOM reaper.
> > This patch increases possibility of succeeding get_page_from_freelist()
> > without sending SIGKILL. Your patch is trying to drop it silently.

we can try to reduce possibility of ending up in the same situation by
this proposal, and your proposal is irrelevant with reducing possibility of
ending up in the same situation because

> > 
> > Serializing setting of MMF_OOM_SKIP with oom_lock is one approach,
> > and ignoring MMF_OOM_SKIP once without oom_lock is another approach.
> 
> Or simply making sure that we only set the flag _after_ the address
> space is gone, which is what I am proposing.

the address space being gone does not guarantee that get_page_from_freelist()
shall be called before entering into out_of_memory() (e.g. preempted for seconds
between "Fails get_page_from_freelist()." and "Enters out_of_memory().").


Re: [PATCH] oom_reaper: close race without using oom_lock

2017-07-21 Thread Tetsuo Handa
Michal Hocko wrote:
> > If we ignore MMF_OOM_SKIP once, we can avoid sequence above.
> 
> But we set MMF_OOM_SKIP _after_ the process lost its address space (well
> after the patch which allows to race oom reaper with the exit_mmap).
> 
> > 
> > Process-1  Process-2
> > 
> > Takes oom_lock.
> > Fails get_page_from_freelist().
> > Enters out_of_memory().
> > Get SIGKILL.
> > Get TIF_MEMDIE.
> > Leaves out_of_memory().
> > Releases oom_lock.
> > Enters do_exit().
> > Calls __mmput().
> >Takes oom_lock.
> >Fails get_page_from_freelist().
> > Releases some memory.
> > Sets MMF_OOM_SKIP.
> >Enters out_of_memory().
> >Ignores MMF_OOM_SKIP mm once.
> >Leaves out_of_memory().
> >Releases oom_lock.
> >Succeeds get_page_from_freelist().
> 
> OK, so let's say you have another task just about to jump into
> out_of_memory and ... end up in the same situation.

Right.

> 
> This race is just
> unavoidable.

There is no perfect way (always timing dependent). But

> 
> > Strictly speaking, this patch is independent with OOM reaper.
> > This patch increases possibility of succeeding get_page_from_freelist()
> > without sending SIGKILL. Your patch is trying to drop it silently.

we can try to reduce possibility of ending up in the same situation by
this proposal, and your proposal is irrelevant with reducing possibility of
ending up in the same situation because

> > 
> > Serializing setting of MMF_OOM_SKIP with oom_lock is one approach,
> > and ignoring MMF_OOM_SKIP once without oom_lock is another approach.
> 
> Or simply making sure that we only set the flag _after_ the address
> space is gone, which is what I am proposing.

the address space being gone does not guarantee that get_page_from_freelist()
shall be called before entering into out_of_memory() (e.g. preempted for seconds
between "Fails get_page_from_freelist()." and "Enters out_of_memory().").


Re: [PATCH] oom_reaper: close race without using oom_lock

2017-07-21 Thread Michal Hocko
On Fri 21-07-17 06:47:11, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 19-07-17 05:51:03, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Tue 18-07-17 23:06:50, Tetsuo Handa wrote:
> > > > > Commit e2fe14564d3316d1 ("oom_reaper: close race with exiting task")
> > > > > guarded whole OOM reaping operations using oom_lock. But there was no
> > > > > need to guard whole operations. We needed to guard only setting of
> > > > > MMF_OOM_REAPED flag because get_page_from_freelist() in
> > > > > __alloc_pages_may_oom() is called with oom_lock held.
> > > > > 
> > > > > If we change to guard only setting of MMF_OOM_SKIP flag, the OOM 
> > > > > reaper
> > > > > can start reaping operations as soon as wake_oom_reaper() is called.
> > > > > But since setting of MMF_OOM_SKIP flag at __mmput() is not guarded 
> > > > > with
> > > > > oom_lock, guarding only the OOM reaper side is not sufficient.
> > > > > 
> > > > > If we change the OOM killer side to ignore MMF_OOM_SKIP flag once,
> > > > > there is no need to guard setting of MMF_OOM_SKIP flag, and we can
> > > > > guarantee a chance to call get_page_from_freelist() in
> > > > > __alloc_pages_may_oom() without depending on oom_lock serialization.
> > > > > 
> > > > > This patch makes MMF_OOM_SKIP act as if MMF_OOM_REAPED, and adds a new
> > > > > flag which acts as if MMF_OOM_SKIP, in order to close both race window
> > > > > (the OOM reaper side and __mmput() side) without using oom_lock.
> > > > 
> > > > Why do we need this patch when
> > > > http://lkml.kernel.org/r/20170626130346.26314-1-mho...@kernel.org
> > > > already removes the lock and solves another problem at once?
> > > 
> > > We haven't got an answer from Hugh and/or Andrea whether that patch is 
> > > safe.
> > 
> > So what? I haven't see anybody disputing the correctness. And to be
> > honest I really dislike your patch. Yet another round kind of solutions
> > are just very ugly hacks usually because they are highly timing
> > sensitive.
> 
> Yes, OOM killer is highly timing sensitive.
> 
> > 
> > > Even if that patch is safe, this patch still helps with CONFIG_MMU=n case.
> > 
> > Could you explain how?
> 
> Nothing prevents sequence below.
> 
> Process-1  Process-2
> 
> Takes oom_lock.
> Fails get_page_from_freelist().
> Enters out_of_memory().
> Gets SIGKILL.
> Gets TIF_MEMDIE.
> Leaves out_of_memory().
> Releases oom_lock.
> Enters do_exit().
> Calls __mmput().
>Takes oom_lock.
>Fails get_page_from_freelist().
> Releases some memory.
> Sets MMF_OOM_SKIP.
>Enters out_of_memory().
>Selects next victim because there is no 
> !MMF_OOM_SKIP mm.
>Sends SIGKILL needlessly.
> 
> If we ignore MMF_OOM_SKIP once, we can avoid sequence above.

But we set MMF_OOM_SKIP _after_ the process lost its address space (well
after the patch which allows to race oom reaper with the exit_mmap).

> 
> Process-1  Process-2
> 
> Takes oom_lock.
> Fails get_page_from_freelist().
> Enters out_of_memory().
> Get SIGKILL.
> Get TIF_MEMDIE.
> Leaves out_of_memory().
> Releases oom_lock.
> Enters do_exit().
> Calls __mmput().
>Takes oom_lock.
>Fails get_page_from_freelist().
> Releases some memory.
> Sets MMF_OOM_SKIP.
>Enters out_of_memory().
>Ignores MMF_OOM_SKIP mm once.
>Leaves out_of_memory().
>Releases oom_lock.
>Succeeds get_page_from_freelist().

OK, so let's say you have another task just about to jump into
out_of_memory and ... end up in the same situation. This race is just
unavoidable.

> Strictly speaking, this patch is independent with OOM reaper.
> This patch increases possibility of succeeding get_page_from_freelist()
> without sending SIGKILL. Your patch is trying to drop it silently.
> 
> Serializing setting of MMF_OOM_SKIP with oom_lock is one approach,
> and ignoring MMF_OOM_SKIP once without oom_lock is another approach.

Or simply making sure that we only set the flag _after_ the address
space is gone, which is what I am proposing.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] oom_reaper: close race without using oom_lock

2017-07-21 Thread Michal Hocko
On Fri 21-07-17 06:47:11, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 19-07-17 05:51:03, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Tue 18-07-17 23:06:50, Tetsuo Handa wrote:
> > > > > Commit e2fe14564d3316d1 ("oom_reaper: close race with exiting task")
> > > > > guarded whole OOM reaping operations using oom_lock. But there was no
> > > > > need to guard whole operations. We needed to guard only setting of
> > > > > MMF_OOM_REAPED flag because get_page_from_freelist() in
> > > > > __alloc_pages_may_oom() is called with oom_lock held.
> > > > > 
> > > > > If we change to guard only setting of MMF_OOM_SKIP flag, the OOM 
> > > > > reaper
> > > > > can start reaping operations as soon as wake_oom_reaper() is called.
> > > > > But since setting of MMF_OOM_SKIP flag at __mmput() is not guarded 
> > > > > with
> > > > > oom_lock, guarding only the OOM reaper side is not sufficient.
> > > > > 
> > > > > If we change the OOM killer side to ignore MMF_OOM_SKIP flag once,
> > > > > there is no need to guard setting of MMF_OOM_SKIP flag, and we can
> > > > > guarantee a chance to call get_page_from_freelist() in
> > > > > __alloc_pages_may_oom() without depending on oom_lock serialization.
> > > > > 
> > > > > This patch makes MMF_OOM_SKIP act as if MMF_OOM_REAPED, and adds a new
> > > > > flag which acts as if MMF_OOM_SKIP, in order to close both race window
> > > > > (the OOM reaper side and __mmput() side) without using oom_lock.
> > > > 
> > > > Why do we need this patch when
> > > > http://lkml.kernel.org/r/20170626130346.26314-1-mho...@kernel.org
> > > > already removes the lock and solves another problem at once?
> > > 
> > > We haven't got an answer from Hugh and/or Andrea whether that patch is 
> > > safe.
> > 
> > So what? I haven't see anybody disputing the correctness. And to be
> > honest I really dislike your patch. Yet another round kind of solutions
> > are just very ugly hacks usually because they are highly timing
> > sensitive.
> 
> Yes, OOM killer is highly timing sensitive.
> 
> > 
> > > Even if that patch is safe, this patch still helps with CONFIG_MMU=n case.
> > 
> > Could you explain how?
> 
> Nothing prevents sequence below.
> 
> Process-1  Process-2
> 
> Takes oom_lock.
> Fails get_page_from_freelist().
> Enters out_of_memory().
> Gets SIGKILL.
> Gets TIF_MEMDIE.
> Leaves out_of_memory().
> Releases oom_lock.
> Enters do_exit().
> Calls __mmput().
>Takes oom_lock.
>Fails get_page_from_freelist().
> Releases some memory.
> Sets MMF_OOM_SKIP.
>Enters out_of_memory().
>Selects next victim because there is no 
> !MMF_OOM_SKIP mm.
>Sends SIGKILL needlessly.
> 
> If we ignore MMF_OOM_SKIP once, we can avoid sequence above.

But we set MMF_OOM_SKIP _after_ the process lost its address space (well
after the patch which allows to race oom reaper with the exit_mmap).

> 
> Process-1  Process-2
> 
> Takes oom_lock.
> Fails get_page_from_freelist().
> Enters out_of_memory().
> Get SIGKILL.
> Get TIF_MEMDIE.
> Leaves out_of_memory().
> Releases oom_lock.
> Enters do_exit().
> Calls __mmput().
>Takes oom_lock.
>Fails get_page_from_freelist().
> Releases some memory.
> Sets MMF_OOM_SKIP.
>Enters out_of_memory().
>Ignores MMF_OOM_SKIP mm once.
>Leaves out_of_memory().
>Releases oom_lock.
>Succeeds get_page_from_freelist().

OK, so let's say you have another task just about to jump into
out_of_memory and ... end up in the same situation. This race is just
unavoidable.

> Strictly speaking, this patch is independent with OOM reaper.
> This patch increases possibility of succeeding get_page_from_freelist()
> without sending SIGKILL. Your patch is trying to drop it silently.
> 
> Serializing setting of MMF_OOM_SKIP with oom_lock is one approach,
> and ignoring MMF_OOM_SKIP once without oom_lock is another approach.

Or simply making sure that we only set the flag _after_ the address
space is gone, which is what I am proposing.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] oom_reaper: close race without using oom_lock

2017-07-20 Thread Tetsuo Handa
Michal Hocko wrote:
> On Wed 19-07-17 05:51:03, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Tue 18-07-17 23:06:50, Tetsuo Handa wrote:
> > > > Commit e2fe14564d3316d1 ("oom_reaper: close race with exiting task")
> > > > guarded whole OOM reaping operations using oom_lock. But there was no
> > > > need to guard whole operations. We needed to guard only setting of
> > > > MMF_OOM_REAPED flag because get_page_from_freelist() in
> > > > __alloc_pages_may_oom() is called with oom_lock held.
> > > > 
> > > > If we change to guard only setting of MMF_OOM_SKIP flag, the OOM reaper
> > > > can start reaping operations as soon as wake_oom_reaper() is called.
> > > > But since setting of MMF_OOM_SKIP flag at __mmput() is not guarded with
> > > > oom_lock, guarding only the OOM reaper side is not sufficient.
> > > > 
> > > > If we change the OOM killer side to ignore MMF_OOM_SKIP flag once,
> > > > there is no need to guard setting of MMF_OOM_SKIP flag, and we can
> > > > guarantee a chance to call get_page_from_freelist() in
> > > > __alloc_pages_may_oom() without depending on oom_lock serialization.
> > > > 
> > > > This patch makes MMF_OOM_SKIP act as if MMF_OOM_REAPED, and adds a new
> > > > flag which acts as if MMF_OOM_SKIP, in order to close both race window
> > > > (the OOM reaper side and __mmput() side) without using oom_lock.
> > > 
> > > Why do we need this patch when
> > > http://lkml.kernel.org/r/20170626130346.26314-1-mho...@kernel.org
> > > already removes the lock and solves another problem at once?
> > 
> > We haven't got an answer from Hugh and/or Andrea whether that patch is safe.
> 
> So what? I haven't see anybody disputing the correctness. And to be
> honest I really dislike your patch. Yet another round kind of solutions
> are just very ugly hacks usually because they are highly timing
> sensitive.

Yes, OOM killer is highly timing sensitive.

> 
> > Even if that patch is safe, this patch still helps with CONFIG_MMU=n case.
> 
> Could you explain how?

Nothing prevents sequence below.

Process-1  Process-2

Takes oom_lock.
Fails get_page_from_freelist().
Enters out_of_memory().
Gets SIGKILL.
Gets TIF_MEMDIE.
Leaves out_of_memory().
Releases oom_lock.
Enters do_exit().
Calls __mmput().
   Takes oom_lock.
   Fails get_page_from_freelist().
Releases some memory.
Sets MMF_OOM_SKIP.
   Enters out_of_memory().
   Selects next victim because there is no 
!MMF_OOM_SKIP mm.
   Sends SIGKILL needlessly.

If we ignore MMF_OOM_SKIP once, we can avoid sequence above.

Process-1  Process-2

Takes oom_lock.
Fails get_page_from_freelist().
Enters out_of_memory().
Get SIGKILL.
Get TIF_MEMDIE.
Leaves out_of_memory().
Releases oom_lock.
Enters do_exit().
Calls __mmput().
   Takes oom_lock.
   Fails get_page_from_freelist().
Releases some memory.
Sets MMF_OOM_SKIP.
   Enters out_of_memory().
   Ignores MMF_OOM_SKIP mm once.
   Leaves out_of_memory().
   Releases oom_lock.
   Succeeds get_page_from_freelist().

Strictly speaking, this patch is independent with OOM reaper.
This patch increases possibility of succeeding get_page_from_freelist()
without sending SIGKILL. Your patch is trying to drop it silently.

Serializing setting of MMF_OOM_SKIP with oom_lock is one approach,
and ignoring MMF_OOM_SKIP once without oom_lock is another approach.


Re: [PATCH] oom_reaper: close race without using oom_lock

2017-07-20 Thread Tetsuo Handa
Michal Hocko wrote:
> On Wed 19-07-17 05:51:03, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Tue 18-07-17 23:06:50, Tetsuo Handa wrote:
> > > > Commit e2fe14564d3316d1 ("oom_reaper: close race with exiting task")
> > > > guarded whole OOM reaping operations using oom_lock. But there was no
> > > > need to guard whole operations. We needed to guard only setting of
> > > > MMF_OOM_REAPED flag because get_page_from_freelist() in
> > > > __alloc_pages_may_oom() is called with oom_lock held.
> > > > 
> > > > If we change to guard only setting of MMF_OOM_SKIP flag, the OOM reaper
> > > > can start reaping operations as soon as wake_oom_reaper() is called.
> > > > But since setting of MMF_OOM_SKIP flag at __mmput() is not guarded with
> > > > oom_lock, guarding only the OOM reaper side is not sufficient.
> > > > 
> > > > If we change the OOM killer side to ignore MMF_OOM_SKIP flag once,
> > > > there is no need to guard setting of MMF_OOM_SKIP flag, and we can
> > > > guarantee a chance to call get_page_from_freelist() in
> > > > __alloc_pages_may_oom() without depending on oom_lock serialization.
> > > > 
> > > > This patch makes MMF_OOM_SKIP act as if MMF_OOM_REAPED, and adds a new
> > > > flag which acts as if MMF_OOM_SKIP, in order to close both race window
> > > > (the OOM reaper side and __mmput() side) without using oom_lock.
> > > 
> > > Why do we need this patch when
> > > http://lkml.kernel.org/r/20170626130346.26314-1-mho...@kernel.org
> > > already removes the lock and solves another problem at once?
> > 
> > We haven't got an answer from Hugh and/or Andrea whether that patch is safe.
> 
> So what? I haven't see anybody disputing the correctness. And to be
> honest I really dislike your patch. Yet another round kind of solutions
> are just very ugly hacks usually because they are highly timing
> sensitive.

Yes, OOM killer is highly timing sensitive.

> 
> > Even if that patch is safe, this patch still helps with CONFIG_MMU=n case.
> 
> Could you explain how?

Nothing prevents sequence below.

Process-1  Process-2

Takes oom_lock.
Fails get_page_from_freelist().
Enters out_of_memory().
Gets SIGKILL.
Gets TIF_MEMDIE.
Leaves out_of_memory().
Releases oom_lock.
Enters do_exit().
Calls __mmput().
   Takes oom_lock.
   Fails get_page_from_freelist().
Releases some memory.
Sets MMF_OOM_SKIP.
   Enters out_of_memory().
   Selects next victim because there is no 
!MMF_OOM_SKIP mm.
   Sends SIGKILL needlessly.

If we ignore MMF_OOM_SKIP once, we can avoid sequence above.

Process-1  Process-2

Takes oom_lock.
Fails get_page_from_freelist().
Enters out_of_memory().
Get SIGKILL.
Get TIF_MEMDIE.
Leaves out_of_memory().
Releases oom_lock.
Enters do_exit().
Calls __mmput().
   Takes oom_lock.
   Fails get_page_from_freelist().
Releases some memory.
Sets MMF_OOM_SKIP.
   Enters out_of_memory().
   Ignores MMF_OOM_SKIP mm once.
   Leaves out_of_memory().
   Releases oom_lock.
   Succeeds get_page_from_freelist().

Strictly speaking, this patch is independent with OOM reaper.
This patch increases possibility of succeeding get_page_from_freelist()
without sending SIGKILL. Your patch is trying to drop it silently.

Serializing setting of MMF_OOM_SKIP with oom_lock is one approach,
and ignoring MMF_OOM_SKIP once without oom_lock is another approach.


Re: [PATCH] oom_reaper: close race without using oom_lock

2017-07-20 Thread Michal Hocko
On Wed 19-07-17 05:51:03, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Tue 18-07-17 23:06:50, Tetsuo Handa wrote:
> > > Commit e2fe14564d3316d1 ("oom_reaper: close race with exiting task")
> > > guarded whole OOM reaping operations using oom_lock. But there was no
> > > need to guard whole operations. We needed to guard only setting of
> > > MMF_OOM_REAPED flag because get_page_from_freelist() in
> > > __alloc_pages_may_oom() is called with oom_lock held.
> > > 
> > > If we change to guard only setting of MMF_OOM_SKIP flag, the OOM reaper
> > > can start reaping operations as soon as wake_oom_reaper() is called.
> > > But since setting of MMF_OOM_SKIP flag at __mmput() is not guarded with
> > > oom_lock, guarding only the OOM reaper side is not sufficient.
> > > 
> > > If we change the OOM killer side to ignore MMF_OOM_SKIP flag once,
> > > there is no need to guard setting of MMF_OOM_SKIP flag, and we can
> > > guarantee a chance to call get_page_from_freelist() in
> > > __alloc_pages_may_oom() without depending on oom_lock serialization.
> > > 
> > > This patch makes MMF_OOM_SKIP act as if MMF_OOM_REAPED, and adds a new
> > > flag which acts as if MMF_OOM_SKIP, in order to close both race window
> > > (the OOM reaper side and __mmput() side) without using oom_lock.
> > 
> > Why do we need this patch when
> > http://lkml.kernel.org/r/20170626130346.26314-1-mho...@kernel.org
> > already removes the lock and solves another problem at once?
> 
> We haven't got an answer from Hugh and/or Andrea whether that patch is safe.

So what? I haven't see anybody disputing the correctness. And to be
honest I really dislike your patch. Yet another round kind of solutions
are just very ugly hacks usually because they are highly timing
sensitive.

> Even if that patch is safe, this patch still helps with CONFIG_MMU=n case.

Could you explain how?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] oom_reaper: close race without using oom_lock

2017-07-20 Thread Michal Hocko
On Wed 19-07-17 05:51:03, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Tue 18-07-17 23:06:50, Tetsuo Handa wrote:
> > > Commit e2fe14564d3316d1 ("oom_reaper: close race with exiting task")
> > > guarded whole OOM reaping operations using oom_lock. But there was no
> > > need to guard whole operations. We needed to guard only setting of
> > > MMF_OOM_REAPED flag because get_page_from_freelist() in
> > > __alloc_pages_may_oom() is called with oom_lock held.
> > > 
> > > If we change to guard only setting of MMF_OOM_SKIP flag, the OOM reaper
> > > can start reaping operations as soon as wake_oom_reaper() is called.
> > > But since setting of MMF_OOM_SKIP flag at __mmput() is not guarded with
> > > oom_lock, guarding only the OOM reaper side is not sufficient.
> > > 
> > > If we change the OOM killer side to ignore MMF_OOM_SKIP flag once,
> > > there is no need to guard setting of MMF_OOM_SKIP flag, and we can
> > > guarantee a chance to call get_page_from_freelist() in
> > > __alloc_pages_may_oom() without depending on oom_lock serialization.
> > > 
> > > This patch makes MMF_OOM_SKIP act as if MMF_OOM_REAPED, and adds a new
> > > flag which acts as if MMF_OOM_SKIP, in order to close both race window
> > > (the OOM reaper side and __mmput() side) without using oom_lock.
> > 
> > Why do we need this patch when
> > http://lkml.kernel.org/r/20170626130346.26314-1-mho...@kernel.org
> > already removes the lock and solves another problem at once?
> 
> We haven't got an answer from Hugh and/or Andrea whether that patch is safe.

So what? I haven't see anybody disputing the correctness. And to be
honest I really dislike your patch. Yet another round kind of solutions
are just very ugly hacks usually because they are highly timing
sensitive.

> Even if that patch is safe, this patch still helps with CONFIG_MMU=n case.

Could you explain how?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] oom_reaper: close race without using oom_lock

2017-07-18 Thread Tetsuo Handa
Michal Hocko wrote:
> On Tue 18-07-17 23:06:50, Tetsuo Handa wrote:
> > Commit e2fe14564d3316d1 ("oom_reaper: close race with exiting task")
> > guarded whole OOM reaping operations using oom_lock. But there was no
> > need to guard whole operations. We needed to guard only setting of
> > MMF_OOM_REAPED flag because get_page_from_freelist() in
> > __alloc_pages_may_oom() is called with oom_lock held.
> > 
> > If we change to guard only setting of MMF_OOM_SKIP flag, the OOM reaper
> > can start reaping operations as soon as wake_oom_reaper() is called.
> > But since setting of MMF_OOM_SKIP flag at __mmput() is not guarded with
> > oom_lock, guarding only the OOM reaper side is not sufficient.
> > 
> > If we change the OOM killer side to ignore MMF_OOM_SKIP flag once,
> > there is no need to guard setting of MMF_OOM_SKIP flag, and we can
> > guarantee a chance to call get_page_from_freelist() in
> > __alloc_pages_may_oom() without depending on oom_lock serialization.
> > 
> > This patch makes MMF_OOM_SKIP act as if MMF_OOM_REAPED, and adds a new
> > flag which acts as if MMF_OOM_SKIP, in order to close both race window
> > (the OOM reaper side and __mmput() side) without using oom_lock.
> 
> Why do we need this patch when
> http://lkml.kernel.org/r/20170626130346.26314-1-mho...@kernel.org
> already removes the lock and solves another problem at once?

We haven't got an answer from Hugh and/or Andrea whether that patch is safe.
Even if that patch is safe, this patch still helps with CONFIG_MMU=n case.


Re: [PATCH] oom_reaper: close race without using oom_lock

2017-07-18 Thread Tetsuo Handa
Michal Hocko wrote:
> On Tue 18-07-17 23:06:50, Tetsuo Handa wrote:
> > Commit e2fe14564d3316d1 ("oom_reaper: close race with exiting task")
> > guarded whole OOM reaping operations using oom_lock. But there was no
> > need to guard whole operations. We needed to guard only setting of
> > MMF_OOM_REAPED flag because get_page_from_freelist() in
> > __alloc_pages_may_oom() is called with oom_lock held.
> > 
> > If we change to guard only setting of MMF_OOM_SKIP flag, the OOM reaper
> > can start reaping operations as soon as wake_oom_reaper() is called.
> > But since setting of MMF_OOM_SKIP flag at __mmput() is not guarded with
> > oom_lock, guarding only the OOM reaper side is not sufficient.
> > 
> > If we change the OOM killer side to ignore MMF_OOM_SKIP flag once,
> > there is no need to guard setting of MMF_OOM_SKIP flag, and we can
> > guarantee a chance to call get_page_from_freelist() in
> > __alloc_pages_may_oom() without depending on oom_lock serialization.
> > 
> > This patch makes MMF_OOM_SKIP act as if MMF_OOM_REAPED, and adds a new
> > flag which acts as if MMF_OOM_SKIP, in order to close both race window
> > (the OOM reaper side and __mmput() side) without using oom_lock.
> 
> Why do we need this patch when
> http://lkml.kernel.org/r/20170626130346.26314-1-mho...@kernel.org
> already removes the lock and solves another problem at once?

We haven't got an answer from Hugh and/or Andrea whether that patch is safe.
Even if that patch is safe, this patch still helps with CONFIG_MMU=n case.


Re: [PATCH] oom_reaper: close race without using oom_lock

2017-07-18 Thread Johannes Weiner
On Tue, Jul 18, 2017 at 11:06:50PM +0900, Tetsuo Handa wrote:
> Commit e2fe14564d3316d1 ("oom_reaper: close race with exiting task")
> guarded whole OOM reaping operations using oom_lock. But there was no
> need to guard whole operations. We needed to guard only setting of
> MMF_OOM_REAPED flag because get_page_from_freelist() in
> __alloc_pages_may_oom() is called with oom_lock held.
>
> If we change to guard only setting of MMF_OOM_SKIP flag, the OOM reaper
> can start reaping operations as soon as wake_oom_reaper() is called.
> But since setting of MMF_OOM_SKIP flag at __mmput() is not guarded with
> oom_lock, guarding only the OOM reaper side is not sufficient.
> 
> If we change the OOM killer side to ignore MMF_OOM_SKIP flag once,
> there is no need to guard setting of MMF_OOM_SKIP flag, and we can
> guarantee a chance to call get_page_from_freelist() in
> __alloc_pages_may_oom() without depending on oom_lock serialization.
> 
> This patch makes MMF_OOM_SKIP act as if MMF_OOM_REAPED, and adds a new
> flag which acts as if MMF_OOM_SKIP, in order to close both race window
> (the OOM reaper side and __mmput() side) without using oom_lock.

I have no idea what this is about - a race window fix? A performance
optimization? A code simplification?

Users and vendors are later going to read through these changelogs and
have to decide whether they want this patch or upgrade to a kernel
containing it. Please keep these people in mind when writing the
subject and first paragraph of the changelogs.


Re: [PATCH] oom_reaper: close race without using oom_lock

2017-07-18 Thread Johannes Weiner
On Tue, Jul 18, 2017 at 11:06:50PM +0900, Tetsuo Handa wrote:
> Commit e2fe14564d3316d1 ("oom_reaper: close race with exiting task")
> guarded whole OOM reaping operations using oom_lock. But there was no
> need to guard whole operations. We needed to guard only setting of
> MMF_OOM_REAPED flag because get_page_from_freelist() in
> __alloc_pages_may_oom() is called with oom_lock held.
>
> If we change to guard only setting of MMF_OOM_SKIP flag, the OOM reaper
> can start reaping operations as soon as wake_oom_reaper() is called.
> But since setting of MMF_OOM_SKIP flag at __mmput() is not guarded with
> oom_lock, guarding only the OOM reaper side is not sufficient.
> 
> If we change the OOM killer side to ignore MMF_OOM_SKIP flag once,
> there is no need to guard setting of MMF_OOM_SKIP flag, and we can
> guarantee a chance to call get_page_from_freelist() in
> __alloc_pages_may_oom() without depending on oom_lock serialization.
> 
> This patch makes MMF_OOM_SKIP act as if MMF_OOM_REAPED, and adds a new
> flag which acts as if MMF_OOM_SKIP, in order to close both race window
> (the OOM reaper side and __mmput() side) without using oom_lock.

I have no idea what this is about - a race window fix? A performance
optimization? A code simplification?

Users and vendors are later going to read through these changelogs and
have to decide whether they want this patch or upgrade to a kernel
containing it. Please keep these people in mind when writing the
subject and first paragraph of the changelogs.


Re: [PATCH] oom_reaper: close race without using oom_lock

2017-07-18 Thread Michal Hocko
On Tue 18-07-17 23:06:50, Tetsuo Handa wrote:
> Commit e2fe14564d3316d1 ("oom_reaper: close race with exiting task")
> guarded whole OOM reaping operations using oom_lock. But there was no
> need to guard whole operations. We needed to guard only setting of
> MMF_OOM_REAPED flag because get_page_from_freelist() in
> __alloc_pages_may_oom() is called with oom_lock held.
> 
> If we change to guard only setting of MMF_OOM_SKIP flag, the OOM reaper
> can start reaping operations as soon as wake_oom_reaper() is called.
> But since setting of MMF_OOM_SKIP flag at __mmput() is not guarded with
> oom_lock, guarding only the OOM reaper side is not sufficient.
> 
> If we change the OOM killer side to ignore MMF_OOM_SKIP flag once,
> there is no need to guard setting of MMF_OOM_SKIP flag, and we can
> guarantee a chance to call get_page_from_freelist() in
> __alloc_pages_may_oom() without depending on oom_lock serialization.
> 
> This patch makes MMF_OOM_SKIP act as if MMF_OOM_REAPED, and adds a new
> flag which acts as if MMF_OOM_SKIP, in order to close both race window
> (the OOM reaper side and __mmput() side) without using oom_lock.

Why do we need this patch when
http://lkml.kernel.org/r/20170626130346.26314-1-mho...@kernel.org
already removes the lock and solves another problem at once?

> Signed-off-by: Tetsuo Handa 
> ---
>  include/linux/mm_types.h |  1 +
>  mm/oom_kill.c| 42 +++---
>  2 files changed, 16 insertions(+), 27 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index ff15181..3184b7a 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -495,6 +495,7 @@ struct mm_struct {
>*/
>   bool tlb_flush_pending;
>  #endif
> + bool oom_killer_synchronized;
>   struct uprobes_state uprobes_state;
>  #ifdef CONFIG_HUGETLB_PAGE
>   atomic_long_t hugetlb_usage;
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 9e8b4f0..1710133 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -300,11 +300,17 @@ static int oom_evaluate_task(struct task_struct *task, 
> void *arg)
>* This task already has access to memory reserves and is being killed.
>* Don't allow any other task to have access to the reserves unless
>* the task has MMF_OOM_SKIP because chances that it would release
> -  * any memory is quite low.
> +  * any memory is quite low. But ignore MMF_OOM_SKIP once, for there is
> +  * still possibility that get_page_from_freelist() with oom_lock held
> +  * succeeds because MMF_OOM_SKIP is set without oom_lock held.
>*/
>   if (!is_sysrq_oom(oc) && tsk_is_oom_victim(task)) {
> - if (test_bit(MMF_OOM_SKIP, >signal->oom_mm->flags))
> + struct mm_struct *mm = task->signal->oom_mm;
> +
> + if (mm->oom_killer_synchronized)
>   goto next;
> + if (test_bit(MMF_OOM_SKIP, >flags))
> + mm->oom_killer_synchronized = true;
>   goto abort;
>   }
>  
> @@ -470,28 +476,10 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
> struct mm_struct *mm)
>  {
>   struct mmu_gather tlb;
>   struct vm_area_struct *vma;
> - bool ret = true;
> -
> - /*
> -  * We have to make sure to not race with the victim exit path
> -  * and cause premature new oom victim selection:
> -  * __oom_reap_task_mm   exit_mm
> -  *   mmget_not_zero
> -  *mmput
> -  *  atomic_dec_and_test
> -  *exit_oom_victim
> -  *  [...]
> -  *  out_of_memory
> -  *select_bad_process
> -  *  # no TIF_MEMDIE task selects new 
> victim
> -  *  unmap_page_range # frees some memory
> -  */
> - mutex_lock(_lock);
>  
>   if (!down_read_trylock(>mmap_sem)) {
> - ret = false;
>   trace_skip_task_reaping(tsk->pid);
> - goto unlock_oom;
> + return false;
>   }
>  
>   /*
> @@ -502,7 +490,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
> struct mm_struct *mm)
>   if (!mmget_not_zero(mm)) {
>   up_read(>mmap_sem);
>   trace_skip_task_reaping(tsk->pid);
> - goto unlock_oom;
> + return true;
>   }
>  
>   trace_start_task_reaping(tsk->pid);
> @@ -549,9 +537,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
> struct mm_struct *mm)
>*/
>   mmput_async(mm);
>   trace_finish_task_reaping(tsk->pid);
> -unlock_oom:
> - mutex_unlock(_lock);
> - return ret;
> + return true;
>  }
>  
>  #define MAX_OOM_REAP_RETRIES 10
> @@ -661,8 +647,10 @@ static void mark_oom_victim(struct task_struct *tsk)
>  

Re: [PATCH] oom_reaper: close race without using oom_lock

2017-07-18 Thread Michal Hocko
On Tue 18-07-17 23:06:50, Tetsuo Handa wrote:
> Commit e2fe14564d3316d1 ("oom_reaper: close race with exiting task")
> guarded whole OOM reaping operations using oom_lock. But there was no
> need to guard whole operations. We needed to guard only setting of
> MMF_OOM_REAPED flag because get_page_from_freelist() in
> __alloc_pages_may_oom() is called with oom_lock held.
> 
> If we change to guard only setting of MMF_OOM_SKIP flag, the OOM reaper
> can start reaping operations as soon as wake_oom_reaper() is called.
> But since setting of MMF_OOM_SKIP flag at __mmput() is not guarded with
> oom_lock, guarding only the OOM reaper side is not sufficient.
> 
> If we change the OOM killer side to ignore MMF_OOM_SKIP flag once,
> there is no need to guard setting of MMF_OOM_SKIP flag, and we can
> guarantee a chance to call get_page_from_freelist() in
> __alloc_pages_may_oom() without depending on oom_lock serialization.
> 
> This patch makes MMF_OOM_SKIP act as if MMF_OOM_REAPED, and adds a new
> flag which acts as if MMF_OOM_SKIP, in order to close both race window
> (the OOM reaper side and __mmput() side) without using oom_lock.

Why do we need this patch when
http://lkml.kernel.org/r/20170626130346.26314-1-mho...@kernel.org
already removes the lock and solves another problem at once?

> Signed-off-by: Tetsuo Handa 
> ---
>  include/linux/mm_types.h |  1 +
>  mm/oom_kill.c| 42 +++---
>  2 files changed, 16 insertions(+), 27 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index ff15181..3184b7a 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -495,6 +495,7 @@ struct mm_struct {
>*/
>   bool tlb_flush_pending;
>  #endif
> + bool oom_killer_synchronized;
>   struct uprobes_state uprobes_state;
>  #ifdef CONFIG_HUGETLB_PAGE
>   atomic_long_t hugetlb_usage;
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 9e8b4f0..1710133 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -300,11 +300,17 @@ static int oom_evaluate_task(struct task_struct *task, 
> void *arg)
>* This task already has access to memory reserves and is being killed.
>* Don't allow any other task to have access to the reserves unless
>* the task has MMF_OOM_SKIP because chances that it would release
> -  * any memory is quite low.
> +  * any memory is quite low. But ignore MMF_OOM_SKIP once, for there is
> +  * still possibility that get_page_from_freelist() with oom_lock held
> +  * succeeds because MMF_OOM_SKIP is set without oom_lock held.
>*/
>   if (!is_sysrq_oom(oc) && tsk_is_oom_victim(task)) {
> - if (test_bit(MMF_OOM_SKIP, >signal->oom_mm->flags))
> + struct mm_struct *mm = task->signal->oom_mm;
> +
> + if (mm->oom_killer_synchronized)
>   goto next;
> + if (test_bit(MMF_OOM_SKIP, >flags))
> + mm->oom_killer_synchronized = true;
>   goto abort;
>   }
>  
> @@ -470,28 +476,10 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
> struct mm_struct *mm)
>  {
>   struct mmu_gather tlb;
>   struct vm_area_struct *vma;
> - bool ret = true;
> -
> - /*
> -  * We have to make sure to not race with the victim exit path
> -  * and cause premature new oom victim selection:
> -  * __oom_reap_task_mm   exit_mm
> -  *   mmget_not_zero
> -  *mmput
> -  *  atomic_dec_and_test
> -  *exit_oom_victim
> -  *  [...]
> -  *  out_of_memory
> -  *select_bad_process
> -  *  # no TIF_MEMDIE task selects new 
> victim
> -  *  unmap_page_range # frees some memory
> -  */
> - mutex_lock(_lock);
>  
>   if (!down_read_trylock(>mmap_sem)) {
> - ret = false;
>   trace_skip_task_reaping(tsk->pid);
> - goto unlock_oom;
> + return false;
>   }
>  
>   /*
> @@ -502,7 +490,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
> struct mm_struct *mm)
>   if (!mmget_not_zero(mm)) {
>   up_read(>mmap_sem);
>   trace_skip_task_reaping(tsk->pid);
> - goto unlock_oom;
> + return true;
>   }
>  
>   trace_start_task_reaping(tsk->pid);
> @@ -549,9 +537,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
> struct mm_struct *mm)
>*/
>   mmput_async(mm);
>   trace_finish_task_reaping(tsk->pid);
> -unlock_oom:
> - mutex_unlock(_lock);
> - return ret;
> + return true;
>  }
>  
>  #define MAX_OOM_REAP_RETRIES 10
> @@ -661,8 +647,10 @@ static void mark_oom_victim(struct task_struct *tsk)
>   return;
>  
>   /* oom_mm is