Re: Re: [PATCH] oom_reaper: close race without using oom_lock

2017-08-10 Thread Tetsuo Handa
Michal Hocko wrote:
> On Thu 10-08-17 21:10:30, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Tue 08-08-17 11:14:50, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > > > On Sat 05-08-17 10:02:55, Tetsuo Handa wrote:
> > > > > > Michal Hocko wrote:
> > > > > > > On Wed 26-07-17 20:33:21, Tetsuo Handa wrote:
> > > > > > > > My question is, how can users know it if somebody was 
> > > > > > > > OOM-killed needlessly
> > > > > > > > by allowing MMF_OOM_SKIP to race.
> > > > > > > 
> > > > > > > Is it really important to know that the race is due to 
> > > > > > > MMF_OOM_SKIP?
> > > > > > 
> > > > > > Yes, it is really important. Needlessly selecting even one OOM 
> > > > > > victim is
> > > > > > a pain which is difficult to explain to and persuade some of 
> > > > > > customers.
> > > > > 
> > > > > How is this any different from a race with a task exiting an releasing
> > > > > some memory after we have crossed the point of no return and will kill
> > > > > something?
> > > > 
> > > > I'm not complaining about an exiting task releasing some memory after 
> > > > we have
> > > > crossed the point of no return.
> > > > 
> > > > What I'm saying is that we can postpone "the point of no return" if we 
> > > > ignore
> > > > MMF_OOM_SKIP for once (both this "oom_reaper: close race without using 
> > > > oom_lock"
> > > > thread and "mm, oom: task_will_free_mem(current) should ignore 
> > > > MMF_OOM_SKIP for
> > > > once." thread). These are race conditions we can avoid without crystal 
> > > > ball.
> > > 
> > > If those races are really that common than we can handle them even
> > > without "try once more" tricks. Really this is just an ugly hack. If you
> > > really care then make sure that we always try to allocate from memory
> > > reserves before going down the oom path. In other words, try to find a
> > > robust solution rather than tweaks around a problem.
> > 
> > Since your "mm, oom: allow oom reaper to race with exit_mmap" patch removes
> > oom_lock serialization from the OOM reaper, possibility of calling 
> > out_of_memory()
> > due to successful mutex_trylock(&oom_lock) would increase when the OOM 
> > reaper set
> > MMF_OOM_SKIP quickly.
> > 
> > What if task_is_oom_victim(current) became true and MMF_OOM_SKIP was set
> > on current->mm between after __gfp_pfmemalloc_flags() returned 0 and before
> > out_of_memory() is called (due to successful mutex_trylock(&oom_lock)) ?
> > 
> > Excuse me? Are you suggesting to try memory reserves before
> > task_is_oom_victim(current) becomes true?
> 
> No what I've tried to say is that if this really is a real problem,
> which I am not sure about, then the proper way to handle that is to
> attempt to allocate from memory reserves for an oom victim. I would be
> even willing to take the oom_lock back into the oom reaper path if the
> former turnes out to be awkward to implement. But all this assumes this
> is a _real_ problem.

Aren't we back to square one? My question is, how can users know it if
somebody was OOM-killed needlessly by allowing MMF_OOM_SKIP to race.

You don't want to call get_page_from_freelist() from out_of_memory(), do you?
But without passing a flag "whether get_page_from_freelist() with memory 
reserves
was already attempted if current thread is an OOM victim" to 
task_will_free_mem()
in out_of_memory() and a flag "whether get_page_from_freelist() without memory
reserves was already attempted if current thread is not an OOM victim" to
test_bit(MMF_OOM_SKIP) in oom_evaluate_task(), we won't be able to know
if somebody was OOM-killed needlessly by allowing MMF_OOM_SKIP to race.

Will you accept passing such flags (something like incomplete patch shown 
below) ?

--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -35,6 +35,8 @@ struct oom_control {
 */
const int order;
 
+   const bool reserves_tried;
+
/* Used by oom implementation, do not set */
unsigned long totalpages;
struct task_struct *chosen;
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -303,8 +303,10 @@ static int oom_evaluate_task(struct task_struct *task, 
void *arg)
 * any memory is quite low.
 */
if (!is_sysrq_oom(oc) && tsk_is_oom_victim(task)) {
-   if (test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags))
+   if (test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
+   WARN_ON(!oc->reserves_tried); // can't represent 
correctly
goto next;
+   }
goto abort;
}
 
@@ -762,7 +764,7 @@ static inline bool __task_will_free_mem(struct task_struct 
*task)
  * Caller has to make sure that task->mm is stable (hold task_lock or
  * it operates on the current).
  */
-static bool task_will_free_mem(struct task_struct *task)
+static bool task_will_free_mem(struct task_struct *task, const bool 
reserves_tried)
 {
struct mm_struct *mm = task->mm;
struct task_struct *p;
@@ -783,8 +785,10 @@ stati

Re: Re: [PATCH] oom_reaper: close race without using oom_lock

2017-08-10 Thread Michal Hocko
On Thu 10-08-17 21:10:30, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Tue 08-08-17 11:14:50, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Sat 05-08-17 10:02:55, Tetsuo Handa wrote:
> > > > > Michal Hocko wrote:
> > > > > > On Wed 26-07-17 20:33:21, Tetsuo Handa wrote:
> > > > > > > My question is, how can users know it if somebody was OOM-killed 
> > > > > > > needlessly
> > > > > > > by allowing MMF_OOM_SKIP to race.
> > > > > > 
> > > > > > Is it really important to know that the race is due to MMF_OOM_SKIP?
> > > > > 
> > > > > Yes, it is really important. Needlessly selecting even one OOM victim 
> > > > > is
> > > > > a pain which is difficult to explain to and persuade some of 
> > > > > customers.
> > > > 
> > > > How is this any different from a race with a task exiting an releasing
> > > > some memory after we have crossed the point of no return and will kill
> > > > something?
> > > 
> > > I'm not complaining about an exiting task releasing some memory after we 
> > > have
> > > crossed the point of no return.
> > > 
> > > What I'm saying is that we can postpone "the point of no return" if we 
> > > ignore
> > > MMF_OOM_SKIP for once (both this "oom_reaper: close race without using 
> > > oom_lock"
> > > thread and "mm, oom: task_will_free_mem(current) should ignore 
> > > MMF_OOM_SKIP for
> > > once." thread). These are race conditions we can avoid without crystal 
> > > ball.
> > 
> > If those races are really that common than we can handle them even
> > without "try once more" tricks. Really this is just an ugly hack. If you
> > really care then make sure that we always try to allocate from memory
> > reserves before going down the oom path. In other words, try to find a
> > robust solution rather than tweaks around a problem.
> 
> Since your "mm, oom: allow oom reaper to race with exit_mmap" patch removes
> oom_lock serialization from the OOM reaper, possibility of calling 
> out_of_memory()
> due to successful mutex_trylock(&oom_lock) would increase when the OOM reaper 
> set
> MMF_OOM_SKIP quickly.
> 
> What if task_is_oom_victim(current) became true and MMF_OOM_SKIP was set
> on current->mm between after __gfp_pfmemalloc_flags() returned 0 and before
> out_of_memory() is called (due to successful mutex_trylock(&oom_lock)) ?
> 
> Excuse me? Are you suggesting to try memory reserves before
> task_is_oom_victim(current) becomes true?

No what I've tried to say is that if this really is a real problem,
which I am not sure about, then the proper way to handle that is to
attempt to allocate from memory reserves for an oom victim. I would be
even willing to take the oom_lock back into the oom reaper path if the
former turnes out to be awkward to implement. But all this assumes this
is a _real_ problem.

> > [...]
> > > > Yes that is possible. Once you are in the shrinker land then you have to
> > > > count with everything. And if you want to imply that
> > > > get_page_from_freelist inside __alloc_pages_may_oom may lockup while
> > > > holding the oom_lock then you might be right but I haven't checked that
> > > > too deeply. It might be very well possible that the node reclaim bails
> > > > out early when we are under OOM.
> > > 
> > > Yes, I worry that get_page_from_freelist() with oom_lock held might 
> > > lockup.
> > > 
> > > If we are about to invoke the OOM killer for the first time, it is likely 
> > > that
> > > __node_reclaim() finds nothing to reclaim and will bail out immediately. 
> > > But if
> > > we are about to invoke the OOM killer again, it is possible that small 
> > > amount of
> > > memory was reclaimed by the OOM killer/reaper, and all reclaimed memory 
> > > was assigned
> > > to things which __node_reclaim() will find and try to reclaim, and any 
> > > thread which
> > > took oom_lock will call __node_reclaim() and __node_reclaim() find 
> > > something
> > > reclaimable if __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory allocation 
> > > is involved.
> > > 
> > > We should consider such situation volatile (i.e. should not make 
> > > assumption that
> > > get_page_from_freelist() with oom_lock held shall bail out immediately) 
> > > if shrinkers
> > > which (directly or indirectly) involve __GFP_DIRECT_RECLAIM && 
> > > !__GFP_NORETRY memory
> > > allocation are permitted.
> > 
> > Well, I think you are so focused on details that you most probably miss
> > a large picture here. Just think about the purpose of the node reclaim.
> > It is there to _prefer_ local allocations than go to a distant NUMA
> > node. So rather than speculating about details maybe it makes sense to
> > consider whether it actually makes any sense to even try to node reclaim
> > when we are OOM. In other words why to do an additional reclaim when we
> > just found out that all reclaim attempts have failed...
> 
> Below is what I will propose if there is possibility of lockup.
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index be5bd60..718b2e7 100644
> --- a/mm/page_alloc.c
> 

Re: Re: [PATCH] oom_reaper: close race without using oom_lock

2017-08-10 Thread Tetsuo Handa
Michal Hocko wrote:
> On Tue 08-08-17 11:14:50, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Sat 05-08-17 10:02:55, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > > > On Wed 26-07-17 20:33:21, Tetsuo Handa wrote:
> > > > > > My question is, how can users know it if somebody was OOM-killed 
> > > > > > needlessly
> > > > > > by allowing MMF_OOM_SKIP to race.
> > > > > 
> > > > > Is it really important to know that the race is due to MMF_OOM_SKIP?
> > > > 
> > > > Yes, it is really important. Needlessly selecting even one OOM victim is
> > > > a pain which is difficult to explain to and persuade some of customers.
> > > 
> > > How is this any different from a race with a task exiting an releasing
> > > some memory after we have crossed the point of no return and will kill
> > > something?
> > 
> > I'm not complaining about an exiting task releasing some memory after we 
> > have
> > crossed the point of no return.
> > 
> > What I'm saying is that we can postpone "the point of no return" if we 
> > ignore
> > MMF_OOM_SKIP for once (both this "oom_reaper: close race without using 
> > oom_lock"
> > thread and "mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP 
> > for
> > once." thread). These are race conditions we can avoid without crystal ball.
> 
> If those races are really that common than we can handle them even
> without "try once more" tricks. Really this is just an ugly hack. If you
> really care then make sure that we always try to allocate from memory
> reserves before going down the oom path. In other words, try to find a
> robust solution rather than tweaks around a problem.

Since your "mm, oom: allow oom reaper to race with exit_mmap" patch removes
oom_lock serialization from the OOM reaper, possibility of calling 
out_of_memory()
due to successful mutex_trylock(&oom_lock) would increase when the OOM reaper 
set
MMF_OOM_SKIP quickly.

What if task_is_oom_victim(current) became true and MMF_OOM_SKIP was set
on current->mm between after __gfp_pfmemalloc_flags() returned 0 and before
out_of_memory() is called (due to successful mutex_trylock(&oom_lock)) ?

Excuse me? Are you suggesting to try memory reserves before
task_is_oom_victim(current) becomes true?

> 
> [...]
> > > Yes that is possible. Once you are in the shrinker land then you have to
> > > count with everything. And if you want to imply that
> > > get_page_from_freelist inside __alloc_pages_may_oom may lockup while
> > > holding the oom_lock then you might be right but I haven't checked that
> > > too deeply. It might be very well possible that the node reclaim bails
> > > out early when we are under OOM.
> > 
> > Yes, I worry that get_page_from_freelist() with oom_lock held might lockup.
> > 
> > If we are about to invoke the OOM killer for the first time, it is likely 
> > that
> > __node_reclaim() finds nothing to reclaim and will bail out immediately. 
> > But if
> > we are about to invoke the OOM killer again, it is possible that small 
> > amount of
> > memory was reclaimed by the OOM killer/reaper, and all reclaimed memory was 
> > assigned
> > to things which __node_reclaim() will find and try to reclaim, and any 
> > thread which
> > took oom_lock will call __node_reclaim() and __node_reclaim() find something
> > reclaimable if __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory allocation is 
> > involved.
> > 
> > We should consider such situation volatile (i.e. should not make assumption 
> > that
> > get_page_from_freelist() with oom_lock held shall bail out immediately) if 
> > shrinkers
> > which (directly or indirectly) involve __GFP_DIRECT_RECLAIM && 
> > !__GFP_NORETRY memory
> > allocation are permitted.
> 
> Well, I think you are so focused on details that you most probably miss
> a large picture here. Just think about the purpose of the node reclaim.
> It is there to _prefer_ local allocations than go to a distant NUMA
> node. So rather than speculating about details maybe it makes sense to
> consider whether it actually makes any sense to even try to node reclaim
> when we are OOM. In other words why to do an additional reclaim when we
> just found out that all reclaim attempts have failed...

Below is what I will propose if there is possibility of lockup.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index be5bd60..718b2e7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3271,9 +3271,11 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, 
const char *fmt, ...)
/*
 * Go through the zonelist yet one more time, keep very high watermark
 * here, this is only to catch a parallel oom killing, we must fail if
-* we're still under heavy pressure.
+* we're still under heavy pressure. But make sure that this reclaim
+* attempt shall not involve __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
+* allocation which will never fail due to oom_lock already held.
 */
-   page = get_page_from_freelist(gfp_mask | __GFP_HARDWALL, order,
+  

Re: Re: [PATCH] oom_reaper: close race without using oom_lock

2017-08-10 Thread Michal Hocko
On Tue 08-08-17 11:14:50, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Sat 05-08-17 10:02:55, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Wed 26-07-17 20:33:21, Tetsuo Handa wrote:
> > > > > Michal Hocko wrote:
> > > > > > On Sun 23-07-17 09:41:50, Tetsuo Handa wrote:
> > > > > > > So, how can we verify the above race a real problem?
> > > > > > 
> > > > > > Try to simulate a _real_ workload and see whether we kill more tasks
> > > > > > than necessary. 
> > > > > 
> > > > > Whether it is a _real_ workload or not cannot become an answer.
> > > > > 
> > > > > If somebody is trying to allocate hundreds/thousands of pages after 
> > > > > memory of
> > > > > an OOM victim was reaped, avoiding this race window makes no sense; 
> > > > > next OOM
> > > > > victim will be selected anyway. But if somebody is trying to allocate 
> > > > > only one
> > > > > page and then is planning to release a lot of memory, avoiding this 
> > > > > race window
> > > > > can save somebody from being OOM-killed needlessly. This race window 
> > > > > depends on
> > > > > what the threads are about to do, not whether the workload is natural 
> > > > > or
> > > > > artificial.
> > > > 
> > > > And with a desparate lack of crystal ball we cannot do much about that
> > > > really.
> > > > 
> > > > > My question is, how can users know it if somebody was OOM-killed 
> > > > > needlessly
> > > > > by allowing MMF_OOM_SKIP to race.
> > > > 
> > > > Is it really important to know that the race is due to MMF_OOM_SKIP?
> > > 
> > > Yes, it is really important. Needlessly selecting even one OOM victim is
> > > a pain which is difficult to explain to and persuade some of customers.
> > 
> > How is this any different from a race with a task exiting an releasing
> > some memory after we have crossed the point of no return and will kill
> > something?
> 
> I'm not complaining about an exiting task releasing some memory after we have
> crossed the point of no return.
> 
> What I'm saying is that we can postpone "the point of no return" if we ignore
> MMF_OOM_SKIP for once (both this "oom_reaper: close race without using 
> oom_lock"
> thread and "mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP 
> for
> once." thread). These are race conditions we can avoid without crystal ball.

If those races are really that common than we can handle them even
without "try once more" tricks. Really this is just an ugly hack. If you
really care then make sure that we always try to allocate from memory
reserves before going down the oom path. In other words, try to find a
robust solution rather than tweaks around a problem.

[...]
> > Yes that is possible. Once you are in the shrinker land then you have to
> > count with everything. And if you want to imply that
> > get_page_from_freelist inside __alloc_pages_may_oom may lockup while
> > holding the oom_lock then you might be right but I haven't checked that
> > too deeply. It might be very well possible that the node reclaim bails
> > out early when we are under OOM.
> 
> Yes, I worry that get_page_from_freelist() with oom_lock held might lockup.
> 
> If we are about to invoke the OOM killer for the first time, it is likely that
> __node_reclaim() finds nothing to reclaim and will bail out immediately. But 
> if
> we are about to invoke the OOM killer again, it is possible that small amount 
> of
> memory was reclaimed by the OOM killer/reaper, and all reclaimed memory was 
> assigned
> to things which __node_reclaim() will find and try to reclaim, and any thread 
> which
> took oom_lock will call __node_reclaim() and __node_reclaim() find something
> reclaimable if __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory allocation is 
> involved.
> 
> We should consider such situation volatile (i.e. should not make assumption 
> that
> get_page_from_freelist() with oom_lock held shall bail out immediately) if 
> shrinkers
> which (directly or indirectly) involve __GFP_DIRECT_RECLAIM && !__GFP_NORETRY 
> memory
> allocation are permitted.

Well, I think you are so focused on details that you most probably miss
a large picture here. Just think about the purpose of the node reclaim.
It is there to _prefer_ local allocations than go to a distant NUMA
node. So rather than speculating about details maybe it makes sense to
consider whether it actually makes any sense to even try to node reclaim
when we are OOM. In other words why to do an additional reclaim when we
just found out that all reclaim attempts have failed...

-- 
Michal Hocko
SUSE Labs