Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy

2015-04-14 Thread Michal Hocko
On Tue 14-04-15 06:36:25, Johannes Weiner wrote:
> On Mon, Apr 13, 2015 at 02:46:14PM +0200, Michal Hocko wrote:
[...]
> > AFAIU, David wasn't asking for the OOM killer as much as he was
> > interested in getting access to a small amount of reserves in order to
> > make a progress. __GFP_HIGH is there for this purpose.
> 
> That's not just any reserve pool available to the generic caller, it's
> the reserve pool for interrupts, which can not wait and replenish it.
> It relies on kswapd to run soon after the interrupt, or right away on
> SMP.  But locks held in the filesystem can hold up kswapd (the reason
> we even still perform direct reclaim) so NOFS allocs shouldn't use it.
> 
> [hannes@dexter linux]$ git grep '__GFP_HIGH\b' | wc -l
> 39
> [hannes@dexter linux]$ git grep GFP_ATOMIC | wc -l
> 4324
> 
> Interrupts have *no other option*. 

Atomic context in general can ALLOC_HARDER so it has an access to
additional reserves wrt. __GFP_HIGH|__GFP_WAIT.

> It's misguided to deplete their
> reserves, cause loss of network packets, loss of input events, from
> allocations that can actually perform reclaim and have perfectly
> acceptable fallback strategies in the caller.

OK, I thought that it was clear that the proposed __GFP_HIGH is a
fallback strategy for those paths which cannot do much better. Not a
random solution for "this shouldn't fail to eagerly".

> Generally, for any reserve system there must be a way to replenish it.
> For interrupts it's kswapd, for the OOM reserves I proposed it's the
> OOM victim exiting soon after the allocation, if not right away.

And my understanding was that the fallback mode would be used in the
context which would lead to release of the fs pressure thus releasing a
memory as well.

> __GFP_NOFAIL is the odd one out here because accessing the system's
> emergency reserves without any prospect of near-future replenishing is
> just slightly better than deadlocking right away.  Which is why this
> reserve access can not be separated out: if you can do *anything*
> better than hanging, do it.  If not, use __GFP_NOFAIL.

Agreed.
 
> > > My question here would be: are there any NOFS allocations that *don't*
> > > want this behavior?  Does it even make sense to require this separate
> > > annotation or should we just make it the default?
> > > 
> > > The argument here was always that NOFS allocations are very limited in
> > > their reclaim powers and will trigger OOM prematurely.  However, the
> > > way we limit dirty memory these days forces most cache to be clean at
> > > all times, and direct reclaim in general hasn't been allowed to issue
> > > page writeback for quite some time.  So these days, NOFS reclaim isn't
> > > really weaker than regular direct reclaim. 
> > 
> > What about [di]cache and some others fs specific shrinkers (and heavy
> > metadata loads)?
> 
> My bad, I forgot about those.  But it doesn't really change the basic
> question of whether we want to change the GFP_NOFS default or merely
> annotate individual sites that want to try harder.

My understanding was the later one. If you look at page cache allocations
which use mapping_gfp_mask (e.g. xfs is using GFP_NOFS for that context
all the time) then those do not really have to try harder.

> > > The only exception is that
> > > it might block writeback, so we'd go OOM if the only reclaimables left
> > > were dirty pages against that filesystem.  That should be acceptable.
> > 
> > OOM killer is hardly acceptable by most users I've heard from. OOM
> > killer is the _last_ resort and if the allocation is restricted then
> > we shouldn't use the big hammer.
> 
> We *are* talking about the last resort for these allocations!  There
> is nothing else we can do to avoid allocation failure at this point.
> Absent a reservation system, we have the choice between failing after
> reclaim - which Dave said was too fragile for XFS - or OOM killing.

As per other emails in this thread (e.g.
http://marc.info/?l=linux-mm=142897087230385=2), I understood that
the access to a small portion of emergency pool would be sufficient to
release the pressure and that sounds preferable to me over a destructive
reclaim attempts.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy

2015-04-14 Thread Johannes Weiner
On Mon, Apr 13, 2015 at 02:46:14PM +0200, Michal Hocko wrote:
> [Sorry for a late reply]
> 
> On Tue 07-04-15 10:18:22, Johannes Weiner wrote:
> > On Wed, Apr 01, 2015 at 05:19:20PM +0200, Michal Hocko wrote:
> > > On Mon 30-03-15 11:32:40, Dave Chinner wrote:
> > > > On Fri, Mar 27, 2015 at 11:05:09AM -0400, Johannes Weiner wrote:
> > > [...]
> > > > > GFP_NOFS sites are currently one of the sites that can deadlock inside
> > > > > the allocator, even though many of them seem to have fallback code.
> > > > > My reasoning here is that if you *have* an exit strategy for failing
> > > > > allocations that is smarter than hanging, we should probably use that.
> > > > 
> > > > We already do that for allocations where we can handle failure in
> > > > GFP_NOFS conditions. It is, however, somewhat useless if we can't
> > > > tell the allocator to try really hard if we've already had a failure
> > > > and we are already in memory reclaim conditions (e.g. a shrinker
> > > > trying to clean dirty objects so they can be reclaimed).
> > > > 
> > > > From that perspective, I think that this patch set aims force us
> > > > away from handling fallbacks ourselves because a) it makes GFP_NOFS
> > > > more likely to fail, and b) provides no mechanism to "try harder"
> > > > when we really need the allocation to succeed.
> > > 
> > > You can ask for this "try harder" by __GFP_HIGH flag. Would that help
> > > in your fallback case?
> > 
> > I would think __GFP_REPEAT would be more suitable here.  From the doc:
> > 
> >  * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
> >  * _might_ fail.  This depends upon the particular VM implementation.
> > 
> > so we can make the semantics of GFP_NOFS | __GFP_REPEAT such that they
> > are allowed to use the OOM killer and dip into the OOM reserves.
> 
> __GFP_REPEAT is quite subtle already.  It makes a difference only
> for high order allocations

That's an implementation detail, owed to the fact that smaller orders
already imply that behavior.  That doesn't change the semantics.  And
people currently *use* it all over the tree for small orders, because
of how the flag is defined in gfp.h; not because of how it's currently
implemented.

> and it is not clear to me why it should imply OOM killer for small
> orders now.  Or did you suggest making it special only with
> GFP_NOFS?  That sounds even more ugly.

Small orders already invoke the OOM killer.  I suggested using this
flag to override the specialness of GFP_NOFS not OOM killing - in
response to whether we can provide an annotation to make some GFP_NOFS
sites more robust.

This is exactly what __GFP_REPEAT is: try the allocation harder than
you would without this flag.  It identifies a caller that is willing
to put in extra effort or be more aggressive because the allocation is
more important than other allocations of the otherwise same gfp_mask.

> AFAIU, David wasn't asking for the OOM killer as much as he was
> interested in getting access to a small amount of reserves in order to
> make a progress. __GFP_HIGH is there for this purpose.

That's not just any reserve pool available to the generic caller, it's
the reserve pool for interrupts, which can not wait and replenish it.
It relies on kswapd to run soon after the interrupt, or right away on
SMP.  But locks held in the filesystem can hold up kswapd (the reason
we even still perform direct reclaim) so NOFS allocs shouldn't use it.

[hannes@dexter linux]$ git grep '__GFP_HIGH\b' | wc -l
39
[hannes@dexter linux]$ git grep GFP_ATOMIC | wc -l
4324

Interrupts have *no other option*.  It's misguided to deplete their
reserves, cause loss of network packets, loss of input events, from
allocations that can actually perform reclaim and have perfectly
acceptable fallback strategies in the caller.

Generally, for any reserve system there must be a way to replenish it.
For interrupts it's kswapd, for the OOM reserves I proposed it's the
OOM victim exiting soon after the allocation, if not right away.

__GFP_NOFAIL is the odd one out here because accessing the system's
emergency reserves without any prospect of near-future replenishing is
just slightly better than deadlocking right away.  Which is why this
reserve access can not be separated out: if you can do *anything*
better than hanging, do it.  If not, use __GFP_NOFAIL.

> > My question here would be: are there any NOFS allocations that *don't*
> > want this behavior?  Does it even make sense to require this separate
> > annotation or should we just make it the default?
> > 
> > The argument here was always that NOFS allocations are very limited in
> > their reclaim powers and will trigger OOM prematurely.  However, the
> > way we limit dirty memory these days forces most cache to be clean at
> > all times, and direct reclaim in general hasn't been allowed to issue
> > page writeback for quite some time.  So these days, NOFS reclaim isn't
> > really weaker than regular direct reclaim. 
> 
> What about 

Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy

2015-04-14 Thread Michal Hocko
On Tue 14-04-15 10:11:18, Dave Chinner wrote:
> On Mon, Apr 13, 2015 at 02:46:14PM +0200, Michal Hocko wrote:
> > [Sorry for a late reply]
> > 
> > On Tue 07-04-15 10:18:22, Johannes Weiner wrote:
> > > On Wed, Apr 01, 2015 at 05:19:20PM +0200, Michal Hocko wrote:
> > > My question here would be: are there any NOFS allocations that *don't*
> > > want this behavior?  Does it even make sense to require this separate
> > > annotation or should we just make it the default?
> > > 
> > > The argument here was always that NOFS allocations are very limited in
> > > their reclaim powers and will trigger OOM prematurely.  However, the
> > > way we limit dirty memory these days forces most cache to be clean at
> > > all times, and direct reclaim in general hasn't been allowed to issue
> > > page writeback for quite some time.  So these days, NOFS reclaim isn't
> > > really weaker than regular direct reclaim. 
> > 
> > What about [di]cache and some others fs specific shrinkers (and heavy
> > metadata loads)?
> 
> We don't do direct reclaim for fs shrinkers in GFP_NOFS context,
> either.

Yeah but we invoke fs shrinkers for the _regular_ direct reclaim (with
__GFP_FS), which was the point I've tried to make here.

> *HOWEVER*
> 
> The shrinker reclaim we can not execute is deferred to the next
> context that can do the reclaim, which is usually kswapd. So the
> reclaim gets done according to the GFP_NOFS memory pressure that is
> occurring, it is just done in a different context...

Right, deferring to kswapd is the reason why I think the direct reclaim
shouldn't invoke OOM killer in this context because that would be
premature - as kswapd still can make some progress. Sorry for not being
more clear.

> > > The only exception is that
> > > it might block writeback, so we'd go OOM if the only reclaimables left
> > > were dirty pages against that filesystem.  That should be acceptable.
> > 
> > OOM killer is hardly acceptable by most users I've heard from. OOM
> > killer is the _last_ resort and if the allocation is restricted then
> > we shouldn't use the big hammer. The allocator might use __GFP_HIGH to
> > get access to memory reserves if it can fail or __GFP_NOFAIL if it
> > cannot. With your patches the NOFAIL case would get an access to memory
> > reserves as well. So I do not really see a reason to change GFP_NOFS vs.
> > OOM killer semantic.
> 
> So, really, what we want is something like:
> 
> #define __GFP_USE_LOWMEM_RESERVE  __GFP_HIGH
> 
> So that it documents the code that is using it effectively and we
> can find them easily with cscope/grep?

I wouldn't be opposed. To be honest I was never fond of __GFP_HIGH. The
naming is counterintuitive. So I would rather go with renaminag it. We do
not have that many users in the tree.
git grep "GFP_HIGH\>" | wc -l
40
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy

2015-04-14 Thread Michal Hocko
On Tue 14-04-15 10:11:18, Dave Chinner wrote:
 On Mon, Apr 13, 2015 at 02:46:14PM +0200, Michal Hocko wrote:
  [Sorry for a late reply]
  
  On Tue 07-04-15 10:18:22, Johannes Weiner wrote:
   On Wed, Apr 01, 2015 at 05:19:20PM +0200, Michal Hocko wrote:
   My question here would be: are there any NOFS allocations that *don't*
   want this behavior?  Does it even make sense to require this separate
   annotation or should we just make it the default?
   
   The argument here was always that NOFS allocations are very limited in
   their reclaim powers and will trigger OOM prematurely.  However, the
   way we limit dirty memory these days forces most cache to be clean at
   all times, and direct reclaim in general hasn't been allowed to issue
   page writeback for quite some time.  So these days, NOFS reclaim isn't
   really weaker than regular direct reclaim. 
  
  What about [di]cache and some others fs specific shrinkers (and heavy
  metadata loads)?
 
 We don't do direct reclaim for fs shrinkers in GFP_NOFS context,
 either.

Yeah but we invoke fs shrinkers for the _regular_ direct reclaim (with
__GFP_FS), which was the point I've tried to make here.

 *HOWEVER*
 
 The shrinker reclaim we can not execute is deferred to the next
 context that can do the reclaim, which is usually kswapd. So the
 reclaim gets done according to the GFP_NOFS memory pressure that is
 occurring, it is just done in a different context...

Right, deferring to kswapd is the reason why I think the direct reclaim
shouldn't invoke OOM killer in this context because that would be
premature - as kswapd still can make some progress. Sorry for not being
more clear.

   The only exception is that
   it might block writeback, so we'd go OOM if the only reclaimables left
   were dirty pages against that filesystem.  That should be acceptable.
  
  OOM killer is hardly acceptable by most users I've heard from. OOM
  killer is the _last_ resort and if the allocation is restricted then
  we shouldn't use the big hammer. The allocator might use __GFP_HIGH to
  get access to memory reserves if it can fail or __GFP_NOFAIL if it
  cannot. With your patches the NOFAIL case would get an access to memory
  reserves as well. So I do not really see a reason to change GFP_NOFS vs.
  OOM killer semantic.
 
 So, really, what we want is something like:
 
 #define __GFP_USE_LOWMEM_RESERVE  __GFP_HIGH
 
 So that it documents the code that is using it effectively and we
 can find them easily with cscope/grep?

I wouldn't be opposed. To be honest I was never fond of __GFP_HIGH. The
naming is counterintuitive. So I would rather go with renaminag it. We do
not have that many users in the tree.
git grep GFP_HIGH\ | wc -l
40
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy

2015-04-14 Thread Johannes Weiner
On Mon, Apr 13, 2015 at 02:46:14PM +0200, Michal Hocko wrote:
 [Sorry for a late reply]
 
 On Tue 07-04-15 10:18:22, Johannes Weiner wrote:
  On Wed, Apr 01, 2015 at 05:19:20PM +0200, Michal Hocko wrote:
   On Mon 30-03-15 11:32:40, Dave Chinner wrote:
On Fri, Mar 27, 2015 at 11:05:09AM -0400, Johannes Weiner wrote:
   [...]
 GFP_NOFS sites are currently one of the sites that can deadlock inside
 the allocator, even though many of them seem to have fallback code.
 My reasoning here is that if you *have* an exit strategy for failing
 allocations that is smarter than hanging, we should probably use that.

We already do that for allocations where we can handle failure in
GFP_NOFS conditions. It is, however, somewhat useless if we can't
tell the allocator to try really hard if we've already had a failure
and we are already in memory reclaim conditions (e.g. a shrinker
trying to clean dirty objects so they can be reclaimed).

From that perspective, I think that this patch set aims force us
away from handling fallbacks ourselves because a) it makes GFP_NOFS
more likely to fail, and b) provides no mechanism to try harder
when we really need the allocation to succeed.
   
   You can ask for this try harder by __GFP_HIGH flag. Would that help
   in your fallback case?
  
  I would think __GFP_REPEAT would be more suitable here.  From the doc:
  
   * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
   * _might_ fail.  This depends upon the particular VM implementation.
  
  so we can make the semantics of GFP_NOFS | __GFP_REPEAT such that they
  are allowed to use the OOM killer and dip into the OOM reserves.
 
 __GFP_REPEAT is quite subtle already.  It makes a difference only
 for high order allocations

That's an implementation detail, owed to the fact that smaller orders
already imply that behavior.  That doesn't change the semantics.  And
people currently *use* it all over the tree for small orders, because
of how the flag is defined in gfp.h; not because of how it's currently
implemented.

 and it is not clear to me why it should imply OOM killer for small
 orders now.  Or did you suggest making it special only with
 GFP_NOFS?  That sounds even more ugly.

Small orders already invoke the OOM killer.  I suggested using this
flag to override the specialness of GFP_NOFS not OOM killing - in
response to whether we can provide an annotation to make some GFP_NOFS
sites more robust.

This is exactly what __GFP_REPEAT is: try the allocation harder than
you would without this flag.  It identifies a caller that is willing
to put in extra effort or be more aggressive because the allocation is
more important than other allocations of the otherwise same gfp_mask.

 AFAIU, David wasn't asking for the OOM killer as much as he was
 interested in getting access to a small amount of reserves in order to
 make a progress. __GFP_HIGH is there for this purpose.

That's not just any reserve pool available to the generic caller, it's
the reserve pool for interrupts, which can not wait and replenish it.
It relies on kswapd to run soon after the interrupt, or right away on
SMP.  But locks held in the filesystem can hold up kswapd (the reason
we even still perform direct reclaim) so NOFS allocs shouldn't use it.

[hannes@dexter linux]$ git grep '__GFP_HIGH\b' | wc -l
39
[hannes@dexter linux]$ git grep GFP_ATOMIC | wc -l
4324

Interrupts have *no other option*.  It's misguided to deplete their
reserves, cause loss of network packets, loss of input events, from
allocations that can actually perform reclaim and have perfectly
acceptable fallback strategies in the caller.

Generally, for any reserve system there must be a way to replenish it.
For interrupts it's kswapd, for the OOM reserves I proposed it's the
OOM victim exiting soon after the allocation, if not right away.

__GFP_NOFAIL is the odd one out here because accessing the system's
emergency reserves without any prospect of near-future replenishing is
just slightly better than deadlocking right away.  Which is why this
reserve access can not be separated out: if you can do *anything*
better than hanging, do it.  If not, use __GFP_NOFAIL.

  My question here would be: are there any NOFS allocations that *don't*
  want this behavior?  Does it even make sense to require this separate
  annotation or should we just make it the default?
  
  The argument here was always that NOFS allocations are very limited in
  their reclaim powers and will trigger OOM prematurely.  However, the
  way we limit dirty memory these days forces most cache to be clean at
  all times, and direct reclaim in general hasn't been allowed to issue
  page writeback for quite some time.  So these days, NOFS reclaim isn't
  really weaker than regular direct reclaim. 
 
 What about [di]cache and some others fs specific shrinkers (and heavy
 metadata loads)?

My bad, I forgot about those.  But it doesn't really change the 

Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy

2015-04-14 Thread Michal Hocko
On Tue 14-04-15 06:36:25, Johannes Weiner wrote:
 On Mon, Apr 13, 2015 at 02:46:14PM +0200, Michal Hocko wrote:
[...]
  AFAIU, David wasn't asking for the OOM killer as much as he was
  interested in getting access to a small amount of reserves in order to
  make a progress. __GFP_HIGH is there for this purpose.
 
 That's not just any reserve pool available to the generic caller, it's
 the reserve pool for interrupts, which can not wait and replenish it.
 It relies on kswapd to run soon after the interrupt, or right away on
 SMP.  But locks held in the filesystem can hold up kswapd (the reason
 we even still perform direct reclaim) so NOFS allocs shouldn't use it.
 
 [hannes@dexter linux]$ git grep '__GFP_HIGH\b' | wc -l
 39
 [hannes@dexter linux]$ git grep GFP_ATOMIC | wc -l
 4324
 
 Interrupts have *no other option*. 

Atomic context in general can ALLOC_HARDER so it has an access to
additional reserves wrt. __GFP_HIGH|__GFP_WAIT.

 It's misguided to deplete their
 reserves, cause loss of network packets, loss of input events, from
 allocations that can actually perform reclaim and have perfectly
 acceptable fallback strategies in the caller.

OK, I thought that it was clear that the proposed __GFP_HIGH is a
fallback strategy for those paths which cannot do much better. Not a
random solution for this shouldn't fail to eagerly.

 Generally, for any reserve system there must be a way to replenish it.
 For interrupts it's kswapd, for the OOM reserves I proposed it's the
 OOM victim exiting soon after the allocation, if not right away.

And my understanding was that the fallback mode would be used in the
context which would lead to release of the fs pressure thus releasing a
memory as well.

 __GFP_NOFAIL is the odd one out here because accessing the system's
 emergency reserves without any prospect of near-future replenishing is
 just slightly better than deadlocking right away.  Which is why this
 reserve access can not be separated out: if you can do *anything*
 better than hanging, do it.  If not, use __GFP_NOFAIL.

Agreed.
 
   My question here would be: are there any NOFS allocations that *don't*
   want this behavior?  Does it even make sense to require this separate
   annotation or should we just make it the default?
   
   The argument here was always that NOFS allocations are very limited in
   their reclaim powers and will trigger OOM prematurely.  However, the
   way we limit dirty memory these days forces most cache to be clean at
   all times, and direct reclaim in general hasn't been allowed to issue
   page writeback for quite some time.  So these days, NOFS reclaim isn't
   really weaker than regular direct reclaim. 
  
  What about [di]cache and some others fs specific shrinkers (and heavy
  metadata loads)?
 
 My bad, I forgot about those.  But it doesn't really change the basic
 question of whether we want to change the GFP_NOFS default or merely
 annotate individual sites that want to try harder.

My understanding was the later one. If you look at page cache allocations
which use mapping_gfp_mask (e.g. xfs is using GFP_NOFS for that context
all the time) then those do not really have to try harder.

   The only exception is that
   it might block writeback, so we'd go OOM if the only reclaimables left
   were dirty pages against that filesystem.  That should be acceptable.
  
  OOM killer is hardly acceptable by most users I've heard from. OOM
  killer is the _last_ resort and if the allocation is restricted then
  we shouldn't use the big hammer.
 
 We *are* talking about the last resort for these allocations!  There
 is nothing else we can do to avoid allocation failure at this point.
 Absent a reservation system, we have the choice between failing after
 reclaim - which Dave said was too fragile for XFS - or OOM killing.

As per other emails in this thread (e.g.
http://marc.info/?l=linux-mmm=142897087230385w=2), I understood that
the access to a small portion of emergency pool would be sufficient to
release the pressure and that sounds preferable to me over a destructive
reclaim attempts.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy

2015-04-13 Thread Dave Chinner
On Mon, Apr 13, 2015 at 02:46:14PM +0200, Michal Hocko wrote:
> [Sorry for a late reply]
> 
> On Tue 07-04-15 10:18:22, Johannes Weiner wrote:
> > On Wed, Apr 01, 2015 at 05:19:20PM +0200, Michal Hocko wrote:
> > My question here would be: are there any NOFS allocations that *don't*
> > want this behavior?  Does it even make sense to require this separate
> > annotation or should we just make it the default?
> > 
> > The argument here was always that NOFS allocations are very limited in
> > their reclaim powers and will trigger OOM prematurely.  However, the
> > way we limit dirty memory these days forces most cache to be clean at
> > all times, and direct reclaim in general hasn't been allowed to issue
> > page writeback for quite some time.  So these days, NOFS reclaim isn't
> > really weaker than regular direct reclaim. 
> 
> What about [di]cache and some others fs specific shrinkers (and heavy
> metadata loads)?

We don't do direct reclaim for fs shrinkers in GFP_NOFS context,
either.

*HOWEVER*

The shrinker reclaim we can not execute is deferred to the next
context that can do the reclaim, which is usually kswapd. So the
reclaim gets done according to the GFP_NOFS memory pressure that is
occurring, it is just done in a different context...

> > The only exception is that
> > it might block writeback, so we'd go OOM if the only reclaimables left
> > were dirty pages against that filesystem.  That should be acceptable.
> 
> OOM killer is hardly acceptable by most users I've heard from. OOM
> killer is the _last_ resort and if the allocation is restricted then
> we shouldn't use the big hammer. The allocator might use __GFP_HIGH to
> get access to memory reserves if it can fail or __GFP_NOFAIL if it
> cannot. With your patches the NOFAIL case would get an access to memory
> reserves as well. So I do not really see a reason to change GFP_NOFS vs.
> OOM killer semantic.

So, really, what we want is something like:

#define __GFP_USE_LOWMEM_RESERVE__GFP_HIGH

So that it documents the code that is using it effectively and we
can find them easily with cscope/grep?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy

2015-04-13 Thread Michal Hocko
On Sat 11-04-15 16:29:26, Tetsuo Handa wrote:
> Johannes Weiner wrote:
> > The argument here was always that NOFS allocations are very limited in
> > their reclaim powers and will trigger OOM prematurely.  However, the
> > way we limit dirty memory these days forces most cache to be clean at
> > all times, and direct reclaim in general hasn't been allowed to issue
> > page writeback for quite some time.  So these days, NOFS reclaim isn't
> > really weaker than regular direct reclaim.  The only exception is that
> > it might block writeback, so we'd go OOM if the only reclaimables left
> > were dirty pages against that filesystem.  That should be acceptable.
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 47981c5e54c3..fe3cb2b0b85b 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2367,16 +2367,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int 
> > order, int alloc_flags,
> > /* The OOM killer does not needlessly kill tasks for lowmem */
> > if (ac->high_zoneidx < ZONE_NORMAL)
> > goto out;
> > -   /* The OOM killer does not compensate for IO-less reclaim */
> > -   if (!(gfp_mask & __GFP_FS)) {
> > -   /*
> > -* XXX: Page reclaim didn't yield anything,
> > -* and the OOM killer can't be invoked, but
> > -* keep looping as per tradition.
> > -*/
> > -   *did_some_progress = 1;
> > -   goto out;
> > -   }
> > if (pm_suspended_storage())
> > goto out;
> > /* The OOM killer may not free memory on a specific node */
> > 
> 
> I think this change will allow calling out_of_memory() which results in
> "oom_kill_process() is trivially called via pagefault_out_of_memory()"
> problem described in https://lkml.org/lkml/2015/3/18/219 .
> 
> I myself think that we should trigger OOM killer for !__GFP_FS allocation
> in order to make forward progress in case the OOM victim is blocked.
> So, my question about this change is whether we can accept involving OOM
> killer from page fault, no matter how trivially OOM killer will kill some
> process?

We trigger OOM killer from the page fault path for ages. In fact the memcg
will trigger memcg OOM killer _only_ from the page fault path because
this context is safe as we do not sit on any locks at the time.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy

2015-04-13 Thread Michal Hocko
[Sorry for a late reply]

On Tue 07-04-15 10:18:22, Johannes Weiner wrote:
> On Wed, Apr 01, 2015 at 05:19:20PM +0200, Michal Hocko wrote:
> > On Mon 30-03-15 11:32:40, Dave Chinner wrote:
> > > On Fri, Mar 27, 2015 at 11:05:09AM -0400, Johannes Weiner wrote:
> > [...]
> > > > GFP_NOFS sites are currently one of the sites that can deadlock inside
> > > > the allocator, even though many of them seem to have fallback code.
> > > > My reasoning here is that if you *have* an exit strategy for failing
> > > > allocations that is smarter than hanging, we should probably use that.
> > > 
> > > We already do that for allocations where we can handle failure in
> > > GFP_NOFS conditions. It is, however, somewhat useless if we can't
> > > tell the allocator to try really hard if we've already had a failure
> > > and we are already in memory reclaim conditions (e.g. a shrinker
> > > trying to clean dirty objects so they can be reclaimed).
> > > 
> > > From that perspective, I think that this patch set aims force us
> > > away from handling fallbacks ourselves because a) it makes GFP_NOFS
> > > more likely to fail, and b) provides no mechanism to "try harder"
> > > when we really need the allocation to succeed.
> > 
> > You can ask for this "try harder" by __GFP_HIGH flag. Would that help
> > in your fallback case?
> 
> I would think __GFP_REPEAT would be more suitable here.  From the doc:
> 
>  * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
>  * _might_ fail.  This depends upon the particular VM implementation.
> 
> so we can make the semantics of GFP_NOFS | __GFP_REPEAT such that they
> are allowed to use the OOM killer and dip into the OOM reserves.

__GFP_REPEAT is quite subtle already. It makes a difference only for
high order allocations and it is not clear to me why it should imply OOM
killer for small orders now. Or did you suggest making it special only
with GFP_NOFS? That sounds even more ugly.

AFAIU, David wasn't asking for the OOM killer as much as he was
interested in getting access to a small amount of reserves in order to
make a progress. __GFP_HIGH is there for this purpose.

> My question here would be: are there any NOFS allocations that *don't*
> want this behavior?  Does it even make sense to require this separate
> annotation or should we just make it the default?
> 
> The argument here was always that NOFS allocations are very limited in
> their reclaim powers and will trigger OOM prematurely.  However, the
> way we limit dirty memory these days forces most cache to be clean at
> all times, and direct reclaim in general hasn't been allowed to issue
> page writeback for quite some time.  So these days, NOFS reclaim isn't
> really weaker than regular direct reclaim. 

What about [di]cache and some others fs specific shrinkers (and heavy
metadata loads)?

> The only exception is that
> it might block writeback, so we'd go OOM if the only reclaimables left
> were dirty pages against that filesystem.  That should be acceptable.

OOM killer is hardly acceptable by most users I've heard from. OOM
killer is the _last_ resort and if the allocation is restricted then
we shouldn't use the big hammer. The allocator might use __GFP_HIGH to
get access to memory reserves if it can fail or __GFP_NOFAIL if it
cannot. With your patches the NOFAIL case would get an access to memory
reserves as well. So I do not really see a reason to change GFP_NOFS vs.
OOM killer semantic.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 47981c5e54c3..fe3cb2b0b85b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2367,16 +2367,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int 
> order, int alloc_flags,
>   /* The OOM killer does not needlessly kill tasks for lowmem */
>   if (ac->high_zoneidx < ZONE_NORMAL)
>   goto out;
> - /* The OOM killer does not compensate for IO-less reclaim */
> - if (!(gfp_mask & __GFP_FS)) {
> - /*
> -  * XXX: Page reclaim didn't yield anything,
> -  * and the OOM killer can't be invoked, but
> -  * keep looping as per tradition.
> -  */
> - *did_some_progress = 1;
> - goto out;
> - }
>   if (pm_suspended_storage())
>   goto out;
>   /* The OOM killer may not free memory on a specific node */

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy

2015-04-13 Thread Dave Chinner
On Mon, Apr 13, 2015 at 02:46:14PM +0200, Michal Hocko wrote:
 [Sorry for a late reply]
 
 On Tue 07-04-15 10:18:22, Johannes Weiner wrote:
  On Wed, Apr 01, 2015 at 05:19:20PM +0200, Michal Hocko wrote:
  My question here would be: are there any NOFS allocations that *don't*
  want this behavior?  Does it even make sense to require this separate
  annotation or should we just make it the default?
  
  The argument here was always that NOFS allocations are very limited in
  their reclaim powers and will trigger OOM prematurely.  However, the
  way we limit dirty memory these days forces most cache to be clean at
  all times, and direct reclaim in general hasn't been allowed to issue
  page writeback for quite some time.  So these days, NOFS reclaim isn't
  really weaker than regular direct reclaim. 
 
 What about [di]cache and some others fs specific shrinkers (and heavy
 metadata loads)?

We don't do direct reclaim for fs shrinkers in GFP_NOFS context,
either.

*HOWEVER*

The shrinker reclaim we can not execute is deferred to the next
context that can do the reclaim, which is usually kswapd. So the
reclaim gets done according to the GFP_NOFS memory pressure that is
occurring, it is just done in a different context...

  The only exception is that
  it might block writeback, so we'd go OOM if the only reclaimables left
  were dirty pages against that filesystem.  That should be acceptable.
 
 OOM killer is hardly acceptable by most users I've heard from. OOM
 killer is the _last_ resort and if the allocation is restricted then
 we shouldn't use the big hammer. The allocator might use __GFP_HIGH to
 get access to memory reserves if it can fail or __GFP_NOFAIL if it
 cannot. With your patches the NOFAIL case would get an access to memory
 reserves as well. So I do not really see a reason to change GFP_NOFS vs.
 OOM killer semantic.

So, really, what we want is something like:

#define __GFP_USE_LOWMEM_RESERVE__GFP_HIGH

So that it documents the code that is using it effectively and we
can find them easily with cscope/grep?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy

2015-04-13 Thread Michal Hocko
On Sat 11-04-15 16:29:26, Tetsuo Handa wrote:
 Johannes Weiner wrote:
  The argument here was always that NOFS allocations are very limited in
  their reclaim powers and will trigger OOM prematurely.  However, the
  way we limit dirty memory these days forces most cache to be clean at
  all times, and direct reclaim in general hasn't been allowed to issue
  page writeback for quite some time.  So these days, NOFS reclaim isn't
  really weaker than regular direct reclaim.  The only exception is that
  it might block writeback, so we'd go OOM if the only reclaimables left
  were dirty pages against that filesystem.  That should be acceptable.
  
  diff --git a/mm/page_alloc.c b/mm/page_alloc.c
  index 47981c5e54c3..fe3cb2b0b85b 100644
  --- a/mm/page_alloc.c
  +++ b/mm/page_alloc.c
  @@ -2367,16 +2367,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int 
  order, int alloc_flags,
  /* The OOM killer does not needlessly kill tasks for lowmem */
  if (ac-high_zoneidx  ZONE_NORMAL)
  goto out;
  -   /* The OOM killer does not compensate for IO-less reclaim */
  -   if (!(gfp_mask  __GFP_FS)) {
  -   /*
  -* XXX: Page reclaim didn't yield anything,
  -* and the OOM killer can't be invoked, but
  -* keep looping as per tradition.
  -*/
  -   *did_some_progress = 1;
  -   goto out;
  -   }
  if (pm_suspended_storage())
  goto out;
  /* The OOM killer may not free memory on a specific node */
  
 
 I think this change will allow calling out_of_memory() which results in
 oom_kill_process() is trivially called via pagefault_out_of_memory()
 problem described in https://lkml.org/lkml/2015/3/18/219 .
 
 I myself think that we should trigger OOM killer for !__GFP_FS allocation
 in order to make forward progress in case the OOM victim is blocked.
 So, my question about this change is whether we can accept involving OOM
 killer from page fault, no matter how trivially OOM killer will kill some
 process?

We trigger OOM killer from the page fault path for ages. In fact the memcg
will trigger memcg OOM killer _only_ from the page fault path because
this context is safe as we do not sit on any locks at the time.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy

2015-04-13 Thread Michal Hocko
[Sorry for a late reply]

On Tue 07-04-15 10:18:22, Johannes Weiner wrote:
 On Wed, Apr 01, 2015 at 05:19:20PM +0200, Michal Hocko wrote:
  On Mon 30-03-15 11:32:40, Dave Chinner wrote:
   On Fri, Mar 27, 2015 at 11:05:09AM -0400, Johannes Weiner wrote:
  [...]
GFP_NOFS sites are currently one of the sites that can deadlock inside
the allocator, even though many of them seem to have fallback code.
My reasoning here is that if you *have* an exit strategy for failing
allocations that is smarter than hanging, we should probably use that.
   
   We already do that for allocations where we can handle failure in
   GFP_NOFS conditions. It is, however, somewhat useless if we can't
   tell the allocator to try really hard if we've already had a failure
   and we are already in memory reclaim conditions (e.g. a shrinker
   trying to clean dirty objects so they can be reclaimed).
   
   From that perspective, I think that this patch set aims force us
   away from handling fallbacks ourselves because a) it makes GFP_NOFS
   more likely to fail, and b) provides no mechanism to try harder
   when we really need the allocation to succeed.
  
  You can ask for this try harder by __GFP_HIGH flag. Would that help
  in your fallback case?
 
 I would think __GFP_REPEAT would be more suitable here.  From the doc:
 
  * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
  * _might_ fail.  This depends upon the particular VM implementation.
 
 so we can make the semantics of GFP_NOFS | __GFP_REPEAT such that they
 are allowed to use the OOM killer and dip into the OOM reserves.

__GFP_REPEAT is quite subtle already. It makes a difference only for
high order allocations and it is not clear to me why it should imply OOM
killer for small orders now. Or did you suggest making it special only
with GFP_NOFS? That sounds even more ugly.

AFAIU, David wasn't asking for the OOM killer as much as he was
interested in getting access to a small amount of reserves in order to
make a progress. __GFP_HIGH is there for this purpose.

 My question here would be: are there any NOFS allocations that *don't*
 want this behavior?  Does it even make sense to require this separate
 annotation or should we just make it the default?
 
 The argument here was always that NOFS allocations are very limited in
 their reclaim powers and will trigger OOM prematurely.  However, the
 way we limit dirty memory these days forces most cache to be clean at
 all times, and direct reclaim in general hasn't been allowed to issue
 page writeback for quite some time.  So these days, NOFS reclaim isn't
 really weaker than regular direct reclaim. 

What about [di]cache and some others fs specific shrinkers (and heavy
metadata loads)?

 The only exception is that
 it might block writeback, so we'd go OOM if the only reclaimables left
 were dirty pages against that filesystem.  That should be acceptable.

OOM killer is hardly acceptable by most users I've heard from. OOM
killer is the _last_ resort and if the allocation is restricted then
we shouldn't use the big hammer. The allocator might use __GFP_HIGH to
get access to memory reserves if it can fail or __GFP_NOFAIL if it
cannot. With your patches the NOFAIL case would get an access to memory
reserves as well. So I do not really see a reason to change GFP_NOFS vs.
OOM killer semantic.

 diff --git a/mm/page_alloc.c b/mm/page_alloc.c
 index 47981c5e54c3..fe3cb2b0b85b 100644
 --- a/mm/page_alloc.c
 +++ b/mm/page_alloc.c
 @@ -2367,16 +2367,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int 
 order, int alloc_flags,
   /* The OOM killer does not needlessly kill tasks for lowmem */
   if (ac-high_zoneidx  ZONE_NORMAL)
   goto out;
 - /* The OOM killer does not compensate for IO-less reclaim */
 - if (!(gfp_mask  __GFP_FS)) {
 - /*
 -  * XXX: Page reclaim didn't yield anything,
 -  * and the OOM killer can't be invoked, but
 -  * keep looping as per tradition.
 -  */
 - *did_some_progress = 1;
 - goto out;
 - }
   if (pm_suspended_storage())
   goto out;
   /* The OOM killer may not free memory on a specific node */

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy

2015-04-11 Thread Tetsuo Handa
Johannes Weiner wrote:
> The argument here was always that NOFS allocations are very limited in
> their reclaim powers and will trigger OOM prematurely.  However, the
> way we limit dirty memory these days forces most cache to be clean at
> all times, and direct reclaim in general hasn't been allowed to issue
> page writeback for quite some time.  So these days, NOFS reclaim isn't
> really weaker than regular direct reclaim.  The only exception is that
> it might block writeback, so we'd go OOM if the only reclaimables left
> were dirty pages against that filesystem.  That should be acceptable.
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 47981c5e54c3..fe3cb2b0b85b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2367,16 +2367,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int 
> order, int alloc_flags,
>   /* The OOM killer does not needlessly kill tasks for lowmem */
>   if (ac->high_zoneidx < ZONE_NORMAL)
>   goto out;
> - /* The OOM killer does not compensate for IO-less reclaim */
> - if (!(gfp_mask & __GFP_FS)) {
> - /*
> -  * XXX: Page reclaim didn't yield anything,
> -  * and the OOM killer can't be invoked, but
> -  * keep looping as per tradition.
> -  */
> - *did_some_progress = 1;
> - goto out;
> - }
>   if (pm_suspended_storage())
>   goto out;
>   /* The OOM killer may not free memory on a specific node */
> 

I think this change will allow calling out_of_memory() which results in
"oom_kill_process() is trivially called via pagefault_out_of_memory()"
problem described in https://lkml.org/lkml/2015/3/18/219 .

I myself think that we should trigger OOM killer for !__GFP_FS allocation
in order to make forward progress in case the OOM victim is blocked.
So, my question about this change is whether we can accept involving OOM
killer from page fault, no matter how trivially OOM killer will kill some
process?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy

2015-04-11 Thread Tetsuo Handa
Johannes Weiner wrote:
 The argument here was always that NOFS allocations are very limited in
 their reclaim powers and will trigger OOM prematurely.  However, the
 way we limit dirty memory these days forces most cache to be clean at
 all times, and direct reclaim in general hasn't been allowed to issue
 page writeback for quite some time.  So these days, NOFS reclaim isn't
 really weaker than regular direct reclaim.  The only exception is that
 it might block writeback, so we'd go OOM if the only reclaimables left
 were dirty pages against that filesystem.  That should be acceptable.
 
 diff --git a/mm/page_alloc.c b/mm/page_alloc.c
 index 47981c5e54c3..fe3cb2b0b85b 100644
 --- a/mm/page_alloc.c
 +++ b/mm/page_alloc.c
 @@ -2367,16 +2367,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int 
 order, int alloc_flags,
   /* The OOM killer does not needlessly kill tasks for lowmem */
   if (ac-high_zoneidx  ZONE_NORMAL)
   goto out;
 - /* The OOM killer does not compensate for IO-less reclaim */
 - if (!(gfp_mask  __GFP_FS)) {
 - /*
 -  * XXX: Page reclaim didn't yield anything,
 -  * and the OOM killer can't be invoked, but
 -  * keep looping as per tradition.
 -  */
 - *did_some_progress = 1;
 - goto out;
 - }
   if (pm_suspended_storage())
   goto out;
   /* The OOM killer may not free memory on a specific node */
 

I think this change will allow calling out_of_memory() which results in
oom_kill_process() is trivially called via pagefault_out_of_memory()
problem described in https://lkml.org/lkml/2015/3/18/219 .

I myself think that we should trigger OOM killer for !__GFP_FS allocation
in order to make forward progress in case the OOM victim is blocked.
So, my question about this change is whether we can accept involving OOM
killer from page fault, no matter how trivially OOM killer will kill some
process?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy

2015-04-07 Thread Johannes Weiner
On Wed, Apr 01, 2015 at 05:19:20PM +0200, Michal Hocko wrote:
> On Mon 30-03-15 11:32:40, Dave Chinner wrote:
> > On Fri, Mar 27, 2015 at 11:05:09AM -0400, Johannes Weiner wrote:
> [...]
> > > GFP_NOFS sites are currently one of the sites that can deadlock inside
> > > the allocator, even though many of them seem to have fallback code.
> > > My reasoning here is that if you *have* an exit strategy for failing
> > > allocations that is smarter than hanging, we should probably use that.
> > 
> > We already do that for allocations where we can handle failure in
> > GFP_NOFS conditions. It is, however, somewhat useless if we can't
> > tell the allocator to try really hard if we've already had a failure
> > and we are already in memory reclaim conditions (e.g. a shrinker
> > trying to clean dirty objects so they can be reclaimed).
> > 
> > From that perspective, I think that this patch set aims force us
> > away from handling fallbacks ourselves because a) it makes GFP_NOFS
> > more likely to fail, and b) provides no mechanism to "try harder"
> > when we really need the allocation to succeed.
> 
> You can ask for this "try harder" by __GFP_HIGH flag. Would that help
> in your fallback case?

I would think __GFP_REPEAT would be more suitable here.  From the doc:

 * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
 * _might_ fail.  This depends upon the particular VM implementation.

so we can make the semantics of GFP_NOFS | __GFP_REPEAT such that they
are allowed to use the OOM killer and dip into the OOM reserves.

My question here would be: are there any NOFS allocations that *don't*
want this behavior?  Does it even make sense to require this separate
annotation or should we just make it the default?

The argument here was always that NOFS allocations are very limited in
their reclaim powers and will trigger OOM prematurely.  However, the
way we limit dirty memory these days forces most cache to be clean at
all times, and direct reclaim in general hasn't been allowed to issue
page writeback for quite some time.  So these days, NOFS reclaim isn't
really weaker than regular direct reclaim.  The only exception is that
it might block writeback, so we'd go OOM if the only reclaimables left
were dirty pages against that filesystem.  That should be acceptable.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 47981c5e54c3..fe3cb2b0b85b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2367,16 +2367,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int 
order, int alloc_flags,
/* The OOM killer does not needlessly kill tasks for lowmem */
if (ac->high_zoneidx < ZONE_NORMAL)
goto out;
-   /* The OOM killer does not compensate for IO-less reclaim */
-   if (!(gfp_mask & __GFP_FS)) {
-   /*
-* XXX: Page reclaim didn't yield anything,
-* and the OOM killer can't be invoked, but
-* keep looping as per tradition.
-*/
-   *did_some_progress = 1;
-   goto out;
-   }
if (pm_suspended_storage())
goto out;
/* The OOM killer may not free memory on a specific node */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy

2015-04-07 Thread Johannes Weiner
On Wed, Apr 01, 2015 at 05:19:20PM +0200, Michal Hocko wrote:
 On Mon 30-03-15 11:32:40, Dave Chinner wrote:
  On Fri, Mar 27, 2015 at 11:05:09AM -0400, Johannes Weiner wrote:
 [...]
   GFP_NOFS sites are currently one of the sites that can deadlock inside
   the allocator, even though many of them seem to have fallback code.
   My reasoning here is that if you *have* an exit strategy for failing
   allocations that is smarter than hanging, we should probably use that.
  
  We already do that for allocations where we can handle failure in
  GFP_NOFS conditions. It is, however, somewhat useless if we can't
  tell the allocator to try really hard if we've already had a failure
  and we are already in memory reclaim conditions (e.g. a shrinker
  trying to clean dirty objects so they can be reclaimed).
  
  From that perspective, I think that this patch set aims force us
  away from handling fallbacks ourselves because a) it makes GFP_NOFS
  more likely to fail, and b) provides no mechanism to try harder
  when we really need the allocation to succeed.
 
 You can ask for this try harder by __GFP_HIGH flag. Would that help
 in your fallback case?

I would think __GFP_REPEAT would be more suitable here.  From the doc:

 * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
 * _might_ fail.  This depends upon the particular VM implementation.

so we can make the semantics of GFP_NOFS | __GFP_REPEAT such that they
are allowed to use the OOM killer and dip into the OOM reserves.

My question here would be: are there any NOFS allocations that *don't*
want this behavior?  Does it even make sense to require this separate
annotation or should we just make it the default?

The argument here was always that NOFS allocations are very limited in
their reclaim powers and will trigger OOM prematurely.  However, the
way we limit dirty memory these days forces most cache to be clean at
all times, and direct reclaim in general hasn't been allowed to issue
page writeback for quite some time.  So these days, NOFS reclaim isn't
really weaker than regular direct reclaim.  The only exception is that
it might block writeback, so we'd go OOM if the only reclaimables left
were dirty pages against that filesystem.  That should be acceptable.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 47981c5e54c3..fe3cb2b0b85b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2367,16 +2367,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int 
order, int alloc_flags,
/* The OOM killer does not needlessly kill tasks for lowmem */
if (ac-high_zoneidx  ZONE_NORMAL)
goto out;
-   /* The OOM killer does not compensate for IO-less reclaim */
-   if (!(gfp_mask  __GFP_FS)) {
-   /*
-* XXX: Page reclaim didn't yield anything,
-* and the OOM killer can't be invoked, but
-* keep looping as per tradition.
-*/
-   *did_some_progress = 1;
-   goto out;
-   }
if (pm_suspended_storage())
goto out;
/* The OOM killer may not free memory on a specific node */
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy

2015-04-02 Thread Michal Hocko
On Thu 02-04-15 08:39:02, Dave Chinner wrote:
> On Wed, Apr 01, 2015 at 05:19:20PM +0200, Michal Hocko wrote:
> > On Mon 30-03-15 11:32:40, Dave Chinner wrote:
> > > On Fri, Mar 27, 2015 at 11:05:09AM -0400, Johannes Weiner wrote:
> > [...]
> > > > GFP_NOFS sites are currently one of the sites that can deadlock inside
> > > > the allocator, even though many of them seem to have fallback code.
> > > > My reasoning here is that if you *have* an exit strategy for failing
> > > > allocations that is smarter than hanging, we should probably use that.
> > > 
> > > We already do that for allocations where we can handle failure in
> > > GFP_NOFS conditions. It is, however, somewhat useless if we can't
> > > tell the allocator to try really hard if we've already had a failure
> > > and we are already in memory reclaim conditions (e.g. a shrinker
> > > trying to clean dirty objects so they can be reclaimed).
> > > 
> > > From that perspective, I think that this patch set aims force us
> > > away from handling fallbacks ourselves because a) it makes GFP_NOFS
> > > more likely to fail, and b) provides no mechanism to "try harder"
> > > when we really need the allocation to succeed.
> > 
> > You can ask for this "try harder" by __GFP_HIGH flag. Would that help
> > in your fallback case?
> 
> That dips into GFP_ATOMIC reserves, right? What is the impact on the
> GFP_ATOMIC allocations that need it?

Yes the memory reserve is shared but the flag would be used only after
previous GFP_NOFS allocation has failed which means that that the system
is close to the OOM and chances for GFP_ATOMIC allocations (which are
GFP_NOWAIT and cannot perform any reclaim) success are quite low already.

> We typically see network cards fail GFP_ATOMIC allocations before XFS
> starts complaining about allocation failures, so i suspect that this
> might just make things worse rather than better...

My understanding is that GFP_ATOMIC allocation would fallback to
GFP_WAIT type of allocation in the deferred context in the networking
code. There would be some performance hit but again we are talking
about close to OOM conditions here.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy

2015-04-02 Thread Michal Hocko
On Thu 02-04-15 08:39:02, Dave Chinner wrote:
 On Wed, Apr 01, 2015 at 05:19:20PM +0200, Michal Hocko wrote:
  On Mon 30-03-15 11:32:40, Dave Chinner wrote:
   On Fri, Mar 27, 2015 at 11:05:09AM -0400, Johannes Weiner wrote:
  [...]
GFP_NOFS sites are currently one of the sites that can deadlock inside
the allocator, even though many of them seem to have fallback code.
My reasoning here is that if you *have* an exit strategy for failing
allocations that is smarter than hanging, we should probably use that.
   
   We already do that for allocations where we can handle failure in
   GFP_NOFS conditions. It is, however, somewhat useless if we can't
   tell the allocator to try really hard if we've already had a failure
   and we are already in memory reclaim conditions (e.g. a shrinker
   trying to clean dirty objects so they can be reclaimed).
   
   From that perspective, I think that this patch set aims force us
   away from handling fallbacks ourselves because a) it makes GFP_NOFS
   more likely to fail, and b) provides no mechanism to try harder
   when we really need the allocation to succeed.
  
  You can ask for this try harder by __GFP_HIGH flag. Would that help
  in your fallback case?
 
 That dips into GFP_ATOMIC reserves, right? What is the impact on the
 GFP_ATOMIC allocations that need it?

Yes the memory reserve is shared but the flag would be used only after
previous GFP_NOFS allocation has failed which means that that the system
is close to the OOM and chances for GFP_ATOMIC allocations (which are
GFP_NOWAIT and cannot perform any reclaim) success are quite low already.

 We typically see network cards fail GFP_ATOMIC allocations before XFS
 starts complaining about allocation failures, so i suspect that this
 might just make things worse rather than better...

My understanding is that GFP_ATOMIC allocation would fallback to
GFP_WAIT type of allocation in the deferred context in the networking
code. There would be some performance hit but again we are talking
about close to OOM conditions here.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy

2015-04-01 Thread Dave Chinner
On Wed, Apr 01, 2015 at 05:19:20PM +0200, Michal Hocko wrote:
> On Mon 30-03-15 11:32:40, Dave Chinner wrote:
> > On Fri, Mar 27, 2015 at 11:05:09AM -0400, Johannes Weiner wrote:
> [...]
> > > GFP_NOFS sites are currently one of the sites that can deadlock inside
> > > the allocator, even though many of them seem to have fallback code.
> > > My reasoning here is that if you *have* an exit strategy for failing
> > > allocations that is smarter than hanging, we should probably use that.
> > 
> > We already do that for allocations where we can handle failure in
> > GFP_NOFS conditions. It is, however, somewhat useless if we can't
> > tell the allocator to try really hard if we've already had a failure
> > and we are already in memory reclaim conditions (e.g. a shrinker
> > trying to clean dirty objects so they can be reclaimed).
> > 
> > From that perspective, I think that this patch set aims force us
> > away from handling fallbacks ourselves because a) it makes GFP_NOFS
> > more likely to fail, and b) provides no mechanism to "try harder"
> > when we really need the allocation to succeed.
> 
> You can ask for this "try harder" by __GFP_HIGH flag. Would that help
> in your fallback case?

That dips into GFP_ATOMIC reserves, right? What is the impact on the
GFP_ATOMIC allocations that need it? We typically see network cards
fail GFP_ATOMIC allocations before XFS starts complaining about
allocation failures, so i suspect that this might just make things
worse rather than better...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy

2015-04-01 Thread Michal Hocko
On Mon 30-03-15 11:32:40, Dave Chinner wrote:
> On Fri, Mar 27, 2015 at 11:05:09AM -0400, Johannes Weiner wrote:
[...]
> > GFP_NOFS sites are currently one of the sites that can deadlock inside
> > the allocator, even though many of them seem to have fallback code.
> > My reasoning here is that if you *have* an exit strategy for failing
> > allocations that is smarter than hanging, we should probably use that.
> 
> We already do that for allocations where we can handle failure in
> GFP_NOFS conditions. It is, however, somewhat useless if we can't
> tell the allocator to try really hard if we've already had a failure
> and we are already in memory reclaim conditions (e.g. a shrinker
> trying to clean dirty objects so they can be reclaimed).
> 
> From that perspective, I think that this patch set aims force us
> away from handling fallbacks ourselves because a) it makes GFP_NOFS
> more likely to fail, and b) provides no mechanism to "try harder"
> when we really need the allocation to succeed.

You can ask for this "try harder" by __GFP_HIGH flag. Would that help
in your fallback case?
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy

2015-04-01 Thread Michal Hocko
On Mon 30-03-15 11:32:40, Dave Chinner wrote:
 On Fri, Mar 27, 2015 at 11:05:09AM -0400, Johannes Weiner wrote:
[...]
  GFP_NOFS sites are currently one of the sites that can deadlock inside
  the allocator, even though many of them seem to have fallback code.
  My reasoning here is that if you *have* an exit strategy for failing
  allocations that is smarter than hanging, we should probably use that.
 
 We already do that for allocations where we can handle failure in
 GFP_NOFS conditions. It is, however, somewhat useless if we can't
 tell the allocator to try really hard if we've already had a failure
 and we are already in memory reclaim conditions (e.g. a shrinker
 trying to clean dirty objects so they can be reclaimed).
 
 From that perspective, I think that this patch set aims force us
 away from handling fallbacks ourselves because a) it makes GFP_NOFS
 more likely to fail, and b) provides no mechanism to try harder
 when we really need the allocation to succeed.

You can ask for this try harder by __GFP_HIGH flag. Would that help
in your fallback case?
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy

2015-04-01 Thread Dave Chinner
On Wed, Apr 01, 2015 at 05:19:20PM +0200, Michal Hocko wrote:
 On Mon 30-03-15 11:32:40, Dave Chinner wrote:
  On Fri, Mar 27, 2015 at 11:05:09AM -0400, Johannes Weiner wrote:
 [...]
   GFP_NOFS sites are currently one of the sites that can deadlock inside
   the allocator, even though many of them seem to have fallback code.
   My reasoning here is that if you *have* an exit strategy for failing
   allocations that is smarter than hanging, we should probably use that.
  
  We already do that for allocations where we can handle failure in
  GFP_NOFS conditions. It is, however, somewhat useless if we can't
  tell the allocator to try really hard if we've already had a failure
  and we are already in memory reclaim conditions (e.g. a shrinker
  trying to clean dirty objects so they can be reclaimed).
  
  From that perspective, I think that this patch set aims force us
  away from handling fallbacks ourselves because a) it makes GFP_NOFS
  more likely to fail, and b) provides no mechanism to try harder
  when we really need the allocation to succeed.
 
 You can ask for this try harder by __GFP_HIGH flag. Would that help
 in your fallback case?

That dips into GFP_ATOMIC reserves, right? What is the impact on the
GFP_ATOMIC allocations that need it? We typically see network cards
fail GFP_ATOMIC allocations before XFS starts complaining about
allocation failures, so i suspect that this might just make things
worse rather than better...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy

2015-03-30 Thread Johannes Weiner
On Mon, Mar 30, 2015 at 11:32:40AM +1100, Dave Chinner wrote:
> On Fri, Mar 27, 2015 at 11:05:09AM -0400, Johannes Weiner wrote:
> > GFP_NOFS sites are currently one of the sites that can deadlock inside
> > the allocator, even though many of them seem to have fallback code.
> > My reasoning here is that if you *have* an exit strategy for failing
> > allocations that is smarter than hanging, we should probably use that.
> 
> We already do that for allocations where we can handle failure in
> GFP_NOFS conditions. It is, however, somewhat useless if we can't
> tell the allocator to try really hard if we've already had a failure
> and we are already in memory reclaim conditions (e.g. a shrinker
> trying to clean dirty objects so they can be reclaimed).

What do you mean you already do that?  These allocations currently
won't fail.  They loop forever in the allocator.  Fallback code is
dead code right now.  (Unless you do order-4 and up, which I doubt.)

> From that perspective, I think that this patch set aims force us
> away from handling fallbacks ourselves because a) it makes GFP_NOFS
> more likely to fail, and b) provides no mechanism to "try harder"
> when we really need the allocation to succeed.

If by "more likely" you mean "at all possible", then yes.

However, as far as trying harder goes, that sounds like a good idea.
It should be possible for NOFS contexts to use the OOM killer and its
reserves.  But still, they should be allowed to propagate allocation
failures rather than just hanging in the allocator.

> > > >  mm: page_alloc: emergency reserve access for __GFP_NOFAIL allocations
> > > > 
> > > > An exacerbation of the victim-stuck-behind-allocation scenario are
> > > > __GFP_NOFAIL allocations, because they will actually deadlock.  To
> > > > avoid this, or try to, give __GFP_NOFAIL allocations access to not
> > > > just the OOM reserves but also the system's emergency reserves.
> > > > 
> > > > This is basically a poor man's reservation system, which could or
> > > > should be replaced later on with an explicit reservation system that
> > > > e.g. filesystems have control over for use by transactions.
> > > > 
> > > > It's obviously not bulletproof and might still lock up, but it should
> > > > greatly reduce the likelihood.  AFAIK Andrea, whose idea this was, has
> > > > been using this successfully for some time.
> > > 
> > > So, if we want GFP_NOFS allocations to be able to dip into a
> > > small extra reservation to make progress at ENOMEM, we have to use
> > > use __GFP_NOFAIL because looping ourselves won't allow use of these
> > > extra reserves?
> > 
> > As I said, this series is not about providing reserves just yet.  It
> > is about using the fallback strategies you already implemented.  And
> > where you don't have any, it's about making the allocator's last way
> > of forward progress, the OOM killer, more reliable.
> 
> Sure - but you're doing that by adding a special reserve for
> GFP_NOFAIL allocations to dip into when the OOM killer is active.
> That can only be accessed by GFP_NOFAIL allocations - anyone who
> has a fallback but really needs the allocation to succeed if at all
> possible (i.e. should only fail to avoid a deadlock situation) can't
> communicate that fact to the allocator

Hm?  It's not restricted to NOFAIL at all, look closer at my patch
series.  What you are describing is exactly how I propose the
allocator should handle all regular allocations: exhaust reclaimable
pages, use the OOM killer, dip into OOM reserves, but ultimately fail.
The only thing __GFP_NOFAIL does in *addition* to that is use the last
emergency reserves of the system in an attempt to avoid deadlocking.

[ Once those reserves are depleted, however, the system will deadlock,
  so we can only give them to allocations that would otherwise lock up
  anyway, i.e. __GFP_NOFAIL.  It would be silly to risk a system
  deadlock for an allocation that has a fallback strategy.  That is
  why you have to let the allocator know whether you can fall back. ]

The notable exception to this behavior are NOFS callers because of its
current OOM kill restrictions.  But as I said, I'm absolutely open to
addressing this and either let them generally use the OOM killer after
some time, or provide you with another annotation that lets you come
back to try harder.  I don't really care which way, that depends on
your requirements.

> > > > This patch makes NOFS allocations fail if reclaim can't free anything.
> > > > 
> > > > It would be good if the filesystem people could weigh in on whether
> > > > they can deal with failing GFP_NOFS allocations, or annotate the
> > > > exceptions with __GFP_NOFAIL etc.  It could well be that a middle
> > > > ground is required that allows using the OOM killer before giving up.
> > > 
> > > ... which looks to me like a catch-22 situation for us: We
> > > have reserves, but callers need to use __GFP_NOFAIL to access them.
> > > GFP_NOFS is going to fail more often, so callers 

Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy

2015-03-30 Thread Johannes Weiner
On Mon, Mar 30, 2015 at 11:32:40AM +1100, Dave Chinner wrote:
 On Fri, Mar 27, 2015 at 11:05:09AM -0400, Johannes Weiner wrote:
  GFP_NOFS sites are currently one of the sites that can deadlock inside
  the allocator, even though many of them seem to have fallback code.
  My reasoning here is that if you *have* an exit strategy for failing
  allocations that is smarter than hanging, we should probably use that.
 
 We already do that for allocations where we can handle failure in
 GFP_NOFS conditions. It is, however, somewhat useless if we can't
 tell the allocator to try really hard if we've already had a failure
 and we are already in memory reclaim conditions (e.g. a shrinker
 trying to clean dirty objects so they can be reclaimed).

What do you mean you already do that?  These allocations currently
won't fail.  They loop forever in the allocator.  Fallback code is
dead code right now.  (Unless you do order-4 and up, which I doubt.)

 From that perspective, I think that this patch set aims force us
 away from handling fallbacks ourselves because a) it makes GFP_NOFS
 more likely to fail, and b) provides no mechanism to try harder
 when we really need the allocation to succeed.

If by more likely you mean at all possible, then yes.

However, as far as trying harder goes, that sounds like a good idea.
It should be possible for NOFS contexts to use the OOM killer and its
reserves.  But still, they should be allowed to propagate allocation
failures rather than just hanging in the allocator.

 mm: page_alloc: emergency reserve access for __GFP_NOFAIL allocations

An exacerbation of the victim-stuck-behind-allocation scenario are
__GFP_NOFAIL allocations, because they will actually deadlock.  To
avoid this, or try to, give __GFP_NOFAIL allocations access to not
just the OOM reserves but also the system's emergency reserves.

This is basically a poor man's reservation system, which could or
should be replaced later on with an explicit reservation system that
e.g. filesystems have control over for use by transactions.

It's obviously not bulletproof and might still lock up, but it should
greatly reduce the likelihood.  AFAIK Andrea, whose idea this was, has
been using this successfully for some time.
   
   So, if we want GFP_NOFS allocations to be able to dip into a
   small extra reservation to make progress at ENOMEM, we have to use
   use __GFP_NOFAIL because looping ourselves won't allow use of these
   extra reserves?
  
  As I said, this series is not about providing reserves just yet.  It
  is about using the fallback strategies you already implemented.  And
  where you don't have any, it's about making the allocator's last way
  of forward progress, the OOM killer, more reliable.
 
 Sure - but you're doing that by adding a special reserve for
 GFP_NOFAIL allocations to dip into when the OOM killer is active.
 That can only be accessed by GFP_NOFAIL allocations - anyone who
 has a fallback but really needs the allocation to succeed if at all
 possible (i.e. should only fail to avoid a deadlock situation) can't
 communicate that fact to the allocator

Hm?  It's not restricted to NOFAIL at all, look closer at my patch
series.  What you are describing is exactly how I propose the
allocator should handle all regular allocations: exhaust reclaimable
pages, use the OOM killer, dip into OOM reserves, but ultimately fail.
The only thing __GFP_NOFAIL does in *addition* to that is use the last
emergency reserves of the system in an attempt to avoid deadlocking.

[ Once those reserves are depleted, however, the system will deadlock,
  so we can only give them to allocations that would otherwise lock up
  anyway, i.e. __GFP_NOFAIL.  It would be silly to risk a system
  deadlock for an allocation that has a fallback strategy.  That is
  why you have to let the allocator know whether you can fall back. ]

The notable exception to this behavior are NOFS callers because of its
current OOM kill restrictions.  But as I said, I'm absolutely open to
addressing this and either let them generally use the OOM killer after
some time, or provide you with another annotation that lets you come
back to try harder.  I don't really care which way, that depends on
your requirements.

This patch makes NOFS allocations fail if reclaim can't free anything.

It would be good if the filesystem people could weigh in on whether
they can deal with failing GFP_NOFS allocations, or annotate the
exceptions with __GFP_NOFAIL etc.  It could well be that a middle
ground is required that allows using the OOM killer before giving up.
   
   ... which looks to me like a catch-22 situation for us: We
   have reserves, but callers need to use __GFP_NOFAIL to access them.
   GFP_NOFS is going to fail more often, so callers need to handle that
   in some way, either by looping or erroring out.
   
   But if we loop manually because we try to handle ENOMEM situations
   

Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy

2015-03-29 Thread Dave Chinner
On Fri, Mar 27, 2015 at 11:05:09AM -0400, Johannes Weiner wrote:
> On Fri, Mar 27, 2015 at 06:58:22AM +1100, Dave Chinner wrote:
> > On Wed, Mar 25, 2015 at 02:17:04AM -0400, Johannes Weiner wrote:
> > > Hi everybody,
> > > 
> > > in the recent past we've had several reports and discussions on how to
> > > deal with allocations hanging in the allocator upon OOM.
> > > 
> > > The idea of this series is mainly to make the mechanism of detecting
> > > OOM situations reliable enough that we can be confident about failing
> > > allocations, and then leave the fallback strategy to the caller rather
> > > than looping forever in the allocator.
> > > 
> > > The other part is trying to reduce the __GFP_NOFAIL deadlock rate, at
> > > least for the short term while we don't have a reservation system yet.
> > 
> > A valid goal, but I think this series goes about it the wrong way.
> > i.e. it forces us to use __GFP_NOFAIL rather than providing us a
> > valid fallback mechanism to access reserves.
> 
> I think you misunderstood the goal.
> 
> While I agree that reserves would be the optimal fallback strategy,
> this series is about avoiding deadlocks in existing callsites that
> currently can not fail.  This is about getting the best out of our
> existing mechanisms until we have universal reservation coverage,
> which will take time to devise and transition our codebase to.

That might be the goal, but it looks like the wrong path to me.

> GFP_NOFS sites are currently one of the sites that can deadlock inside
> the allocator, even though many of them seem to have fallback code.
> My reasoning here is that if you *have* an exit strategy for failing
> allocations that is smarter than hanging, we should probably use that.

We already do that for allocations where we can handle failure in
GFP_NOFS conditions. It is, however, somewhat useless if we can't
tell the allocator to try really hard if we've already had a failure
and we are already in memory reclaim conditions (e.g. a shrinker
trying to clean dirty objects so they can be reclaimed).

>From that perspective, I think that this patch set aims force us
away from handling fallbacks ourselves because a) it makes GFP_NOFS
more likely to fail, and b) provides no mechanism to "try harder"
when we really need the allocation to succeed.

> > >  mm: page_alloc: emergency reserve access for __GFP_NOFAIL allocations
> > > 
> > > An exacerbation of the victim-stuck-behind-allocation scenario are
> > > __GFP_NOFAIL allocations, because they will actually deadlock.  To
> > > avoid this, or try to, give __GFP_NOFAIL allocations access to not
> > > just the OOM reserves but also the system's emergency reserves.
> > > 
> > > This is basically a poor man's reservation system, which could or
> > > should be replaced later on with an explicit reservation system that
> > > e.g. filesystems have control over for use by transactions.
> > > 
> > > It's obviously not bulletproof and might still lock up, but it should
> > > greatly reduce the likelihood.  AFAIK Andrea, whose idea this was, has
> > > been using this successfully for some time.
> > 
> > So, if we want GFP_NOFS allocations to be able to dip into a
> > small extra reservation to make progress at ENOMEM, we have to use
> > use __GFP_NOFAIL because looping ourselves won't allow use of these
> > extra reserves?
> 
> As I said, this series is not about providing reserves just yet.  It
> is about using the fallback strategies you already implemented.  And
> where you don't have any, it's about making the allocator's last way
> of forward progress, the OOM killer, more reliable.

Sure - but you're doing that by adding a special reserve for
GFP_NOFAIL allocations to dip into when the OOM killer is active.
That can only be accessed by GFP_NOFAIL allocations - anyone who
has a fallback but really needs the allocation to succeed if at all
possible (i.e. should only fail to avoid a deadlock situation) can't
communicate that fact to the allocator



> > > This patch makes NOFS allocations fail if reclaim can't free anything.
> > > 
> > > It would be good if the filesystem people could weigh in on whether
> > > they can deal with failing GFP_NOFS allocations, or annotate the
> > > exceptions with __GFP_NOFAIL etc.  It could well be that a middle
> > > ground is required that allows using the OOM killer before giving up.
> > 
> > ... which looks to me like a catch-22 situation for us: We
> > have reserves, but callers need to use __GFP_NOFAIL to access them.
> > GFP_NOFS is going to fail more often, so callers need to handle that
> > in some way, either by looping or erroring out.
> > 
> > But if we loop manually because we try to handle ENOMEM situations
> > gracefully (e.g. try a number of times before erroring out) we can't
> > dip into the reserves because the only semantics being provided are
> > "try-once-without-reserves" or "try-forever-with-reserves".  i.e.
> > what we actually need here is "try-once-with-reserves" semantics 

Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy

2015-03-29 Thread Dave Chinner
On Fri, Mar 27, 2015 at 11:05:09AM -0400, Johannes Weiner wrote:
 On Fri, Mar 27, 2015 at 06:58:22AM +1100, Dave Chinner wrote:
  On Wed, Mar 25, 2015 at 02:17:04AM -0400, Johannes Weiner wrote:
   Hi everybody,
   
   in the recent past we've had several reports and discussions on how to
   deal with allocations hanging in the allocator upon OOM.
   
   The idea of this series is mainly to make the mechanism of detecting
   OOM situations reliable enough that we can be confident about failing
   allocations, and then leave the fallback strategy to the caller rather
   than looping forever in the allocator.
   
   The other part is trying to reduce the __GFP_NOFAIL deadlock rate, at
   least for the short term while we don't have a reservation system yet.
  
  A valid goal, but I think this series goes about it the wrong way.
  i.e. it forces us to use __GFP_NOFAIL rather than providing us a
  valid fallback mechanism to access reserves.
 
 I think you misunderstood the goal.
 
 While I agree that reserves would be the optimal fallback strategy,
 this series is about avoiding deadlocks in existing callsites that
 currently can not fail.  This is about getting the best out of our
 existing mechanisms until we have universal reservation coverage,
 which will take time to devise and transition our codebase to.

That might be the goal, but it looks like the wrong path to me.

 GFP_NOFS sites are currently one of the sites that can deadlock inside
 the allocator, even though many of them seem to have fallback code.
 My reasoning here is that if you *have* an exit strategy for failing
 allocations that is smarter than hanging, we should probably use that.

We already do that for allocations where we can handle failure in
GFP_NOFS conditions. It is, however, somewhat useless if we can't
tell the allocator to try really hard if we've already had a failure
and we are already in memory reclaim conditions (e.g. a shrinker
trying to clean dirty objects so they can be reclaimed).

From that perspective, I think that this patch set aims force us
away from handling fallbacks ourselves because a) it makes GFP_NOFS
more likely to fail, and b) provides no mechanism to try harder
when we really need the allocation to succeed.

mm: page_alloc: emergency reserve access for __GFP_NOFAIL allocations
   
   An exacerbation of the victim-stuck-behind-allocation scenario are
   __GFP_NOFAIL allocations, because they will actually deadlock.  To
   avoid this, or try to, give __GFP_NOFAIL allocations access to not
   just the OOM reserves but also the system's emergency reserves.
   
   This is basically a poor man's reservation system, which could or
   should be replaced later on with an explicit reservation system that
   e.g. filesystems have control over for use by transactions.
   
   It's obviously not bulletproof and might still lock up, but it should
   greatly reduce the likelihood.  AFAIK Andrea, whose idea this was, has
   been using this successfully for some time.
  
  So, if we want GFP_NOFS allocations to be able to dip into a
  small extra reservation to make progress at ENOMEM, we have to use
  use __GFP_NOFAIL because looping ourselves won't allow use of these
  extra reserves?
 
 As I said, this series is not about providing reserves just yet.  It
 is about using the fallback strategies you already implemented.  And
 where you don't have any, it's about making the allocator's last way
 of forward progress, the OOM killer, more reliable.

Sure - but you're doing that by adding a special reserve for
GFP_NOFAIL allocations to dip into when the OOM killer is active.
That can only be accessed by GFP_NOFAIL allocations - anyone who
has a fallback but really needs the allocation to succeed if at all
possible (i.e. should only fail to avoid a deadlock situation) can't
communicate that fact to the allocator



   This patch makes NOFS allocations fail if reclaim can't free anything.
   
   It would be good if the filesystem people could weigh in on whether
   they can deal with failing GFP_NOFS allocations, or annotate the
   exceptions with __GFP_NOFAIL etc.  It could well be that a middle
   ground is required that allows using the OOM killer before giving up.
  
  ... which looks to me like a catch-22 situation for us: We
  have reserves, but callers need to use __GFP_NOFAIL to access them.
  GFP_NOFS is going to fail more often, so callers need to handle that
  in some way, either by looping or erroring out.
  
  But if we loop manually because we try to handle ENOMEM situations
  gracefully (e.g. try a number of times before erroring out) we can't
  dip into the reserves because the only semantics being provided are
  try-once-without-reserves or try-forever-with-reserves.  i.e.
  what we actually need here is try-once-with-reserves semantics so
  that we can make progress after a failing GFP_NOFS
  try-once-without-reserves allocation.
 
  IOWS, __GFP_NOFAIL is not the answer here - it's GFP_NOFS |
  

Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy

2015-03-27 Thread Johannes Weiner
On Fri, Mar 27, 2015 at 06:58:22AM +1100, Dave Chinner wrote:
> On Wed, Mar 25, 2015 at 02:17:04AM -0400, Johannes Weiner wrote:
> > Hi everybody,
> > 
> > in the recent past we've had several reports and discussions on how to
> > deal with allocations hanging in the allocator upon OOM.
> > 
> > The idea of this series is mainly to make the mechanism of detecting
> > OOM situations reliable enough that we can be confident about failing
> > allocations, and then leave the fallback strategy to the caller rather
> > than looping forever in the allocator.
> > 
> > The other part is trying to reduce the __GFP_NOFAIL deadlock rate, at
> > least for the short term while we don't have a reservation system yet.
> 
> A valid goal, but I think this series goes about it the wrong way.
> i.e. it forces us to use __GFP_NOFAIL rather than providing us a
> valid fallback mechanism to access reserves.

I think you misunderstood the goal.

While I agree that reserves would be the optimal fallback strategy,
this series is about avoiding deadlocks in existing callsites that
currently can not fail.  This is about getting the best out of our
existing mechanisms until we have universal reservation coverage,
which will take time to devise and transition our codebase to.

GFP_NOFS sites are currently one of the sites that can deadlock inside
the allocator, even though many of them seem to have fallback code.
My reasoning here is that if you *have* an exit strategy for failing
allocations that is smarter than hanging, we should probably use that.

> >  mm: page_alloc: emergency reserve access for __GFP_NOFAIL allocations
> > 
> > An exacerbation of the victim-stuck-behind-allocation scenario are
> > __GFP_NOFAIL allocations, because they will actually deadlock.  To
> > avoid this, or try to, give __GFP_NOFAIL allocations access to not
> > just the OOM reserves but also the system's emergency reserves.
> > 
> > This is basically a poor man's reservation system, which could or
> > should be replaced later on with an explicit reservation system that
> > e.g. filesystems have control over for use by transactions.
> > 
> > It's obviously not bulletproof and might still lock up, but it should
> > greatly reduce the likelihood.  AFAIK Andrea, whose idea this was, has
> > been using this successfully for some time.
> 
> So, if we want GFP_NOFS allocations to be able to dip into a
> small extra reservation to make progress at ENOMEM, we have to use
> use __GFP_NOFAIL because looping ourselves won't allow use of these
> extra reserves?

As I said, this series is not about providing reserves just yet.  It
is about using the fallback strategies you already implemented.  And
where you don't have any, it's about making the allocator's last way
of forward progress, the OOM killer, more reliable.

If you have an allocation site that is endlessly looping around calls
to the allocator, it means you DON'T have a fallback strategy.  In
that case, it would be in your interest to tell the allocator, such
that it can take measures to break the infinite loop.

However, those measures are not without their own risk and they need
to be carefully sequenced to reduce the risk for deadlocks.  E.g. we
can not give __GFP_NOFAIL allocations access to the statically-sized
emergency reserves without taking steps to free memory at the same
time, because then we'd just trade forward progress of that allocation
against forward progress of some memory reclaimer later on which finds
the emergency reserves exhausted.

> >  mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM
> > 
> > Another hang that was reported was from NOFS allocations.  The trouble
> > with these is that they can't issue or wait for writeback during page
> > reclaim, and so we don't want to OOM kill on their behalf.  However,
> > with such restrictions on making progress, they are prone to hangs.
> 
> And because this effectively means GFP_NOFS allocations are
> going to fail much more often, we're either going to have to loop
> ourselves or use __GFP_NOFAIL...
> 
> > This patch makes NOFS allocations fail if reclaim can't free anything.
> > 
> > It would be good if the filesystem people could weigh in on whether
> > they can deal with failing GFP_NOFS allocations, or annotate the
> > exceptions with __GFP_NOFAIL etc.  It could well be that a middle
> > ground is required that allows using the OOM killer before giving up.
> 
> ... which looks to me like a catch-22 situation for us: We
> have reserves, but callers need to use __GFP_NOFAIL to access them.
> GFP_NOFS is going to fail more often, so callers need to handle that
> in some way, either by looping or erroring out.
> 
> But if we loop manually because we try to handle ENOMEM situations
> gracefully (e.g. try a number of times before erroring out) we can't
> dip into the reserves because the only semantics being provided are
> "try-once-without-reserves" or "try-forever-with-reserves".  i.e.
> what we actually need here is 

Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy

2015-03-27 Thread Johannes Weiner
On Fri, Mar 27, 2015 at 06:58:22AM +1100, Dave Chinner wrote:
 On Wed, Mar 25, 2015 at 02:17:04AM -0400, Johannes Weiner wrote:
  Hi everybody,
  
  in the recent past we've had several reports and discussions on how to
  deal with allocations hanging in the allocator upon OOM.
  
  The idea of this series is mainly to make the mechanism of detecting
  OOM situations reliable enough that we can be confident about failing
  allocations, and then leave the fallback strategy to the caller rather
  than looping forever in the allocator.
  
  The other part is trying to reduce the __GFP_NOFAIL deadlock rate, at
  least for the short term while we don't have a reservation system yet.
 
 A valid goal, but I think this series goes about it the wrong way.
 i.e. it forces us to use __GFP_NOFAIL rather than providing us a
 valid fallback mechanism to access reserves.

I think you misunderstood the goal.

While I agree that reserves would be the optimal fallback strategy,
this series is about avoiding deadlocks in existing callsites that
currently can not fail.  This is about getting the best out of our
existing mechanisms until we have universal reservation coverage,
which will take time to devise and transition our codebase to.

GFP_NOFS sites are currently one of the sites that can deadlock inside
the allocator, even though many of them seem to have fallback code.
My reasoning here is that if you *have* an exit strategy for failing
allocations that is smarter than hanging, we should probably use that.

   mm: page_alloc: emergency reserve access for __GFP_NOFAIL allocations
  
  An exacerbation of the victim-stuck-behind-allocation scenario are
  __GFP_NOFAIL allocations, because they will actually deadlock.  To
  avoid this, or try to, give __GFP_NOFAIL allocations access to not
  just the OOM reserves but also the system's emergency reserves.
  
  This is basically a poor man's reservation system, which could or
  should be replaced later on with an explicit reservation system that
  e.g. filesystems have control over for use by transactions.
  
  It's obviously not bulletproof and might still lock up, but it should
  greatly reduce the likelihood.  AFAIK Andrea, whose idea this was, has
  been using this successfully for some time.
 
 So, if we want GFP_NOFS allocations to be able to dip into a
 small extra reservation to make progress at ENOMEM, we have to use
 use __GFP_NOFAIL because looping ourselves won't allow use of these
 extra reserves?

As I said, this series is not about providing reserves just yet.  It
is about using the fallback strategies you already implemented.  And
where you don't have any, it's about making the allocator's last way
of forward progress, the OOM killer, more reliable.

If you have an allocation site that is endlessly looping around calls
to the allocator, it means you DON'T have a fallback strategy.  In
that case, it would be in your interest to tell the allocator, such
that it can take measures to break the infinite loop.

However, those measures are not without their own risk and they need
to be carefully sequenced to reduce the risk for deadlocks.  E.g. we
can not give __GFP_NOFAIL allocations access to the statically-sized
emergency reserves without taking steps to free memory at the same
time, because then we'd just trade forward progress of that allocation
against forward progress of some memory reclaimer later on which finds
the emergency reserves exhausted.

   mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM
  
  Another hang that was reported was from NOFS allocations.  The trouble
  with these is that they can't issue or wait for writeback during page
  reclaim, and so we don't want to OOM kill on their behalf.  However,
  with such restrictions on making progress, they are prone to hangs.
 
 And because this effectively means GFP_NOFS allocations are
 going to fail much more often, we're either going to have to loop
 ourselves or use __GFP_NOFAIL...
 
  This patch makes NOFS allocations fail if reclaim can't free anything.
  
  It would be good if the filesystem people could weigh in on whether
  they can deal with failing GFP_NOFS allocations, or annotate the
  exceptions with __GFP_NOFAIL etc.  It could well be that a middle
  ground is required that allows using the OOM killer before giving up.
 
 ... which looks to me like a catch-22 situation for us: We
 have reserves, but callers need to use __GFP_NOFAIL to access them.
 GFP_NOFS is going to fail more often, so callers need to handle that
 in some way, either by looping or erroring out.
 
 But if we loop manually because we try to handle ENOMEM situations
 gracefully (e.g. try a number of times before erroring out) we can't
 dip into the reserves because the only semantics being provided are
 try-once-without-reserves or try-forever-with-reserves.  i.e.
 what we actually need here is try-once-with-reserves semantics so
 that we can make progress after a failing GFP_NOFS
 try-once-without-reserves 

Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy

2015-03-26 Thread Dave Chinner
On Wed, Mar 25, 2015 at 02:17:04AM -0400, Johannes Weiner wrote:
> Hi everybody,
> 
> in the recent past we've had several reports and discussions on how to
> deal with allocations hanging in the allocator upon OOM.
> 
> The idea of this series is mainly to make the mechanism of detecting
> OOM situations reliable enough that we can be confident about failing
> allocations, and then leave the fallback strategy to the caller rather
> than looping forever in the allocator.
> 
> The other part is trying to reduce the __GFP_NOFAIL deadlock rate, at
> least for the short term while we don't have a reservation system yet.

A valid goal, but I think this series goes about it the wrong way.
i.e. it forces us to use __GFP_NOFAIL rather than providing us a
valid fallback mechanism to access reserves.



>  mm: page_alloc: emergency reserve access for __GFP_NOFAIL allocations
> 
> An exacerbation of the victim-stuck-behind-allocation scenario are
> __GFP_NOFAIL allocations, because they will actually deadlock.  To
> avoid this, or try to, give __GFP_NOFAIL allocations access to not
> just the OOM reserves but also the system's emergency reserves.
> 
> This is basically a poor man's reservation system, which could or
> should be replaced later on with an explicit reservation system that
> e.g. filesystems have control over for use by transactions.
> 
> It's obviously not bulletproof and might still lock up, but it should
> greatly reduce the likelihood.  AFAIK Andrea, whose idea this was, has
> been using this successfully for some time.

So, if we want GFP_NOFS allocations to be able to dip into a
small extra reservation to make progress at ENOMEM, we have to use
use __GFP_NOFAIL because looping ourselves won't allow use of these
extra reserves?

>  mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM
> 
> Another hang that was reported was from NOFS allocations.  The trouble
> with these is that they can't issue or wait for writeback during page
> reclaim, and so we don't want to OOM kill on their behalf.  However,
> with such restrictions on making progress, they are prone to hangs.

And because this effectively means GFP_NOFS allocations are
going to fail much more often, we're either going to have to loop
ourselves or use __GFP_NOFAIL...

> This patch makes NOFS allocations fail if reclaim can't free anything.
> 
> It would be good if the filesystem people could weigh in on whether
> they can deal with failing GFP_NOFS allocations, or annotate the
> exceptions with __GFP_NOFAIL etc.  It could well be that a middle
> ground is required that allows using the OOM killer before giving up.

... which looks to me like a catch-22 situation for us: We
have reserves, but callers need to use __GFP_NOFAIL to access them.
GFP_NOFS is going to fail more often, so callers need to handle that
in some way, either by looping or erroring out.

But if we loop manually because we try to handle ENOMEM situations
gracefully (e.g. try a number of times before erroring out) we can't
dip into the reserves because the only semantics being provided are
"try-once-without-reserves" or "try-forever-with-reserves".  i.e.
what we actually need here is "try-once-with-reserves" semantics so
that we can make progress after a failing GFP_NOFS
"try-once-without-reserves" allocation.

IOWS, __GFP_NOFAIL is not the answer here - it's GFP_NOFS |
__GFP_USE_RESERVE that we need on the failure fallback path. Which,
incidentally, is trivial to add to the XFS allocation code. Indeed,
I'll request that you test series like this on metadata intensive
filesystem workloads on XFS under memory stress and quantify how
many new "XFS: possible deadlock in memory allocation" warnings are
emitted. If the patch set floods the system with such warnings, then
it means the proposed means the fallback for "caller handles
allocation failure" is not making progress.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy

2015-03-26 Thread Dave Chinner
On Wed, Mar 25, 2015 at 02:17:04AM -0400, Johannes Weiner wrote:
 Hi everybody,
 
 in the recent past we've had several reports and discussions on how to
 deal with allocations hanging in the allocator upon OOM.
 
 The idea of this series is mainly to make the mechanism of detecting
 OOM situations reliable enough that we can be confident about failing
 allocations, and then leave the fallback strategy to the caller rather
 than looping forever in the allocator.
 
 The other part is trying to reduce the __GFP_NOFAIL deadlock rate, at
 least for the short term while we don't have a reservation system yet.

A valid goal, but I think this series goes about it the wrong way.
i.e. it forces us to use __GFP_NOFAIL rather than providing us a
valid fallback mechanism to access reserves.



  mm: page_alloc: emergency reserve access for __GFP_NOFAIL allocations
 
 An exacerbation of the victim-stuck-behind-allocation scenario are
 __GFP_NOFAIL allocations, because they will actually deadlock.  To
 avoid this, or try to, give __GFP_NOFAIL allocations access to not
 just the OOM reserves but also the system's emergency reserves.
 
 This is basically a poor man's reservation system, which could or
 should be replaced later on with an explicit reservation system that
 e.g. filesystems have control over for use by transactions.
 
 It's obviously not bulletproof and might still lock up, but it should
 greatly reduce the likelihood.  AFAIK Andrea, whose idea this was, has
 been using this successfully for some time.

So, if we want GFP_NOFS allocations to be able to dip into a
small extra reservation to make progress at ENOMEM, we have to use
use __GFP_NOFAIL because looping ourselves won't allow use of these
extra reserves?

  mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM
 
 Another hang that was reported was from NOFS allocations.  The trouble
 with these is that they can't issue or wait for writeback during page
 reclaim, and so we don't want to OOM kill on their behalf.  However,
 with such restrictions on making progress, they are prone to hangs.

And because this effectively means GFP_NOFS allocations are
going to fail much more often, we're either going to have to loop
ourselves or use __GFP_NOFAIL...

 This patch makes NOFS allocations fail if reclaim can't free anything.
 
 It would be good if the filesystem people could weigh in on whether
 they can deal with failing GFP_NOFS allocations, or annotate the
 exceptions with __GFP_NOFAIL etc.  It could well be that a middle
 ground is required that allows using the OOM killer before giving up.

... which looks to me like a catch-22 situation for us: We
have reserves, but callers need to use __GFP_NOFAIL to access them.
GFP_NOFS is going to fail more often, so callers need to handle that
in some way, either by looping or erroring out.

But if we loop manually because we try to handle ENOMEM situations
gracefully (e.g. try a number of times before erroring out) we can't
dip into the reserves because the only semantics being provided are
try-once-without-reserves or try-forever-with-reserves.  i.e.
what we actually need here is try-once-with-reserves semantics so
that we can make progress after a failing GFP_NOFS
try-once-without-reserves allocation.

IOWS, __GFP_NOFAIL is not the answer here - it's GFP_NOFS |
__GFP_USE_RESERVE that we need on the failure fallback path. Which,
incidentally, is trivial to add to the XFS allocation code. Indeed,
I'll request that you test series like this on metadata intensive
filesystem workloads on XFS under memory stress and quantify how
many new XFS: possible deadlock in memory allocation warnings are
emitted. If the patch set floods the system with such warnings, then
it means the proposed means the fallback for caller handles
allocation failure is not making progress.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 00/12] mm: page_alloc: improve OOM mechanism and policy

2015-03-25 Thread Johannes Weiner
Hi everybody,

in the recent past we've had several reports and discussions on how to
deal with allocations hanging in the allocator upon OOM.

The idea of this series is mainly to make the mechanism of detecting
OOM situations reliable enough that we can be confident about failing
allocations, and then leave the fallback strategy to the caller rather
than looping forever in the allocator.

The other part is trying to reduce the __GFP_NOFAIL deadlock rate, at
least for the short term while we don't have a reservation system yet.

Here is a breakdown of the proposed changes:

 mm: oom_kill: remove pointless locking in oom_enable()
 mm: oom_kill: clean up victim marking and exiting interfaces
 mm: oom_kill: remove misleading test-and-clear of known TIF_MEMDIE
 mm: oom_kill: remove pointless locking in exit_oom_victim()
 mm: oom_kill: generalize OOM progress waitqueue
 mm: oom_kill: simplify OOM killer locking
 mm: page_alloc: inline should_alloc_retry() contents

These are preparational patches to clean up parts in the OOM killer
and the page allocator.  Filesystem folks and others that only care
about allocation semantics may want to skip over these.

 mm: page_alloc: wait for OOM killer progress before retrying

One of the hangs we have seen reported is from lower order allocations
that loop infinitely in the allocator.  In an attempt to address that,
it has been proposed to limit the number of retry loops - possibly
even make that number configurable from userspace - and return NULL
once we are certain that the system is "truly OOM".  But it wasn't
clear how high that number needs to be to reliably determine a global
OOM situation from the perspective of an individual allocation.

An issue is that OOM killing is currently an asynchroneous operation
and the optimal retry number depends on how long it takes an OOM kill
victim to exit and release its memory - which of course varies with
system load and exiting task.

To address this, this patch makes OOM killing synchroneous and only
returns to the allocator once the victim has actually exited.  With
that, the allocator no longer requires retry loops just to poll for
the victim releasing memory.

 mm: page_alloc: private memory reserves for OOM-killing allocations

Once out_of_memory() is synchroneous, there are still two issues that
can make determining system-wide OOM from a single allocation context
unreliable.  For one, concurrent allocations can swoop in right after
a kill and steal the memory, causing spurious allocation failures for
contexts that actually freed memory.  But also, the OOM victim could
get blocked on some state that the allocation is holding, which would
delay the release of the memory (and refilling of the reserves) until
after the allocation has completed.

This patch creates private reserves for allocations that have issued
an OOM kill.  Once these reserves run dry, it seems reasonable to
assume that other allocations are not succeeding either anymore.

 mm: page_alloc: emergency reserve access for __GFP_NOFAIL allocations

An exacerbation of the victim-stuck-behind-allocation scenario are
__GFP_NOFAIL allocations, because they will actually deadlock.  To
avoid this, or try to, give __GFP_NOFAIL allocations access to not
just the OOM reserves but also the system's emergency reserves.

This is basically a poor man's reservation system, which could or
should be replaced later on with an explicit reservation system that
e.g. filesystems have control over for use by transactions.

It's obviously not bulletproof and might still lock up, but it should
greatly reduce the likelihood.  AFAIK Andrea, whose idea this was, has
been using this successfully for some time.

 mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM

Another hang that was reported was from NOFS allocations.  The trouble
with these is that they can't issue or wait for writeback during page
reclaim, and so we don't want to OOM kill on their behalf.  However,
with such restrictions on making progress, they are prone to hangs.

This patch makes NOFS allocations fail if reclaim can't free anything.

It would be good if the filesystem people could weigh in on whether
they can deal with failing GFP_NOFS allocations, or annotate the
exceptions with __GFP_NOFAIL etc.  It could well be that a middle
ground is required that allows using the OOM killer before giving up.

 mm: page_alloc: do not lock up low-order allocations upon OOM

With both OOM killing and "true OOM situation" detection more
reliable, this patch finally allows allocations up to order 3 to
actually fail on OOM and leave the fallback strategy to the caller -
as opposed to the current policy of hanging in the allocator.

Comments?

 drivers/staging/android/lowmemorykiller.c |   2 +-
 include/linux/mmzone.h|   2 +
 include/linux/oom.h   |  12 +-
 kernel/exit.c |   2 +-
 mm/internal.h |   3 +-
 mm/memcontrol.c  

[patch 00/12] mm: page_alloc: improve OOM mechanism and policy

2015-03-25 Thread Johannes Weiner
Hi everybody,

in the recent past we've had several reports and discussions on how to
deal with allocations hanging in the allocator upon OOM.

The idea of this series is mainly to make the mechanism of detecting
OOM situations reliable enough that we can be confident about failing
allocations, and then leave the fallback strategy to the caller rather
than looping forever in the allocator.

The other part is trying to reduce the __GFP_NOFAIL deadlock rate, at
least for the short term while we don't have a reservation system yet.

Here is a breakdown of the proposed changes:

 mm: oom_kill: remove pointless locking in oom_enable()
 mm: oom_kill: clean up victim marking and exiting interfaces
 mm: oom_kill: remove misleading test-and-clear of known TIF_MEMDIE
 mm: oom_kill: remove pointless locking in exit_oom_victim()
 mm: oom_kill: generalize OOM progress waitqueue
 mm: oom_kill: simplify OOM killer locking
 mm: page_alloc: inline should_alloc_retry() contents

These are preparational patches to clean up parts in the OOM killer
and the page allocator.  Filesystem folks and others that only care
about allocation semantics may want to skip over these.

 mm: page_alloc: wait for OOM killer progress before retrying

One of the hangs we have seen reported is from lower order allocations
that loop infinitely in the allocator.  In an attempt to address that,
it has been proposed to limit the number of retry loops - possibly
even make that number configurable from userspace - and return NULL
once we are certain that the system is truly OOM.  But it wasn't
clear how high that number needs to be to reliably determine a global
OOM situation from the perspective of an individual allocation.

An issue is that OOM killing is currently an asynchroneous operation
and the optimal retry number depends on how long it takes an OOM kill
victim to exit and release its memory - which of course varies with
system load and exiting task.

To address this, this patch makes OOM killing synchroneous and only
returns to the allocator once the victim has actually exited.  With
that, the allocator no longer requires retry loops just to poll for
the victim releasing memory.

 mm: page_alloc: private memory reserves for OOM-killing allocations

Once out_of_memory() is synchroneous, there are still two issues that
can make determining system-wide OOM from a single allocation context
unreliable.  For one, concurrent allocations can swoop in right after
a kill and steal the memory, causing spurious allocation failures for
contexts that actually freed memory.  But also, the OOM victim could
get blocked on some state that the allocation is holding, which would
delay the release of the memory (and refilling of the reserves) until
after the allocation has completed.

This patch creates private reserves for allocations that have issued
an OOM kill.  Once these reserves run dry, it seems reasonable to
assume that other allocations are not succeeding either anymore.

 mm: page_alloc: emergency reserve access for __GFP_NOFAIL allocations

An exacerbation of the victim-stuck-behind-allocation scenario are
__GFP_NOFAIL allocations, because they will actually deadlock.  To
avoid this, or try to, give __GFP_NOFAIL allocations access to not
just the OOM reserves but also the system's emergency reserves.

This is basically a poor man's reservation system, which could or
should be replaced later on with an explicit reservation system that
e.g. filesystems have control over for use by transactions.

It's obviously not bulletproof and might still lock up, but it should
greatly reduce the likelihood.  AFAIK Andrea, whose idea this was, has
been using this successfully for some time.

 mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM

Another hang that was reported was from NOFS allocations.  The trouble
with these is that they can't issue or wait for writeback during page
reclaim, and so we don't want to OOM kill on their behalf.  However,
with such restrictions on making progress, they are prone to hangs.

This patch makes NOFS allocations fail if reclaim can't free anything.

It would be good if the filesystem people could weigh in on whether
they can deal with failing GFP_NOFS allocations, or annotate the
exceptions with __GFP_NOFAIL etc.  It could well be that a middle
ground is required that allows using the OOM killer before giving up.

 mm: page_alloc: do not lock up low-order allocations upon OOM

With both OOM killing and true OOM situation detection more
reliable, this patch finally allows allocations up to order 3 to
actually fail on OOM and leave the fallback strategy to the caller -
as opposed to the current policy of hanging in the allocator.

Comments?

 drivers/staging/android/lowmemorykiller.c |   2 +-
 include/linux/mmzone.h|   2 +
 include/linux/oom.h   |  12 +-
 kernel/exit.c |   2 +-
 mm/internal.h |   3 +-
 mm/memcontrol.c