Re: [patch v2 1/3] mm: remove GFP_THISNODE

2015-03-02 Thread David Rientjes
On Mon, 2 Mar 2015, Vlastimil Babka wrote:

> > > > You are thinking about an opportunistic allocation attempt in SLAB?
> > > > 
> > > > AFAICT SLAB allocations should trigger reclaim.
> > > > 
> > > 
> > > Well, let me quote your commit 952f3b51beb5:
> > 
> > This was about global reclaim. Local reclaim is good and that can be
> > done via zone_reclaim.
> 
> Right, so the patch is a functional change for zone_reclaim_mode == 1, where
> !__GFP_WAIT will prevent it.
> 

My patch is not a functional change, get_page_from_freelist() handles 
zone_reclaim_mode == 1 properly in the page allocator fastpath.  This 
patch only touches the slowpath.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v2 1/3] mm: remove GFP_THISNODE

2015-03-02 Thread Vlastimil Babka

On 03/02/2015 05:08 PM, Christoph Lameter wrote:

On Mon, 2 Mar 2015, Vlastimil Babka wrote:


You are thinking about an opportunistic allocation attempt in SLAB?

AFAICT SLAB allocations should trigger reclaim.



Well, let me quote your commit 952f3b51beb5:


This was about global reclaim. Local reclaim is good and that can be
done via zone_reclaim.


Right, so the patch is a functional change for zone_reclaim_mode == 1, 
where !__GFP_WAIT will prevent it.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v2 1/3] mm: remove GFP_THISNODE

2015-03-02 Thread Christoph Lameter
On Mon, 2 Mar 2015, Vlastimil Babka wrote:

> > You are thinking about an opportunistic allocation attempt in SLAB?
> >
> > AFAICT SLAB allocations should trigger reclaim.
> >
>
> Well, let me quote your commit 952f3b51beb5:

This was about global reclaim. Local reclaim is good and that can be
done via zone_reclaim.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v2 1/3] mm: remove GFP_THISNODE

2015-03-02 Thread Vlastimil Babka
On 03/02/2015 04:46 PM, Christoph Lameter wrote:
> On Mon, 2 Mar 2015, Vlastimil Babka wrote:
> 
>> So it would be IMHO better for longer-term maintainability to have the
>> relevant __GFP_THISNODE callers pass also __GFP_NO_KSWAPD to denote these
>> opportunistic allocation attempts, instead of having this subtle semantic
> 
> You are thinking about an opportunistic allocation attempt in SLAB?
> 
> AFAICT SLAB allocations should trigger reclaim.
> 

Well, let me quote your commit 952f3b51beb5:


commit 952f3b51beb592f3f1de15adcdef802fc086ea91
Author: Christoph Lameter 
Date:   Wed Dec 6 20:33:26 2006 -0800

[PATCH] GFP_THISNODE must not trigger global reclaim

The intent of GFP_THISNODE is to make sure that an allocation occurs on a
particular node.  If this is not possible then NULL needs to be returned so
that the caller can choose what to do next on its own (the slab allocator
depends on that).

However, GFP_THISNODE currently triggers reclaim before returning a failure
(GFP_THISNODE means GFP_NORETRY is set).  If we have over allocated a node
then we will currently do some reclaim before returning NULL.  The caller
may want memory from other nodes before reclaim should be triggered.  (If
the caller wants reclaim then he can directly use __GFP_THISNODE instead).

There is no flag to avoid reclaim in the page allocator and adding yet
another GFP_xx flag would be difficult given that we are out of available
flags.

So just compare and see if all bits for GFP_THISNODE (__GFP_THISNODE,
__GFP_NORETRY and __GFP_NOWARN) are set.  If so then we return NULL before
waking up kswapd.

Signed-off-by: Christoph Lameter 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5d123b3..dc8753b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1151,6 +1151,17 @@ restart:
if (page)
goto got_pg;
 
+   /*
+* GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
+* __GFP_NOWARN set) should not cause reclaim since the subsystem
+* (f.e. slab) using GFP_THISNODE may choose to trigger reclaim
+* using a larger set of nodes after it has established that the
+* allowed per node queues are empty and that nodes are
+* over allocated.
+*/
+   if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
+   goto nopage;
+
for (z = zonelist->zones; *z; z++)
wakeup_kswapd(*z, order);


So it seems to me that *at least some* allocations in slab are supposed
to work like this? Of course it's possible that since 2006, more
allocation sites in slab started passing GFP_THISNODE without realizing
this semantics. In that case, such sites should be fixed. (I think David
already mentioned some in this thread?)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v2 1/3] mm: remove GFP_THISNODE

2015-03-02 Thread Christoph Lameter
On Mon, 2 Mar 2015, Vlastimil Babka wrote:

> So it would be IMHO better for longer-term maintainability to have the
> relevant __GFP_THISNODE callers pass also __GFP_NO_KSWAPD to denote these
> opportunistic allocation attempts, instead of having this subtle semantic

You are thinking about an opportunistic allocation attempt in SLAB?

AFAICT SLAB allocations should trigger reclaim.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v2 1/3] mm: remove GFP_THISNODE

2015-03-02 Thread Vlastimil Babka

On 02/27/2015 11:16 PM, David Rientjes wrote:

NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE.

GFP_THISNODE is a secret combination of gfp bits that have different
behavior than expected.  It is a combination of __GFP_THISNODE,
__GFP_NORETRY, and __GFP_NOWARN and is special-cased in the page allocator
slowpath to fail without trying reclaim even though it may be used in
combination with __GFP_WAIT.

An example of the problem this creates: commit e97ca8e5b864 ("mm: fix
GFP_THISNODE callers and clarify") fixed up many users of GFP_THISNODE
that really just wanted __GFP_THISNODE.  The problem doesn't end there,
however, because even it was a no-op for alloc_misplaced_dst_page(),
which also sets __GFP_NORETRY and __GFP_NOWARN, and
migrate_misplaced_transhuge_page(), where __GFP_NORETRY and __GFP_NOWAIT
is set in GFP_TRANSHUGE.  Converting GFP_THISNODE to __GFP_THISNODE is
a no-op in these cases since the page allocator special-cases
__GFP_THISNODE && __GFP_NORETRY && __GFP_NOWARN.

It's time to just remove GFP_THISNODE entirely.  We leave __GFP_THISNODE
to restrict an allocation to a local node, but remove GFP_THISNODE and
its obscurity.  Instead, we require that a caller clear __GFP_WAIT if it
wants to avoid reclaim.

This allows the aforementioned functions to actually reclaim as they
should.  It also enables any future callers that want to do
__GFP_THISNODE but also __GFP_NORETRY && __GFP_NOWARN to reclaim.  The
rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT.

Aside: ovs_flow_stats_update() really wants to avoid reclaim as well, so
it is unchanged.

Signed-off-by: David Rientjes 


Acked-by: Vlastimil Babka 

So you've convinced me that this is a non-functional change for slab and 
a prerequisity for patch 2/3 which is the main point of this series for 
4.0. That said, the new 'goto nopage' condition is still triggered by a 
combination of flag states, and the less we have those, the better for 
us IMHO.


Looking at commit 952f3b51be which introduced this particular check 
using GFP_THISNODE, it seemed like it was a workaround to avoid 
triggering reclaim, without needing a new gfp flag. Nowadays, we have 
such flag called __GFP_NO_KSWAPD and as I explained in my reply to v1 
(where I missed the new condition), passing the flag would already 
prevent wake_all_kswapds() and treating the allocation as atomic in 
gfp_to_alloc_flags(). So the whole difference would be another 
get_page_from_freelist() attempt (possibly less constrained than the 
fast path one) and then bail out on !wait.


So it would be IMHO better for longer-term maintainability to have the 
relevant __GFP_THISNODE callers pass also __GFP_NO_KSWAPD to denote 
these opportunistic allocation attempts, instead of having this subtle 
semantic difference attached to __GFP_THISNODE && !__GFP_WAIT. It would 
be probably too risky for 4.0. On the other hand, I don't think even 
this series is really urgent. It's true that patch 2/3 fixes two 
commits, including a 4.0 one, but those commits are already not 
regressions without the fix. But maybe current -rcX is low enough to 
proceed with this series and catch any regressions in time, allowing the 
larger cleanups later.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v2 1/3] mm: remove GFP_THISNODE

2015-03-02 Thread David Rientjes
On Mon, 2 Mar 2015, Vlastimil Babka wrote:

You are thinking about an opportunistic allocation attempt in SLAB?

AFAICT SLAB allocations should trigger reclaim.

   
   Well, let me quote your commit 952f3b51beb5:
  
  This was about global reclaim. Local reclaim is good and that can be
  done via zone_reclaim.
 
 Right, so the patch is a functional change for zone_reclaim_mode == 1, where
 !__GFP_WAIT will prevent it.
 

My patch is not a functional change, get_page_from_freelist() handles 
zone_reclaim_mode == 1 properly in the page allocator fastpath.  This 
patch only touches the slowpath.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v2 1/3] mm: remove GFP_THISNODE

2015-03-02 Thread Vlastimil Babka

On 02/27/2015 11:16 PM, David Rientjes wrote:

NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE.

GFP_THISNODE is a secret combination of gfp bits that have different
behavior than expected.  It is a combination of __GFP_THISNODE,
__GFP_NORETRY, and __GFP_NOWARN and is special-cased in the page allocator
slowpath to fail without trying reclaim even though it may be used in
combination with __GFP_WAIT.

An example of the problem this creates: commit e97ca8e5b864 (mm: fix
GFP_THISNODE callers and clarify) fixed up many users of GFP_THISNODE
that really just wanted __GFP_THISNODE.  The problem doesn't end there,
however, because even it was a no-op for alloc_misplaced_dst_page(),
which also sets __GFP_NORETRY and __GFP_NOWARN, and
migrate_misplaced_transhuge_page(), where __GFP_NORETRY and __GFP_NOWAIT
is set in GFP_TRANSHUGE.  Converting GFP_THISNODE to __GFP_THISNODE is
a no-op in these cases since the page allocator special-cases
__GFP_THISNODE  __GFP_NORETRY  __GFP_NOWARN.

It's time to just remove GFP_THISNODE entirely.  We leave __GFP_THISNODE
to restrict an allocation to a local node, but remove GFP_THISNODE and
its obscurity.  Instead, we require that a caller clear __GFP_WAIT if it
wants to avoid reclaim.

This allows the aforementioned functions to actually reclaim as they
should.  It also enables any future callers that want to do
__GFP_THISNODE but also __GFP_NORETRY  __GFP_NOWARN to reclaim.  The
rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT.

Aside: ovs_flow_stats_update() really wants to avoid reclaim as well, so
it is unchanged.

Signed-off-by: David Rientjes rient...@google.com


Acked-by: Vlastimil Babka vba...@suse.cz

So you've convinced me that this is a non-functional change for slab and 
a prerequisity for patch 2/3 which is the main point of this series for 
4.0. That said, the new 'goto nopage' condition is still triggered by a 
combination of flag states, and the less we have those, the better for 
us IMHO.


Looking at commit 952f3b51be which introduced this particular check 
using GFP_THISNODE, it seemed like it was a workaround to avoid 
triggering reclaim, without needing a new gfp flag. Nowadays, we have 
such flag called __GFP_NO_KSWAPD and as I explained in my reply to v1 
(where I missed the new condition), passing the flag would already 
prevent wake_all_kswapds() and treating the allocation as atomic in 
gfp_to_alloc_flags(). So the whole difference would be another 
get_page_from_freelist() attempt (possibly less constrained than the 
fast path one) and then bail out on !wait.


So it would be IMHO better for longer-term maintainability to have the 
relevant __GFP_THISNODE callers pass also __GFP_NO_KSWAPD to denote 
these opportunistic allocation attempts, instead of having this subtle 
semantic difference attached to __GFP_THISNODE  !__GFP_WAIT. It would 
be probably too risky for 4.0. On the other hand, I don't think even 
this series is really urgent. It's true that patch 2/3 fixes two 
commits, including a 4.0 one, but those commits are already not 
regressions without the fix. But maybe current -rcX is low enough to 
proceed with this series and catch any regressions in time, allowing the 
larger cleanups later.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v2 1/3] mm: remove GFP_THISNODE

2015-03-02 Thread Vlastimil Babka

On 03/02/2015 05:08 PM, Christoph Lameter wrote:

On Mon, 2 Mar 2015, Vlastimil Babka wrote:


You are thinking about an opportunistic allocation attempt in SLAB?

AFAICT SLAB allocations should trigger reclaim.



Well, let me quote your commit 952f3b51beb5:


This was about global reclaim. Local reclaim is good and that can be
done via zone_reclaim.


Right, so the patch is a functional change for zone_reclaim_mode == 1, 
where !__GFP_WAIT will prevent it.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v2 1/3] mm: remove GFP_THISNODE

2015-03-02 Thread Christoph Lameter
On Mon, 2 Mar 2015, Vlastimil Babka wrote:

 So it would be IMHO better for longer-term maintainability to have the
 relevant __GFP_THISNODE callers pass also __GFP_NO_KSWAPD to denote these
 opportunistic allocation attempts, instead of having this subtle semantic

You are thinking about an opportunistic allocation attempt in SLAB?

AFAICT SLAB allocations should trigger reclaim.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v2 1/3] mm: remove GFP_THISNODE

2015-03-02 Thread Vlastimil Babka
On 03/02/2015 04:46 PM, Christoph Lameter wrote:
 On Mon, 2 Mar 2015, Vlastimil Babka wrote:
 
 So it would be IMHO better for longer-term maintainability to have the
 relevant __GFP_THISNODE callers pass also __GFP_NO_KSWAPD to denote these
 opportunistic allocation attempts, instead of having this subtle semantic
 
 You are thinking about an opportunistic allocation attempt in SLAB?
 
 AFAICT SLAB allocations should trigger reclaim.
 

Well, let me quote your commit 952f3b51beb5:


commit 952f3b51beb592f3f1de15adcdef802fc086ea91
Author: Christoph Lameter clame...@sgi.com
Date:   Wed Dec 6 20:33:26 2006 -0800

[PATCH] GFP_THISNODE must not trigger global reclaim

The intent of GFP_THISNODE is to make sure that an allocation occurs on a
particular node.  If this is not possible then NULL needs to be returned so
that the caller can choose what to do next on its own (the slab allocator
depends on that).

However, GFP_THISNODE currently triggers reclaim before returning a failure
(GFP_THISNODE means GFP_NORETRY is set).  If we have over allocated a node
then we will currently do some reclaim before returning NULL.  The caller
may want memory from other nodes before reclaim should be triggered.  (If
the caller wants reclaim then he can directly use __GFP_THISNODE instead).

There is no flag to avoid reclaim in the page allocator and adding yet
another GFP_xx flag would be difficult given that we are out of available
flags.

So just compare and see if all bits for GFP_THISNODE (__GFP_THISNODE,
__GFP_NORETRY and __GFP_NOWARN) are set.  If so then we return NULL before
waking up kswapd.

Signed-off-by: Christoph Lameter clame...@sgi.com
Signed-off-by: Andrew Morton a...@osdl.org
Signed-off-by: Linus Torvalds torva...@osdl.org

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5d123b3..dc8753b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1151,6 +1151,17 @@ restart:
if (page)
goto got_pg;
 
+   /*
+* GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
+* __GFP_NOWARN set) should not cause reclaim since the subsystem
+* (f.e. slab) using GFP_THISNODE may choose to trigger reclaim
+* using a larger set of nodes after it has established that the
+* allowed per node queues are empty and that nodes are
+* over allocated.
+*/
+   if (NUMA_BUILD  (gfp_mask  GFP_THISNODE) == GFP_THISNODE)
+   goto nopage;
+
for (z = zonelist-zones; *z; z++)
wakeup_kswapd(*z, order);


So it seems to me that *at least some* allocations in slab are supposed
to work like this? Of course it's possible that since 2006, more
allocation sites in slab started passing GFP_THISNODE without realizing
this semantics. In that case, such sites should be fixed. (I think David
already mentioned some in this thread?)


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v2 1/3] mm: remove GFP_THISNODE

2015-03-02 Thread Christoph Lameter
On Mon, 2 Mar 2015, Vlastimil Babka wrote:

  You are thinking about an opportunistic allocation attempt in SLAB?
 
  AFAICT SLAB allocations should trigger reclaim.
 

 Well, let me quote your commit 952f3b51beb5:

This was about global reclaim. Local reclaim is good and that can be
done via zone_reclaim.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v2 1/3] mm: remove GFP_THISNODE

2015-02-27 Thread David Rientjes
On Fri, 27 Feb 2015, Christoph Lameter wrote:

> > +/*
> > + * Construct gfp mask to allocate from a specific node but do not invoke 
> > reclaim
> > + * or warn about failures.
> > + */
> 
> We should be triggering reclaim from slab allocations. Why would we not do
> this?
> 
> Otherwise we will be going uselessly off node for slab allocations.
> 
> > +static inline gfp_t gfp_exact_node(gfp_t flags)
> > +{
> > +   return (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_WAIT;
> > +}
> >  #endif
> 
> Reclaim needs to be triggered. In particular zone reclaim was made to be
> triggered from slab allocations to create more room if needed.
> 

This illustrates the precise need for a patch like this that removes 
GFP_THISNODE entirely: notice there's no functional change with this 
patch.

GFP_THISNODE is __GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY.

The calls to cache_alloc_node() and cache_grow() modified by this 
patch in mm/slab.c that pass GFP_THISNODE get caught in the page allocator 
slowpath by this:

if (IS_ENABLED(CONFIG_NUMA) &&
(gfp_mask & GFP_THISNODE) == GFP_THISNODE)
goto nopage;

with today's kernel.  In fact, there is no way for the slab allocator to 
currently allocate exactly on one node, allow reclaim, and avoid looping 
forever while suppressing the page allocation failure warning.  The reason 
is because of how GFP_THISNODE is defined above.

With this patch, it would be possible to modify gfp_exact_node() so that 
instead of doing

return (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_WAIT;

which has no functional change from today, it could be

return flags | __GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY;

so that we _can_ do reclaim for that node and avoid looping forever when 
the allocation fails.  These three flags are the exact same bits set in 
today's GFP_THISNODE and it is, I agree, what the slab allocator really 
wants to do in cache_grow().  But the conditional above is what 
short-circuits such an allocation and needs to be removed, which is what 
this patch does.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v2 1/3] mm: remove GFP_THISNODE

2015-02-27 Thread Christoph Lameter
On Fri, 27 Feb 2015, David Rientjes wrote:

> +/*
> + * Construct gfp mask to allocate from a specific node but do not invoke 
> reclaim
> + * or warn about failures.
> + */

We should be triggering reclaim from slab allocations. Why would we not do
this?

Otherwise we will be going uselessly off node for slab allocations.

> +static inline gfp_t gfp_exact_node(gfp_t flags)
> +{
> + return (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_WAIT;
> +}
>  #endif

Reclaim needs to be triggered. In particular zone reclaim was made to be
triggered from slab allocations to create more room if needed.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch v2 1/3] mm: remove GFP_THISNODE

2015-02-27 Thread David Rientjes
NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE.

GFP_THISNODE is a secret combination of gfp bits that have different
behavior than expected.  It is a combination of __GFP_THISNODE,
__GFP_NORETRY, and __GFP_NOWARN and is special-cased in the page allocator
slowpath to fail without trying reclaim even though it may be used in
combination with __GFP_WAIT.

An example of the problem this creates: commit e97ca8e5b864 ("mm: fix 
GFP_THISNODE callers and clarify") fixed up many users of GFP_THISNODE
that really just wanted __GFP_THISNODE.  The problem doesn't end there,
however, because even it was a no-op for alloc_misplaced_dst_page(),
which also sets __GFP_NORETRY and __GFP_NOWARN, and
migrate_misplaced_transhuge_page(), where __GFP_NORETRY and __GFP_NOWAIT
is set in GFP_TRANSHUGE.  Converting GFP_THISNODE to __GFP_THISNODE is
a no-op in these cases since the page allocator special-cases
__GFP_THISNODE && __GFP_NORETRY && __GFP_NOWARN.

It's time to just remove GFP_THISNODE entirely.  We leave __GFP_THISNODE
to restrict an allocation to a local node, but remove GFP_THISNODE and
its obscurity.  Instead, we require that a caller clear __GFP_WAIT if it
wants to avoid reclaim.

This allows the aforementioned functions to actually reclaim as they
should.  It also enables any future callers that want to do
__GFP_THISNODE but also __GFP_NORETRY && __GFP_NOWARN to reclaim.  The
rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT.

Aside: ovs_flow_stats_update() really wants to avoid reclaim as well, so
it is unchanged.

Signed-off-by: David Rientjes 
---
 v2: fix typos in commit message per Vlastimil

 include/linux/gfp.h| 10 --
 mm/page_alloc.c| 22 ++
 mm/slab.c  | 22 ++
 net/openvswitch/flow.c |  4 +++-
 4 files changed, 27 insertions(+), 31 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -117,16 +117,6 @@ struct vm_area_struct;
 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \
 __GFP_NO_KSWAPD)
 
-/*
- * GFP_THISNODE does not perform any reclaim, you most likely want to
- * use __GFP_THISNODE to allocate from a given node without fallback!
- */
-#ifdef CONFIG_NUMA
-#define GFP_THISNODE   (__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY)
-#else
-#define GFP_THISNODE   ((__force gfp_t)0)
-#endif
-
 /* This mask makes up all the page movable related flags */
 #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2355,13 +2355,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
/* The OOM killer does not compensate for light reclaim */
if (!(gfp_mask & __GFP_FS))
goto out;
-   /*
-* GFP_THISNODE contains __GFP_NORETRY and we never hit this.
-* Sanity check for bare calls of __GFP_THISNODE, not real OOM.
-* The caller should handle page allocation failure by itself if
-* it specifies __GFP_THISNODE.
-* Note: Hugepage uses it but will hit PAGE_ALLOC_COSTLY_ORDER.
-*/
+   /* The OOM killer may not free memory on a specific node */
if (gfp_mask & __GFP_THISNODE)
goto out;
}
@@ -2615,15 +2609,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
order,
}
 
/*
-* GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
-* __GFP_NOWARN set) should not cause reclaim since the subsystem
-* (f.e. slab) using GFP_THISNODE may choose to trigger reclaim
-* using a larger set of nodes after it has established that the
-* allowed per node queues are empty and that nodes are
-* over allocated.
+* If this allocation cannot block and it is for a specific node, then
+* fail early.  There's no need to wakeup kswapd or retry for a
+* speculative node-specific allocation.
 */
-   if (IS_ENABLED(CONFIG_NUMA) &&
-   (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
+   if (IS_ENABLED(CONFIG_NUMA) && (gfp_mask & __GFP_THISNODE) && !wait)
goto nopage;
 
 retry:
@@ -2816,7 +2806,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
/*
 * Check the zones suitable for the gfp_mask contain at least one
 * valid zone. It's possible to have an empty zonelist as a result
-* of GFP_THISNODE and a memoryless node
+* of __GFP_THISNODE and a memoryless node
 */
if (unlikely(!zonelist->_zonerefs->zone))
return NULL;
diff --git a/mm/slab.c b/mm/slab.c
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -857,6 +857,11 @@ static inline void *cache_alloc_node(struct kmem_cache 
*cachep,

Re: [patch v2 1/3] mm: remove GFP_THISNODE

2015-02-27 Thread David Rientjes
On Fri, 27 Feb 2015, Christoph Lameter wrote:

  +/*
  + * Construct gfp mask to allocate from a specific node but do not invoke 
  reclaim
  + * or warn about failures.
  + */
 
 We should be triggering reclaim from slab allocations. Why would we not do
 this?
 
 Otherwise we will be going uselessly off node for slab allocations.
 
  +static inline gfp_t gfp_exact_node(gfp_t flags)
  +{
  +   return (flags | __GFP_THISNODE | __GFP_NOWARN)  ~__GFP_WAIT;
  +}
   #endif
 
 Reclaim needs to be triggered. In particular zone reclaim was made to be
 triggered from slab allocations to create more room if needed.
 

This illustrates the precise need for a patch like this that removes 
GFP_THISNODE entirely: notice there's no functional change with this 
patch.

GFP_THISNODE is __GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY.

The calls to cache_alloc_node() and cache_grow() modified by this 
patch in mm/slab.c that pass GFP_THISNODE get caught in the page allocator 
slowpath by this:

if (IS_ENABLED(CONFIG_NUMA) 
(gfp_mask  GFP_THISNODE) == GFP_THISNODE)
goto nopage;

with today's kernel.  In fact, there is no way for the slab allocator to 
currently allocate exactly on one node, allow reclaim, and avoid looping 
forever while suppressing the page allocation failure warning.  The reason 
is because of how GFP_THISNODE is defined above.

With this patch, it would be possible to modify gfp_exact_node() so that 
instead of doing

return (flags | __GFP_THISNODE | __GFP_NOWARN)  ~__GFP_WAIT;

which has no functional change from today, it could be

return flags | __GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY;

so that we _can_ do reclaim for that node and avoid looping forever when 
the allocation fails.  These three flags are the exact same bits set in 
today's GFP_THISNODE and it is, I agree, what the slab allocator really 
wants to do in cache_grow().  But the conditional above is what 
short-circuits such an allocation and needs to be removed, which is what 
this patch does.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch v2 1/3] mm: remove GFP_THISNODE

2015-02-27 Thread David Rientjes
NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE.

GFP_THISNODE is a secret combination of gfp bits that have different
behavior than expected.  It is a combination of __GFP_THISNODE,
__GFP_NORETRY, and __GFP_NOWARN and is special-cased in the page allocator
slowpath to fail without trying reclaim even though it may be used in
combination with __GFP_WAIT.

An example of the problem this creates: commit e97ca8e5b864 (mm: fix 
GFP_THISNODE callers and clarify) fixed up many users of GFP_THISNODE
that really just wanted __GFP_THISNODE.  The problem doesn't end there,
however, because even it was a no-op for alloc_misplaced_dst_page(),
which also sets __GFP_NORETRY and __GFP_NOWARN, and
migrate_misplaced_transhuge_page(), where __GFP_NORETRY and __GFP_NOWAIT
is set in GFP_TRANSHUGE.  Converting GFP_THISNODE to __GFP_THISNODE is
a no-op in these cases since the page allocator special-cases
__GFP_THISNODE  __GFP_NORETRY  __GFP_NOWARN.

It's time to just remove GFP_THISNODE entirely.  We leave __GFP_THISNODE
to restrict an allocation to a local node, but remove GFP_THISNODE and
its obscurity.  Instead, we require that a caller clear __GFP_WAIT if it
wants to avoid reclaim.

This allows the aforementioned functions to actually reclaim as they
should.  It also enables any future callers that want to do
__GFP_THISNODE but also __GFP_NORETRY  __GFP_NOWARN to reclaim.  The
rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT.

Aside: ovs_flow_stats_update() really wants to avoid reclaim as well, so
it is unchanged.

Signed-off-by: David Rientjes rient...@google.com
---
 v2: fix typos in commit message per Vlastimil

 include/linux/gfp.h| 10 --
 mm/page_alloc.c| 22 ++
 mm/slab.c  | 22 ++
 net/openvswitch/flow.c |  4 +++-
 4 files changed, 27 insertions(+), 31 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -117,16 +117,6 @@ struct vm_area_struct;
 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \
 __GFP_NO_KSWAPD)
 
-/*
- * GFP_THISNODE does not perform any reclaim, you most likely want to
- * use __GFP_THISNODE to allocate from a given node without fallback!
- */
-#ifdef CONFIG_NUMA
-#define GFP_THISNODE   (__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY)
-#else
-#define GFP_THISNODE   ((__force gfp_t)0)
-#endif
-
 /* This mask makes up all the page movable related flags */
 #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2355,13 +2355,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
/* The OOM killer does not compensate for light reclaim */
if (!(gfp_mask  __GFP_FS))
goto out;
-   /*
-* GFP_THISNODE contains __GFP_NORETRY and we never hit this.
-* Sanity check for bare calls of __GFP_THISNODE, not real OOM.
-* The caller should handle page allocation failure by itself if
-* it specifies __GFP_THISNODE.
-* Note: Hugepage uses it but will hit PAGE_ALLOC_COSTLY_ORDER.
-*/
+   /* The OOM killer may not free memory on a specific node */
if (gfp_mask  __GFP_THISNODE)
goto out;
}
@@ -2615,15 +2609,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
order,
}
 
/*
-* GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
-* __GFP_NOWARN set) should not cause reclaim since the subsystem
-* (f.e. slab) using GFP_THISNODE may choose to trigger reclaim
-* using a larger set of nodes after it has established that the
-* allowed per node queues are empty and that nodes are
-* over allocated.
+* If this allocation cannot block and it is for a specific node, then
+* fail early.  There's no need to wakeup kswapd or retry for a
+* speculative node-specific allocation.
 */
-   if (IS_ENABLED(CONFIG_NUMA) 
-   (gfp_mask  GFP_THISNODE) == GFP_THISNODE)
+   if (IS_ENABLED(CONFIG_NUMA)  (gfp_mask  __GFP_THISNODE)  !wait)
goto nopage;
 
 retry:
@@ -2816,7 +2806,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
/*
 * Check the zones suitable for the gfp_mask contain at least one
 * valid zone. It's possible to have an empty zonelist as a result
-* of GFP_THISNODE and a memoryless node
+* of __GFP_THISNODE and a memoryless node
 */
if (unlikely(!zonelist-_zonerefs-zone))
return NULL;
diff --git a/mm/slab.c b/mm/slab.c
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -857,6 +857,11 @@ static inline void *cache_alloc_node(struct kmem_cache 
*cachep,
return 

Re: [patch v2 1/3] mm: remove GFP_THISNODE

2015-02-27 Thread Christoph Lameter
On Fri, 27 Feb 2015, David Rientjes wrote:

 +/*
 + * Construct gfp mask to allocate from a specific node but do not invoke 
 reclaim
 + * or warn about failures.
 + */

We should be triggering reclaim from slab allocations. Why would we not do
this?

Otherwise we will be going uselessly off node for slab allocations.

 +static inline gfp_t gfp_exact_node(gfp_t flags)
 +{
 + return (flags | __GFP_THISNODE | __GFP_NOWARN)  ~__GFP_WAIT;
 +}
  #endif

Reclaim needs to be triggered. In particular zone reclaim was made to be
triggered from slab allocations to create more room if needed.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/